“Disaster recovery” is often thought about and planned on a
server by server basis. A server goes down and gets recovered. A bunch of
servers go down and we go to backups.
The linchpin of almost every org's disaster recovery plan is recovery from
backups. A bad day might consist of an incident that requires recovery from backups. That said, in the worst of
really bad days, everything may be encrypted by the bad actors and even the backups might be compromised. What then?
How prepared are you to recover from, well, from almost nothing? Yiou know....an entire organization that needs to be recovered.....at scale.
Think this scenario is impossible? The June 2018 ransomwareattack the Alaskan borough of Matanuska-Susitna (Mat-Su) is just such an
example.
When we see or have information about such a rare event, we
can’t just conclude “don’t click on random attachments” and waste the look at
what disaster recovery at scale really looks like. There are lessons to be learned everywhere,
especially in this case.
How prioritized are your organization’s customers and their objectives
for a large scale incident? For instance, Matsu provided services both to
support their internal operations and to their constituents just like any
county would. Some examples might include:
- Paying invoices to county vendors in a timely manner
- Issuing and archving marriage licenses
- Processing constituent PII for county responsibilities
- Managing swim instruction reservations of the borough’s swimming pool
They lost production systems forcing employees to use typewriters. The ransomware tried to encrypt backups but apparently failed. That said, my understanding is that the borough lost years of archived emails.
For months, Mat-su communicated on a daily basis as part of their response about the
service available on that day, any limitations to the level of service, and set
expectations on when services still impacted would become available. This means that there was a prioritization for
the recovery of these customer facing services behind the scenes.
A reasonable disaster recovery planner can surmise that underlying
the prioritization was a list of servers that were needed to bring online to
enable any given county service. There
were a lot of recovery to be done and I’d guess that perhaps the systems
required for swim lessons weren’t the best use of that initial recovery time.
How do we translate this example taken from a county government
to your company?
In a revenue producing company, the systems required to
interact with customers and process payments likely would be considered critical
services by your executive team. The applications and infrastructure required
to recover those services to a normal level might comprise a completely
different list than your priority list of services to support IT functions
after a disaster. A lot of other things
might be important but not as important as those.
The business priority may not always be about revenue. When I led a
well known software application development team, the most important priority
in the event of a horrendous zero-day against our own product was to recover
the servers and applications that would allow us some capability to develop,
test, and sign a security patch in a network isolated environment so that we
could quickly return our customers to a secure state even if the initial
release was minimally tested. We did just that. Our biggest lesson learned from
practicing our plan was that the signing of the binary turned out to be the trickiest
part. So, yeah, testing your plan is important too.
So, how to proceed….
The first step is easy. Simply review your current disaster recovery
plan. Are customer facing services and their infrastructure/application
dependencies already reflected in your disaster recovery plan? If so, are the
priorities and dependencies up to date?
If not, here are my thoughts on how to plan...:
-
Identify the Recovery Baseline Infrastructure:
There is some definable subset of infrastructure, applications, and services that
IT and security needs to begin recovering elsewhere, conduct emergency
communication, keep newly recovered servers secure, and respond to any flare-ups.
I call these the recovery baseline.
-
Identify The Critical Set of Customer Facing
Services: What are the critical services does your organization provide your
customers? These might be things like processing payments, investigate claims,
perform home delivery, deliver software, create security patches, etc.
-
Prioritize Customer Facing Services Against Non-Recovery Baseline IT Services: Determine your critical customer facing services, supporting
applications, and hosting infrastructure. Think through if their recovery shold
be prioritized above any of the remaining IT services that need to be
recovered.
-
Determine Any Impacts To The Level of Service:
You won’t be recovering from backups so, in the absence of backed up data, you’ll
need to understand what that means for your level of service. You’ll need to
set expectations around the level of service and ensure this is reflected in
your communications plan.
-
Revise your plan: Ensure that your new prioritization
is folded back into your recovery plan. You may also want to get feedback about
the customer facing service recovery plan from a set of customers.
-
Prepare Your Customer Focused Disaster
Communications: You’ll want to prepare your list of customer services and have
a an email ready and website ready template ready to go in case of an
emergency.
The above are just
the application of lessons learned that I took from the Mat-su incident. These would be great topics to discuss with your executive team.
What other great disaster recovery cyber incidents
are there to learn from? Let’s learn
together.
Follow me on Twitter for the latest blog updates: @Opinionatedsec1
SEE ALSO
No comments:
Post a Comment