Opinionated Security: Why Do We Emphasize Everything At Scale Except Disaster Recovery?

“Disaster recovery” is often thought about and planned on a server by server basis. A server goes down and gets recovered. A bunch of servers go down and we go to backups.

The linchpin of almost every org's disaster recovery plan is recovery from backups. A bad day might consist of an incident that requires recovery from backups. That said, in the worst of really bad days, everything may be encrypted by the bad actors and even the backups might be compromised. What then?

How prepared are you to recover from, well, from almost nothing? Yiou know....an entire organization that needs to be recovered.....at scale.

Think this scenario is impossible? The June 2018 ransomwareattack the Alaskan borough of Matanuska-Susitna (Mat-Su) is just such an example.

When we see or have information about such a rare event, we can’t just conclude “don’t click on random attachments” and waste the look at what disaster recovery at scale really looks like. There are lessons to be learned everywhere, especially in this case.

How prioritized are your organization’s customers and their objectives for a large scale incident? For instance, Matsu provided services both to support their internal operations and to their constituents just like any county would. Some examples might include:

Paying invoices to county vendors in a timely manner
Issuing and archving marriage licenses
Processing constituent PII for county responsibilities
Managing swim instruction reservations of the borough’s swimming pool

Mat-su didn't just get hit by ranomware. They were wrecked. '

They lost production systems forcing employees to use typewriters. The ransomware tried to encrypt backups but apparently failed. That said, my understanding is that the borough lost years of archived emails.

For months, Mat-su communicated on a daily basis as part of their response about the service available on that day, any limitations to the level of service, and set expectations on when services still impacted would become available. This means that there was a prioritization for the recovery of these customer facing services behind the scenes.

A reasonable disaster recovery planner can surmise that underlying the prioritization was a list of servers that were needed to bring online to enable any given county service. There were a lot of recovery to be done and I’d guess that perhaps the systems required for swim lessons weren’t the best use of that initial recovery time.

How do we translate this example taken from a county government to your company?

In a revenue producing company, the systems required to interact with customers and process payments likely would be considered critical services by your executive team. The applications and infrastructure required to recover those services to a normal level might comprise a completely different list than your priority list of services to support IT functions after a disaster. A lot of other things might be important but not as important as those.

The business priority may not always be about revenue. When I led a well known software application development team, the most important priority in the event of a horrendous zero-day against our own product was to recover the servers and applications that would allow us some capability to develop, test, and sign a security patch in a network isolated environment so that we could quickly return our customers to a secure state even if the initial release was minimally tested. We did just that. Our biggest lesson learned from practicing our plan was that the signing of the binary turned out to be the trickiest part. So, yeah, testing your plan is important too.

So, how to proceed….

The first step is easy. Simply review your current disaster recovery plan. Are customer facing services and their infrastructure/application dependencies already reflected in your disaster recovery plan? If so, are the priorities and dependencies up to date?

If not, here are my thoughts on how to plan...:

- Identify the Recovery Baseline Infrastructure: There is some definable subset of infrastructure, applications, and services that IT and security needs to begin recovering elsewhere, conduct emergency communication, keep newly recovered servers secure, and respond to any flare-ups. I call these the recovery baseline.

- Identify The Critical Set of Customer Facing Services: What are the critical services does your organization provide your customers? These might be things like processing payments, investigate claims, perform home delivery, deliver software, create security patches, etc.

- Prioritize Customer Facing Services Against Non-Recovery Baseline IT Services: Determine your critical customer facing services, supporting applications, and hosting infrastructure. Think through if their recovery shold be prioritized above any of the remaining IT services that need to be recovered.

- Determine Any Impacts To The Level of Service: You won’t be recovering from backups so, in the absence of backed up data, you’ll need to understand what that means for your level of service. You’ll need to set expectations around the level of service and ensure this is reflected in your communications plan.

- Revise your plan: Ensure that your new prioritization is folded back into your recovery plan. You may also want to get feedback about the customer facing service recovery plan from a set of customers.

- Prepare Your Customer Focused Disaster Communications: You’ll want to prepare your list of customer services and have a an email ready and website ready template ready to go in case of an emergency.

The above are just the application of lessons learned that I took from the Mat-su incident. These would be great topics to discuss with your executive team.

What other great disaster recovery cyber incidents are there to learn from? Let’s learn together.

Follow me on Twitter for the latest blog updates: @Opinionatedsec1

SEE ALSO

Random Cyber Security Conundrums

The Super Secret Source Of Some Of The Best Cyber Threat Intelligence Available

Intentional Choices For Security Teams and Digital Transformation

Opinionated Security

Monday, July 29, 2019

Why Do We Emphasize Everything At Scale Except Disaster Recovery?

No comments:

Post a Comment