The CrowdStrike Incident: The Devil is in the Details, and Chaos is in the Code

By Greg Sullivan, CIOSO Global [ Join Cybersecurity Insiders ]
72
News Cybersecurity USA

The CrowdStrike failure, a watershed moment in cybersecurity, stands as the most significant story of the year and potentially one of the most impactful of the decade. The flawed update it pushed to Windows operating systems worldwide crashed critical machines—an estimated 8.5 million—and sparked a global IT outage that will reverberate for months or even years to come.

And while fountains of ink have been spilled on this subject already, for someone like me—a cybersecurity consultant with 30 years of experience in the field, someone who helped clients through this CrowdStrike incident in real-time—I find that many articles focus on the same few points over and over again. Yes, we are now finding out what happened, how it happened, and what CrowdStrike and its competitors will do to prevent such a situation from happening again.

But a wealth of interesting aspects of the failure—and its fallout—aren’t widely known or being significantly discussed. These illuminate facets of what happened, and as our industry will be reacting to this incident for years to come, the lessons we take away from this will be necessary for executives and cybersecurity/IT professionals to apply in order to reduce the impact of future incidents.

The “Trusted Provider” is Finished

First, the risk calculus for this type of event will from this point forward be different for cybersecurity and risk management professionals. We must elevate this kind of risk—damaging software coming not from a hack or social engineering attack but from a relied-on provider. The days of trusting partners, even those in charge of security, with absolute impunity are over.

Instead, we must treat cybersecurity updates like any other software update. This means applying greater resiliency at the end of the IT process and conducting more thorough and mandated testing before deployment. We can no longer give any partners a free pass and push their updates through. As much as we hope it won’t, we must assume that the CrowdStrike scenario could happen again.

Prepare Por Opportunistic Exploiters

While there is no evidence (at the time I’m writing this) that the CrowdStrike failure was caused in any part by a hack or malicious actions, the fact remains that when things go wrong, bad actors will be on the scene trying to take advantage of the situation. The confusion, frustration, and desperation the average person feels when dealing with their work computer that has been “blue-screened” is real. Very quickly after this incident, a malicious file started making the rounds, claiming to be a quick fix to the problem—but the so-called “CrowdStrike hotfix” was simply malware that was reaching more people than usual as desperation replaced sensible actions.

While this type of situation—a trusted provider being the source of a widespread outage—may not have been on our radar before, it is now. All employees must be prepared and know what is and is not protocol during these incidents. There’s always more cybersecurity training for employees at every level, and it should incorporate lessons from the CrowdStrike failure. Start now; don’t let the next incident be the beginning of your team’s learning curve.

A View Into The Cure: Implementing The Fix

Some may be wondering what the breadth of responses and fixes looked like. Sure, we all have read countless retellings of the scramble that airlines and significant healthcare organizations went through in the immediate aftermath of the outage. However, small to mid-sized businesses and local governments worldwide experienced just as much pain. Even in centralized offices, getting machines back up and running was manual and laborious because servers that would have run a fix and pushed it out were also down.

Given the number of devices being deployed remotely—as we’ve all enjoyed a more flexible approach to remote work in recent years—fixes were more complicated. In some cases, those fixes were unwieldy or borderline bizarre. IT personnel were going door to door in some locations, fixing machines one at a time. In cities, some organizations’ employees were told to bring their work computers to a central location to be worked on. Sometimes, IT would be forced to walk people through the fixes over the phone. Anyone who’s ever tried to help a relative troubleshoot their laptop over the phone knows how far from efficient that process is.

In some cases, we even saw organizations breaking some of the most cardinal rules of basic security hygiene. I heard of flash drives with the script to implement a fix being mailed nationwide. CrowdStrike knocked us back a few decades regarding the technical sophistication of our fix implementations.

Cybersecurity Tools; N+1 Redundancy 

We need to reevaluate our processes and tools closely because we can’t solely rely on our “trusted provider” anymore. I suggest making the most of all the cybersecurity tools at your disposal. Some IT departments turn off overlapping software security features—but this is a mistake! It’s additional work to run more processes, but that extra security insight can be helpful as a “checks and balances” to confirm an exploit or even pinpoint an utterly different attack surface not identified by other cybersecurity tools.

For example, I’ve seen organizations run multiple vulnerability checks on the same data set or database from different software programs—without fail, the results are never the same. Cybersecurity tools are imperfect, but using multiple tools can help mitigate risks. It’s essential to gather information through the lens of different cybersecurity systems and focus that data onto a single-pane-of-glass dashboard where a holistic analysis can yield more informed decisions.

Organizations should look to enable N+1 redundancy in their security measures. The concept of “N+1 Redundancy” refers to a system design principle where there is an extra “+1” component for every “N” component necessary to ensure continued system operation. In the context of cybersecurity, this means having a backup cybersecurity solution that is as effective as the primary one.

That said, it’s possible we’ll see some companies adapt their updating policy and move to an n-1, or in rare cases, n-2 model for incoming patches. With these updates themselves representing possible risks, there’s an argument for waiting one cycle of updates to ensure that any similarly disastrous bugs are discovered by others and not one’s own company. Of course, this means that companies will have to live longer with known vulnerabilities, but this is mitigated by avoiding the potential disruption of a bad update. Of course, it might have to be a case-by-case basis on what patches can be done in an n-1 timeframe and which are so critical that the risk must be taken to update them immediately. Each company should set their own policies and decide on the approach that’s right for them.

What’s Next? Review Your Incident Response Plan

Once your company has recovered from this outage, there’s no better time to reassess your incident response plan. All IT departments should hold a meeting to review the effectiveness of their response to the CrowdStrike incident and ask questions such as:

  • Were clear roles, responsibilities, and policies established?
  • Was the current plan executed as it was written up? If not, why?
  • When was the last time the communication plan was updated?
  • Were the proper individuals notified in the correct order?
  • Who is responsible for triage to assess the severity and impact?
  • Who is monitoring the restored systems and documenting the effectiveness of all actions?
  • Is the incident response plan updated to properly account for the “trusted provider” scenario discussed here?

With every new incident, reassess the plan and keep it fresh and current by performing tabletop exercises to flesh out any weaknesses or outdated information. Remember, the goal of a tabletop assessment is to test an organization’s preparedness and response against realistic scenarios. It’s important to conduct a debriefing, a.k.a. hotwash, where an after-action review is performed to analyze strengths and weaknesses.

Conclusion: Pivot and Prepare

The CrowdStrike incident may have been unexpected, but a similar one in the future should not catch organizations off guard. We need to operate under the assumption that such a failure from a “trusted” partner will happen again, somewhere, sometime. To prepare for the next occurrence, IT needs to treat cybersecurity updates like they do with standard software patches. Following these established procedures means reviewing any update notes, setting up a test environment and applying the update, getting the necessary approval for deployment, staggering the deployment, and informing all relevant stakeholders.

Our digital ecosystem is too interconnected. Trusting software updates with impunity, circumventing the procedural steps taken for other patches and updates, will continue to thrust our fragile IT habitat into chaos. Apply the “commonsense filter” to software updates; the devil is in the details, and chaos is in the code.

Ad

No posts to display