Don’t overreact to the CrowdStrike outage; being more nuanced will help you win
Share on socials
Don’t overreact to the CrowdStrike outage; being more nuanced will help you win
Matt Saunders
22 January 2025
6 min read
Matt Saunders
22 January 2025
6 min read
Discover lessons from the CrowdStrike outage to strengthen incident response and foster a resilient DevOps culture.
Given the severity of the outage and the worldwide implications for all businesses that have emerged since July 2024, it's been commonplace to see organisations levelling accusations of lack of preparation at their IT teams. But the takeaways that we've found, especially that
84% of organisations admitted to not having an adequate incident response in place, which indicates a heightened level of awareness which progressive organisations can leverage to improve the situation.
84% of organisations admitted to not having an adequate incident response in place, which indicates a heightened level of awareness which progressive organisations can leverage to improve the situation.
You never want a serious crisis to go to waste.
Rahm Emanuel
Chief of Staff to Barack Obama
For decades, IT Operations teams have been unsung heroes, fixing servers and restarting applications to maintain uptime, and only the DevOps revolution of the last fifteen years has seen these efforts come to prominence. Getting a large organisation to commit people, time and effort to improve reliability and add proper rigorous plans to deal with incidents has traditionally been hard, given that these efforts don't directly contribute to a bottom line. But seeing a major outage play out in ways that directly affected so many others is a morbid opportunity to secure resources to improve one's own situation, a conclusion we can draw from the fact that only 16% of our respondents said their own incident response plan was effective.
Whilst "we don't want to be the next CrowdStrike" is not the most positive way to frame operational improvements, it's a great lever for organisations to focus in on SRE (site reliability engineering), the groundwork for which is now well-established, and an opportunity for those who were under the mistaken belief that "it couldn't happen here".
Testing strategies have come under the spotlight, too. The answer to the cliched response of "why wasn't this update tested?" is of course "it was!" in the majority of organisations we surveyed. But one of the core reasons the CrowdStrike outage happened was because the set of circumstances that caused the outage was actually very hard to test, and our research confirms suspicions that nuanced discussions around proper testing are now taking place, with about half of them now intending to do more work on unit and integration tests. The outage has shone a light on situational awareness–and we're encouraging organisations to take time to understand the path less trodden when writing tests - particularly integration tests-where perhaps the software isn't being used in exactly the situation the authors intended.
Nearly everyone who responded to our survey said that they were expanding their technical teams in reaction to the CrowdStrike outage. We urge organisations to look strategically at these extra hires and bulk up in engineering competence around site reliability, testing, and security through a DevOps lens of continuous improvement. As highlighted by the Head of DevOps, Jobin Kuruvilla, in the report, this is the only way to ensure that lasting change is made. So, rather than hiring lots of dedicated siloed testers, organisations should look to hire people who can improve quality across the whole lifecycle of an application.
Many of the responses we saw emphasised a positive outcome-with light shone on usually underfinanced and underrepresented areas of competency-such as operations, security and testing. But the CrowdStrike outage highlighted to the world how absolutely life-critical some IT systems can be, and the long-term reaction to the emotions we all felt due to an outage of this scale can set back cultural initiatives around psychological safety and implementing a culture of learning over blame. An incident of this magnitude in an organisation can lead to micro-management, excessively constrained additional processes, and a culture where innovation is stifled due to a perception that change must be heavily scrutinised to avoid incidents. Strategic epithets such as "right first time every time" can easily reduce delivery momentum to a crawl.
Our takeaways from the report from a DevOps angle are:
- Don't be dragged under by increasing scrutiny on everything in the belief that this makes software better
- Carefully analyse your software's use cases and make smart and pragmatic decisions on how to test it
- Automate all of these tests in an up-to-date and well-maintained continuous integration system
- Work actively to prevent creep of oversight by emphasising a collaborative, innovative culture where engineers can take calculated risks for safety.
We can help. Have a look at our DevOps maturity assessment and get in touch.
Written by
Matt Saunders
DevOps Lead
From a background as a Linux sysadmin, Matt is an authority in all things DevOps. At Adaptavist and beyond, he champions DevOps ways of working, helping teams maximise people, process and technology to deliver software efficiently and safely.
DevOps