DevOps insights from the CrowdStrike outage: boosting software resilience
Share on socials
DevOps insights from the CrowdStrike outage: boosting software resilience
Matt Saunders
26 November 2024
4 min read
Matt Saunders
26 November 2024
4 min read
In this blog, we highlight lessons from the CrowdStrike outage, emphasising the importance of DevOps practices, strategic testing, and automated infrastructure.
There are significant parallels between the learnings for companies to respond to the CrowdStrike outage and the principles and practicalities of the DevOps movement.
An oft-repeated observation of the CrowdStrike incident is "Why was this not tested properly?" and DevOps offers many of the answers to this. A key tenet of DevOps is to focus hard on feedback loops—where iterative improvements to software are measured, tested and fed back to the developers, and the kneejerk reaction to the CrowdStrike incident suggests that this wasn't done. But a detailed analysis of the root cause of the incident does not support this—instead suggesting that a niche set of hard-to-test and hard-to-predict circumstances caused the outage.
The vast majority of our survey respondents say they will expand technical teams, focusing on DevOps and testing, and it's clear that this presents an opportunity to level up on both for all organisations. Organisations must avoid falling into the trap of just adding more and more tests—as these add to the build and testing cycle time and can lead to a lengthening of the feedback loop, the opposite of what we want. Rather, organisations should look to a more strategic DevOps approach—relying less on unit tests and expanding capabilities of integration and acceptance tests. This, in turn, helps ensure that those feedback loops are representative of the real world and rapid enough to fit into a highly iterative development process.
"There is almost no human action or decision that cannot be made to look flawed and less sensible in the misleading light of hindsight."
Sidney Dekker
Just Culture: Balancing Safety and Accountability
Work on analysing disasters and outages in non-IT industries, such as that by Sidney Dekker and Steven J Spear, suggests that chalking incidents up to bad luck or freak situations misses an opportunity for improvement, and technical teams can learn much from this approach. Every incident that occurs is an opportunity to learn to feedback on new ways of testing software into the software delivery lifecycle.
Not many organisations are in the position of writing software which directly interacts with the Windows kernel - with the majority of applications being built for the sandbox that a Web application runs in. But we can learn from efforts to improve the testing of mobile applications - by investing significantly in infrastructure to test applications automatically using real hardware devices where necessary. For the majority of organisations this is a capability that can be bought in. Giving developers access to automated infrastructure to procure and run this is crucial to maintaining the cadence of a feedback loop.
Many welcome increased regulation around software delivery, but those on the inside are wary of an increased burden placed on development teams to ensure software reliability and security. Good DevOps practices such as repetition and automation can help here, with an opportunity for organisations to interpret regulations and security requirements in an automated continuous fashion.
We specialise in building effective automated testing (CI/CD) infrastructure on public and private clouds with tools like Terraform and Kubernetes. Our teams carefully aim to solve these problems with collaboration and pragmatism, providing a blended result that is accessible to all and—most of all—surface problems quickly and coherently so that developers can get on with what they are best at.
Please refer to our press release for more details on our research findings and how Adaptavist can assist your organisation in navigating this new landscape with our DevOps services. Join us in shaping the future of software engineering, where resilience and innovation go hand in hand.
For more information on our DevOps services and resources, visit our DevOps Resource Hub here.
Written by
Matt Saunders
DevOps Lead
From a background as a Linux sysadmin, Matt is an authority in all things DevOps. At Adaptavist and beyond, he champions DevOps ways of working, helping teams maximise people, process and technology to deliver software efficiently and safely.
DevOps