Adaptavist works with many organisations to build software to be deployed onto their own servers—often web applications that run in a user's browser. And in these cases it's possible and desirable to have a seamless and rapid deployment process based around the principles of Continuous Delivery. Small, distinct, and frequent deployments are proven to reduce risk, deliver value to customers faster, and make recovery from a bad deployment simpler.
However, software intended to run on other organisations' computers—for example, on reservation terminals at airports or on embedded devices in healthcare—presents new problems. At Adaptavist, we see this with the development of plugins and add-ons such as ScriptRunner in the Atlassian ecosystem. These are designed to run not only in cloud environments but also on our customers' servers with varied hardware specifications for our 'Data Center' products.
The CrowdStrike Falcon software is deployed to a very wide array of different hardware—and testing all these combinations is both costly and time-consuming. Similarly, software intended to be deployed to mobile phones is faced with a huge cartesian matrix of possible combinations. Testing is generally carried out on virtual servers—this is both faster and simpler than using physical devices, but can lead to some incompatibilities being missed. Equally, just testing manually is not realistic. Finding the right balance between automated and manual testing in this situation is vital.
So, once the software is built and thoroughly tested, we move to the deployment phase. We always advise a progressive or 'blue/green' rollout of software. But in the case of software that runs on other people's computers this often isn't possible. Carrier-grade network routers often ship with the ability to deploy new software to a resilient control plane, the idea being that if the update doesn't work, then the device can change back to working software to prevent downtime, and we see this echoed in software architecture with load balancers configurable to avoid sending traffic to broken instances or pods. We mirror this with techniques such as progressive rollouts, and doing this using contemporary orchestrators such as Kubernetes and modern serverless platforms makes this seamless. The types of devices that CrowdStrike Falcon runs on do not support this level of resilience, which means scrutiny over testing when you have limited control of the end-user device is absolutely essential.
There are further learnings to come when the full details of how the software interacts with the operating system kernel come to light—but we can see some valid takeaways already. Ensuring loose coupling of components, documented API contracts between components, and use of circuit-breakers can mitigate the impact of a bad deployment. The impact of the interaction between CrowdStrike Falcon and Windows caused computers to crash entirely, meaning that just redeploying a known good version of Falcon wasn't possible. This is an experience common with operating systems with security models that allow direct access to the operating system's kernel, and we don't yet fully understand why this was the case. But our takeaway is to use minimum possible privileges when running software to minimise the damage a bad version can cause.
Having a well-thought-out incident response process is also vital. The full details of why the CrowdStrike Falcon update was pushed to all devices at 6 am is unknown, but we must accept that sometimes that absolutely has to happen. And if the rollout fails, it's critical that customers can receive updates, remedial actions, and effective communication from the software vendor.
Adaptavist is in the business of helping organisations across the globe work better—through software, services, and techniques that make a difference in how these organisations operate and how software can be delivered quickly, reliably, and safely. There are still many questions over the CrowdStrike incident; we don't have all the answers now; much commentary, including our own, is speculative, and it's likely that the full extent of the issues that caused such a dramatic outcome can't be completely mitigated. But we do know sound, solid principles of software engineering, testing, deployment, and incident response, and we can help.