Mission control: Lessons in agile risk management
Share on socials
Mission control: Lessons in agile risk management
Jean Henson
21 November 2024
10 min read
Jean Henson
21 November 2024
10 min read
In mission control, engineers and mission managers face the highest level of pressure when systems fail, but their response isn't based on panic. Instead, it's grounded in agile principles.
Lessons in risk and resilience
On Friday 19 July 2024, 05:27 UTC, a critical issue at CrowdStrike led to a global outage impacting multiple industries. As a result of this update, many organisations experienced the dreaded Blue Screen Of Death (BSOD), which significantly disrupted business operations across various sectors and included airlines, banks, and retailers.
The issue originated from a faulty update to the CrowdStrike Falcon sensor software deployed on Windows machines, but the full specifics are still unknown. The impact of this incident has been far-reaching in the tech world. Organisations globally have since taken considered steps to review and improve their agile processes to help mitigate any impact that may arise from potential future disruption; as we have seen from our research on this incident, 87 percent of organisations experienced downtime and 38 percent of companies faced severe operational disruptions lasting more than 24 hours.
A preventable complex system failure
This event raised important questions about system resilience, response times, and organisational preparedness. Global organisations can learn lessons from this incident and ensure they are aligned with agile principles, particularly around organisational change, risk management, and adaptability.
To understand the broader implications of this outage, we can draw a familiar comparison with something like a mission control launch failure. Space agencies like NASA, which uses the Scrum agile methodology to develop software for projects like the Space Launch System (SLS), regularly deal with incredibly high stakes during rocket launches, where split-second decisions and cross-functional collaboration can mean the difference between success and failure. These ambitious launch projects involve numerous complex components, teams and stakeholders, making interdepartmental collaboration crucial.
In mission control, engineers and mission managers face the highest level of pressure when systems fail, but their response isn't based on panic. Instead, it's grounded in agile principles that prioritise rapid iteration, collaboration, and the ability to adapt to new information. NASA uses SCRUM with its iterative feedback loops, transparent communication, and failover systems to ensure that if something goes wrong, there's a path forward. Could these agile practices have made a difference with CrowdStrike?
Iterative problem-solving and rapid response
Mission control operates in short development cycles, analysing telemetry data, testing hypotheses, and quickly adapting based on new information. When a problem arises during a rocket launch, mission controllers follow a series of if-this-then-that scenarios, working through multiple contingencies until the issue is resolved.
In an agile organisation, adopting this iterative problem-solving mindset often speeds up response time during outages. For example, a more agile approach would involve setting up a cross-functional incident response team that regularly cycles through diagnosis, remediation, and validation without waiting for the 'perfect' fix. The goal is to get something working quickly, test it in real time, and iterate rapidly.
By proactively embracing these agile feedback loops, issues are detected earlier, and the team can deploy a more immediate solution rather than waiting for a larger, system-wide fix that takes longer to implement. This mirrors the way mission control constantly runs diagnostic checks during a mission, making tweaks in real time to ensure success.
Cross-functional collaboration under pressure
At NASA, the key to success in mission control is cross-functional collaboration. Engineers from various disciplines, including propulsion, communication, software, and telemetry, work seamlessly together to solve problems. During a launch failure, everyone involved must share information quickly, make decisions collaboratively, and remain flexible.
Similarly, when systems fail within an organisation, they may have benefited from a more agile, cross-functional response in hindsight. An agile organisation places great emphasis on collaboration and breaking down silos. Having cross-functional teams working together could have led to a faster, more efficient resolution or even prevented it. Real-time communication and shared accountability are crucial in high-stakes environments, whether in mission control or cybersecurity operations.
Transparency and communication
Transparency is key to maintaining trust when a failure occurs, whether in a space mission or a cybersecurity service. Mission control teams are trained to be transparent with astronauts and stakeholders, giving them updates on what's happening, why, and what steps to take to resolve the issue.
Regardless of industry, organisations can benefit from this approach by providing customers with more frequent and transparent updates about the progress of an outage. Agile practices emphasise continuous communication—internal teams should be in constant touch. At the same time, external stakeholders (in CrowdStrike's case), and customers should be kept informed by the organisation of what's going on and when they can expect a resolution. This kind of openness helps reduce customer anxiety and creates a feedback loop that can help accelerate problem-solving.
Fail-fast and pivot
One of the core tenets of agile is the fail-fast approach. This principle encourages teams to try solutions quickly, learn from their failures, pivot, and do something different. In mission control, this approach is essential because every second counts and if a failure is detected early, engineers can rapidly adjust or pivot before it becomes a larger crisis.
An organisation's response to an outage can benefit from an agile fail-fast culture by testing potential solutions quickly and discarding ineffective ones sooner. The agility to fail quickly, learn from it without blame, and improve can enable the ability to resolve the issue faster, even if they don't have the perfect solution right away. By treating any incident as a learning opportunity, organisations can minimise both downtime and customer impact.
'The CrowdStrike outage has prompted a shift towards learning rather than blame, with companies prioritising radical candour and psychological safety.'
Adaptavist
Iterative risk management
In mission control, risk management is not a one-time task; it's a continuous process. Engineers and mission planners constantly assess and reassess risks, creating contingency plans for every possible failure scenario. This level of preparedness helps mitigate the risks of failure and ensures that systems can fail gracefully if needed.
Agile risk management includes proactive 'risk sprints', which are short, focused cycles in which the team dedicates a short period within a regular sprint to identifying, testing, and mitigating potential vulnerabilities in smaller, more manageable chunks. In an agile context, risk management involves continuously identifying, assessing, and mitigating risks throughout the development and operational processes. The CrowdStrike outage serves as a reminder of the potential risks associated with technology and the need for proactive measures to mitigate them. Continuous risk assessments could mean an organisation is better prepared for incidents like this one, and downtime will be shorter or (even prevent them entirely) as part of robust risk management practices.
A culture of continuous learning
Compliance with industry standards and regulations is critical to managing risk in a company's overall business operations. Any incidents will highlight the need to ensure systems and processes comply with relevant security and operational standards. Agile practices can support compliance by incorporating compliance representatives in agile processes, thus promoting transparency, regular audits, and continuous improvement. These help maintain adherence to compliance requirements and drive a cultural shift toward more proactive compliance management within agile teams. Proactive risk management fosters a culture of continuous learning, where teams regularly reflect on past incidents to improve future compliance management strategies.
The CrowdStrike incident has also driven a shift in attitudes to regulation. 42.5 percent of organisations we surveyed are now more supportive of software industry regulations due to the outage. Incident reporting and testing regulations are gaining the most traction.
An agile approach to compliance management involves integrating security practices into the development lifecycle. The CrowdStrike incident highlights the importance of considering security as a fundamental aspect of product development rather than an afterthought.
In conclusion
In mission control, it's not just about avoiding failure—it's about responding quickly and learning from it to improve. In today's fast-paced, interconnected world, agile processes aren't just nice to have; they're essential for survival.
So, what are organisations doing differently as a result of this incident? Our research findings deduce that 'widescale changes to software engineering practices are being planned, whereby 89.25 percent of respondents in our research report plan to invest in agile and DevOps practice training.'
Related content
Read moreWritten by
Jean Henson
Director Business Agility
Jean has spent 20 years in information technology and business process improvement. Successful in business analyst, IT tool management, and customer success roles, at Adaptavist Jean helps enterprise clients transform their processes and teams to deliver exceptional value.
Agile
Related content
Read more