DevOps Decrypted: Ep.30 - Key takeaways from the CrowdStrike downtime

In this episode of DevOps Decrypted, Jon and Jobin discuss the CrowdStrike outage of the summer of 2024. They review a survey of over 400 affected organisations, revealing actions taken since and their preparedness for future crises.

Please update your cookie preferences below to view this content.

In this lighter episode of DevOps Decrypted, Jon and Jobin meet to discuss a heavy subject: the massive CrowdStrike outage of summer 2024. Everyone in tech (and many outside of the industry) remembers the day it happened. But what have we all learned from it?

Well, we might have the answers. Adaptavist surveyed more than 400 participants from medium and large organisations, all of whom were affected by the CrowdStrike outage. Some of the results are pretty thought-provoking. Jon and Jobin run through the findings of our exclusive report, revealing what actions have been taken since—and whether they'll be enough to survive the next inevitable crisis.

Jon Mort:

Hello, everybody! Welcome to DevOps Decrypted. We're a reduced crew today. It's just Jobin and myself, Jon, because of Thanksgiving and illness and things. So you're stuck with just the 2 of us!

We're going to be looking at a recent report and survey that we've put together at Adaptavist Group, based on the findings of the CrowdStrike incident back in the summer and all of the problems that caused organisations all around the world.

So yeah, welcome. Welcome, Jobin!

Jobin Kuruvilla:

Thank you, Jon. I think it's good to have at least Jon, the CTO, on the call to talk about nothing other than something else big as the CrowdStrike outage. I don't know. I mean at Adaptavist, it feels like almost every time somebody is travelling – and I believe at this particular time, I just had a travel, and I was actually on the other side of the Atlantic when this happened.

And I was like, Oh, I just escaped from this biggest outage! So it wasn't even clear what the outage was at the time because I was just in the middle of the travel, and yeah, I was not sure if I would be able to come back in time, and you know, it was all a messy situation, if I remember correctly.

Jon Mort:

It was a complete disaster. So I was actually going – I was on holiday on the day that it happened, and so flying out the evening afterwards. And we were like, Well, are we gonna make it? And it was all fine in the end!

But yeah, there were so many like – you turned up at the airport, and it was like the Apocalypse had happened there. There was, like, none of the systems were working and there were people sleeping on the floor. It was horrendous. So yeah, it certainly had a massive impact on travel.

Jobin Kuruvilla:

Yeah. And the biggest thing that caught my attention was everybody on the big screen was talking about the blue screen. And you know how technology is impacting our world. And I was like, okay, at least for one time. I feel proud that I work in the technology space. And you know, hey, we can bring systems down if you really like!

It was so funny. But yeah, I was talking to my family, my kids even, and telling them, Hey, this is what happens, you know, when an outage happens in the system because every time I tell my kids I'm doing something wonderful? They don't believe me. You're not sending rockets to the sky! That's what I say. But hey, we have technology people working behind that, too, and see what happens when we don't work or when we get something wrong, right?

Jon Mort:

Yeah, yeah, absolutely. And that was kind of like – quick segue to this report – this is one of the things we wanted to understand, really, with that. So much of industry was affected; there was obviously travel that was a big headline, and you know, in the UK, there were whole TV stations that were taken out – so Sky News was essentially off the air to kind of broadcast off emergency things. And so, it kind of affected everything.

So a bit of background on the survey.

We surveyed about 400 people from organisations with 10 million dollars in annual revenue in the UK, US, and Germany to get a sense of the size range of decent-sized organisations.

The headline stat is that 87% of people who responded said that they were significantly impacted by this CrowdStrike event.

Which kind of gives you this scale of – it's big.

So that's what we intuited—it's huge. And then 38% of them had disruptions lasting more than 24 hours.

Oh, which I mean, Jobin, have you got thoughts on that? It says to me, is this—it is big.

Jobin Kuruvilla:

It is huge. The funny thing is, you know, when we talk about DevOps a lot of the time, people actually wonder what DevOps is, right? I mean, we talk about the technology, and there's too much to talk about technology. There are, you know, too many keywords and too many technical jargon that we speak about when it comes to DevOps.

So one thing I talked to when I – not just when I talk to customers, even internally, with our business development teams, account managers, etc. – Don't worry about all that technical jargon, but focus on what the business problem is.

What is the actual problem that the customer is facing? Now, take this as a good example. Right?

Your environment has been down for more than 24 hours. That's a huge problem. Imagine how many millions we are losing just by not being operational, right? So, there is a business problem that can be solved by DevOps, right? So that's my key takeaway from it, you know, associated with our daily problems.

If you don't realise the importance of DevOps. If you don't realise the importance of, you know, having a good operational system in place, then what are you doing? I mean, this is what's going to happen.

Jon Mort:

Yeah, yeah, absolutely. And I think even well-prepared organisations can still get caught out by something of this scale. Actually, planning for something like this takes quite a lot of imagination and bravery.

Jobin Kuruvilla:

What you mentioned caught my attention – you know, having that planning. It's interesting in the survey. What came out was, you know, there were a lot of people with those plans already in place, but only 16% of those people found them effective during the crisis.

40% actually discovered their plans were inadequate for an accident of this scale. That's very eye-opening as well because people think they already have everything covered—we have adequate plans in place and things like that. But when a crisis like this comes, that's when people realise, oh, gosh! Our plans are not adequate, or something is not working. This actually happens quite a lot with blue-green deployments.

For example, I mean, I'm sure you have probably come across. The people have blue-green deployments. Actually, blue-green deployments is a bad example – disaster, recovery, disaster recovery plans in place, right?

Nobody does it until the disaster happens. And then you suddenly realise, oh, my God, my DR doesn't work any more. So how do we actually plan ahead of time, and make sure that, okay, our DR is fully functional? Maybe do a test run every month, maybe? Do you know, make sure, whenever you do an upgrade, make sure the DR is also updated? So, that's an interesting side of things, right?

Jon Mort:

Yeah, absolutely, I think, like, you know, exactly – the phrase is like this idea; "if you've got backups and you don't test your restores, all you've all you've got is hope". And in that sense, you're like… That's not a plan!

And it's something that the teams at Adaptavist group building software, one of the practices that we have, we call it game days, but it's about like coming up with a scenario where there is a disaster that you simulate. You run it through as real as you possibly can through an incident process, and you go through, test all of those procedures, and see what you can learn from. And what you're trying to do is train people, so they get the experience of going through and dealing with incidents and problems and things like that.

But you also want to find holes in your procedure. So if you're following the things or holes in the systems that you've got, it's got this kind of a double-edged purpose for actually improving your practices and your ability to serve your customers.

But on that, one of the things I came across with a striking number… So there was 82% had lacking incident response plans.

Only 16% of them had found it to be effective.

And yeah, afterwards, 41% are now confident in their organisation's ability to prevent and recover from a CrowdStrike-like incident, and I would love that level of confidence. It seems really high.

Jobin Kuruvilla:

I would even argue that that's probably not right. I mean, see.

I'm pretty sure those 40% who later discovered that their plans were inadequate. They were also confident prior to the outage. I think it's Bill Gates who said, you know, success is a lousy teacher. It's failure that teaches us. You know, everything – I think this outage taught us quite a lot.

I'm curious how people still think that you know, hey, we are now ready. Did they actually do something different? I mean, did they do something after the updates? Maybe improve their processes? Maybe, added more testing whatever, you know. I wonder if there is some cue to it in the report itself?

Jon Mort:

Yeah, well, that's what I mean. That's what I'm – the optimist in me wants to think of like, hey, this, what we have here is, we've had this incident, and there's been a huge investment into recovery and analysing weak spots and single points of failure and other things. And that's why, you know, nearly half of the people are now affected and now happy with, and think they can do better next time. But yeah, I wonder whether that is like false confidence, and I guess time will tell until the next one.

Because I'm pretty sure that there will be something, I think, as things get more and more complicated and more and more dependencies and systems. I think the chances of something like this happening again, I think, they just go up.

Jobin Kuruvilla:

I'm sure there will be other incidents like this in the future. I mean, we are seeing this kind of incident. The log project one, you know, comes to mind. So there are a lot of these incidents that happen. But at the same time, you know, there were a lot of positive, transformational trends that emerged from this outage. That's what the report says.

So if you look at the positive transformational trends that came out, 74% of more reporting positive outcomes across all the categories, you know, which were in the report. For some, 81% have implemented more robust development practices. 80% report enhanced cybersecurity awareness among staff.

I mean, that's a good thing. I mean, people are now more aware of what can go wrong. And the more people are aware of it, you know, they're well equipped to respond to any of these kinds of things. So I think there are positive ones coming out of this.

Jon Mort:

Yeah. And so if I put my optimistic hat on and say, hey, all of that, that definitely is the case. And actually, there's some solid action behind those things. I always wonder about human factors and things, and I wonder if, like the fact that it was such a widespread, and had such obvious impact.

And I can imagine a whole load of senior leadership folk would have been affected by that. The transport and things, the number of people who are flying around the world and suddenly grounded because of a cyber security tool – that raises the visibility of it, raises that in the conscious, and maybe it then becomes easier to put, say, to build the business case for the investment in cyber security and training.

So, actually having, like, the human act, isn't there? Oh, like this thing, this thing that my technical team has been telling me I have to fix for a long time—now I see that's what they're talking about. It isn't some kind of fictional thing; it means I can't board my flight, I can't get home, or I can't go on vacation.

Jobin Kuruvilla:

It's not a technical debt any more. It's real.

It's something that you can actually feel. It's a tangible thing. So I think that, yeah, absolutely. But what's also interesting is, you know, a complete overhaul of development practices – 35% increasing focusing or focus on redundancy systems.

That is a person who has transferred their software update process entirely, which is very striking. I mean, it's a transformation of sorts, right?

Jon Mort:

Yeah. Yeah. And I think there's a lot of that.

One of the things I hope hasn't happened is that people and teams have become less willing to update policies. This is my big worry – as a reaction in organisations is that we become less keen to do the necessary security updates. We don't get rid of the zero-day problems that we're not patching quickly enough.

And so I think, I hope that those changes will lead to safer practices around those updates rather than delaying or putting a lot of red tape behind those things—because I can imagine an immediate kind of naive reaction is to go right—we're not updating anything ever anymore!

It works. Don't touch it – which is equally unsafe.

Jobin Kuruvilla:

Oh, yeah, 100%, I agree. But again.

Whenever you talk about transformation, it's usually a bigger thing, especially if you're overhauling the entire software development process.

That itself is a huge thing that can take, you know, in some cases months, in some cases years.

So we are not talking about something that is as simple as yes, we are going to, you know, upgrade the version of the software to the next major version or the next stable version. It's as if we are overhauling the entire software development process. That's a big project in itself.

Jon Mort:

Yeah, yeah, it's huge, which makes me slightly sceptical about the numbers here because how many teams have managed to overhaul the entirety of their software development process in 6 months, which is the sort of the timeframe that we're talking about here?

Jobin Kuruvilla:

Exactly – which brings back that point about your false confidence that you're talking about earlier. Yes, you know now, 40%. Also, the number of people feel more confident that if such an outage happens again, they're confident about tackling it. But yeah could be a false hope.

What can you really do in 6 months' time? Right?

Jon Mort:

Yeah. Yeah. I think, let's put our optimist hats on and say, yeah, there’s been some really meaningful change off the back, which I really hope there is, because it's something that I think a lot of us have been commenting on and saying we need.

I mean, it's the reason why we're emotionally invested in the DevOps movement, right? Because we can see better.

Jobin Kuruvilla:

I am the head of DevOps for a reason! Yes, absolutely, I totally agree. But hopefully, these kinds of events actually open the eyes of, you know, those who are sceptical, as you said. The people on the ground. They know that there's a change that's required. They know that some of our practices need to be better, but sometimes, it's very difficult to get that buy-in from the senior stakeholders.

Which is something that we always try to do when we speak to customers. Right? Always bring in that senior stakeholder who can sponsor the transformational engagement, that is very key. I think we talked quite a lot about it in our last podcast to BetaNXT – when BetaNXT, our customer, came on board and talked about their transformational success story – that buy-in from the senior stakeholders is so important.

Yeah, unless we get that, I think these kinds of outages play a part in teaching them. Yes, this is real.

Jon Mort:

Yeah, yeah, absolutely. One of the other things that this survey found was that the vast majority were planning on increasing investments in cyber security training, in incident response, and also in hiring and long term hiring plans in software development and DevOps, DevOps engineering and things.

And I think that paints a really good picture for the future and the things, and I was thinking as well that if you know, if you're a professional working in this area, and you really want to build, put together a good case to work on the resiliency. And you see that there's a problem, I think using this as a case study, like the CrowdStrike incident and things. I think really, you know, it's got a lot of emotional involvement attached to it and things. And it's really tangible. So, I think a piece of advice would be to reference it in a business case and use it as part of your narrative. To say, if we don't invest here, this sort of thing can happen.

Jobin Kuruvilla:

Exactly.

And again, you mentioned that long-term planning of, you know, hiring more people. And the spread was actually interesting to me. Because you know, talking about that is its percentage of hiring in quality assurance areas: 34% is in IT operations, 32% in software developers.

I never thought companies were in shortage of software developers! But here we go.

And 31% is in DevOps engineers, which made sense to me absolutely, you know, 31% is at least in hiring new DevOps engineers, because your DevOps processes need to be better – makes sense.

But that increase in software developers and testers, that kind of surprised me because I always thought companies are all already hiring enough developers and testing has now become the norm rather than the exception. Right?

Jon Mort:

Yeah, yeah…

All right. So, let's wrap up talking about this. I'll share my biggest takeaway, and then Jobin – so the thing that really is striking out of this survey is the meaningful change that appears to have happened as a result, which I think is a really as like, sort of finding the positives when it was such a big, difficult situation that a lot of organisations found themselves in.

But if we, as an industry, are in a better position as a result of it, I think that's got to be a good thing. Not wishing for anything, not wishing for a repeat, of course, but I think actually taking on those principles of looking at a bad thing happening a disaster, and then learning from it and improving, I think that's a hugely, hugely positive story that we can see in what we found in this survey, and report.

Jobin Kuruvilla:

Yeah, I mean, I would actually quote something from the report itself. It's very clear that building true resilience will require us to address deeper cultural and structural changes. Right? Again, it touches all the cornerstones that DevOps teams talk about. It's not always a tool itself. The people, the process – the tools we need to take into account all of those factors and make sure that we have a completely resilient system in place.

And so the next time when something like this happens, at least we can fail fast and come back quickly, you know, up and running. We don't have to wait for 24 hours. That's a day – I don't want to miss a day – I don't even want to miss an hour!

We have the time change here in the US and, of course, in the UK. Missing that one hour pains me, so missing 24 hours? No, no, thanks!

Jon Mort:

Yeah, yeah.

Jobin Kuruvilla:

I think some things are happening in our world, too. So, at Adaptavist, we are going to the next big event, AWS re:Invent.

It looks like we have folks from across the Adaptavist group going to re:Invent, leveraging our partnership with AWS as well as, you know, other strategic partners like Atlassian and GitLab. So if anybody in the listeners going to re:Invent – I don't think the episode will be out by then, but at least, you know, we'll have some good marketing content for you after the event itself!

Jon Mort:

Yeah – it was nice to meet you there!

Jobin Kuruvilla:

I know, exactly!

It was nice to meet you there, but hey, look out for our marketing blogs or stuff. But there are a lot of exciting meetings that we have planned with our partners – Atlassian, AWS and GitLab – and I think we have lightning talks happening in the Atlassian booth, for example.

One of my DevOps team leads is doing that, Jason… So yeah, a lot of exciting things happening there.

Another thing was, we were at KubeCon last week, so a lot of positive takeaways from there. We were at the Gartner event in Barcelona at the end of last month.

I think the transformational aspect is now the biggest talking point. Obviously, there are things like platform engineering developer experience, all supporting that. But at the same time, you know, every organisation, as far as I can see, undergoing a transformational change.

And yeah, even this report points to that same fact.

Jon Mort:

Yeah, yeah, absolutely. Actually, I think that transformation, I mean, it was one of the things about the Gartner event of being sort of like, you know, less DevOps-y, but that transformational mindset is absolutely the – and seeking to improve and things which, you know, you see that as throughout the DevOps culture, of all of the various DevOps events that we end up at.

Jobin Kuruvilla:

Alright, folks – there it is, short and sweet! It's just the two of us, so we made it very easy for you.

Hopefully, you learned a lot from this, but this was the DevOps Decrypted podcast, part of the Adaptivist podcast series. And hopefully, we'll see you in re:Invent or we have seen you in re:Invent—and I hope by the time we are out there.

But please, like and subscribe – if you have any feedback, send it to devopsdecrypted@adaptivist.com

Thanks, everybody!