DevOps Decrypted: Ep.31 - The State of Resilience in 2025? Are you ready for another outage?

We start the year with the State of Resilience 2025 report and discuss cybersecurity, the 2024 CrowdStrike event, and digital resilience.

Please update your cookie preferences below to view this content.

Laura Larramore:

Welcome to DevOps Decrypted, where we talk all things DevOps! I'm your host, Laura Laramore, here with our Adaptavist panel – Matt Saunders, and Jobin. We are on Episode 31 of DevOps Decrypted, and you are welcome to follow us on all of our socials and to interact with our show. We would really appreciate that!

Today, we're going to be talking a little bit about the State of Resilience report from Cockroach Labs that was just published. So, Matt, if you want to get us into that a little bit, we'll go ahead and start with that?

Matt Saunders:

Yeah, let's go for it. So let's talk about resilience. I feel like we're talking quite a lot about resilience and keeping servers up and uptime. So yes, a new report came out from our friends at Cockroach Labs, who publish databases. I don't know why they called that, seems like a very strange name….

But yeah, basically just reporting that 55% of companies are having weekly outages, and 14% are having daily outages. I mean, I think there's some hype behind this around, you know, what actually classifies as an outage.

But we're building ever more complicated systems.

We're monitoring them to some degree or another; I guess this is just kind of an inevitable slump into nothing ever working properly or things breaking all the time.

I Don't know. What do you think, Jobin?

Jobin Kuruvilla:

I feel like we are failing our customers. I mean, that's what I feel like. To be very honest, I feel bad about it.

You know what… The biggest thing is, if you put things into perspective – obviously, you know, our customers care about cost, especially these days when there's an inflation brewing. And the economy is not doing so great. A staggering 100% of survey participants experienced revenue losses due to outages – 8% of them over 1 million in losses over the last 12 months. Can you imagine that? I mean, that quantifies to a big amount.

I feel like we, as partners and consultants, are failing our customers. I'm curious to dive deeper and see what is actually going wrong. You keep talking about a lot of DevOps, implementation, or digital transformation pitfalls. Are they running into that? I mean, is that causing it? Or are they just not caring about these things? What exactly is happening?

Matt Saunders:

Yeah… I'm thinking about this from a lens of where we're encouraging people to – I know this phrase is a little bit outdated now – "Move fast and break things” and experiment, give people an environment where they feel confident in trying things.

You set that against beliefs such as going for 100% uptime, which is going to rigidify everything that you do, so you can't ever change anything. People are moving faster these days, and this is the net result of it.

Is it useful to quantify outages in monetary terms? Definitely.

It shows that there is a real-world consequence to these things happening. What I worry about is how we then pitch the fix here because traditionally, when people come to me and say, "Do you know how much money we've lost because that server was broken?", things have not gone well…

Jobin Kuruvilla:

Absolutely. Yeah. Looking deep into the survey, it looks like most of them are reporting network and software failures as the leading cause for these outages, right? That makes sense. And again, together with cloud platform and 3rd party service reliability issues.

So there's a lot of reliability issues. They're talking about again network and software problems. I mean, that's not new.

Matt Saunders:

It's not, though I think there might be a clue there as to where we're going wrong because when you were like, "we're failing our customers and not doing this quite right, and what is actually the source of this" – I think, again showing my age, we always used to just blame the network, you know, and come to a realisation that the network is unreliable. You just have to deal with that.

And so from there I think it's a bit of a leap. But then you get to microservices and carving up the responsibility of things, and where people are running increasingly complicated apps that are split up into microservices, and microservices have to talk to each other. How's that working?

Now, in the good old days of the Java monolith, everything happened inside of a single process.

Now, the typical user journey could hit, you know, on potentially some mythical retail site, financial site, could be hitting, like, 5 or 10 microservices to put up a page or to show something in an app. And if the network isn't perfect, which it never is, then you have problems.

So… these things are getting increasingly complicated. They're being dispersed across different teams, across different pieces of infrastructure. For all the right reasons.

I mean, I don't want to go back to the days when we had Java monoliths, where if you wanted to touch a single line of code, then you had to recompile the whole damn thing. And it took 7 hours and 15 approvals.

But yeah, maybe we're missing something there.

Jobin Kuruvilla:

And you would think that with all the recent advancements in cloud technologies, for example, you know, they are focusing on resilience. And you know, there's this 99.999% availability that they're promising. So, to see that network and software issues are the primary cause of cyber liability issues? It's surprising.

Matt Saunders:

Yes, yes, all those software issues. What does that mean? That could be anything, couldn't it?

Jobin Kuruvilla:

That is a wide bucket, yeah.

It is kind of confusing what software issues mean. But again, this could potentially lead to a discussion about this proliferation of tools. How many tools are people dealing with on a day-to-day basis?

And again, we speak about—sometimes, when we do engagements, when we go to customer engagements, you know, I've seen places where over 200 tools are used daily to get their work done.

You see that trend of late, platform engineering – a consolidation of tools. And there's a reason why people are doing it. The more tools that you use, the more complicated the setup is. You have to worry about issues in each of those tools and one of them fails. And then you have got a problem at hand. Maybe that's what it means.

Matt Saunders:

Yeah. Also, with this proliferation of tools, I guess you end up struggling with people's cognitive ability to actually understand how all these things actually join up together, right?

You've got 7 tools involved in building your CI pipeline—do you know where each tool's responsibility ends? Is that someone's job? Does the tool or the accumulation of tools actually help with that?

I think we see a lot of new tooling going in for things like security and compliance. And potentially, we're just bogging ourselves down with this sort of stuff. So yeah, I'm not saying we should just have one tool to rule them all, although… Well, actually, maybe we are.

Jobin Kuruvilla:

No, I mean, that's certainly not going to happen!

Matt Saunders:

And cyber attacks!

Jobin Kuruvilla:

Cyber attacks!

Matt Saunders:

Hmm, so this is big. This is real. You know, I was talking to some people who had an issue—this is actually last week. A company had an issue with an application that didn't have any proper security around it.

It was a Javascript application that actually had a password in the Javascript. So, somebody thought, "Well, we can write this app and do it on the entire client side because that makes things a lot easier". I think it's just talking to APIs. Brilliant, fine.

The problem is that this then just works on security through obscurity. I'm telling this story because it was a little bit of a battle to actually get people to accept that hackers will find things. You may think that just by hiding something away, they won't come and find you.

Hackers don't have personal vendettas. Well, some of them do – but with the tooling you've got around cyber attacks. You can be unleashing... I don't mean YOU, Jobin, and I'm not saying you're a cyber attacker… or are you? What were you up to last weekend? I don't know…

Jobin Kuruvilla:

And I'm not saying that I have a personal vendetta against you, either…

Matt Saunders:

Hackers will find weaknesses, whether they're looking for them or not.

Jobin Kuruvilla:

And it is a bit of an oversight, isn't it? I mean, having such a childish mistake go into production. I think if they had implemented a proper DevSecOps pipeline, this would have been caught early in the cycle, and it never would have made it into a production environment. So I think there is scope for improvement there.

Again, cyber attacks. There are all these new regulations coming, including the DORA regulations and things like that, and the government is putting up these regulations for a reason. They want to limit these kinds of things from happening again and again. So I think there's a good thing happening on that front.

Matt Saunders:

Yeah, and so, in addition to all that talk of good things happening, should we talk about our CrowdStrike report?

I can't remember if we've mentioned it on the podcast before.

Yes, we did! In DevOps Decrypted Episode 30, Jobin and Jon discuss the outage and our exclusive report. You can also check out the full press release on our CrowdStrike outage research.

Jobin Kuruvilla:

We definitely mentioned it in the LinkedIn Live that we did today.

Matt Saunders:

Yeah, we did. But TL;DR, based on the outage that basically everyone experienced last summer because of a bug in the CrowdStrike software, we put out a report surveying how attitudes have changed since CrowdStrike.

So, past the headlines, where we discovered that most people were affected. And I think we were looking at about 5 billion dollars worth of value wiped off of various Fortune 500 companies because of it.

There was some good stuff there about learning and about getting better at this kind of stuff. So, things like the majority of companies off the back of that CrowdStrike report were saying that they're going to be investing more heavily in things like testing and staffing up to make this better for everyone. So yeah, I think it's good to see this report as well.

I think the Crowdstrike report that we wrote has gone down quite well – we got into Raconteur – so there are some thoughts from Jon, who's often on the podcast, on that. But yeah, the TL;DR on it seems to be that there's increased awareness around this stuff, around how reliability is really important. It takes some kind of seismic thing that affects everyone to make people sit up and listen, pay attention and realise that, oh, yes, maybe you have got all these plans for what happens in a disaster. But how far does that go? What's your scope here?

Does it really affect you, or does your plan really encompass things like all your Windows PCs not booting any more? Maybe not. And it's that kind of outside-of-the-box thinking that

I think we're seeing more of it now.

Jobin Kuruvilla:

Yeah, and talking about good things to take from the report, you know, I will also mention that in the State of Resilience report, organisations reported that one of the major obstacles is prioritisation and budgetary constraints, followed by system complexity and inadequate training staffing levels.

So, there are certain things that organisations can look at and maybe prioritise.

Okay, infrastructure, stability, resilience, and overall resilience of your system are priorities—just like DevOps, and you know, you should probably pay some attention to them. We, as consultants, run into these budgetary constraints all the time with our customers, and they say, "Yes, we know that it has to be done, but there is a budget constraint at the moment, and the focus is elsewhere."

Yes, great, but eventually, you're going to pay the price for that.

Either you pay the consultants to implement the system properly, or you pay the price by having an outage and losing revenue because of that. So, I think there is also a lesson to be learned from that.

Laura Larramore:

Everything is possible with as much money and time as you have!

There's always a limit that clients have, as far as their money and their time, in what they want to do. And they kind of do have to do that risk assessment like, okay – so what are you telling me the risk of doing this is going to be? And I think that this resiliency report kind of can point that out. If 100% of clients are saying we lost money here, and then you have 8% of them saying, “We lost like a million dollars here”, it's not a small bit of money, either.

That can bolster some of those arguments that, okay, well, we're telling you that you know this. This might be a worthwhile investment in what you need to do, going forward, to build up your systems and to get them to where they need to be.

Jobin Kuruvilla:

I agree. I mean, often said, organisations are blindsided, right? And these kinds of reports actually shed some light into, okay, this is what's going to happen. And this is how you quantify it in terms of money. So you're actually losing money by not doing things that you should be doing. So it's definitely a good thing.

Matt Saunders:

Yeah, it's a really fascinating one. And you know, I'm sure we've all been seeing this for as long as we've been working in the industry. It's like that classic thing of people prioritising features over stability.

It's like, Oh, what should we do? Oh, well, we can add all these new features, and we expect that you know, somebody's probably got a spreadsheet out and works out that adding this feature can increase our revenue by 17% or something.

What do you get in return for doing all this investment in DevSecOps and testing on the bottom line? Nothing.

But it's like that, you know, measuring a negative thing. You can't do it. And that's why it's really hard. And you're right. It's almost like this sort of stuff has to have some shock value to shake people out of that level of complacency.

Jobin Kuruvilla:

Yeah. And speaking of the question that you ask, what do you get out of the investment that I do in this area? This is the same question most cloud customers are asking as well right? I mean, people were pitching the move to cloud as an answer to these reliability issues.

And now that you're on cloud, and you're still seeing these reliability issues – which then prompts the question: is cloud actually dead? And it's that the reason why, sometimes, you know, customers are moving back from cloud to on-prem.

Matt Saunders:

Cloud is definitely not dead.

Jobin Kuruvilla:

Are you sure, though, I mean, is this the reason why you get me every day!

Matt Saunders:

I enjoyed that. Am I sure? Not entirely. So yeah, I think we talked about cloud repatriation before, and there's a lovely story… Let me sit down and tell you a story.

One day, we were fed up with running all these servers. We couldn't get enough servers, and there was too much lead time, so we decided to get somebody else to run them for us. We were very, very happy—because when we got lots of extra customers, we could just ask the Cloud Provider for a few clicks to add another 100 servers for us.

Thank you very much.

And then we went to bed. We were very happy. But then, the next morning, we woke up and realised we were spending an awful lot of money on the cloud. And so we all moved ourselves back into the data center.

That's the story. And people like David Heinemar Hansen from Basecamp – were talking about this a lot. There's another recent one with a provider who was actually sending out physical servers to their customers, and using a whole load of cloud spend and then repatriated cloud spend back onto those servers as they got returned because they were end of life.

And there are a lot of cases where running stuff in a data centre is going to be a lot cheaper than running on cloud. I don't think that's ever been a secret. I think people who are keeping an eye on their cloud spend and understanding where it's going, understanding the flipside, the costs of being able to scale things up instantaneously and dramatically. What that actually look like? We're going through this sort of pain ourselves at Adaptavist a little bit where we're like, we're spending a lot of money on the cloud, and we're not entirely sure where it's all going.

Jobin Kuruvilla:

Isn't that the key thing, though? Understanding where the spend is? I mean, that is the most important thing. Because yes, we do say that going to the cloud actually will save you money. And in the end, if you're seeing that it is actually doubling our bills, I mean, there's a reason why that's happening.

Understanding why that's happening, which service is consuming the most money, and why that's happening is key. That's probably why all these FinAuts technologies are now becoming increasingly popular.

Matt Saunders:

Yeah, yeah, it's the key. You need to understand what your spend actually is. And then you're gonna have to look at where the cloud market is going, where the data center market is going. The thing with the Basecamp repatriation was that they've been running for a long time, and they could see that there were some established workloads that they knew that they weren't really going to change very much in 3 or 5 years.

Therefore, we can just buy equipment to run this stuff for 3 or 5 years. It's already fully utilised. So it's not like a growth thing anymore, and then they save money.

I think many businesses are going to be in a world where at least some of their workloads are going to be like that, and so there's a definite argument for going off and doing that to some extent.

Having said that, I think I basically just said if you know what your workload is going to be in 3 or 5 years' time, then go to the data centre. You can make a whole lot of savings in AWS things like reserved instances. Other cloud providers are the same sort of thing, where you get the CPU that you need at huge discounts, very, very big discounts.

The thing that people often say about the downsides of running stuff in your data centre? You have to run servers. You have to have people who understand operating systems again. Is that the right thing for your business? I know it's not for us because we have myriad things going on in our services and in our product worlds, using a whole load of cloud technology that will be very expensive to go and replicate back off in the data centre when we start running containers. All that stuff that we actually almost take for granted.

Jobin Kuruvilla:

Yeah, the flexibility of spinning up more environments as needed, you know, bringing it down. That's so much easier on cloud. So, obviously, there is value in cloud. Nobody is actually debating whether that is cloud dead, you know. That's a tricky question. Obviously, cloud is not dead, but at the same time, we do see a lot of companies actually taking workloads, containerising it, you know, Kubernetes, or whatever, and then, having a hybrid cloud setup where some of the workloads actually around, some on AWS, some on Azure, so that hybrid cloud is actually becoming more and more popular. We do run into that quite a lot these days.

Matt Saunders:

That's interesting. So I guess that's because things like Kubernetes have become kind of ubiquitous. So you can buy a Kubernetes service from any number of different cloud providers, and I think this has changed massively in the last five or so years, where we used to talk about hybrid cloud, and people are like, "We want to run some more workload in AWS, but be able to run it in GCP", and we're like… well, it's both clouds are both got APIs. So yeah, just change the words AWS to GCP and your Terraform, and off you go. Right?

Yeah, maybe not. So, fast-forward to where? I'm not going to say all these Kubernetes distributions are the same because they absolutely aren't. But the core of them, the way that apps are run, has become very, very standardised, hasn't it?

So yeah, that's absolutely coming on the table. And I think that's a good thing for businesses. I've lost count of the number of people who are saying things like, "Oh, don't run all your work on AWS because what if they increase their prices by tenfold next year?" and you're going to be stuck, and from there you end up having to do some sort of hybrid cloud project to guard against that sort of thing…

Jobin Kuruvilla:

It brings back that vendor lock-in cloud-agnostic discussion as well, doesn't it?

Matt Saunders:

Yes. Those sorts of things are good to have, but they can hold you back if you're trying to design and ensure that your application can run anywhere. That's going to take longer than just saying, "Well, no—we're just going to run it over there."

But just to your point, that is a lot easier. Now, you run an application that runs in EKS. It's going to run probably just fine in AKS or in GKE, or wherever else. And I think that is a net good thing for companies.

Because I don't believe for a moment AWS are going to increase their costs by tenfold.

That is not going to happen.

Jobin Kuruvilla:

They are, in fact, making it cheaper as the day goes by, so they're making the services cheaper and cheaper to make it more affordable to a wider range of customers. That's what we see in practice.

Matt Saunders:

Yeah, 100%. So it's more choice for everyone, which has got to be a good thing.

Jobin Kuruvilla:

Having said that, we also see priorities changing even when it comes to cloud windows; AWS just recently announced that they are discontinuing AWS Code Commit. You know, the Code Commit, the code build pipeline – that was like a big service offering from AWS for a long time. But now they have said they're discontinuing AWS Code Commit.

So, we are currently in the process of migrating many customers from AWS to GitLab, Bitbucket, and GitHub, so I think—I don't know. What does that tell you, Matt?

Matt Saunders:

I think so… an ongoing, not really a joke, but it's an observation that AWS are very, very good at spinning up lots of different services. And I think, well, Google, too. And Microsoft – to spin up lots of things, put out something that is better if you're already running in our ecosystem, and some of the competition.

And then killing some of those services. They've caught some flack for well, just you had this service running, and now you've killed it. What are we supposed to do now?

Which I think is unfair, you know? AWS is a business, after all. They want to find things that help their customers. But also, they want to find things that are going to make them money – fine. So it's all part of the natural ebb and flow of things that some products will come out and then be killed.
Everyone does it. Anyone who spins up a product, realises it's maybe not the right thing for them, or doesn't represent the direction they want to go in or whatever, and then just carries on with it?

That doesn't make any sense.

So yeah, so code, commit – to be honest, I've only really dabbled in it. Haven't used it in anger.

It was a decent product, right? I think. I mean, you and your team have spent a lot more time in it than I have, but especially given the integration with the other AWS services out there, it did a good job.

Jobin Kuruvilla:

That was a huge reason many customers adopted cloud commit in the first place. But you're right. I think they're discontinuing it not because it is a bad product but mostly because they're reprioritising their efforts elsewhere.

I'm probably going back to that original idea of, okay, AWS is a service provider. We are here to help you host your stuff here, and not to actually probably SCM or CICD—that can be taken care of by the likes of GitLab and GitHub. There are other products that are better and have more features that you may be able to use.

To their credit, AWS has been really wonderful in terms of helping their customers come up with a plan. In fact, even working with service providers like Adaptavist—we just did a recent webinar with the help of AWS and GitLab explaining to our customers how Adaptavist can help them move from Code Commit to GitLab.

We are also doing a similar one for GitHub and Bitbucket. Our campaigns explain how we can help, so yeah, there are a lot of things happening there. But I think it's all about prioritisation. We want to make our cloud more reliable and not worry so much about having a thousand services in there. Maybe that's what it is.

Matt Saunders:

Yeah, and fair's fair.

I'm coming back to what we were talking about a few minutes ago – this idea of lock-in. It goes just just the same for source code management as it does for where you actually run your apps, I guess.

I'm not saying you should be ready to move your SCM at any moment, because well, quite clearly, it's something that's quite central and you want to be able to hang on to it for a while.

But yeah, it sounds reasonable. And yes, we have services to help people move – and not just move, but like, get really ahead of the game and start… I think we came into this call talking about fragmentation of apps and having too many of the damn things. Well, yes, come and talk to us, and we can get you onto something that will let you have fewer apps!

Jobin Kuruvilla:

Yeah, absolutely. I mean, here's what I say… Since a tool will be discontinued, yes, it is going to cause a disruption to what you're doing on a day to day basis. It is going to reduce your team velocity because you suddenly have something that you need to worry about, which you weren't worrying about earlier.

But on the flip side, this is also an opportunity for you to transform the way that you're working. Maybe it is an opportunity in the right sense, to maybe improve your ways of working, so I will say – take this as an opportunity. See what's the next best thing out there, and how it is going to improve your life on a day-to-day basis.

Matt Saunders:

Yeah. The other thing is that these things have evolved since Code Commit has been around.

Jobin Kuruvilla:

Exactly.

Matt Saunders:

We've managed to get, I don't know, how many minutes into this podcast without mentioning AI?!

Jobin Kuruvilla:

Did we? Wow!

Matt Saunders:

We did, we did. But yeah, you can move to one of these source code management platforms that actually – not that I'm saying AWS isn't doing great things in AI, because they clearly are – but we've built in AI in things like GitLab, which is backed by Amazon.

Yeah, you can take another big step up in what you do.

Jobin Kuruvilla:

Yeah, I was going to say that it is actually backed by AWS because, you know, if you look at the GitLab Duo, the AI engine on GitLab, it is backed by Amazon Q, and all you have to do is, you know, search Q code – and there goes your automated code generation, there's your automated test cases. So it is becoming easier and easier.

Yes, by moving to those tools, you're probably improving the way you work. So…

Matt Saunders:

There we go!

Jobin Kuruvilla:

There's the opportunity.

Laura Larramore:

I always appreciate ways to improve the way we work! Thanks, you guys, for this conversation. It was very interesting, and I hope that it was interesting to our audience. That study again was the State of Resilience 2025 report—it was published by Cockroach Labs.

That was mostly the conversation today; we were talking about that at length, so go and have a look at it if you would like.

Thanks, you guys. Thanks for the discussion. This has been DevOps Decrypted. I'm your host, Laura Larramore. Please connect with us on our socials. Have a great day!