Skip to content

You're Crazy To Think You Can Predict Your Next Outage!

Published:

You’re probably terrible at predicting the next outage in your system!

Not because you’re a bad engineer, I’m guilty of it too. It’s just how we humans work. You’ll have a mental model of what could go wrong based on what has gone wrong. If a particular type of failure has never hit your system, the odds of you proactively designing against it are pretty slim. You don’t know what you don’t know.

This has some pretty uncomfortable implications for how you might think about reliability. And even more implications as you move further away from writing individual lines of code yourself, into the world of LLM’s software for you.

The Monitoring Trap

The classic response to “how do we know when things go wrong?” is to add more monitoring. Dashboards, alerts, thresholds. CPU over 80%? Alert. Error rate above 1%? Alert. P99 latency over 500ms? Alert.

Imagine you have an outage, your pager goes off at 2am because something’s wrong. You sleepily stumble out of bed, and open up your observability platform of choice. You see that the production RDS instance has ran out of disk space. Annoying, but a relatively easy fix. You bundle more memory into the instance (gotta love the cloud) and then go back to bed.

Hopefully, the next day you run some kind of post-mortem. And one of the probable outcomes of that to create some kind of monitor watching the disk space. If the disk space is over 80%, notify somebody. Problem solved.

This kind of stuff has value. But it’s not observability. There’s a really important distinction between monitoring and observability, and conflating the two is where you’re going to get stuck.

Monitoring tells you when something you already anticipated is happening. You set a threshold because you know that metric matters. You’ve seen it spike before. You know what it means.

Observability is different. Observability is the ability to sit down at any point in time, ask any question about your system, and actually get an answer. Slice the data in ways you hadn’t thought of before. Follow breadcrumbs back through a chain of events to find the actual root cause, not just the symptom.

Logs, metrics and traces (and to some extent profiling) aren’t observability. They’re the raw material. What you do with them, and how flexibly you can query and explore them, that’s where observability actually lives.

The reason this matters so much comes back to that first point. If you can only answer questions you thought to ask in advance, you’re stuck inside your own mental model of how the system can fail. And your mental model has gaps. It always does.

Metrics for Systems, Observability for Software

One thing that’s helped me think about this more clearly is drawing a line between the components my application runs on top of, and the application itself.

Infrastructure has a fixed-ish set of constraints and problems. Redis starts to fall over under certain memory pressure. Kafka consumer lag grows when throughput spikes. EC2 instances hit CPU/memory limits. These failure modes are well understood, well documented, and relatively stable. Metrics work brilliantly here. Set your thresholds, configure your auto-scaling, and you’re largely covered.

If you’re running something like Nginx and want to check the traffic running through it, logs are about your only option. Stream the logs from Nginx and do something with them. The same goes for Amazon CloudTrail logs. Many infrastructure services output logs in some shape or form, and that’s really helpful.

The software your (or Claude) writes is a completely different story. Application code changes constantly. A new feature ships, a refactor lands, a dependency gets bumped. Any metric threshold you defined last month might be completely meaningless today because the code it was measuring looks nothing like it did when you set it.

These changes are also being made by several people. Whether you’re running a lean, vibe-coded startup or part of an enormous development organisation. There are multiple parties contributing to the code base. And all those parties will probably have a slightly different mental model of the system, and a very different set of previous experiences. I use the word parties intentionally, I’d include coding assistants in that bucket as well.

On top of all of that, an individual software component runs as part of a larger system. The complexity of a system emerges from the interactions between the parts, not the individual parts themselves. And when you truly consider the entire system, the software components, the developers, coding assistants, the product managers, executive leadership, executive leaders using LLM’s to make decisions, users of he software, prospects of the software, the cloud provider you use to host everything and everything else.

All of which is changing constantly. Not so simple is it.

This is where the ability to ask arbitrary questions of your system becomes genuinely useful. Instead of asking “is this metric above the threshold I set six weeks ago?”, you can ask “how does this system behave right now compared to before this deployment?“. That kind of comparison, made freely and without having to pre-define everything, is what lets you keep up with software that never stops changing.

Designing for Change in the First Place

Which brings me to something I think gets undervalued in many architectural conversations: the rate of change itself is something worth designing for. I’d argue evolvability is the most important ‘-ility’.

Speed of change is a competitive advantage. If you can ship safely, iterate quickly, and roll back confidently you have an advantage over other teams that are scared of their own codebase. Some of what might scare you about your codebase is accidental complexity that’s crept in over time.

Or the fact the whole thing has been vibe-coded with little to no thoughts about software quality.

One area where I see this play out is in how teams think about reuse. There’s a real temptation to build reusable components early. It feels responsible. It feels like good engineering. But I’ve been burned by this enough times to be pretty cautious about it now.

Build the thing first. Use it in one place. Understand it properly. Then think about reuse. Premature abstraction is its own kind of complexity, and unpicking a badly designed abstraction that’s been reused in fifteen places is a painful way to spend a sprint.

The exception I’d make here is around observability and schema design. Getting those right early, and building them consistently, tends to pay for itself pretty quickly.

Making Illegal States Impossible

One of my favourite ways to reduce the surface area for unexpected failures is to make invalid states literally impossible to represent in code.

Take something like an email address. In a lot of systems, an email can be verified or unverified, and that distinction gets tracked with a boolean flag somewhere. Which means somewhere in the codebase, some function that should only operate on verified emails is probably trusting that the caller checked the flag. Or maybe it isn’t. Hard to tell.

A better approach is to make these two things separate types. UnverifiedEmail and VerifiedEmail. The constructor for VerifiedEmail is private, and the only way to get one is through a verification service that you actually trust. Now any function that needs a verified email just takes a VerifiedEmail as a parameter. The compiler enforces the rule. You can’t pass an unverified email in by accident because the types won’t let you.

This is the kind of thing that sounds like a small detail but compounds nicely across a whole codebase. Fewer implicit assumptions. Fewer “I assumed the caller had already checked that” bugs. Fewer surprises at 2am.

These are also the kinds of things that LLM written code will never do without explicit prompting, they just don’t think that way.

You Can’t Predict Everything, So Stop Trying

Coming back to where I started. You can’t predict every failure. You can’t anticipate every question you’ll want to ask about your system when something goes wrong at 2am. And you can’t design away every edge case before it surfaces in production.

There’s no way you could even see all the different edges, let alone track them.

What you can do is build systems that are easier to understand when things go wrong. Invest in real observability so you can ask questions you haven’t thought of yet. Design your application code so that change is cheap and safe. Push complexity down to the infrastructure layer where failure modes are predictable, and keep your application layer as clean and queryable as possible.

The goal isn’t a system that never breaks. It’s a system where, when it does break, you can actually figure out why.