Ergodicity: Why Averages Lie and Systems Break
The migration that shouldn’t have been hard
Your company has always been an Azure shop. Then a decision comes down from way above your pay grade. Your cloud provider of choice has gone out to tender. AWS won over Azure. Now it’s come down from the top: you need to move all your applications to AWS.
You’re not worried. You’ve containerised everything with Kubernetes. Kubernetes runs anywhere. That’s the whole point. You sit there, smugly I might add, knowing the migration will be easy.
It isn’t.
Almost immediately, you realise a monumental problem. The problem isn’t Kubernetes. That part works exactly as you expected. Stateless containers run anywhere. The problem is CosmosDB — Azure’s proprietary database service. CosmosDB doesn’t run on AWS.
It’s embedded across multiple applications. Swapping it out means significant application-level code changes. Not just infrastructure changes. Not the kind of thing you can abstract away with a container orchestrator. The kind of thing that means months of rework you hadn’t budgeted for.
You’d made an assumption: our architecture is portable across cloud providers. Kubernetes gave you confidence in that assumption. But Kubernetes abstracts away the compute layer. It doesn’t abstract away the data layer. Or any other layer.
Nobody had asked the question that would have surfaced this: “What happens if we need to leave Azure?” Not because leaving Azure was likely. Just to understand what the answer would be.
You hadn’t made a technical mistake. You’d made a reasoning mistake — and it has a name.
What ergodicity actually means
Ergodicity. It sounds like something from a statistics textbook you skimmed in university and never thought about again.
A coin flip is the clean example. Send 100 people out to each flip a coin once and you’ll get roughly 50 heads and 50 tails. Have one person flip a coin 100 times and you’ll get roughly the same. The path doesn’t matter. A run of 10 tails in a row doesn’t change the probability of the next flip. The group average and the individual average converge.
As humans, we think in averages. We apply this same reasoning to many real world scenarios — mean time to failure, average returns, expected value. On average, mean time to failure is 2 hours. Great. But what works for the group doesn’t always hold for the individual. The average could be fine, but one massive outlier can destroy you.
Most of the statistical tools we reach for assume ergodicity. Most real-world systems don’t satisfy it.
That distinction — between the group average and the individual path — is what we most often get wrong. Once you see it though, you start noticing it everywhere.
The place it’s easiest to see isn’t in software. It’s in money.
The pattern is everywhere
Your production architecture has more in common with a 1966 retirement portfolio than you might think.
Say you retire and plan to draw down a fixed amount each year. You’ve done the maths. 8% is the average historical return, so you use 5% as your number — conservative, sensible. The numbers work. Excellent.
If you retired in 1966, here’s what actually happened. Near-zero returns from 1966 to 1982. Then roughly 15% per year from 1982 to 1997. The average return across the full period looks fine. But you were drawing down during the flat years, before the growth arrived. By the time the good years came, there might not be enough left in the pot to benefit from them.
The group average looks healthy. It tells you nothing useful about the person who retired in 1966. The path mattered. The sequence mattered. The average didn’t protect you.
Sequence of returns risk is a well-documented phenomenon. You can mitigate it. Diversified investments, asset allocations, emergency funds, regular reviews of your portfolio.
The answer isn’t to predict the future better. It’s to build in a buffer that keeps you in the game long enough for the averages to work in your favour. A cash reserve. Shifting some of your investments into bonds. Keeping a margin of safety. Something that absorbs a bad sequence before it becomes fatal.
Back to software
So back to you, your Azure shop, and CosmosDB.
The ergodic assumption the team had made was reasonable on the surface: because we chose to use Kubernetes, we built portability as a first-class citizen. In an ergodic system that would hold. The path — which cloud you’re on — wouldn’t matter, because the architecture behaves the same regardless.
But the system was non-ergodic. The path had locked you in. CosmosDB was embedded in ways Kubernetes couldn’t abstract away. Nobody thought to ask what a migration would cost until it was already underway.
The question that should have been asked wasn’t “are we likely to leave Azure?” It was “what would it take if we did?” The answer would have surfaced the dependency. It might not have changed the decision to use CosmosDB. But wrapping it behind an interface would have made it a contained change, not a codebase-wide rework.
Migrating from CosmosDB to an AWS-managed database isn’t easy — it definitely isn’t. But at least the changes stay in the codebase.
That’s one side of it. What about when it goes well?
A team builds a system for roughly 100 active users. They choose a serverless architecture — partly for cost, partly for simplicity. The system grows. 10,000 users. Then 20,000. And because of the serverless foundation, it scales elastically.
Here’s the honest version of that story though: the resilience was accidental. The team didn’t sit down and reason explicitly about ergodicity or future states. They got lucky. The early architectural choice happened to be path-independent in exactly the right way.
Whilst you can get the right answer without thinking formally about ergodicity, you’re playing a luck game. Sometimes you might come good. If you were the investor from 1982 to 1997 you’d think you were a genius. You just happened to be in the right place at the right time.
Using ergodicity as a lens gives you a deliberate path to better outcomes. Instead of hoping your early choices happen to age well, you ask the question upfront: am I locking in a path here? What does this system look like if usage grows by 100x? What does it look like if we need to change providers? What does it look like if we discover Godzilla is real, and he comes storming out of the sea and stomps all over us-east-1.
This isn’t about predicting every possible future. It’s about knowing where your path dependencies live before they surface in an emergency.
The intellectual lineage
This isn’t a new idea. It’s been formalised — and that formalisation matters.
Barry O’Reilly’s work in Residuality Theory overlaps closely with this framing. I’d read both of his books — Residuality Theory and The Architect’s Paradox — and understood the ideas. But a specific post O’Reilly made on LinkedIn — explicitly connecting ergodicity to software architecture — made the concept click.
The central move in Residuality Theory is this: map the possible future states of a system, not just the expected state. Ask “what if…” systematically rather than optimising only for the happy path. Instead of asking “will my architecture handle the average case?”, ask “what does my architecture look like across the distribution of possible futures?”
If you want to go deeper, Barry’s work is the place to start.
Which brings it down to a single question you can ask before any design decision.
One question
Have I thought about whether my architecture has hidden dependencies that would make future changes expensive?
That is the question.
The follow-up is just as useful: how will this system evolve over time, and have I thought through the possible future states?
The CosmosDB team didn’t need to predict the contract change. They needed to know that CosmosDB was a path dependency — so that if a migration was ever required, it wasn’t a surprise. The serverless team got lucky. Asking the question deliberately means you don’t have to rely on luck.
The goal isn’t to solve every potential future problem upfront. It’s knowing where the constraints live. Where the assumptions are baked in. Where a path dependency could cost you months when the future diverges from the average. Ask the questions, briefly capture the answers. A paragraph in a decision log, a note in the ADR (architecture decision record) and work from there. You need a record the question was asked and what the answer was at the time.
Most of the time the answer will be “this is fine.” But occasionally you’ll catch something. And when you do, it’s far more valuable than discovering it three months into a full codebase refactor.