In the early hours of February 24, 2022, as Russian forces crossed the Ukrainian border, a different kind of evacuation was already underway. Not of people — though that would come — but of data. Government databases, citizen records, tax systems, land registries, pension records. The digital infrastructure of a sovereign nation, being moved out of the path of artillery.
Ukraine’s Ministry of Digital Transformation, working with Amazon Web Services and Microsoft, executed one of the largest emergency data migrations in history. Critical systems were lifted from on-premises data centers in Kyiv and Kharkiv — some of which would be physically destroyed within days — and relocated to cloud infrastructure outside the country. The operation had been quietly prepared for weeks before the invasion, after US intelligence warnings made the threat concrete. When the moment came, the migration was not theoretical. It was live-fire disaster recovery.
The organizations that survived digitally were those that had already architected for portability and redundancy. Those that hadn’t faced losses that were, in some cases, permanent.
This is the story the technology industry tells itself about resilience. And it is a good story. It demonstrates the value of preparation, the power of distributed infrastructure, the importance of treating disaster recovery not as a compliance exercise but as an architectural principle. The lesson has been reinforced by subsequent events — the ongoing conflicts in the Middle East, the increasing frequency of state-level cyber operations, the growing recognition that geopolitical risk is infrastructure risk.
But here is the philosophical error: we have learned this lesson only for the systems we built ourselves.
The Comfortable Illusion
Every enterprise has a disaster recovery plan. Most have never tested it under real stress.
The typical corporate DR exercise is a tabletop simulation. Stakeholders gather in a conference room, walk through a scenario — ransomware attack, data center outage, regional disruption — and document their theoretical responses. Recovery Time Objectives and Recovery Point Objectives are recorded in spreadsheets. Runbooks are filed. Compliance boxes are checked.
The gap between these documented plans and actual recovery capability is enormous. And most boards don’t know it.
This is precisely why chaos engineering emerged as a discipline. Netflix’s Chaos Monkey, introduced in 2011, deliberately killed production instances at random to force engineers to build systems that could survive failure. The insight was simple and profound: you cannot know whether your systems are resilient until you have broken them. Controlled failure in production — not in a conference room, not in a staging environment, but in the live system — is the only honest test of resilience.
AWS GameDay exercises, Google’s DiRT (Disaster Recovery Testing), fault injection frameworks — these all operate on the same principle. Break it. Watch what happens. Fix the weaknesses you discover. Repeat.
This works because digital infrastructure has a property so fundamental that we rarely name it: reversibility. You can terminate an instance and launch another. You can restore from a snapshot. You can fail over to a secondary region. You can roll back a deployment. The blast radius of a deliberate failure is bounded and recoverable. That is what makes chaos engineering possible — not courage, but the underlying architecture of recoverability.
Now consider what happens when you remove that property.
The Asymmetry
When you drain an aquifer, there is no failover.
The Ogallala Aquifer stretches beneath eight US states, from South Dakota to Texas. It took roughly ten thousand years to fill — the accumulated gift of the last ice age, water filtering through sand and gravel across geological time. Parts of the Texas Panhandle and western Kansas have drawn it down by more than 150 feet since the 1950s. At current extraction rates in those regions, the water will be functionally gone within a generation. There is no secondary region. There is no snapshot to restore from. The recovery time objective, if we stopped pumping today, is measured in millennia.
When you collapse a fishery, there is no rollback.
The Atlantic cod fishery off Newfoundland sustained communities for five hundred years. In the late 1980s, catches began to decline sharply. Scientists warned that stocks were being depleted faster than they could reproduce. The warnings were overridden by economic and political pressure. In 1992, the Canadian government declared a moratorium. Forty thousand people lost their livelihoods overnight. Three decades later — three decades of zero commercial fishing — the cod stocks have not recovered.
When you degrade soil to the point of desertification, there is no backup region to fail over to.
The United Nations estimates that 24 billion tons of fertile soil are lost annually to erosion and degradation. A third of the world’s arable land has been lost in the last forty years. Topsoil — the thin, biologically active layer that makes agriculture possible — takes roughly five hundred years to form one inch naturally. We are consuming it in decades. Some agricultural scientists have estimated that at current rates of degradation, the world’s topsoil could support only about sixty more harvests.
These are not edge cases. They are the systems that keep civilization running. And they share a property that is the precise inverse of digital infrastructure: irreversibility. There is no Chaos Monkey for the Ogallala Aquifer, because you cannot restore what took ten thousand years to create. There is no GameDay exercise for Atlantic cod, because you cannot relaunch a collapsed ecosystem from a backup.
In technology, resilience means redundancy, failover, and recovery. You build systems that can break and be rebuilt.
In nature, resilience means not breaking it in the first place — because the recovery time exceeds any meaningful human horizon.
We have no backup planet.
The Recovery Gap
The technology industry has developed sophisticated language for thinking about failure and recovery. Two concepts in particular are worth examining through this lens.
Recovery Time Objective (RTO) is the maximum acceptable time between a failure and the restoration of service. For a critical financial system, the RTO might be minutes. For a content management system, hours. The number is chosen based on business impact — what is the cost per unit of time of being down?
Recovery Point Objective (RPO) is the maximum acceptable amount of data loss, measured in time. An RPO of one hour means you can tolerate losing up to one hour of transactions. An RPO of zero means no data loss is acceptable — which requires synchronous replication across multiple locations.
Now apply these concepts to natural systems.
What is the RTO for a depleted aquifer? Thousands of years. What is the RTO for collapsed fishery stocks? Decades — if recovery happens at all. What is the RTO for an extinct species? The answer is undefined. There is no recovery. The RTO is infinity.
What is the RPO for biodiversity? Every species lost is a permanent deletion from a database that took 3.8 billion years to populate. There is no transaction log. There is no point-in-time recovery.
If a CTO presented these numbers to a board — infinite recovery times, permanent data loss with no possibility of restoration, no failover capability — the response would be immediate and unambiguous: this system is unacceptable. No responsible organization would operate critical infrastructure with these parameters.
And yet. This is precisely how we operate the most critical infrastructure we have. The systems that produce our food, filter our water, regulate our climate, pollinate our crops, and stabilize our coastlines have no redundancy, no failover, and recovery times measured in geological epochs.
The Dasgupta Review, commissioned by the UK Treasury in 2021, put this in economic terms: humanity’s demands on nature now exceed its capacity to supply by an estimated 1.6 times. We are running a deficit against a resource base that cannot be recapitalized by quarterly earnings.
The Chaos We Cannot Engineer
Chaos engineering works because it operates within a boundary condition: the system under test is rebuildable. You can kill a process because you can restart it. You can corrupt data because you have backups. You can take down a region because you have another region.
Remove that boundary condition and chaos engineering becomes something else entirely. It becomes just chaos.
We are, in effect, running uncontrolled chaos experiments on natural systems — but without the safety nets that make chaos engineering a discipline rather than sabotage. We are injecting failures (pollution, overextraction, habitat destruction) into production (the biosphere) without snapshots, without failover, and without rollback capability. And we are doing it without monitoring dashboards, without alerting, and often without even knowing what we are losing.
This is not a metaphor stretched too thin. The structural parallel is precise. In technology, you would never run a chaos experiment on a system with no backups, no monitoring, and no recovery plan. You would call that negligence. You would call it a career-ending mistake. And yet this is the default mode of operation for how industrial civilization interacts with natural systems.
Consider the monitoring gap alone. A modern cloud environment generates metrics on every resource — CPU utilization, memory pressure, network latency, error rates, queue depths. Anomalies trigger alerts. Dashboards provide real-time visibility. You know the state of your system at all times.
What is the equivalent for an ecosystem? We do not have real-time monitoring of most aquifer levels. We do not have comprehensive dashboards for soil health across agricultural regions. We do not have alerting on biodiversity loss at the speed it occurs. The Intergovernmental Science-Policy Platform on Biodiversity and Ecosystem Services (IPBES) estimated in 2019 that one million species are at risk of extinction — but the confidence interval on that number is wide, because we have not yet catalogued most of the species on Earth. We are losing components of a system we have not finished inventorying.
Imagine operating a data center where you didn’t have a complete asset register, couldn’t monitor half your servers, and discovered outages only after customers complained. No one would accept that. But it is a reasonable description of our relationship with the biosphere.
What Resilience Actually Requires
The Ukraine data evacuation succeeded for specific, instructive reasons. The preparation began before the crisis. The architecture had been designed — or was rapidly redesigned — for portability. International cooperation between the Ukrainian government and US-headquartered cloud providers made the migration physically possible. And critically, the people involved understood the value of what they were protecting.
Each of these conditions has a parallel in natural resource resilience.
Preparation before the crisis. The Ogallala Aquifer’s depletion has been documented for decades. The decline of global fisheries is well-characterized. Soil degradation is measured and mapped. The science is not the bottleneck. The bottleneck is the same one that existed in corporate DR before chaos engineering: the illusion that documented awareness is the same as operational readiness. Knowing you have a problem and having the architecture to survive it are different things.
Portability and redundancy. In natural systems, this translates to biodiversity itself. A diverse ecosystem is a redundant system — multiple species performing overlapping functions, so that the loss of one does not cascade into systemic failure. Monocultures, whether in agriculture or in how we think about solutions, are the equivalent of running your entire operation in a single availability zone. The efficiency is seductive. The fragility is hidden until it isn’t.
International cooperation. The Ukraine migration worked because organizations with the relevant capability chose to act. Natural resource resilience requires the same — coordinated action across jurisdictions, industries, and incentive structures. The Paris Agreement, the Kunming-Montreal Global Biodiversity Framework, the UN Decade on Ecosystem Restoration: these are the equivalents of mutual aid agreements between cloud providers. They exist on paper. Their operational readiness is untested.
Understanding the value of what you protect. This is the deepest parallel and the most important. Ukraine’s digital evacuation was driven by an acute understanding that these systems were irreplaceable — that losing them meant losing the administrative capacity of a nation. The urgency was visceral because the threat was visible.
The threat to natural systems is slower, more distributed, and easier to abstract away. But the stakes are higher. The systems we are degrading — water, soil, biodiversity, climate regulation — are not amenities. They are infrastructure. They are the operating system on which every human system, including every digital system, depends.
The Question
I spent years helping enterprises build resilient cloud architectures. Designing for failure. Architecting redundancy. Testing recovery. The discipline of resilience is, at its core, a discipline of honesty — forcing yourself to confront what would actually happen if the thing you depend on disappeared, and building accordingly.
That same discipline, applied with the same rigor to the natural systems we depend on, leads to an uncomfortable conclusion: we are nowhere close to resilient. Our RTO is infinity. Our RPO is permanent loss. Our monitoring is incomplete. Our failover capability is nonexistent. And unlike a cloud region, we cannot build another one.
The technology industry has demonstrated, under the most extreme conditions imaginable, that it knows how to protect critical infrastructure when it decides to. The question is whether we can extend that same architectural seriousness — the same refusal to accept unrecoverable failure — to the systems that actually keep us alive.
Because the planet is not a staging environment. There is no production-equivalent to fail over to. And we do not get to restore from backup.
What is your disaster recovery plan for that?