About Services Portfolio Blog Contact Contact Us
Cloud Migration

What we learned migrating 12,000 VMs in 90 days.

April 28, 2026

On a Friday in late January, our regulator notified us — with the formal politeness regulators reserve for moments of maximum unhelpfulness — that one of our data center contracts was no longer compliant with new sovereignty rules, and would not be renewed. We had until the end of April to be out. Approximately twelve thousand virtual machines, four hundred terabytes of relational data, and a tangle of seventeen years of decisions made by people who no longer worked here. Ninety days.

We made it. Two went down for longer than we wanted. None lost data. The retrospective took six weeks and produced a document longer than this article. What follows is the shortest version I can write while still being honest.

The wave structure

We organized everything into five waves. Wave zero was discovery: every running process, every dependency, every certificate, every hard-coded IP. Waves one through four were the actual migration, sequenced by blast radius — the things that could fail loudly without taking the company down went first, the things that could not fail at all went last. Each wave was two weeks. Each wave ended with a real cut-over of real production. There was no “pilot” wave that didn’t count. Every wave counted.

The three rules

Before wave one started, we wrote three rules on a whiteboard. They stayed up for ninety days. Every architectural argument got resolved by checking which rule applied. I cannot overstate how much time this saved.

Rule 1 — Lift, then shift, then improve. Never simultaneously.

The temptation to “fix this while we’re touching it” is the single biggest reason migrations slip. We had a long list of things we wanted to clean up — services that should be containerized, databases that should be sharded, scheduled jobs that should be event-driven — and we deferred all of them. Every one. The migration’s job was to move the workload to the new substrate without changing its shape. Improvement comes after stability, not during transition.

Rule 2 — If you can’t roll it back in an hour, you can’t roll it forward today.

Every cut-over had a documented rollback path tested before the cut-over happened. If a rollback would take more than an hour, the cut-over didn’t happen that day. This sounds slow. It was faster. We rolled back twice in ninety days, both times to a clean known-good state, both times without an all-hands incident. The rollback discipline is the only reason the team could move quickly without being terrified.

Rule 3 — The owner of the workload owns the cut-over.

Platform team did not migrate anyone’s service. We migrated the substrate. The teams who owned each workload did the cut-over themselves, on our tooling, with our support. This was unpopular for the first two weeks and indispensable by week three. Nobody knows the strange behavior of a system better than the people who built and ran it. Centralizing the cut-over decision in the platform team would have created a bottleneck and, worse, an accountability gap when something went sideways.

What broke anyway

Two outages, both during wave three, both caused by the same thing: a hard-coded IP address in a configuration file that was managed by a different team than the one running the cut-over. The first time it happened we lost forty-six minutes of write availability on a non-critical service. The second time we lost twenty-one minutes, on a more critical service, but we knew what to look for and the rollback was clean.

Both came down to the same gap: our discovery tooling was good at finding network dependencies between hosts, and bad at finding configuration that named hosts by IP rather than DNS. Between waves one and two we added a static-analysis pass that grepped every config repo for IP-shaped strings. We thought we’d caught them all. We had caught maybe 80%. The other 20% were in places like the body of an alerting webhook or a comment-out-but-still-active block in a Puppet manifest that had been forgotten for nine years.

The one decision that mattered more than the rest

Before wave zero, we spent ten days arguing about tooling. Cloud A or cloud B; this migration platform or that one; rehost or replatform. The argument felt important. It was not. Every option on the shortlist would have worked. None of them would have failed catastrophically. The decisions inside the wave structure — which order, which rollback, which owner — mattered far more than the substrate decisions we agonized over.

If I could give my January self one piece of advice, it would be: pick the cloud you have the most operational experience with, even if the other one is technically better on paper. Familiarity is a multiplier on every other decision you make for ninety straight days. We picked the one we knew. I am certain that was right.

After

Wave four cut over on April 23rd. The last machine in the old data center was decommissioned on April 28th, two days before the deadline. The team took the long weekend off, which was the first long weekend any of us had taken since January. The improvement work — the deferred containerization, the sharding, the event-driven cleanups — started in earnest in May and will run for most of this year. We are doing it slowly, with proper design reviews, with no regulator in the room. It is much more pleasant work.

The migration was the hardest thing this team has done together. It is also, by a comfortable margin, the most proud I have ever been of a group of engineers. Ninety days. Three rules. One whiteboard. The substrate matters less than people will tell you. The discipline matters more.

Back to all posts
Ready when you are

Ready to elevate your business?

Start with a free, no-obligation IT audit.

Contact Us