Replacing Legacy Systems Without Stopping Them

Jan 23, 2025 Metasphere Engineering 11 min read

Your platform has been running on the same monolith for years. Millions of lines of code. Database tables with foreign keys that cross every domain boundary. Leadership wants it migrated to cloud-native microservices. The engineering team has been told to “just rewrite it.”

Demolish the building. Build a new one. Move everyone in over the weekend.

Six months into the rewrite, the new system covers 40% of the monolith’s functionality. The original system has had three hotfix deployments that nobody applied to the new codebase. Feature parity is a target that moves every time you look at it. The old building keeps getting renovated while the new one is under construction. And the business won’t accept a 48-hour cutover window because peak season is approaching. You know how this story ends. Everyone does. The rewrite gets quietly shelved and the monolith wins by default. (The monolith always wins by default.)

Key takeaways

Big-bang rewrites fail because feature parity is a moving target. The monolith ships hotfixes nobody applies to the new codebase. The gap widens every sprint until the rewrite gets shelved.
Strangler fig routes traffic incrementally. One URL path at a time. The vine growing around the tree. The monolith doesn’t know it’s being replaced. Rollback is a config change.
Database decoupling is the hardest part, not application code. Shared foreign keys across domains create coupling that defeats the purpose of extraction.
Shadow mode testing catches bugs no synthetic test can find. Run both systems in parallel for 2-4 weeks before committing production traffic to the new one.
Extract the highest-value, lowest-coupling domain first. Not the easiest. Not the most complex. The one that delivers measurable business value with the fewest cross-domain dependencies.

The cruel irony: systems that need migration most urgently (highest traffic, most integrations, most critical) are exactly the ones that cannot tolerate a cutover window. Rollback means restoring from backup and accepting lost transactions. A coin flip disguised as a migration strategy.

Martin Fowler documented the strangler fig pattern as the safest migration approach. Each step individually reversible. Legacy keeps running throughout. Effective cloud migration is built around exactly this approach.

Prerequisites

API gateway or reverse proxy deployed in front of the legacy system (the future facade)
Domain boundaries identified and documented (which code owns which data)
Legacy database schema mapped with cross-domain foreign key dependencies cataloged
CDC-capable database (PostgreSQL WAL or MySQL binlog accessible)
Monitoring baseline established for legacy system latency, error rates, and throughput per endpoint

The Facade and Incremental Routing

The facade layer is an API gateway sitting in front of everything, making routing decisions. Initially, every request routes to the legacy system. The facade is transparent. Users see no difference. Zero traffic to the new system. Zero risk. Day one should be boring. Boring is exactly what you want.

As capabilities are built on the new cloud-native platform , the facade’s routing configuration shifts specific request types to the new system. Start with the simplest, lowest-risk capability: a read-only endpoint with no write side effects. Something like a product catalog lookup or account balance query. Validate the new system’s response against the legacy system’s response before committing real traffic. Pick the most boring endpoint you have. Migrate that first.

Shadow Mode: The Migration Safety Net

Shadow testing is the single most effective risk reduction technique in any migration. Also the one teams most frequently skip “to save time.” That shortcut always costs more than it saves.

In shadow mode, the facade sends each request to both systems at the same time. The legacy system’s response is served to the user. The new system’s response is compared against it automatically. Divergences get logged and alerted on. Users are never affected because the new system’s responses are never served. Zero risk, all the learning.

What shadow testing catches that no other approach can: differences in behavior under real production data patterns, race conditions that only appear at production concurrency, edge cases in data that no test fixture anticipated, and performance characteristics under real load distributions. Shadow testing regularly catches timezone handling differences, floating-point rounding mismatches, and character encoding edge cases that would have corrupted customer data after cutover. Each is a 15-minute fix found in shadow mode. Each is a multi-day incident found in production.

Two weeks of shadow testing against live traffic is worth more than two months of testing against synthetic data.

Data Synchronization During Transition

Database consistency between both systems during parallel running is the hardest technical problem in a strangler fig migration. Every team underestimates it. Within the first two weeks, data drift starts producing customer-visible errors harder to debug than any cutover failure would have been. A customer updates their address in the new system. The legacy system still has the old address. An invoice goes to the wrong place.

The Feature Parity Race The losing game of trying to match the monolith’s functionality while the monolith continues shipping features. By the time the new system covers 80% of the monolith, the monolith has added another 15% of new functionality. The target keeps moving. Strangler fig avoids this race by replacing incrementally rather than rewriting in parallel.

Simple principle. Hard to enforce. Each data entity has a single system of record at any given time. For the customer profile entity, either the legacy system is authoritative or the new system is. Never both at once. When migrating the customer profile capability, writes go to the new system and replicate to the legacy system. The legacy system reads from its local copy but doesn’t write to it.

Change Data Capture from the legacy database enables this replication without modifying the legacy application. Debezium on top of PostgreSQL’s WAL or MySQL’s binlog streams every change to Kafka, where it gets consumed and applied to the new system’s database. The reverse replication (new system back to legacy) is messier. Usually a dedicated sync service that handles conflicts, applies data model transformations, and fails loudly when it hits data it can’t reconcile. The “fail loudly” part matters most. Silent data divergence kills migrations.

The migration sequence follows the data dependency graph. Migrate domains whose data has no dependencies on domains still in the legacy system first. Migrate write-heavy domains last, after all downstream read consumers have been migrated. Data engineering pipelines are the backbone that makes this sequencing work.

Migration Phase	Data Authority	Replication Direction	Risk Level
Shadow mode	Legacy (writes and reads)	Legacy to new (read-only sync)	Minimal
Read migration	Legacy (writes), new (reads)	Legacy to new (continuous CDC)	Low
Write migration	New (writes and reads)	New to legacy (backward compat)	Medium
Decommission	New only	None (legacy data archived)	Low

The Decommission Problem

Strangler fig migrations that never finish are worse than big-bang failures. This rarely gets discussed, but it’s the most common outcome. At least a failed cutover surfaces the problem immediately. A stalled migration leaves two live systems with a permanent facade. Two codebases to maintain. Two sets of infrastructure to patch. None of the simplification anyone promised. You didn’t migrate. You doubled your operational burden and called it progress.

Anti-pattern

Don’t: Let the facade accumulate business logic. Once teams start building features against the facade API instead of the new system, the facade becomes a permanent third system. Now you maintain three codebases instead of migrating from one to another.

Do: Keep the facade strictly to routing and protocol translation. Zero business logic. Every feature request goes to the new system. Explicit decommission gates with 30-day zero-traffic validation windows for each migrated capability.

Actually finishing requires organizational will you establish at the start, not negotiate at the end. Decommission milestones need to exist before the first line of migration code. Each phase has an explicit gate: zero traffic routed to the legacy system for this capability for 30 consecutive days, no data dependencies from remaining legacy capabilities on this data domain. Without hard gates, timelines don’t slip. They evaporate.

Cloud-native architecture ROI is realized only when the legacy system is actually decommissioned. While both systems run, you pay for both, maintain both, and patch both.

When Lift-and-Shift Is the Better Answer

Not every migration needs the strangler fig. The complexity is justified when the legacy system has real architectural problems worth fixing: tight coupling, scaling limits, patterns blocking business progress. If the system is well-designed and just needs cloud hosting, moving to cloud infrastructure and improving architecture incrementally is more pragmatic.

Factor	Strangler Fig	Lift-and-Shift
Codebase	Large monolith with architectural debt	Well-structured, needs cloud hosting
Goal	Modernize architecture + migrate	Migrate infrastructure only
Timeline	Months to years (incremental)	Weeks to months
Risk per step	Low (each step reversible)	Moderate (bigger blast radius per move)
When it wins	Architecture is the bottleneck	Infrastructure is the bottleneck

What the Industry Gets Wrong About Strangler Fig Migration

“Start with the easiest domain to extract.” Start with the highest-value domain. Easy extractions prove the pattern but don’t prove the value. Extracting the user profile service is technically simple and delivers zero business impact. Extracting the ordering domain is harder and immediately enables independent deployment for revenue-critical features.

“Rewrite the database layer while extracting services.” Separate the concerns. Extract the service first with a shared database and an anti-corruption layer for data access. Decouple the database second. Attempting both at the same time doubles the risk and the timeline. One migration per dimension.

Our take The database is the real migration bottleneck, not the application code. Shared foreign keys, cross-domain joins, and transaction boundaries that span multiple domains are what make the remaining extractions hard after the first few easy ones are done. Plan the database decoupling strategy before the first extraction, even if you don’t execute it until the third or fourth. Skip this and you hit a wall around the fourth or fifth extraction when every remaining domain has foreign key dependencies on something not yet migrated.

Same monolith. Same team. Traffic shifts one capability at a time through the facade. Shadow mode catches the timezone bug and the rounding mismatch before any customer sees them. The ordering service goes live after two weeks of shadow validation. Rollback is a gateway config change. No cutover window. No coin flip. If the new system is architecturally similar (same monolith on different hosting), just lift and shift. If you’re modernizing, the strangler fig payoff scales with how much architecture you’re actually fixing. A legacy modernization assessment tells you which one you’re doing before you commit.

Frequently Asked Questions

What is the strangler fig pattern and where does the name come from?

The strangler fig is a tropical tree that germinates in a host tree’s canopy, sends roots to the soil, and gradually surrounds and replaces the host. Martin Fowler applied the metaphor to software migration: a new system grows around the old one, taking over capabilities incrementally, until the old system is fully replaced. The legacy system keeps running throughout, so every migration step is individually reversible and no single step carries catastrophic risk.

What is the facade layer in a strangler fig migration?

The facade sits in front of both the legacy and new system, routing requests between them based on configuration. Initially all traffic routes to legacy. As capabilities are built on the new system, the facade routes specific request types to the new system. It also handles protocol translation and data format conversion. If the new system has a problem, rerouting back to legacy is a configuration change, not a redeployment.

How do you handle data synchronization during parallel running?

Define a single system of record per data domain. Route all writes for that domain to one system and replicate to the other. Migrate read traffic before write traffic since reads are safer to validate and simpler to roll back. Bidirectional sync is operationally painful. The standard approach avoids true bidirectional writes by picking one system as the authority for each domain at each migration phase.

How do you prevent the facade from becoming permanent?

The most common strangler fig failure mode is the facade accumulating business logic until it becomes a permanent layer. Teams that let this happen end up paying to maintain three systems instead of two, and the ongoing ops cost dwarfs the original migration budget. Preventing it needs strict routing-only scope, explicit decommission milestones, and executive commitment to a sunset date. If teams start building features against the facade API, the migration has stalled.

When is lift-and-shift better than the strangler fig?

Lift-and-shift works better when the legacy system is well-architected but needs cloud hosting, the timeline is tight, and the system is small. Systems under 100K lines of code with fewer than 20 external integrations are often safe candidates. If the migration is also a modernization effort that breaks apart a monolith or adopts managed services, the strangler fig payoff scales with how much architecture you’re actually fixing.