Replacing Legacy Systems Without Stopping Them
Your platform has been running on the same monolith for years. Millions of lines of code. Database tables with foreign keys that cross every domain boundary. Leadership wants it migrated to cloud-native microservices. The engineering team has been told to “just rewrite it.”
Demolish the building. Build a new one. Move everyone in over the weekend.
Six months into the rewrite, the new system covers 40% of the monolith’s functionality. The original system has had three hotfix deployments that nobody applied to the new codebase. Feature parity is a target that moves every time you look at it. The old building keeps getting renovated while the new one is under construction. And the business won’t accept a 48-hour cutover window because peak season is approaching. You know how this story ends. Everyone does. The rewrite gets quietly shelved and the monolith wins by default. (The monolith always wins by default.)
- Big-bang rewrites fail because feature parity is a moving target. The monolith ships hotfixes nobody applies to the new codebase. The gap widens every sprint until the rewrite gets shelved.
- Strangler fig routes traffic incrementally. One URL path at a time. The vine growing around the tree. The monolith doesn’t know it’s being replaced. Rollback is a config change.
- Database decoupling is the hardest part, not application code. Shared foreign keys across domains create coupling that defeats the purpose of extraction.
- Shadow mode testing catches bugs no synthetic test can find. Run both systems in parallel for 2-4 weeks before committing production traffic to the new one.
- Extract the highest-value, lowest-coupling domain first. Not the easiest. Not the most complex. The one that delivers measurable business value with the fewest cross-domain dependencies.
The cruel irony: systems that need migration most urgently (highest traffic, most integrations, most critical) are exactly the ones that cannot tolerate a cutover window. Rollback means restoring from backup and accepting lost transactions. A coin flip disguised as a migration strategy.
Martin Fowler documented the strangler fig pattern as the safest migration approach. Each step individually reversible. Legacy keeps running throughout. Effective cloud migration is built around exactly this approach.
- API gateway or reverse proxy deployed in front of the legacy system (the future facade)
- Domain boundaries identified and documented (which code owns which data)
- Legacy database schema mapped with cross-domain foreign key dependencies cataloged
- CDC-capable database (PostgreSQL WAL or MySQL binlog accessible)
- Monitoring baseline established for legacy system latency, error rates, and throughput per endpoint
The Facade and Incremental Routing
The facade layer is an API gateway sitting in front of everything, making routing decisions. Initially, every request routes to the legacy system. The facade is transparent. Users see no difference. Zero traffic to the new system. Zero risk. Day one should be boring. Boring is exactly what you want.
As capabilities are built on the new cloud-native platform , the facade’s routing configuration shifts specific request types to the new system. Start with the simplest, lowest-risk capability: a read-only endpoint with no write side effects. Something like a product catalog lookup or account balance query. Validate the new system’s response against the legacy system’s response before committing real traffic. Pick the most boring endpoint you have. Migrate that first.
Shadow Mode: The Migration Safety Net
Shadow testing is the single most effective risk reduction technique in any migration. Also the one teams most frequently skip “to save time.” That shortcut always costs more than it saves.
In shadow mode, the facade sends each request to both systems at the same time. The legacy system’s response is served to the user. The new system’s response is compared against it automatically. Divergences get logged and alerted on. Users are never affected because the new system’s responses are never served. Zero risk, all the learning.
What shadow testing catches that no other approach can: differences in behavior under real production data patterns, race conditions that only appear at production concurrency, edge cases in data that no test fixture anticipated, and performance characteristics under real load distributions. Shadow testing regularly catches timezone handling differences, floating-point rounding mismatches, and character encoding edge cases that would have corrupted customer data after cutover. Each is a 15-minute fix found in shadow mode. Each is a multi-day incident found in production.
Two weeks of shadow testing against live traffic is worth more than two months of testing against synthetic data.
Data Synchronization During Transition
Database consistency between both systems during parallel running is the hardest technical problem in a strangler fig migration. Every team underestimates it. Within the first two weeks, data drift starts producing customer-visible errors harder to debug than any cutover failure would have been. A customer updates their address in the new system. The legacy system still has the old address. An invoice goes to the wrong place.
Simple principle. Hard to enforce. Each data entity has a single system of record at any given time. For the customer profile entity, either the legacy system is authoritative or the new system is. Never both at once. When migrating the customer profile capability, writes go to the new system and replicate to the legacy system. The legacy system reads from its local copy but doesn’t write to it.
Change Data Capture from the legacy database enables this replication without modifying the legacy application. Debezium on top of PostgreSQL’s WAL or MySQL’s binlog streams every change to Kafka, where it gets consumed and applied to the new system’s database. The reverse replication (new system back to legacy) is messier. Usually a dedicated sync service that handles conflicts, applies data model transformations, and fails loudly when it hits data it can’t reconcile. The “fail loudly” part matters most. Silent data divergence kills migrations.
The migration sequence follows the data dependency graph. Migrate domains whose data has no dependencies on domains still in the legacy system first. Migrate write-heavy domains last, after all downstream read consumers have been migrated. Data engineering pipelines are the backbone that makes this sequencing work.
| Migration Phase | Data Authority | Replication Direction | Risk Level |
|---|---|---|---|
| Shadow mode | Legacy (writes and reads) | Legacy to new (read-only sync) | Minimal |
| Read migration | Legacy (writes), new (reads) | Legacy to new (continuous CDC) | Low |
| Write migration | New (writes and reads) | New to legacy (backward compat) | Medium |
| Decommission | New only | None (legacy data archived) | Low |
The Decommission Problem
Strangler fig migrations that never finish are worse than big-bang failures. This rarely gets discussed, but it’s the most common outcome. At least a failed cutover surfaces the problem immediately. A stalled migration leaves two live systems with a permanent facade. Two codebases to maintain. Two sets of infrastructure to patch. None of the simplification anyone promised. You didn’t migrate. You doubled your operational burden and called it progress.
Don’t: Let the facade accumulate business logic. Once teams start building features against the facade API instead of the new system, the facade becomes a permanent third system. Now you maintain three codebases instead of migrating from one to another.
Do: Keep the facade strictly to routing and protocol translation. Zero business logic. Every feature request goes to the new system. Explicit decommission gates with 30-day zero-traffic validation windows for each migrated capability.
Actually finishing requires organizational will you establish at the start, not negotiate at the end. Decommission milestones need to exist before the first line of migration code. Each phase has an explicit gate: zero traffic routed to the legacy system for this capability for 30 consecutive days, no data dependencies from remaining legacy capabilities on this data domain. Without hard gates, timelines don’t slip. They evaporate.
Cloud-native architecture ROI is realized only when the legacy system is actually decommissioned. While both systems run, you pay for both, maintain both, and patch both.
When Lift-and-Shift Is the Better Answer
Not every migration needs the strangler fig. The complexity is justified when the legacy system has real architectural problems worth fixing: tight coupling, scaling limits, patterns blocking business progress. If the system is well-designed and just needs cloud hosting, moving to cloud infrastructure and improving architecture incrementally is more pragmatic.
| Factor | Strangler Fig | Lift-and-Shift |
|---|---|---|
| Codebase | Large monolith with architectural debt | Well-structured, needs cloud hosting |
| Goal | Modernize architecture + migrate | Migrate infrastructure only |
| Timeline | Months to years (incremental) | Weeks to months |
| Risk per step | Low (each step reversible) | Moderate (bigger blast radius per move) |
| When it wins | Architecture is the bottleneck | Infrastructure is the bottleneck |
What the Industry Gets Wrong About Strangler Fig Migration
“Start with the easiest domain to extract.” Start with the highest-value domain. Easy extractions prove the pattern but don’t prove the value. Extracting the user profile service is technically simple and delivers zero business impact. Extracting the ordering domain is harder and immediately enables independent deployment for revenue-critical features.
“Rewrite the database layer while extracting services.” Separate the concerns. Extract the service first with a shared database and an anti-corruption layer for data access. Decouple the database second. Attempting both at the same time doubles the risk and the timeline. One migration per dimension.
Same monolith. Same team. Traffic shifts one capability at a time through the facade. Shadow mode catches the timezone bug and the rounding mismatch before any customer sees them. The ordering service goes live after two weeks of shadow validation. Rollback is a gateway config change. No cutover window. No coin flip. If the new system is architecturally similar (same monolith on different hosting), just lift and shift. If you’re modernizing, the strangler fig payoff scales with how much architecture you’re actually fixing. A legacy modernization assessment tells you which one you’re doing before you commit.