← Back to Insights

Replacing Legacy Systems Without Stopping Them

Metasphere Engineering 11 min read

Your platform has been running on the same monolith for years. Millions of lines of code. Database tables with foreign keys that cross every domain boundary. Leadership wants it migrated to cloud-native microservices. The engineering team has been told to “just rewrite it.”

Demolish the building. Build a new one. Move everyone in over the weekend.

Six months into the rewrite, the new system covers 40% of the monolith’s functionality. The original system has had three hotfix deployments that nobody applied to the new codebase. Feature parity is a target that moves every time you look at it. The old building keeps getting renovated while the new one is under construction. And the business won’t accept a 48-hour cutover window because peak season is approaching. You know how this story ends. Everyone does. The rewrite gets quietly shelved and the monolith wins by default. (The monolith always wins by default.)

Key takeaways
  • Big-bang rewrites fail because feature parity is a moving target. The monolith ships hotfixes nobody applies to the new codebase. The gap widens every sprint until the rewrite gets shelved.
  • Strangler fig routes traffic incrementally. One URL path at a time. The vine growing around the tree. The monolith doesn’t know it’s being replaced. Rollback is a config change.
  • Database decoupling is the hardest part, not application code. Shared foreign keys across domains create coupling that defeats the purpose of extraction.
  • Shadow mode testing catches bugs no synthetic test can find. Run both systems in parallel for 2-4 weeks before committing production traffic to the new one.
  • Extract the highest-value, lowest-coupling domain first. Not the easiest. Not the most complex. The one that delivers measurable business value with the fewest cross-domain dependencies.

The cruel irony: systems that need migration most urgently (highest traffic, most integrations, most critical) are exactly the ones that cannot tolerate a cutover window. Rollback means restoring from backup and accepting lost transactions. A coin flip disguised as a migration strategy.

Martin Fowler documented the strangler fig pattern as the safest migration approach. Each step individually reversible. Legacy keeps running throughout. Effective cloud migration is built around exactly this approach.

Prerequisites
  1. API gateway or reverse proxy deployed in front of the legacy system (the future facade)
  2. Domain boundaries identified and documented (which code owns which data)
  3. Legacy database schema mapped with cross-domain foreign key dependencies cataloged
  4. CDC-capable database (PostgreSQL WAL or MySQL binlog accessible)
  5. Monitoring baseline established for legacy system latency, error rates, and throughput per endpoint

The Facade and Incremental Routing

The facade layer is an API gateway sitting in front of everything, making routing decisions. Initially, every request routes to the legacy system. The facade is transparent. Users see no difference. Zero traffic to the new system. Zero risk. Day one should be boring. Boring is exactly what you want.

As capabilities are built on the new cloud-native platform , the facade’s routing configuration shifts specific request types to the new system. Start with the simplest, lowest-risk capability: a read-only endpoint with no write side effects. Something like a product catalog lookup or account balance query. Validate the new system’s response against the legacy system’s response before committing real traffic. Pick the most boring endpoint you have. Migrate that first.

Strangler Fig: Facade Routes Traffic ProgressivelyStrangler Fig: Facade Routes Traffic ProgressivelyAPI Facade (Router)Routes by domain + migration statusMonolith (shrinking)Billing, orders: still hereEach extraction removes trafficUser Service (done)100% traffic routed hereCatalog Service (migrating)Shadow mode: validatingThe monolith shrinks one domain at a time. No big bang. No deadline pressure.
Strangler fig migration showing monolith gradually replaced by microservicesFour-phase timeline: monolith shrinks from 100% to a tiny legacy stub while microservice boxes grow and a facade routes increasing traffic to new servicesPhase 1Phase 2Phase 3Phase 4FacadeMonolith100%All traffic to legacyFacadeMonolith70%Svc ASvc BSvc C30% migratedFacade30%UsersOrdersBillingAuthSearch70% migratedFacadelegacyUsersOrdersBillingAuthSearch100% migratedMigration timeline

Shadow Mode: The Migration Safety Net

Shadow testing is the single most effective risk reduction technique in any migration. Also the one teams most frequently skip “to save time.” That shortcut always costs more than it saves.

In shadow mode, the facade sends each request to both systems at the same time. The legacy system’s response is served to the user. The new system’s response is compared against it automatically. Divergences get logged and alerted on. Users are never affected because the new system’s responses are never served. Zero risk, all the learning.

Shadow mode: facade routes requests to both legacy and new systems for comparisonFacade sends requests to both legacy (primary) and new (shadow). Legacy response serves the user. New system response is compared. Differences logged. When match rate hits 99.9%, traffic shifts.Shadow Mode: Validate Before You MigrateRequestFrom clientFacadeRoutes to both systemssimultaneouslyprimaryLegacy SystemResponse goes to usershadowNew SystemResponse captured, not sentCompareLog differencesMatch rate reaches 99.9% over 2 weeksTraffic shifts to new system. Legacy becomes shadow, then decommissioned.Shadow mode turns migration risk into a measurable comparison.

What shadow testing catches that no other approach can: differences in behavior under real production data patterns, race conditions that only appear at production concurrency, edge cases in data that no test fixture anticipated, and performance characteristics under real load distributions. Shadow testing regularly catches timezone handling differences, floating-point rounding mismatches, and character encoding edge cases that would have corrupted customer data after cutover. Each is a 15-minute fix found in shadow mode. Each is a multi-day incident found in production.

Two weeks of shadow testing against live traffic is worth more than two months of testing against synthetic data.

Data Synchronization During Transition

Database consistency between both systems during parallel running is the hardest technical problem in a strangler fig migration. Every team underestimates it. Within the first two weeks, data drift starts producing customer-visible errors harder to debug than any cutover failure would have been. A customer updates their address in the new system. The legacy system still has the old address. An invoice goes to the wrong place.

The Feature Parity Race The losing game of trying to match the monolith’s functionality while the monolith continues shipping features. By the time the new system covers 80% of the monolith, the monolith has added another 15% of new functionality. The target keeps moving. Strangler fig avoids this race by replacing incrementally rather than rewriting in parallel.

Simple principle. Hard to enforce. Each data entity has a single system of record at any given time. For the customer profile entity, either the legacy system is authoritative or the new system is. Never both at once. When migrating the customer profile capability, writes go to the new system and replicate to the legacy system. The legacy system reads from its local copy but doesn’t write to it.

Change Data Capture from the legacy database enables this replication without modifying the legacy application. Debezium on top of PostgreSQL’s WAL or MySQL’s binlog streams every change to Kafka, where it gets consumed and applied to the new system’s database. The reverse replication (new system back to legacy) is messier. Usually a dedicated sync service that handles conflicts, applies data model transformations, and fails loudly when it hits data it can’t reconcile. The “fail loudly” part matters most. Silent data divergence kills migrations.

The migration sequence follows the data dependency graph. Migrate domains whose data has no dependencies on domains still in the legacy system first. Migrate write-heavy domains last, after all downstream read consumers have been migrated. Data engineering pipelines are the backbone that makes this sequencing work.

Migration PhaseData AuthorityReplication DirectionRisk Level
Shadow modeLegacy (writes and reads)Legacy to new (read-only sync)Minimal
Read migrationLegacy (writes), new (reads)Legacy to new (continuous CDC)Low
Write migrationNew (writes and reads)New to legacy (backward compat)Medium
DecommissionNew onlyNone (legacy data archived)Low

The Decommission Problem

Strangler fig migrations that never finish are worse than big-bang failures. This rarely gets discussed, but it’s the most common outcome. At least a failed cutover surfaces the problem immediately. A stalled migration leaves two live systems with a permanent facade. Two codebases to maintain. Two sets of infrastructure to patch. None of the simplification anyone promised. You didn’t migrate. You doubled your operational burden and called it progress.

Anti-pattern

Don’t: Let the facade accumulate business logic. Once teams start building features against the facade API instead of the new system, the facade becomes a permanent third system. Now you maintain three codebases instead of migrating from one to another.

Do: Keep the facade strictly to routing and protocol translation. Zero business logic. Every feature request goes to the new system. Explicit decommission gates with 30-day zero-traffic validation windows for each migrated capability.

Actually finishing requires organizational will you establish at the start, not negotiate at the end. Decommission milestones need to exist before the first line of migration code. Each phase has an explicit gate: zero traffic routed to the legacy system for this capability for 30 consecutive days, no data dependencies from remaining legacy capabilities on this data domain. Without hard gates, timelines don’t slip. They evaporate.

Cloud-native architecture ROI is realized only when the legacy system is actually decommissioned. While both systems run, you pay for both, maintain both, and patch both.

Strangler fig decommission timeline with 30-day validation gates between phasesSequential migration: facade deployed, shadow mode for 2 weeks, user service goes live with 30-day validation, decommission gate checks for zero legacy traffic, then order service migrates. Each phase has a validation window before the next begins.Decommission Timeline: Validation Gates Between PhasesPhase 0Facade deployed0% migratedBaselinePhase 1Read-only shadow2 weeks validationPhase 2User service live30-day validationZero legacytraffic?No: keep validatingYesDecommissionUser serviceremoved fromlegacyRepeatOrder service nextSame gates applyUntil legacy retired30 days of zero legacy traffic before each decommission gate.

When Lift-and-Shift Is the Better Answer

Not every migration needs the strangler fig. The complexity is justified when the legacy system has real architectural problems worth fixing: tight coupling, scaling limits, patterns blocking business progress. If the system is well-designed and just needs cloud hosting, moving to cloud infrastructure and improving architecture incrementally is more pragmatic.

FactorStrangler FigLift-and-Shift
CodebaseLarge monolith with architectural debtWell-structured, needs cloud hosting
GoalModernize architecture + migrateMigrate infrastructure only
TimelineMonths to years (incremental)Weeks to months
Risk per stepLow (each step reversible)Moderate (bigger blast radius per move)
When it winsArchitecture is the bottleneckInfrastructure is the bottleneck

What the Industry Gets Wrong About Strangler Fig Migration

“Start with the easiest domain to extract.” Start with the highest-value domain. Easy extractions prove the pattern but don’t prove the value. Extracting the user profile service is technically simple and delivers zero business impact. Extracting the ordering domain is harder and immediately enables independent deployment for revenue-critical features.

“Rewrite the database layer while extracting services.” Separate the concerns. Extract the service first with a shared database and an anti-corruption layer for data access. Decouple the database second. Attempting both at the same time doubles the risk and the timeline. One migration per dimension.

Our take The database is the real migration bottleneck, not the application code. Shared foreign keys, cross-domain joins, and transaction boundaries that span multiple domains are what make the remaining extractions hard after the first few easy ones are done. Plan the database decoupling strategy before the first extraction, even if you don’t execute it until the third or fourth. Skip this and you hit a wall around the fourth or fifth extraction when every remaining domain has foreign key dependencies on something not yet migrated.

Same monolith. Same team. Traffic shifts one capability at a time through the facade. Shadow mode catches the timezone bug and the rounding mismatch before any customer sees them. The ordering service goes live after two weeks of shadow validation. Rollback is a gateway config change. No cutover window. No coin flip. If the new system is architecturally similar (same monolith on different hosting), just lift and shift. If you’re modernizing, the strangler fig payoff scales with how much architecture you’re actually fixing. A legacy modernization assessment tells you which one you’re doing before you commit.

Migrate Your Legacy System Without the Cutover Risk

Big-bang migrations fail predictably: scope grows, timelines slip, the cutover window arrives before the system is ready. Strangler fig with facade routing, CDC data sync, and explicit decommission plans reduces migration risk to manageable, reversible steps.

Start Your Strangler Fig

Frequently Asked Questions

What is the strangler fig pattern and where does the name come from?

+

The strangler fig is a tropical tree that germinates in a host tree’s canopy, sends roots to the soil, and gradually surrounds and replaces the host. Martin Fowler applied the metaphor to software migration: a new system grows around the old one, taking over capabilities incrementally, until the old system is fully replaced. The legacy system keeps running throughout, so every migration step is individually reversible and no single step carries catastrophic risk.

What is the facade layer in a strangler fig migration?

+

The facade sits in front of both the legacy and new system, routing requests between them based on configuration. Initially all traffic routes to legacy. As capabilities are built on the new system, the facade routes specific request types to the new system. It also handles protocol translation and data format conversion. If the new system has a problem, rerouting back to legacy is a configuration change, not a redeployment.

How do you handle data synchronization during parallel running?

+

Define a single system of record per data domain. Route all writes for that domain to one system and replicate to the other. Migrate read traffic before write traffic since reads are safer to validate and simpler to roll back. Bidirectional sync is operationally painful. The standard approach avoids true bidirectional writes by picking one system as the authority for each domain at each migration phase.

How do you prevent the facade from becoming permanent?

+

The most common strangler fig failure mode is the facade accumulating business logic until it becomes a permanent layer. Teams that let this happen end up paying to maintain three systems instead of two, and the ongoing ops cost dwarfs the original migration budget. Preventing it needs strict routing-only scope, explicit decommission milestones, and executive commitment to a sunset date. If teams start building features against the facade API, the migration has stalled.

When is lift-and-shift better than the strangler fig?

+

Lift-and-shift works better when the legacy system is well-architected but needs cloud hosting, the timeline is tight, and the system is small. Systems under 100K lines of code with fewer than 20 external integrations are often safe candidates. If the migration is also a modernization effort that breaks apart a monolith or adopts managed services, the strangler fig payoff scales with how much architecture you’re actually fixing.