Strangler Fig Pattern: De-Risk Legacy Cloud Migration

Jan 23, 2025 Metasphere Engineering 8 min read

Your platform has been running on the same monolith for years. Millions of lines of code. Database tables with foreign keys that cross every domain boundary. Leadership wants it migrated to cloud-native microservices. Your engineering team has been told to “just rewrite it.” Six months into the rewrite, the new system covers 40% of the monolith’s functionality. The original system has had three hotfix deployments that nobody remembered to apply to the new codebase. Feature parity is a moving target that keeps moving. And the business will not accept the proposed 48-hour cutover window because peak season is approaching. You know how this story ends.

Big-bang migrations fail in predictable, avoidable ways. The systems that need migration most urgently (the ones running the business’s most critical processes, with the most integrations and the highest traffic) are exactly the ones that cannot tolerate a 48-hour outage window. If something goes wrong during that window, rollback means restoring from backup and accepting that every transaction processed since the cutover started is at risk. That is not a migration strategy. That is a bet.

The strangler fig pattern addresses this by making migration a continuous, incremental process where each step is individually reversible and the legacy system keeps running safely throughout. Effective cloud migration and modernization is built around this incremental approach.

The Mechanics

The facade layer is the enabling technology. Think of it as an API gateway that sits in front of everything and makes routing decisions. Initially, every request routes to the legacy system. The facade is transparent. Users see no difference. The new system receives zero traffic and zero risk. Day one is boring, and boring is exactly what you want.

As capabilities are built on the new cloud-native system, the facade’s routing configuration is updated to send specific request types to the new system. The first migration is always the simplest, lowest-risk capability: a read-only endpoint with no write side-effects. Something like a product catalog lookup or account balance query where the new system’s response can be validated against the legacy system’s response before any real traffic is committed. Pick the most boring endpoint you have. Migrate that first.

Shadow mode is the most underused technique in migration engineering, and skipping it is one of the most expensive shortcuts you can take. Run both systems in parallel for a capability. Send production requests to both. Compare their responses automatically. Alert on divergences. Do this for 2-4 weeks before committing any production traffic to the new system. The divergences you find during shadow mode are the bugs that would have surfaced after cutover, under full production load, at 3 AM. Shadow testing routinely catches timezone handling differences, floating-point rounding mismatches, and character encoding edge cases that would have corrupted customer data in production. Each of those is a 15-minute fix when found in shadow mode and a multi-day incident when found after cutover.

The facade enables incremental migration by making routing decisions per capability. As each service is built and validated on the new platform, the facade shifts traffic without requiring a single cutover window. Shadow mode testing against live traffic catches the subtle data handling differences that synthetic tests miss.

Data Synchronization During Transition

The hardest technical problem in a strangler fig migration is maintaining data consistency between the legacy system and the new system while both handle production traffic. Every team underestimates this. Teams typically discover within the first two weeks of parallel running that data drift between systems produces customer-visible errors that are harder to debug than any cutover failure would have been. This is where migrations go to die.

The key principle is simple but enforcing it is hard: each data entity has a single system of record at any given time. For the customer profile entity, either the legacy system is authoritative or the new system is. Never both simultaneously. When migrating the customer profile capability, writes go to the new system and replicate to the legacy system. The legacy system reads from its local copy but does not write to it. The new system is the write authority for that domain. No ambiguity.

Change Data Capture from the legacy database enables this replication without modifying the legacy application. Debezium on top of PostgreSQL’s WAL or MySQL’s binlog streams every change to Kafka, where it is consumed and applied to the new system’s database. The reverse replication (new system back to legacy) is more complex and usually implemented with a dedicated synchronization service that handles conflicts, applies data model transformations, and fails loudly when it encounters data it cannot reconcile. That “fail loudly” part is critical. Silent data divergence is what kills migrations. Every single time.

The migration sequence follows the data dependency graph. Migrate domains whose data has no dependencies on domains still in the legacy system first. Migrate write-heavy domains last, after all downstream read consumers have been migrated to read from the new system. Data engineering pipelines are the infrastructure backbone that makes this sequencing possible.

Shadow Mode: The Migration Safety Net

Shadow testing deserves its own section because it is the single most effective risk reduction technique in any migration. It is also the one teams most frequently skip “to save time.” That shortcut will cost you ten times what it saved.

In shadow mode, the facade sends each request to both the legacy system and the new system simultaneously. The legacy system’s response is served to the user. The new system’s response is compared against it automatically. Divergences are logged, categorized, and alerted on. The new system has zero impact on users because its responses are never served. Zero risk, maximum learning.

What shadow testing catches that no other testing approach can: differences in behavior under real production data patterns, race conditions that only appear at production concurrency, edge cases in data that no test fixture anticipated, and performance characteristics under real load distributions. Two weeks of shadow testing against live traffic is worth more than two months of testing against synthetic data. That is not an exaggeration.

The Decommission Problem

Strangler fig migrations that never finish are worse than big-bang failures. This is the outcome nobody talks about. At least a failed cutover surfaces the problem immediately. A stalled migration produces two live systems with a permanent facade, two codebases to maintain, two sets of infrastructure to patch, and none of the promised simplification. Industry data suggests 30-40% of enterprise migrations reach this stalled state where both systems remain in production indefinitely. You have not migrated. You have doubled your operational burden.

The discipline of actually finishing the migration requires organizational will established at the start, not negotiated at the end. Decommission milestones must be defined before the first line of migration code is written. Each phase has an explicit gate: zero traffic routed to the legacy system for this capability for 30 consecutive days, no data dependencies from remaining legacy capabilities on this data domain. These create objective criteria rather than open-ended negotiation about when it is “safe enough” to turn something off. Without hard gates, migration timelines always drift.

The cloud-native architecture ROI is realized only when the legacy system is actually decommissioned. While both systems run, you pay for both, maintain both, and patch both. Set decommission deadlines at the executive level, tied to infrastructure budget. This prevents the migration from becoming an indefinite parallel-running state where the “migration” has simply added a new system on top of the old one.

When Lift-and-Shift Is the Better Answer

Not every migration needs the strangler fig. Strangler fig complexity is justified when the legacy system has real architectural problems worth fixing during the migration: tight coupling, scaling limitations, outdated patterns that prevent the business from moving forward. If the legacy system is well-designed but simply needs cloud hosting, moving it to cloud infrastructure and addressing architecture incrementally afterward is often more pragmatic. Do not over-engineer the migration.

The decision test is simple: if the new cloud system will be architecturally similar to the legacy system (same monolith, same patterns, just running on AWS), the strangler fig overhead is hard to justify. Just lift and shift. If the migration is also a modernization that decomposes a monolith, adopts managed services, and enables horizontal scaling, the incremental approach pays dividends proportional to the architectural improvement. A thorough legacy modernization assessment before committing engineering resources is what prevents wasted effort in either direction. Measure twice, migrate once.

Frequently Asked Questions

What is the strangler fig pattern and where does the name come from?

The strangler fig is a tropical tree that germinates in a host tree’s canopy, sends roots to the soil, and gradually surrounds and replaces the host. Martin Fowler applied the metaphor to software migration: a new system grows around the old one, taking over capabilities incrementally, until the old system is fully replaced. The legacy system keeps running throughout, so every migration step is individually reversible and no single step carries catastrophic risk.

What is the facade layer in a strangler fig migration?

The facade sits in front of both the legacy and new system, routing requests between them based on configuration. Initially all traffic routes to legacy. As capabilities are built on the new system, the facade routes specific request types to the new system. It also handles protocol translation and data format conversion. If the new system has a problem, rerouting back to legacy is a configuration change, not a redeployment.

How do you handle data synchronization during parallel running?

Define a single system of record per data domain. Route all writes for that domain to one system and replicate to the other. Migrate read traffic before write traffic since reads are safer to validate and simpler to roll back. Bidirectional sync is operationally complex. The standard approach avoids true bidirectional writes by designating authority per domain at each migration phase.

How do you prevent the facade from becoming permanent?

The most common strangler fig failure mode is the facade accumulating business logic until it becomes a permanent layer. Organizations that let this happen spend 30-40% more in total migration cost maintaining three systems instead of two. Prevention requires strict routing-only scope, explicit decommission milestones, and executive commitment to a sunset date. If teams start building features against the facade API, the migration has stalled.

When is lift-and-shift better than the strangler fig?

Lift-and-shift works better when the legacy system is well-architected but needs cloud hosting, the timeline is constrained, and the system is small. Systems under 100K lines of code with fewer than 20 external integrations are often safe candidates. If the migration is also a modernization effort that decomposes a monolith or adopts managed services, the incremental strangler fig approach pays dividends proportional to the architectural improvement.