← Back to Insights

Database Migration Without Downtime

Metasphere Engineering 10 min read

You complete what looks like a successful Oracle-to-Aurora migration. Application tier deployed cleanly. Data loaded. Smoke tests passed. Green across the board. You redirect production traffic and within 45 minutes hit connection exhaustion. Your ORM was configured to open a connection per request, which worked fine against on-premise Oracle with 2,000 connection capacity but overwhelms Aurora’s limits at production volume. Three weeks of unplanned PgBouncer retrofitting because of a configuration nobody thought to load-test.

Key takeaways
  • Connection pooling causes more migration failures than schema conversion. Teams spend months on schema and an afternoon on pool sizing. Production teaches them the ratio was backward.
  • CDC (Change Data Capture) cuts cutover windows from hours to minutes. Stream changes from source to target in real time. Cut over when replication lag drops to seconds.
  • Row counts matching does not mean data is correct. A precision mismatch between Oracle NUMBER and PostgreSQL NUMERIC produces the right number of wrong rows.
  • Test with production-volume data, not staging subsets. Query plans change with table statistics. A migration that works at 10GB fails at 2TB.
  • Keep the rollback path live for 2-4 weeks after cutover. Bidirectional replication ensures the old database stays current while you validate the new one.

Debezium is the standard open-source CDC platform. AWS DMS covers managed migrations. Schema conversion absorbs months of effort. Connection behavior absorbs an afternoon. That ratio needs to flip.

Assessment: Where Migrations Actually Break

Schema conversion tools (AWS SCT, ora2pg) automate DDL and do it reasonably well. Do not trust them for stored procedures. PL/SQL to PL/pgSQL translation needs human review for every non-trivial procedure because the tools translate syntax without understanding behavior. A rounding difference between Oracle NUMBER(10,2) and PostgreSQL NUMERIC hides until financial reconciliation fails weeks after cutover.

Prerequisites
  1. Full inventory of stored procedures, triggers, and views with engine-specific syntax
  2. Data type mapping validated for precision-sensitive columns (financial, scientific)
  3. Collation and character set differences documented between source and target
  4. Connection pool sizing calculated for target engine limits under production concurrency
  5. Rollback procedure documented and tested with the operations team

Infrastructure-as-code teams have an easier assessment because schemas are already version-controlled. But version-controlled schemas only help if the test suite exercises the edge cases. The migration that breaks is never the straightforward table copy. It is the view that references an Oracle-specific function, the trigger that depends on session-level variables, or the stored procedure with implicit type coercion that PostgreSQL handles differently.

The CDC Migration Pattern

Migration StrategyCutover DowntimeRollback CapabilityData RiskBest For
CDC + incremental sync5-15 minutesKeep source running, revert trafficLow (continuous validation)Production databases with SLA
Dump and restoreHours (scales with size)Restore from backup (hours)Medium (gap between dump and restore)Dev/staging, small databases
Dual-writeZero (gradual traffic shift)Stop writing to new targetHigh (two-write consistency risk)When zero downtime is mandatory
Blue-green databaseMinutes (DNS swap)Swap backLow (both run in parallel)When parallel operation is affordable

CDC keeps the target continuously synced while the source handles live traffic. At cutover: stop writes, apply the final CDC batch (typically under 30 seconds), redirect connections. Total window: 5-15 minutes.

CDC-Based Database Migration Cutover FlowAnimated diagram showing the CDC migration process: initial bulk load from Oracle source to Aurora target, continuous CDC replication with decreasing lag, stopping writes, applying the final batch, redirecting traffic, and completing cutover in 5 to 15 minutes.CDC-Based Migration: From Replication to CutoverSource DBOracleTarget DBAuroraInitial Bulk LoadFull datasetCDC ReplicationStreaming changes continuouslyReplication Lag30s5s<1sSTOP WRITESFinal CDC batchREDIRECT TRAFFICLiveDecommissionedCutover CompleteTotal downtime: 5-15 minutes

Each phase has validation gates. Skipping application-level validation is the most common cutover-day failure, and the one with the worst blast radius.

Database Migration: Six Phases to ProductionDatabase Migration: Six Phases to ProductionAssessmentSchema analysisSchema ConvertType mappingInitial LoadBulk transferCDC ReplicationContinuous syncValidateShadow queriesCutoverDNS flipCDC keeps source alive during migration. Cutover is a DNS change, not a prayer.

One critical timing detail: CDC must start before or during the initial load. If the load takes 6 hours and capture starts after, those 6 hours of changes get missed. Row counts match. Data doesn’t.

Data Integrity Validation

Row counts match. Every decimal is wrong. A type mismatch between Oracle NUMBER and PostgreSQL NUMERIC slips through row count checks undetected. Every row present. Every value slightly off. Validation intensity must match the data’s risk level.

Data Risk LevelExamplesValidation MethodConfidence
LowLookup tables, static configuration, code reference dataRow count comparison + spot-check sample rowsSufficient. Low blast radius if wrong
MediumUser profiles, product catalog, order historyRow count + column-level hash comparison on critical columnsHigh. Catches most corruption and truncation
HighFinancial records, regulated PII, payment transactionsShadow queries under production-representative load. Source vs target result diffMaximum. Validates both data and query behavior

Shadow queries (running identical requests against source and target under production load) are the gold standard for high-value data. They also surface query performance gaps, planner differences, and connection behavior that staging never reveals. Cloud migration teams should run shadow validation for weeks, not days, before cutover.

Anti-pattern

Don’t: Validate with staging-sized data and call it good. A 10GB test database with 10 concurrent connections reveals nothing about how 2TB and 500 connections behave. Query plans change with table statistics. Connection pools exhaust at thresholds staging never reaches.

Do: Load production-volume data into the target. Replay production query patterns at production concurrency. The migration either works under real conditions or it does not. There is no middle ground worth betting on.

Connection Pooling Architecture

Data integrity is validated. The migration looks clean. Then you flip the switch and the database falls over. Not because of bad data. Because of math.

100 connections per pod, 100 pods = 10,000 connections. Aurora supports roughly 5,000. Dead in production. Perfectly fine in staging with 10 pods. Nobody did the multiplication.

# PgBouncer config - transaction mode pooling
[databases]
production = host=aurora-cluster.xxxxx.us-east-1.rds.amazonaws.com port=5432 dbname=app

[pgbouncer]
pool_mode = transaction        # Release connection after each transaction
max_client_conn = 10000        # Accept thousands of app connections
default_pool_size = 200        # Only 200 actual DB connections
reserve_pool_size = 20         # Emergency overflow
server_idle_timeout = 300      # Reclaim idle connections after 5min

PgBouncer multiplexes 10,000 application connections over 200 actual database connections. Test this under production load before cutover. There is no “fix it after” if connection exhaustion takes you down on cutover day.

How the connection pool formula works

The formula connections = (core_count * 2) + effective_spindle_count comes from PostgreSQL tuning guides and represents the optimal pool size for a given instance. For cloud instances without spinning disks, the spindle count is effectively 1 (SSD). A db.r6g.2xlarge with 8 vCPUs gives an optimal pool of (8 * 2) + 1 = 17 connections per application instance. With 100 pods, that is 1,700 total connections, well within Aurora’s limits. Most teams skip this calculation and guess a round number like 100, which works until it catastrophically does not.

DimensionDirect ORM ConnectionsPgBouncer Transaction Mode
Setup100 pods x 100 connections each100 pods x 100 connections to PgBouncer
Database connections10,000 direct connections to Aurora200 multiplexed connections to Aurora
Aurora limitdb.r6g.2xlarge supports ~5,000Well within limit at 200
At production trafficConnection exhaustion. Service unavailableStable. PgBouncer queues excess requests
Latency overheadNone (direct)<1ms per transaction (routing overhead)
When it breaksPrepared statements require session mode (not transaction mode)

Rollback Planning That Actually Works

A rollback plan that requires restoring from backup is disaster recovery, not rollback. The distinction matters because disaster recovery has multi-hour RTO. A genuine rollback requires three things: source database still running and receiving reverse CDC, bidirectional replication keeping both databases current, and traffic routing switchable in seconds. Rehearse until the team completes it under 10 minutes.

Keep the source running with reverse CDC for at least two weeks post-cutover. Resilience engineering demands it. Tell finance it is insurance. The cost of running two databases for two weeks is trivial compared to the cost of an unrecoverable cutover failure.

Database migration rollback architecture with reverse CDC keeping source alive during validationAfter cutover, the source database stays alive with reverse CDC streaming changes back from the new system. If issues arise, traffic routes back to source instantly. Reverse CDC ensures both databases stay synchronized during the validation window.Rollback Architecture: Source Stays AliveSource DBKept alive post-cutoverRead-only standbyDo not decommission yetTarget DBHandling all trafficPrimary after cutoverLiveOriginal CDC (now inactive)Reverse CDC: syncing back to sourceEvery write on target streams back to keep source currentIf issues: route back to sourceOne gateway config change. Source is current.Never decommission source until reverse CDC proves zero divergence for 30 days.
The Connection Exhaustion Cliff The production traffic volume at which connection management differences between the source and target database cause failures. On-premise databases tolerate thousands of direct connections. Cloud-managed databases have strict limits tied to instance size. The cliff is invisible in staging with 10 connections. It appears within minutes at production scale with 500+.

What the Industry Gets Wrong About Database Migration

“Schema conversion is the hard part.” Schema conversion tools handle the bulk of DDL automatically. Connection pooling, query plan differences, and transaction isolation behavior under real load are what actually cause migration failures on cutover day. The effort allocation is almost always backward: months on schema, an afternoon on the runtime behavior that actually kills you.

“Test in staging, deploy with confidence.” Staging validates syntax. Production validates behavior. Query plans, connection pool limits, and replication lag all behave differently at real volume. If your migration test doesn’t stress the target at production scale, the cutover itself becomes your first real test.

Our take CDC with bidirectional replication is the only migration pattern that provides a safe rollback path for production databases. Dump-and-restore has no rollback once the cutover window closes. Dual-write introduces consistency risks that are deceptively hard to detect. CDC keeps both databases synchronized throughout the transition, and the cutover shrinks to minutes because the target is already current. Every other pattern trades safety for simplicity, and that trade is never worth it for production data.

That Oracle-to-Aurora migration? With shadow traffic testing, CDC replication, and connection pool validated at production volume, the 45-minute failure never happens. The rollback switch stays untouched. The three-week retrofit stays hypothetical.

Migrate Your Database Without the Emergency Rollback

Database migrations that fail cost more than the ones never attempted. CDC replication for minimal-downtime cutover, integrity validation gates at every phase, connection pooling sized for the target engine, and rollback plans tested under production load before go-live.

Engineer Your Cutover

Frequently Asked Questions

What is CDC and why is it essential for minimal-downtime database migration?

+

Change Data Capture streams every transaction as it commits to the source database. During migration, CDC keeps the target in sync while the source stays live. At cutover, you stop writes, apply the final CDC batch, and redirect traffic. The cutover window is 5-15 minutes, not hours. Without CDC, cutover takes as long as a full data copy, which for terabyte databases means hours of downtime.

What database objects cause the most problems during cloud migration?

+

Stored procedures and triggers, especially when switching engines like Oracle to PostgreSQL. Cloud-managed databases work differently in procedural language support and system function behavior. Schema conversion tools handle DDL fine but struggle with PL/SQL to PL/pgSQL translation. Views with engine-specific syntax, collation differences, and partitioning variations are common pain points that only show up with thorough testing.

How do you validate data integrity after migration?

+

Row count comparison is the minimum but not enough. A matching count with a silent transformation error gives you the right number of wrong rows. Checksum comparison catches corruption. Shadow queries (running the same queries against source and target at the same time) catch transformation errors that checksums miss. For high-value transactional data, shadow queries under production-level load give the highest confidence.

What connection pooling changes does cloud migration require?

+

Cloud-managed databases have hard connection limits tied to instance size, and each idle connection consumes 5-10MB of memory. An ORM configured for 100 connections per pod across 100 pods opens 10,000 connections, exhausting the limit before request volume does. PgBouncer in transaction mode multiplexes thousands of application connections over 200-300 actual database connections. This architecture change must be tested under production load before cutover.

What is a real rollback plan for a database migration?

+

A real rollback plan requires the source database still running and intact, CDC still capturing changes from the target back to the source, and the ability to redirect traffic within minutes. A rollback plan that requires restoring from backup is disaster recovery with multi-hour RTO, not a rollback plan. Test the rollback procedure as thoroughly as you test the cutover itself.