Database Cloud Migration: CDC Replication and Zero-Downtime Cutover
You complete what looks like a successful Oracle-to-Aurora migration. Application tier deployed cleanly. Data loaded. Smoke tests passed. Green across the board. You redirect production traffic and within 45 minutes hit connection exhaustion. Your ORM was configured to open a connection per request, which worked fine against on-premise Oracle with 2,000 connection capacity but overwhelms Aurora’s connection limits at production volume. Nobody tested connection behavior under real load. You roll back to Oracle within the hour and spend the next three weeks retrofitting PgBouncer. Three weeks of unplanned work because of a configuration nobody thought to test.
Database migrations fail at a higher rate than application migrations, and the failures are far more consequential. An application rollback is a few hours of lost productivity. A database migration failure during cutover means data inconsistency, extended downtime, and integrity issues that do not surface for days. And here is the part that should worry you: the things that cause the most failures are rarely the things teams spend the most time planning for. Schema conversion gets months of attention. Connection pooling gets an afternoon, if that.
The Assessment Phase Is Not Optional
Before touching production, a thorough assessment of the source database must produce explicit answers to specific questions. Not a high-level summary. Not a PowerPoint. A line-item inventory.
What stored procedures, triggers, and user-defined functions exist, and what are their engine-specific dependencies? What character sets and collations are in use, and do they have exact equivalents on the target? Are there queries relying on implicit type coercion, engine-specific date handling, or optimizer hints that will behave differently on the target engine? You need answers to all of these before you write a single migration script.
Schema conversion tools like AWS SCT, pgloader, and ora2pg automate a meaningful portion of DDL conversion. They handle table structures, indexes, and basic views well. They are consistently incomplete or incorrect for stored procedure conversion. Do not trust them for this. PL/SQL to PL/pgSQL translation requires manual review of every procedure. The tool output for procedures is a starting point that needs a developer who understands both dialects to complete. Teams that trust the tool output for procedure conversion regularly discover data corruption in production two weeks later from a rounding behavior difference between Oracle’s NUMBER and PostgreSQL’s NUMERIC in a specific edge case.
The assessment output should be an explicit list of every known incompatibility with a resolution plan for each. Teams with mature infrastructure-as-code practices have an easier assessment phase because their schemas are already version-controlled. The source of truth is in code, not inferred from the live database through reverse engineering.
The CDC Migration Pattern
CDC-based migration is what makes the cutover window short enough to be safe. This is the pattern that works. After the initial bulk data load and replication lag stabilization, the target database is continuously synchronized with the source. The actual cutover is the time required to stop application writes, apply the final CDC batch (usually under 30 seconds of accumulated changes), and redirect traffic. Total: 5-15 minutes rather than the hours required for a full copy-and-switch approach.
The full migration follows a phased approach where each stage has explicit validation gates before proceeding. Do not rush through phases. Skipping application-level validation on the target is the single most common source of cutover-day failures.
CDC tooling choices: AWS Database Migration Service for managed AWS migrations, Debezium for open-source CDC from MySQL, PostgreSQL, Oracle, MongoDB, and SQL Server, and database-native tools like PostgreSQL’s logical replication. Validate the choice against your source engine, migration volume, and target platform during assessment. Not during cutover. You do not want surprises on cutover day.
One critical implementation detail that burns teams who do not know about it: CDC must capture changes that occur during the initial data load. If the initial load takes 6 hours, 6 hours of inserts, updates, and deletes are happening on the source that the initial load does not include. The CDC capture must start before or simultaneously with the initial load so those changes are captured and replayed on the target. CDC tools that start capture after the initial load completes produce an inconsistent target that looks correct in row counts but has stale data for every record that changed during the load window. The row counts match. The data does not. This is exactly the kind of bug that does not show up until production.
Data Integrity Validation
Row count comparison confirms you transferred the right number of rows. It does not confirm you transferred the right data. Do not confuse these two things. A transformation error that corrupts values while preserving row counts is invisible to row count checks. In one migration we reviewed, row counts matched perfectly and every decimal value was wrong because of a precision mismatch between Oracle’s NUMBER(10,2) and the PostgreSQL target type. The right number of wrong rows.
The validation strategy must match each data set’s risk profile.
For reference data and lookup tables, sampled spot-checks are often sufficient. For transactional data (financial records, order histories, user activity logs), checksum comparison of critical columns provides stronger assurance. For the highest-value data in regulated environments, shadow queries are the gold standard. Run identical queries against both source and target simultaneously under production-representative load and compare results field by field. Yes, this is expensive to set up. It is far cheaper than finding corruption after cutover.
The validation phase also surfaces secondary issues that would otherwise appear after cutover: query performance differences from missing indexes or uncollected statistics on the target, query planner differences between engine versions, and connection pooling behavior under load. Discovering these under controlled conditions is the goal. Discovering them during the first production morning is the expensive alternative, and it happens more often than anyone admits. Teams running this as part of a broader cloud migration program should validate connection behavior as early as possible.
Connection Pooling Architecture
Data engineering teams migrating from on-premise Oracle or SQL Server to cloud-managed PostgreSQL consistently underestimate the connection model change. This is the issue that catches almost every team. On-premise databases are often configured for hundreds or thousands of direct connections. Cloud-managed databases have connection limits tied to instance size, and each idle connection consumes 5-10MB of shared memory.
Here is the math that breaks things. An ORM configured to open 100 connections per application instance across 100 Kubernetes pods opens 10,000 connections. Aurora PostgreSQL on a db.r6g.2xlarge supports around 5,000. You have exhausted the connection limit before your application handles its first request at scale. The application ran fine in dev with 3 pods. It ran fine in staging with 10 pods. It fell over in production with 100 pods the first morning under real traffic. Nobody thought to do the multiplication.
PgBouncer in transaction mode is the standard solution. It multiplexes thousands of application-level connections over a smaller pool of actual database connections. A PgBouncer pool of 200 database connections can serve 10,000 application connections by reusing connections during the brief gaps between transactions. The architecture change (inserting a connection pooler between application and database) is straightforward. But it must be tested under production-representative load before cutover. Not after. There is no “after” if connection exhaustion takes down your application on cutover day.
For teams running this migration as part of a broader cloud-native modernization, connection pooling architecture should be standardized and load-tested across all services before any database cutover begins. Not retrofitted after the first cutover fails under load.
And when things go wrong (because something always goes wrong), you need a real rollback plan.
Rollback Planning
A rollback plan that requires restoring from backup is not a rollback plan. It is disaster recovery with a multi-hour RTO. Do not confuse the two. A real rollback plan assumes the worst case: the cutover completed, production traffic ran against the new database for 30 minutes, something critical failed, and you need to be fully back on the old database within 10 minutes.
That requires three things in place before you start the cutover. First, the source database still running and fully intact. Never decommission during the cutover weekend. Second, CDC still running, capturing any writes made to the target during the cutover window and applying them back to the source. Third, application routing that can switch traffic back to the source in seconds, not minutes. All three. Non-negotiable.
The rollback procedure should be documented step-by-step, rehearsed in staging, and timed to confirm it actually meets the target RTO. If your rehearsal takes 45 minutes, your production rollback will take longer, not shorter. Practice until the procedure completes in under 10 minutes with confidence.
Decommissioning the source database immediately after a seemingly successful cutover is tempting for cost reasons. Do not do it. Keep the source running with reverse CDC for at least two weeks while you validate behavior under real production patterns. The cost of running both databases for two weeks is negligible compared to discovering a critical issue three days after cutover with no rollback path. For teams operating under resilience and disaster recovery requirements, this parallel-running period is non-negotiable.
The database migration that succeeds is the one where the team planned for the rollback as thoroughly as the cutover. The source database alive with reverse CDC for two weeks after cutover is insurance that costs pennies compared to discovering a critical issue with no path back. Plan for the rollback you hope you never need, and test it until you know it works.