Database Migration Without Downtime
You complete what looks like a successful Oracle-to-Aurora migration. Application tier deployed cleanly. Data loaded. Smoke tests passed. Green across the board. You redirect production traffic and within 45 minutes hit connection exhaustion. Your ORM was configured to open a connection per request, which worked fine against on-premise Oracle with 2,000 connection capacity but overwhelms Aurora’s limits at production volume. Three weeks of unplanned PgBouncer retrofitting because of a configuration nobody thought to load-test.
- Connection pooling causes more migration failures than schema conversion. Teams spend months on schema and an afternoon on pool sizing. Production teaches them the ratio was backward.
- CDC (Change Data Capture) cuts cutover windows from hours to minutes. Stream changes from source to target in real time. Cut over when replication lag drops to seconds.
- Row counts matching does not mean data is correct. A precision mismatch between Oracle NUMBER and PostgreSQL NUMERIC produces the right number of wrong rows.
- Test with production-volume data, not staging subsets. Query plans change with table statistics. A migration that works at 10GB fails at 2TB.
- Keep the rollback path live for 2-4 weeks after cutover. Bidirectional replication ensures the old database stays current while you validate the new one.
Debezium is the standard open-source CDC platform. AWS DMS covers managed migrations. Schema conversion absorbs months of effort. Connection behavior absorbs an afternoon. That ratio needs to flip.
Assessment: Where Migrations Actually Break
Schema conversion tools (AWS SCT, ora2pg) automate DDL and do it reasonably well. Do not trust them for stored procedures. PL/SQL to PL/pgSQL translation needs human review for every non-trivial procedure because the tools translate syntax without understanding behavior. A rounding difference between Oracle NUMBER(10,2) and PostgreSQL NUMERIC hides until financial reconciliation fails weeks after cutover.
- Full inventory of stored procedures, triggers, and views with engine-specific syntax
- Data type mapping validated for precision-sensitive columns (financial, scientific)
- Collation and character set differences documented between source and target
- Connection pool sizing calculated for target engine limits under production concurrency
- Rollback procedure documented and tested with the operations team
Infrastructure-as-code teams have an easier assessment because schemas are already version-controlled. But version-controlled schemas only help if the test suite exercises the edge cases. The migration that breaks is never the straightforward table copy. It is the view that references an Oracle-specific function, the trigger that depends on session-level variables, or the stored procedure with implicit type coercion that PostgreSQL handles differently.
The CDC Migration Pattern
| Migration Strategy | Cutover Downtime | Rollback Capability | Data Risk | Best For |
|---|---|---|---|---|
| CDC + incremental sync | 5-15 minutes | Keep source running, revert traffic | Low (continuous validation) | Production databases with SLA |
| Dump and restore | Hours (scales with size) | Restore from backup (hours) | Medium (gap between dump and restore) | Dev/staging, small databases |
| Dual-write | Zero (gradual traffic shift) | Stop writing to new target | High (two-write consistency risk) | When zero downtime is mandatory |
| Blue-green database | Minutes (DNS swap) | Swap back | Low (both run in parallel) | When parallel operation is affordable |
CDC keeps the target continuously synced while the source handles live traffic. At cutover: stop writes, apply the final CDC batch (typically under 30 seconds), redirect connections. Total window: 5-15 minutes.
Each phase has validation gates. Skipping application-level validation is the most common cutover-day failure, and the one with the worst blast radius.
One critical timing detail: CDC must start before or during the initial load. If the load takes 6 hours and capture starts after, those 6 hours of changes get missed. Row counts match. Data doesn’t.
Data Integrity Validation
Row counts match. Every decimal is wrong. A type mismatch between Oracle NUMBER and PostgreSQL NUMERIC slips through row count checks undetected. Every row present. Every value slightly off. Validation intensity must match the data’s risk level.
| Data Risk Level | Examples | Validation Method | Confidence |
|---|---|---|---|
| Low | Lookup tables, static configuration, code reference data | Row count comparison + spot-check sample rows | Sufficient. Low blast radius if wrong |
| Medium | User profiles, product catalog, order history | Row count + column-level hash comparison on critical columns | High. Catches most corruption and truncation |
| High | Financial records, regulated PII, payment transactions | Shadow queries under production-representative load. Source vs target result diff | Maximum. Validates both data and query behavior |
Shadow queries (running identical requests against source and target under production load) are the gold standard for high-value data. They also surface query performance gaps, planner differences, and connection behavior that staging never reveals. Cloud migration teams should run shadow validation for weeks, not days, before cutover.
Don’t: Validate with staging-sized data and call it good. A 10GB test database with 10 concurrent connections reveals nothing about how 2TB and 500 connections behave. Query plans change with table statistics. Connection pools exhaust at thresholds staging never reaches.
Do: Load production-volume data into the target. Replay production query patterns at production concurrency. The migration either works under real conditions or it does not. There is no middle ground worth betting on.
Connection Pooling Architecture
Data integrity is validated. The migration looks clean. Then you flip the switch and the database falls over. Not because of bad data. Because of math.
100 connections per pod, 100 pods = 10,000 connections. Aurora supports roughly 5,000. Dead in production. Perfectly fine in staging with 10 pods. Nobody did the multiplication.
# PgBouncer config - transaction mode pooling
[databases]
production = host=aurora-cluster.xxxxx.us-east-1.rds.amazonaws.com port=5432 dbname=app
[pgbouncer]
pool_mode = transaction # Release connection after each transaction
max_client_conn = 10000 # Accept thousands of app connections
default_pool_size = 200 # Only 200 actual DB connections
reserve_pool_size = 20 # Emergency overflow
server_idle_timeout = 300 # Reclaim idle connections after 5min
PgBouncer multiplexes 10,000 application connections over 200 actual database connections. Test this under production load before cutover. There is no “fix it after” if connection exhaustion takes you down on cutover day.
How the connection pool formula works
The formula connections = (core_count * 2) + effective_spindle_count comes from PostgreSQL tuning guides and represents the optimal pool size for a given instance. For cloud instances without spinning disks, the spindle count is effectively 1 (SSD). A db.r6g.2xlarge with 8 vCPUs gives an optimal pool of (8 * 2) + 1 = 17 connections per application instance. With 100 pods, that is 1,700 total connections, well within Aurora’s limits. Most teams skip this calculation and guess a round number like 100, which works until it catastrophically does not.
| Dimension | Direct ORM Connections | PgBouncer Transaction Mode |
|---|---|---|
| Setup | 100 pods x 100 connections each | 100 pods x 100 connections to PgBouncer |
| Database connections | 10,000 direct connections to Aurora | 200 multiplexed connections to Aurora |
| Aurora limit | db.r6g.2xlarge supports ~5,000 | Well within limit at 200 |
| At production traffic | Connection exhaustion. Service unavailable | Stable. PgBouncer queues excess requests |
| Latency overhead | None (direct) | <1ms per transaction (routing overhead) |
| When it breaks | Prepared statements require session mode (not transaction mode) |
Rollback Planning That Actually Works
A rollback plan that requires restoring from backup is disaster recovery, not rollback. The distinction matters because disaster recovery has multi-hour RTO. A genuine rollback requires three things: source database still running and receiving reverse CDC, bidirectional replication keeping both databases current, and traffic routing switchable in seconds. Rehearse until the team completes it under 10 minutes.
Keep the source running with reverse CDC for at least two weeks post-cutover. Resilience engineering demands it. Tell finance it is insurance. The cost of running two databases for two weeks is trivial compared to the cost of an unrecoverable cutover failure.
What the Industry Gets Wrong About Database Migration
“Schema conversion is the hard part.” Schema conversion tools handle the bulk of DDL automatically. Connection pooling, query plan differences, and transaction isolation behavior under real load are what actually cause migration failures on cutover day. The effort allocation is almost always backward: months on schema, an afternoon on the runtime behavior that actually kills you.
“Test in staging, deploy with confidence.” Staging validates syntax. Production validates behavior. Query plans, connection pool limits, and replication lag all behave differently at real volume. If your migration test doesn’t stress the target at production scale, the cutover itself becomes your first real test.
That Oracle-to-Aurora migration? With shadow traffic testing, CDC replication, and connection pool validated at production volume, the 45-minute failure never happens. The rollback switch stays untouched. The three-week retrofit stays hypothetical.