Serverless Events: Handling Failures, Duplicates, and Partial State
Your order processing pipeline handles 50 events per minute in staging without a hitch. Black Friday hits. Traffic spikes to 50,000 events per minute and Lambda scales to meet it. Exactly as advertised. The relay team went from jogging to sprinting. Then three things blow up at once.
A slice of orders got processed twice because SQS delivered duplicates during the spike and your handler wasn’t idempotent. Duplicate charges pile up and support tickets arrive before anyone even notices. The baton passed twice. Both runners ran. A downstream payment failure left 400 orders stuck in a half-processed state. Inventory reserved, no payment collected, no automated recovery path. A runner fell. Nobody reversed the handoffs. And the alerting that should have caught all of this? Dead silent. Nobody wired OpenTelemetry correlation IDs through the event chain, so the monitoring couldn’t connect a failed payment back to the original order. No tracking chip on the baton. Nobody knows where it went.
Scaling was never the hard part. Surviving what happens after scaling is.
- SQS delivers duplicates during traffic spikes. The baton passed twice. If your handler isn’t idempotent, duplicate charges pile up before anyone notices. At-least-once delivery means exactly that.
- Idempotency keys must come from business identifiers, not infrastructure IDs. Use
order-123-payment-attempt-1, not the SQS message ID. Message IDs change on redelivery. Different baton, same race. - Every DLQ message is a confirmed business failure. Alert on depth above zero within 1 minute. Build the replay mechanism before you need it. 2 messages/hour = 96 failed operations over a weekend. Dropped batons piling up.
- Correlation IDs must travel through every event in the chain, including error paths and DLQ handlers. The tracking chip on every baton. Without them, debugging becomes CloudWatch timestamp archaeology.
- Partial failure recovery requires compensation logic. A saga that reverses the handoffs. Inventory reserved but payment failed? A retry won’t release the reservation. Only a compensating transaction will.
- DynamoDB table (or equivalent) provisioned for idempotency key storage with TTL configured
- DLQ configured on every SQS queue and Lambda event source mapping
- CloudWatch alarm on DLQ depth > 0 with notification within 1 minute
- Structured logging format adopted across all Lambda functions
- Trace ID generation at the API gateway entry point with propagation logic in shared middleware
Idempotency: The Non-Negotiable Foundation
At-least-once delivery. Full stop. Every event system uses it, and your handlers must account for it. The implementation is a DynamoDB conditional write: insert the event ID, process if new, skip if the key already exists. Atomic check-and-record.
Most implementations break at one specific point. The processing step between the initial conditional write and the completion marker is the danger zone. If your handler processes the business logic successfully but crashes before marking the event as complete, the next retry sees the event ID as in-progress, not done. You need a status field on the deduplication record: PROCESSING on first receipt, COMPLETE after successful processing. Retries that find PROCESSING older than your expected max processing time (say 5 minutes) should re-attempt, because the first attempt likely crashed.
For payment and inventory operations specifically, derive the idempotency key from the business operation’s natural identifier, not the infrastructure event ID. Use order-123-payment-attempt-1 rather than the SQS message ID. SQS redelivery scenarios change the message ID while the business operation remains the same. Get this wrong and you process the same charge twice with two different message IDs, and your deduplication logic sees them as distinct events.
Don’t: Use the SQS message ID as your idempotency key. Message IDs change when SQS redelivers a message after visibility timeout expiry. Two different message IDs, same business operation, double charge.
Do: Derive the key from the business operation: order-123-payment-attempt-1. This survives queue redelivery, dead letter requeing, and manual replay because the business identity never changes.
Sagas: Design the Failure Path First
A distributed transaction spanning inventory, payment, and shipping cannot rely on two-phase commit. The saga pattern coordinates it through events, with compensating transactions running in reverse when something fails downstream.
Two flavors, and the operational gap between them is huge.
| Aspect | Choreography | Orchestration (Step Functions) |
|---|---|---|
| Visibility | Invisible flow across repos | Explicit state machine, visual debugger |
| Debugging | Hours tracing events through CloudWatch | Single execution view, queryable history |
| Compensation | Scattered across services | Centralized, versioned |
| Complexity ceiling | Manageable at 2-3 steps | Scales to 10+ steps |
| Best for | Simple, 2-step workflows | 4+ step business processes |
For anything beyond three steps, orchestration wins on operability and it’s not close. Choreography looks beautiful on a whiteboard. Runners who just hand off to whoever is next. In production, tracing a compensation failure across four repos at 3x normal traffic volume is the kind of experience that makes people update their resumes. (Ask me how I know.)
Design compensation first. Every compensating transaction needs to be idempotent and handle partial success. A refund must work whether the charge was fully processed, partially processed, or authorized-not-captured. If your compensation logic only handles the happy failure path, the first partial failure in production will leave you with an order that cannot complete and cannot roll back. Stuck between states with no automated recovery.
DLQ Discipline: Every Message Is a Business Failure
Teams set up DLQs and never look at them. A trickle of 2 messages per hour means 96 failed business operations over a weekend. An order not placed. A notification not sent. A payment not processed. Alert on depth above zero within 1 minute. Build the replay mechanism before you need it, not during the incident.
The replay mechanism needs its own idempotency handling. Messages replayed from the DLQ hit the same handler that originally failed. If the root cause was a transient infrastructure issue (downstream timeout, throttling), replay works cleanly. If the root cause was a data issue (malformed payload, schema mismatch), replaying without fixing the data produces the same failure. Categorize DLQ messages by failure type before replaying. Not every message gets the same treatment.
Observability: Correlation IDs or Archaeology
Without correlation IDs, debugging a failed order across an event chain means hours of timestamp guessing across CloudWatch log groups. With them, it is one query returning in seconds. The math on that tradeoff is not complicated.
Generate a UUID at the API gateway entry point. Propagate it through every event payload, every log line, every error handler, and every DLQ record. Most teams wire correlation into the happy path and skip error paths and DLQ handlers. But the happy path isn’t where you need observability.
Invest in observability tooling early. Retrofitting correlation IDs into a live event chain means touching every handler, every schema, every log statement. Wire it in from day one.
When Serverless Stops Making Sense
| Signal | Threshold | Better Alternative |
|---|---|---|
| Sustained invocations | Above ~10,000/minute, steady traffic | Containers (ECS/Kubernetes) |
| Latency requirement | Sub-100ms consistently | Containers with minimum replicas |
| Provisioned concurrency cost | Approaches container cost | Containers (you lost the cost advantage) |
| Execution duration | Hitting 15-minute timeout regularly | Step Functions or containers |
| State across requests | In-memory state required | Containers with sticky sessions |
Above roughly 10,000 invocations per minute with consistent traffic, containers cost materially less. When you need sub-100ms latency with variable traffic, containers with minimum replicas beat provisioned concurrency on cost. Provisioned concurrency eliminates cold starts but also eliminates the pricing advantage. At that point you are paying container prices for serverless operational constraints. Distributed systems that need consistent latency should evaluate the crossover point carefully.
What the Industry Gets Wrong About Event-Driven Serverless
“SQS guarantees exactly-once delivery.” SQS guarantees at-least-once. During traffic spikes, duplicate deliveries are expected behavior, not a bug. If your handler is not idempotent, every spike produces duplicate processing. The infrastructure is working correctly. Your code is not accounting for it.
“Serverless eliminates infrastructure concerns.” Serverless eliminates server management. It introduces different concerns: cold starts, concurrent execution limits, 15-minute timeout walls, DLQ management, and correlation tracking across asynchronous invocations. Different concerns. Not fewer.
That pipeline with duplicate charges piling up? Same spike. Same relay race. Idempotency keys catch the duplicates. Runners who received the baton twice only run once. DLQ alerting flags the payment failures within a minute. Dropped batons found in seconds. Correlation IDs trace every order from entry to finish in a single query. Every baton tracked. Saga compensation releases the stuck inventory automatically. The referee reverses the handoffs. Every order processes exactly once. Scaling was never the problem.