Serverless Events: Handling Failures, Duplicates, and Partial State

Jan 17, 2026 Metasphere Engineering 10 min read

Your order processing pipeline handles 50 events per minute in staging without a hitch. Black Friday hits. Traffic spikes to 50,000 events per minute and Lambda scales to meet it. Exactly as advertised. The relay team went from jogging to sprinting. Then three things blow up at once.

A slice of orders got processed twice because SQS delivered duplicates during the spike and your handler wasn’t idempotent. Duplicate charges pile up and support tickets arrive before anyone even notices. The baton passed twice. Both runners ran. A downstream payment failure left 400 orders stuck in a half-processed state. Inventory reserved, no payment collected, no automated recovery path. A runner fell. Nobody reversed the handoffs. And the alerting that should have caught all of this? Dead silent. Nobody wired OpenTelemetry correlation IDs through the event chain, so the monitoring couldn’t connect a failed payment back to the original order. No tracking chip on the baton. Nobody knows where it went.

Scaling was never the hard part. Surviving what happens after scaling is.

Key takeaways

SQS delivers duplicates during traffic spikes. The baton passed twice. If your handler isn’t idempotent, duplicate charges pile up before anyone notices. At-least-once delivery means exactly that.
Idempotency keys must come from business identifiers, not infrastructure IDs. Use order-123-payment-attempt-1, not the SQS message ID. Message IDs change on redelivery. Different baton, same race.
Every DLQ message is a confirmed business failure. Alert on depth above zero within 1 minute. Build the replay mechanism before you need it. 2 messages/hour = 96 failed operations over a weekend. Dropped batons piling up.
Correlation IDs must travel through every event in the chain, including error paths and DLQ handlers. The tracking chip on every baton. Without them, debugging becomes CloudWatch timestamp archaeology.
Partial failure recovery requires compensation logic. A saga that reverses the handoffs. Inventory reserved but payment failed? A retry won’t release the reservation. Only a compensating transaction will.

Prerequisites

DynamoDB table (or equivalent) provisioned for idempotency key storage with TTL configured
DLQ configured on every SQS queue and Lambda event source mapping
CloudWatch alarm on DLQ depth > 0 with notification within 1 minute
Structured logging format adopted across all Lambda functions
Trace ID generation at the API gateway entry point with propagation logic in shared middleware

Idempotency: The Non-Negotiable Foundation

At-least-once delivery. Full stop. Every event system uses it, and your handlers must account for it. The implementation is a DynamoDB conditional write: insert the event ID, process if new, skip if the key already exists. Atomic check-and-record.

Most implementations break at one specific point. The processing step between the initial conditional write and the completion marker is the danger zone. If your handler processes the business logic successfully but crashes before marking the event as complete, the next retry sees the event ID as in-progress, not done. You need a status field on the deduplication record: PROCESSING on first receipt, COMPLETE after successful processing. Retries that find PROCESSING older than your expected max processing time (say 5 minutes) should re-attempt, because the first attempt likely crashed.

The Idempotency Prerequisite The pattern that has to be in place before any event-driven serverless system touches production traffic. Without idempotent handlers, every retry, every SQS duplicate, every Lambda cold-start re-invocation produces duplicate side effects. Not an optimization. A correctness requirement.

For payment and inventory operations specifically, derive the idempotency key from the business operation’s natural identifier, not the infrastructure event ID. Use order-123-payment-attempt-1 rather than the SQS message ID. SQS redelivery scenarios change the message ID while the business operation remains the same. Get this wrong and you process the same charge twice with two different message IDs, and your deduplication logic sees them as distinct events.

Anti-pattern

Don’t: Use the SQS message ID as your idempotency key. Message IDs change when SQS redelivers a message after visibility timeout expiry. Two different message IDs, same business operation, double charge.

Do: Derive the key from the business operation: order-123-payment-attempt-1. This survives queue redelivery, dead letter requeing, and manual replay because the business identity never changes.

Sagas: Design the Failure Path First

A distributed transaction spanning inventory, payment, and shipping cannot rely on two-phase commit. The saga pattern coordinates it through events, with compensating transactions running in reverse when something fails downstream.

Two flavors, and the operational gap between them is huge.

Aspect	Choreography	Orchestration (Step Functions)
Visibility	Invisible flow across repos	Explicit state machine, visual debugger
Debugging	Hours tracing events through CloudWatch	Single execution view, queryable history
Compensation	Scattered across services	Centralized, versioned
Complexity ceiling	Manageable at 2-3 steps	Scales to 10+ steps
Best for	Simple, 2-step workflows	4+ step business processes

For anything beyond three steps, orchestration wins on operability and it’s not close. Choreography looks beautiful on a whiteboard. Runners who just hand off to whoever is next. In production, tracing a compensation failure across four repos at 3x normal traffic volume is the kind of experience that makes people update their resumes. (Ask me how I know.)

Design compensation first. Every compensating transaction needs to be idempotent and handle partial success. A refund must work whether the charge was fully processed, partially processed, or authorized-not-captured. If your compensation logic only handles the happy failure path, the first partial failure in production will leave you with an order that cannot complete and cannot roll back. Stuck between states with no automated recovery.

DLQ Discipline: Every Message Is a Business Failure

Teams set up DLQs and never look at them. A trickle of 2 messages per hour means 96 failed business operations over a weekend. An order not placed. A notification not sent. A payment not processed. Alert on depth above zero within 1 minute. Build the replay mechanism before you need it, not during the incident.

The replay mechanism needs its own idempotency handling. Messages replayed from the DLQ hit the same handler that originally failed. If the root cause was a transient infrastructure issue (downstream timeout, throttling), replay works cleanly. If the root cause was a data issue (malformed payload, schema mismatch), replaying without fixing the data produces the same failure. Categorize DLQ messages by failure type before replaying. Not every message gets the same treatment.

Observability: Correlation IDs or Archaeology

Without correlation IDs, debugging a failed order across an event chain means hours of timestamp guessing across CloudWatch log groups. With them, it is one query returning in seconds. The math on that tradeoff is not complicated.

Generate a UUID at the API gateway entry point. Propagate it through every event payload, every log line, every error handler, and every DLQ record. Most teams wire correlation into the happy path and skip error paths and DLQ handlers. But the happy path isn’t where you need observability.

Invest in observability tooling early. Retrofitting correlation IDs into a live event chain means touching every handler, every schema, every log statement. Wire it in from day one.

When Serverless Stops Making Sense

Signal	Threshold	Better Alternative
Sustained invocations	Above ~10,000/minute, steady traffic	Containers (ECS/Kubernetes)
Latency requirement	Sub-100ms consistently	Containers with minimum replicas
Provisioned concurrency cost	Approaches container cost	Containers (you lost the cost advantage)
Execution duration	Hitting 15-minute timeout regularly	Step Functions or containers
State across requests	In-memory state required	Containers with sticky sessions

Above roughly 10,000 invocations per minute with consistent traffic, containers cost materially less. When you need sub-100ms latency with variable traffic, containers with minimum replicas beat provisioned concurrency on cost. Provisioned concurrency eliminates cold starts but also eliminates the pricing advantage. At that point you are paying container prices for serverless operational constraints. Distributed systems that need consistent latency should evaluate the crossover point carefully.

What the Industry Gets Wrong About Event-Driven Serverless

“SQS guarantees exactly-once delivery.” SQS guarantees at-least-once. During traffic spikes, duplicate deliveries are expected behavior, not a bug. If your handler is not idempotent, every spike produces duplicate processing. The infrastructure is working correctly. Your code is not accounting for it.

“Serverless eliminates infrastructure concerns.” Serverless eliminates server management. It introduces different concerns: cold starts, concurrent execution limits, 15-minute timeout walls, DLQ management, and correlation tracking across asynchronous invocations. Different concerns. Not fewer.

Our take Wire idempotency keys, DLQ alerting, and correlation IDs before the first production event. Not after the first duplicate incident. These three controls take days to implement. The incidents they prevent take weeks to investigate and months of user trust to rebuild. Treating them as post-launch improvements is how teams end up explaining duplicate charges to their VP of Engineering.

That pipeline with duplicate charges piling up? Same spike. Same relay race. Idempotency keys catch the duplicates. Runners who received the baton twice only run once. DLQ alerting flags the payment failures within a minute. Dropped batons found in seconds. Correlation IDs trace every order from entry to finish in a single query. Every baton tracked. Saga compensation releases the stuck inventory automatically. The referee reverses the handoffs. Every order processes exactly once. Scaling was never the problem.

Frequently Asked Questions

What makes a workload well-suited for serverless functions?

Serverless excels for event-triggered workloads with spiky or unpredictable traffic, stateless tasks completing in seconds to minutes, and glue logic orchestrating managed services. It performs poorly for sustained high-throughput above roughly 10,000 invocations per minute where per-invocation overhead exceeds container cost, anything requiring sub-100ms consistent latency due to 100-3,000ms cold starts, and stateful workloads needing in-memory state across requests.

What is idempotency and why is it required for event-driven systems?

An idempotent operation produces the same result no matter how many times it runs with the same input. Event systems use at-least-once delivery, so your function will receive duplicates due to retries. Without idempotency, duplicates produce double charges, duplicate notifications, or corrupted records. Implementation requires an atomic check-and-record using the event ID via DynamoDB conditional writes or database transactions. The atomicity is where most implementations break.

What is the saga pattern and when should I use it?

The saga pattern manages distributed transactions across multiple services without two-phase commit. Each step publishes an event triggering the next, with compensating transactions running in reverse on failure. Use sagas for business processes spanning 3+ services that need consistency guarantees. For 4+ step sagas, AWS Step Functions provides clear operational advantages over hand-rolled choreography, including visible state machines and queryable execution history.

How do dead letter queues work and what should we do with the messages?

A DLQ receives messages that failed processing after exhausting retries, typically 3-5 attempts with exponential backoff. Every DLQ message is a confirmed business failure: an order not placed, a notification not sent. Alert on DLQ depth above zero within 1 minute. After fixing the root cause, replay accumulated messages in order. A DLQ filling silently is a monitoring gap hiding active production failures.

How do you debug an event chain spanning multiple Lambda functions?

Inject a trace ID at the entry point and propagate it through every event payload and log line. AWS X-Ray, Datadog, or Honeycomb reconstruct the full trace from correlated spans. Without correlation IDs, reconstructing a failed order across several Lambda functions takes hours of CloudWatch timestamp archaeology and guesswork. With them, it is a single query returning in seconds.