← Back to Insights

Serverless Events: Handling Failures, Duplicates, and Partial State

Metasphere Engineering 10 min read

Your order processing pipeline handles 50 events per minute in staging without a hitch. Black Friday hits. Traffic spikes to 50,000 events per minute and Lambda scales to meet it. Exactly as advertised. The relay team went from jogging to sprinting. Then three things blow up at once.

A slice of orders got processed twice because SQS delivered duplicates during the spike and your handler wasn’t idempotent. Duplicate charges pile up and support tickets arrive before anyone even notices. The baton passed twice. Both runners ran. A downstream payment failure left 400 orders stuck in a half-processed state. Inventory reserved, no payment collected, no automated recovery path. A runner fell. Nobody reversed the handoffs. And the alerting that should have caught all of this? Dead silent. Nobody wired OpenTelemetry correlation IDs through the event chain, so the monitoring couldn’t connect a failed payment back to the original order. No tracking chip on the baton. Nobody knows where it went.

Scaling was never the hard part. Surviving what happens after scaling is.

Key takeaways
  • SQS delivers duplicates during traffic spikes. The baton passed twice. If your handler isn’t idempotent, duplicate charges pile up before anyone notices. At-least-once delivery means exactly that.
  • Idempotency keys must come from business identifiers, not infrastructure IDs. Use order-123-payment-attempt-1, not the SQS message ID. Message IDs change on redelivery. Different baton, same race.
  • Every DLQ message is a confirmed business failure. Alert on depth above zero within 1 minute. Build the replay mechanism before you need it. 2 messages/hour = 96 failed operations over a weekend. Dropped batons piling up.
  • Correlation IDs must travel through every event in the chain, including error paths and DLQ handlers. The tracking chip on every baton. Without them, debugging becomes CloudWatch timestamp archaeology.
  • Partial failure recovery requires compensation logic. A saga that reverses the handoffs. Inventory reserved but payment failed? A retry won’t release the reservation. Only a compensating transaction will.
Prerequisites
  1. DynamoDB table (or equivalent) provisioned for idempotency key storage with TTL configured
  2. DLQ configured on every SQS queue and Lambda event source mapping
  3. CloudWatch alarm on DLQ depth > 0 with notification within 1 minute
  4. Structured logging format adopted across all Lambda functions
  5. Trace ID generation at the API gateway entry point with propagation logic in shared middleware

Idempotency: The Non-Negotiable Foundation

At-least-once delivery. Full stop. Every event system uses it, and your handlers must account for it. The implementation is a DynamoDB conditional write: insert the event ID, process if new, skip if the key already exists. Atomic check-and-record.

Most implementations break at one specific point. The processing step between the initial conditional write and the completion marker is the danger zone. If your handler processes the business logic successfully but crashes before marking the event as complete, the next retry sees the event ID as in-progress, not done. You need a status field on the deduplication record: PROCESSING on first receipt, COMPLETE after successful processing. Retries that find PROCESSING older than your expected max processing time (say 5 minutes) should re-attempt, because the first attempt likely crashed.

The Idempotency Prerequisite The pattern that has to be in place before any event-driven serverless system touches production traffic. Without idempotent handlers, every retry, every SQS duplicate, every Lambda cold-start re-invocation produces duplicate side effects. Not an optimization. A correctness requirement.

For payment and inventory operations specifically, derive the idempotency key from the business operation’s natural identifier, not the infrastructure event ID. Use order-123-payment-attempt-1 rather than the SQS message ID. SQS redelivery scenarios change the message ID while the business operation remains the same. Get this wrong and you process the same charge twice with two different message IDs, and your deduplication logic sees them as distinct events.

Idempotency pattern: atomic check-process-record preventing duplicate event processingEvent arrives. Handler checks DynamoDB for the idempotency key. If found, return cached result (no reprocessing). If not found, process the event, store the result atomically with a conditional write, and return. Duplicate delivery is safe.Idempotency: Process Once, Deliver Safely TwiceEventkey: order-123Check KeyDynamoDB lookuporder-123 exists?FoundNot foundReturn cached resultSkip processing entirelyProcess EventBusiness logic runsStore Result + KeyConditional write: only if key absentAtomic: no race conditionsProcess the same event twice. Get the same result. Every time.
Anti-pattern

Don’t: Use the SQS message ID as your idempotency key. Message IDs change when SQS redelivers a message after visibility timeout expiry. Two different message IDs, same business operation, double charge.

Do: Derive the key from the business operation: order-123-payment-attempt-1. This survives queue redelivery, dead letter requeing, and manual replay because the business identity never changes.

Sagas: Design the Failure Path First

A distributed transaction spanning inventory, payment, and shipping cannot rely on two-phase commit. The saga pattern coordinates it through events, with compensating transactions running in reverse when something fails downstream.

Two flavors, and the operational gap between them is huge.

AspectChoreographyOrchestration (Step Functions)
VisibilityInvisible flow across reposExplicit state machine, visual debugger
DebuggingHours tracing events through CloudWatchSingle execution view, queryable history
CompensationScattered across servicesCentralized, versioned
Complexity ceilingManageable at 2-3 stepsScales to 10+ steps
Best forSimple, 2-step workflows4+ step business processes

For anything beyond three steps, orchestration wins on operability and it’s not close. Choreography looks beautiful on a whiteboard. Runners who just hand off to whoever is next. In production, tracing a compensation failure across four repos at 3x normal traffic volume is the kind of experience that makes people update their resumes. (Ask me how I know.)

Design compensation first. Every compensating transaction needs to be idempotent and handle partial success. A refund must work whether the charge was fully processed, partially processed, or authorized-not-captured. If your compensation logic only handles the happy failure path, the first partial failure in production will leave you with an order that cannot complete and cannot roll back. Stuck between states with no automated recovery.

Saga Pattern with Compensation RollbackAn order saga executes three steps: Reserve Inventory, Charge Payment, and Ship Order. Ship Order fails, triggering compensating transactions in reverse order to roll back the saga cleanly.Saga Pattern: Forward Execution and Compensation RollbackFORWARD EXECUTIONReserve InventoryStep 1Charge PaymentStep 2Ship OrderStep 3Shipping service unavailableCOMPENSATING TRANSACTIONS (REVERSE ORDER)Refund PaymentUndo Step 2Release InventoryUndo Step 1Saga Rolled Back SuccessfullyAll resources released. No partial state.

DLQ Discipline: Every Message Is a Business Failure

Teams set up DLQs and never look at them. A trickle of 2 messages per hour means 96 failed business operations over a weekend. An order not placed. A notification not sent. A payment not processed. Alert on depth above zero within 1 minute. Build the replay mechanism before you need it, not during the incident.

DLQ Discipline: Every Failed Message Gets AttentionDLQ Discipline: Every Failed Message Gets AttentionMessage FailsLambda retries exhaustedSent to DLQAlert: DLQ > 0Depth above zero = problemPage on-call immediatelyInvestigateRead message payloadCheck error in logsFix root causeReplay MessagesMove from DLQ back to queueReprocess after fix appliedA DLQ with no alerting is a data graveyard. Messages go in and never come out.

The replay mechanism needs its own idempotency handling. Messages replayed from the DLQ hit the same handler that originally failed. If the root cause was a transient infrastructure issue (downstream timeout, throttling), replay works cleanly. If the root cause was a data issue (malformed payload, schema mismatch), replaying without fixing the data produces the same failure. Categorize DLQ messages by failure type before replaying. Not every message gets the same treatment.

Observability: Correlation IDs or Archaeology

Without correlation IDs, debugging a failed order across an event chain means hours of timestamp guessing across CloudWatch log groups. With them, it is one query returning in seconds. The math on that tradeoff is not complicated.

Generate a UUID at the API gateway entry point. Propagate it through every event payload, every log line, every error handler, and every DLQ record. Most teams wire correlation into the happy path and skip error paths and DLQ handlers. But the happy path isn’t where you need observability.

Correlation ID propagation across serverless event chainAPI Gateway generates correlation ID. Lambda A passes it to SQS message. Lambda B reads it from message attributes. Each log entry includes the ID. One grep reconstructs the full request path.Correlation IDs: One Grep to Find EverythingX-Correlation-Id: abc-123-defAPI GatewayGenerates IDLambda ALogs with IDSQSID in attributesLambda BReads ID, logsCloudWatch Logsgrep abc-123-defFull request path reconstructedWithout correlation IDs, debugging serverless is archaeology.

Invest in observability tooling early. Retrofitting correlation IDs into a live event chain means touching every handler, every schema, every log statement. Wire it in from day one.

When Serverless Stops Making Sense

SignalThresholdBetter Alternative
Sustained invocationsAbove ~10,000/minute, steady trafficContainers (ECS/Kubernetes)
Latency requirementSub-100ms consistentlyContainers with minimum replicas
Provisioned concurrency costApproaches container costContainers (you lost the cost advantage)
Execution durationHitting 15-minute timeout regularlyStep Functions or containers
State across requestsIn-memory state requiredContainers with sticky sessions

Above roughly 10,000 invocations per minute with consistent traffic, containers cost materially less. When you need sub-100ms latency with variable traffic, containers with minimum replicas beat provisioned concurrency on cost. Provisioned concurrency eliminates cold starts but also eliminates the pricing advantage. At that point you are paying container prices for serverless operational constraints. Distributed systems that need consistent latency should evaluate the crossover point carefully.

What the Industry Gets Wrong About Event-Driven Serverless

“SQS guarantees exactly-once delivery.” SQS guarantees at-least-once. During traffic spikes, duplicate deliveries are expected behavior, not a bug. If your handler is not idempotent, every spike produces duplicate processing. The infrastructure is working correctly. Your code is not accounting for it.

“Serverless eliminates infrastructure concerns.” Serverless eliminates server management. It introduces different concerns: cold starts, concurrent execution limits, 15-minute timeout walls, DLQ management, and correlation tracking across asynchronous invocations. Different concerns. Not fewer.

Our take Wire idempotency keys, DLQ alerting, and correlation IDs before the first production event. Not after the first duplicate incident. These three controls take days to implement. The incidents they prevent take weeks to investigate and months of user trust to rebuild. Treating them as post-launch improvements is how teams end up explaining duplicate charges to their VP of Engineering.

That pipeline with duplicate charges piling up? Same spike. Same relay race. Idempotency keys catch the duplicates. Runners who received the baton twice only run once. DLQ alerting flags the payment failures within a minute. Dropped batons found in seconds. Correlation IDs trace every order from entry to finish in a single query. Every baton tracked. Saga compensation releases the stuck inventory automatically. The referee reverses the handoffs. Every order processes exactly once. Scaling was never the problem.

Build Event-Driven Serverless That Survives Production

Event-driven serverless that works in demos fails silently in production when idempotency is missing, sagas lack compensation, and DLQs fill without alerting. Proper failure handling, correlation ID propagation, and recovery logic from day one.

Fix Your Event Pipeline

Frequently Asked Questions

What makes a workload well-suited for serverless functions?

+

Serverless excels for event-triggered workloads with spiky or unpredictable traffic, stateless tasks completing in seconds to minutes, and glue logic orchestrating managed services. It performs poorly for sustained high-throughput above roughly 10,000 invocations per minute where per-invocation overhead exceeds container cost, anything requiring sub-100ms consistent latency due to 100-3,000ms cold starts, and stateful workloads needing in-memory state across requests.

What is idempotency and why is it required for event-driven systems?

+

An idempotent operation produces the same result no matter how many times it runs with the same input. Event systems use at-least-once delivery, so your function will receive duplicates due to retries. Without idempotency, duplicates produce double charges, duplicate notifications, or corrupted records. Implementation requires an atomic check-and-record using the event ID via DynamoDB conditional writes or database transactions. The atomicity is where most implementations break.

What is the saga pattern and when should I use it?

+

The saga pattern manages distributed transactions across multiple services without two-phase commit. Each step publishes an event triggering the next, with compensating transactions running in reverse on failure. Use sagas for business processes spanning 3+ services that need consistency guarantees. For 4+ step sagas, AWS Step Functions provides clear operational advantages over hand-rolled choreography, including visible state machines and queryable execution history.

How do dead letter queues work and what should we do with the messages?

+

A DLQ receives messages that failed processing after exhausting retries, typically 3-5 attempts with exponential backoff. Every DLQ message is a confirmed business failure: an order not placed, a notification not sent. Alert on DLQ depth above zero within 1 minute. After fixing the root cause, replay accumulated messages in order. A DLQ filling silently is a monitoring gap hiding active production failures.

How do you debug an event chain spanning multiple Lambda functions?

+

Inject a trace ID at the entry point and propagate it through every event payload and log line. AWS X-Ray, Datadog, or Honeycomb reconstruct the full trace from correlated spans. Without correlation IDs, reconstructing a failed order across several Lambda functions takes hours of CloudWatch timestamp archaeology and guesswork. With them, it is a single query returning in seconds.