Serverless Event-Driven Patterns: Sagas, DLQs, Idempotency

Jan 17, 2026 Metasphere Engineering 9 min read

Your order processing pipeline handles 50 events per minute in staging without a hitch. Black Friday hits, traffic spikes to 50,000 events per minute, and Lambda scales to meet it. That part works exactly as advertised. Then three things blow up at once. 2% of orders got processed twice because SQS delivered duplicates during the spike and your handler was not idempotent. That is 1,000 duplicate charges and a wave of support tickets before anyone even notices. A downstream payment failure left 400 orders stuck in a half-processed state. Inventory reserved, no payment collected, no automated recovery path. And the alerting that should have caught all of this? Dead silent. Nobody wired correlation IDs through the event chain, so the monitoring could not connect the dots.

This exact scenario plays out multiple times a year. The scaling part of serverless is a solved problem. It is genuinely boring at this point. The engineering discipline around idempotency, failure handling, and observability is where every team gets burned. These are solvable problems, but they require deliberate architecture, not a retreat back to monoliths. A disciplined serverless engineering approach addresses each of these failure modes directly.

Idempotency: The Non-Negotiable Foundation

At-least-once delivery is the guarantee SQS, EventBridge, and most event systems provide. Exactly-once delivery either does not exist in your system or is prohibitively expensive to achieve. The practical implication is non-negotiable: your event handlers must be idempotent. And making them idempotent requires explicit engineering. There is no shortcut.

The standard pattern is event ID deduplication with DynamoDB. When an event arrives, use a conditional write to attempt inserting the event ID into a deduplication table. If the conditional write succeeds (the ID did not exist), process the event and update the record to mark completion. If it fails (the ID already exists), skip processing and acknowledge the event. The conditional write makes the check-and-record atomic.

Here is where most implementations break. This is the mistake that catches every team eventually. The processing step between the initial conditional write and the completion marker is the danger zone. If your handler successfully processes the business logic but crashes before marking the event as complete, the next retry sees the event ID as in-progress, not complete. You need a status field on the deduplication record: PROCESSING on first receipt, COMPLETE after successful processing. Retries that find PROCESSING older than your expected max processing time (say 5 minutes) should re-attempt, because the first attempt likely crashed.

For payment and inventory operations specifically, derive the idempotency key from the business operation’s natural identifier, not the infrastructure event ID. Use order-123-payment-attempt-1 rather than the SQS message ID. This ensures idempotency survives queue redelivery scenarios where the infrastructure event ID changes but the business operation is the same. Get this wrong and you will process the same charge twice with two different message IDs.

Sagas: Designing the Failure Path First

A business process spanning multiple services (reserving inventory, charging payment, creating a shipping record) cannot use a database transaction across service boundaries. You already know this. The saga pattern handles it with a sequence of local transactions and compensating transactions for rollback.

Two implementation approaches exist. Choosing the wrong one will cost you months.

Choreography-based sagas have each service publish events triggering the next step. No central coordinator. Maximum decoupling. Sounds great in a design meeting. The problem is visibility. With 5 services each publishing events to trigger the next step, the overall business process flow is invisible. It only exists in the aggregate behavior of independent event handlers. When step 4 of 5 fails, figuring out which upstream steps need compensation means reading code across 5 repositories. Teams routinely spend 8 hours or more tracing compensation logic because nobody had a single view of the overall flow. That is not debugging. That is archaeology.

Orchestration-based sagas use a central coordinator, typically AWS Step Functions, that calls each service in sequence and manages compensation on failure. The flow is explicit and visible in the Step Functions console. Individual step failures show their error, timing, and input. Retry and backoff policies are declarative. The compensation path is defined alongside the happy path in a single state machine definition.

For workflows with more than 3-4 steps or non-trivial compensation logic, Step Functions wins decisively. Do not overthink this. The cost optimization math also favors it: a choreography saga with 5 steps generates at minimum 5 Lambda invocations and 4 intermediate queue messages per transaction. The equivalent Step Functions workflow is a single execution with 5 states, often cheaper at scale and significantly easier to debug.

When a step in the saga fails, the compensating transactions must execute in reverse order to undo the effects of previously completed steps. Design these compensation paths before implementing the forward path. This is not optional advice. It reveals the hard failure modes early, including the nasty cases where compensation itself can fail.

The hardest part of saga design is not the happy path. It never is. The hard part is designing the compensation path first. Every step that can succeed needs a corresponding compensating transaction that undoes its effects. Every compensating transaction must itself be idempotent and handle partial success of the original step. A payment refund must work whether the original charge was fully processed, partially processed, or authorized but not captured. Design compensation first. It reveals the hard cases before they show up in production with 400 orders in an unrecoverable state and your on-call engineer staring at a Slack thread at 2 AM.

DLQ Discipline: Every Message Is a Business Failure

A dead letter queue receives messages that exhausted their retry budget, typically 3-5 attempts with exponential backoff. The DLQ prevents infinite retry loops while preserving failed events for investigation. So far, so straightforward. Here is where teams consistently drop the ball.

They set up DLQs and then never monitor them. A DLQ accumulating messages at 2 per hour means 2 business transactions per hour are failing silently. An order not placed. A notification not sent. Data not synchronized. Over a weekend, that is 96 unprocessed business operations that nobody noticed. Every single one is a customer impact.

The non-negotiable DLQ practice: alert on depth above zero within 1 minute. Not “alert when it reaches 100.” Not “check it during the morning standup.” Zero tolerance for silent accumulation. When the alert fires, investigate immediately. After fixing the root cause, replay the accumulated messages in order using a dedicated replay Lambda that reads from the DLQ and resubmits to the original queue.

Build the replay mechanism before you need it. Writing message replay code during an incident is how you turn a bad situation into a catastrophe.

Observability: Correlation IDs or Chaos

Debugging a synchronous system means reading a stack trace. Debugging an event-driven system means reconstructing a timeline from correlated logs across multiple functions, queues, and databases. Without correlation IDs, that reconstruction is a 2-4 hour detective exercise involving timestamp approximation and guesswork across CloudWatch log groups. With correlation IDs, it is one query. 90 seconds. Done.

The implementation is straightforward but must be rigorous. No exceptions. At the entry point (API Gateway, the first event producer), generate a UUID trace ID. Every event published downstream includes { "traceId": "abc-123", "orderId": "order-456" } in its payload. Every Lambda function logs { "traceId": "abc-123", "step": "charge-payment", "status": "success", "durationMs": 340 } on every significant operation. CloudWatch Insights or Datadog reconstructs the complete execution timeline from a single filter traceId = 'abc-123' query.

Here is the mistake that bites teams hardest: implementing correlation IDs in the happy path but not in error paths, DLQ handlers, and compensation logic. Think about that for a second. When you are debugging a failure, the error path is exactly where you need the trace. Every catch block, every DLQ handler, every compensation function must log the trace ID. The observability and monitoring investment here is small. The debugging time it saves during incidents is enormous.

When to Move Back to Containers

Serverless is not always the right answer. Knowing when to migrate a workload back to containers is just as important as knowing when to go serverless in the first place. Blind loyalty to any architecture pattern is the wrong approach.

The crossover point is sustained throughput. Below roughly 10,000 invocations per minute, Lambda’s per-invocation pricing and zero-idle-cost model usually wins. Above that, especially for workloads with consistent traffic patterns rather than spikes, the per-invocation overhead exceeds what you would pay for reserved Fargate tasks or EKS pods running continuously. Teams routinely save 40-60% by moving sustained-traffic workloads from Lambda to Fargate after initial serverless prototyping proves the architecture. That is real money.

The other migration trigger is cold start sensitivity. If your use case requires consistent sub-100ms response times and you cannot use provisioned concurrency (because traffic is too variable to pre-warm efficiently), containers with a minimum replica count give you the consistent latency that distributed systems often demand. Provisioned concurrency can work, but it costs as much as running containers at that point, eliminating the cost advantage entirely. At that point you are paying container prices for serverless constraints. Just run containers.

Frequently Asked Questions

What makes a workload well-suited for serverless functions?

Serverless excels for event-triggered workloads with spiky or unpredictable traffic, stateless tasks completing in seconds to minutes, and glue logic orchestrating managed services. It performs poorly for sustained high-throughput above roughly 10,000 invocations per minute where per-invocation overhead exceeds container cost, anything requiring sub-100ms consistent latency due to 100-3,000ms cold starts, and stateful workloads needing in-memory state across requests.

What is idempotency and why is it required for event-driven systems?

An idempotent operation produces the same result no matter how many times it runs with the same input. Event systems use at-least-once delivery, so your function will receive duplicates due to retries. Without idempotency, duplicates produce double charges, duplicate notifications, or corrupted records. Implementation requires an atomic check-and-record using the event ID via DynamoDB conditional writes or database transactions. The atomicity is where most implementations break.

What is the saga pattern and when should I use it?

The saga pattern manages distributed transactions across multiple services without two-phase commit. Each step publishes an event triggering the next, with compensating transactions running in reverse on failure. Use sagas for business processes spanning 3+ services that need consistency guarantees. For 4+ step sagas, AWS Step Functions provides significant operational advantages over hand-rolled choreography, including visible state machines and queryable execution history.

How do dead letter queues work and what should we do with the messages?

A DLQ receives messages that failed processing after exhausting retries, typically 3-5 attempts with exponential backoff. Every DLQ message is a confirmed business failure: an order not placed, a notification not sent. Alert on DLQ depth above zero within 1 minute. After fixing the root cause, replay accumulated messages in order. A DLQ filling silently is a monitoring gap hiding active production failures.

How do you debug an event chain spanning multiple Lambda functions?

Inject a trace ID at the entry point and propagate it through every event payload and log line. AWS X-Ray, Datadog, or Honeycomb reconstruct the full trace from correlated spans. Without correlation IDs, reconstructing a failed order across 5 Lambda functions takes 2-4 hours using CloudWatch timestamps and guesswork. With them, it is a single query taking under 2 minutes.