Serverless Data Processing: Pay for What Runs
You inherited a Spark cluster that runs 22 hours a day processing data that arrives in 3 bursts. During those bursts, the cluster hits 80% utilization. The other 18 hours it sits at 4%, burning compute on idle executors. A full-time kitchen staff sitting around between three meal rushes. You can feel the money evaporating. Serverless could eliminate that idle cost. Hire caterers for the events. Fire them between meals. But the last time someone tried moving a pipeline to Lambda, the largest partition hit the 15-minute timeout. The whole migration got rolled back while the data team watched. The caterer who leaves at a fixed time regardless of whether dessert is served.
Serverless data processing works great for the right workloads. It’s also a mess for the wrong ones. The difference comes down to partition strategy, orchestration choices, and knowing exactly where the cost crossover lives.
- A Spark cluster at 4% utilization for 18 hours/day is the textbook case for serverless. Bursty workloads with long idle periods save a lot on compute.
- The 15-minute Lambda timeout is a binary wall, not a graceful limit. The caterer who walks out at 15 minutes regardless. Partition your data so no single invocation exceeds 10 minutes. The largest partition determines whether Lambda works at all.
- Step Functions orchestrate multi-stage pipelines with retry, error handling, and state persistence. Never chain Lambdas through direct invocation.
- Exactly-once processing does not exist. Idempotent design with deduplication keys approximates it so closely the distinction stops mattering.
- Cost crossover happens at moderate sustained utilization. Once your kitchen stays busy more often than it sits idle, full-time staff pull ahead.
The Workload Fitness Test
Not every pipeline belongs on Lambda. Apache Flink handles stateful streams where you need windowed counts and session tracking. Spark excels at heavy joins across terabyte-scale datasets. Serverless works when each piece of work stands alone and finishes within the timeout.
- Each processing unit completes independently without cross-partition state
- No single partition requires more than 10 minutes of compute
- Workload has idle periods where zero-cost scaling delivers real savings
- Downstream consumers tolerate eventual consistency (minutes, not seconds)
- Data volume per run stays under 500GB
| Workload characteristic | Serverless (Lambda + Step Functions) | Spark (EMR / Glue) |
|---|---|---|
| Event-driven triggers | Native S3/SQS/EventBridge triggers, zero idle cost | Cluster must be running or takes minutes to start |
| Partition-friendly transforms | Each file processed independently, linear scale-out | Overkill for embarrassingly parallel work |
| Shuffle-heavy joins | Impossible without external coordination | Built-in shuffle, sort-merge join, broadcast join |
| Stateful processing | No built-in state management | Windows, sessionization, iterative ML |
| Bursty schedule | Scales to zero between runs | Cluster idles or needs warm-up time |
| Sustained throughput | Cost escalates with duration | Reserved pricing wins above 3-4 hours/day |
| Signal | Serverless (Lambda/Step Functions) | Spark (EMR/Glue) | The Gap (Fargate/Glue) |
|---|---|---|---|
| Trigger pattern | Event-driven: S3 object arrives, SQS message, EventBridge schedule | Scheduled batch: daily/hourly ETL runs | Scheduled but lightweight |
| Data shape | Each file independent, no cross-partition joins | Large table joins, global aggregations, data redistribution | Medium joins, moderate shuffle |
| Processing time | Sub-10 min per chunk | Hours of sustained compute | 10 min to 3 hours |
| Traffic pattern | Bursty: 3 bursts/day, zero cost between | Sustained: 3+ hours daily, predictable volume | Moderate: too long for Lambda, too short for Spark |
| State needs | Stateless per invocation | Window functions, sessionization, iterative ML | Light state via checkpoints |
| Cost model | Per-invocation. Wins when idle time > 60% | Per-cluster-hour. Wins at sustained throughput | Per-vCPU-second. Middle ground |
The “too big for Lambda, too small for Spark” gap catches more teams than either extreme. AWS Glue (Spark with per-second billing) or Fargate tasks fill it. If your transforms need 20 minutes but not 3 hours, that middle tier is where you belong.
The 15-Minute Wall
In practice, you get about 10 minutes of safe processing time once you factor in cold starts, S3 uploads, and graceful shutdown. Chunk the dataset so no invocation exceeds that margin. Simple format conversions handle 500MB-1GB per Lambda. Complex transformations with parsing and enrichment: 50-100MB per chunk.
Failed chunks retry on their own. In Spark, one partition failure often restarts the whole stage. At terabyte scale, that’s the difference between a 30-second retry and a 45-minute restart.
Don’t: Set the Lambda timeout to 15 minutes and let it race the clock. One slow S3 write at minute 14 and the entire invocation fails without completing its output, leaving partial data that poisons downstream consumers.
Do: Target 10-minute processing with 5 minutes of safety margin. If a chunk approaches 8 minutes, that chunk needs further subdivision, not a longer timeout.
Fan-Out Orchestration and Backpressure
Fan-out is where serverless pipelines get their speed. One S3 event triggers a coordinator Lambda that distributes 1,000 chunks across 1,000 concurrent invocations. The whole pipeline finishes in minutes instead of hours.
But fan-out without backpressure is a footgun. Set MaximumConcurrency on the SQS event source mapping. Without it, 10,000 backlogged messages launch 10,000 concurrent Lambdas that overwhelm DynamoDB write capacity or exhaust your account’s concurrency limit. The pipeline doesn’t slow down gracefully. It crashes.
For fan-in collection, each worker writes to S3 with deterministic key patterns. The final step lists the output prefix and merges results. Use Step Functions Express workflows for high-fan-out orchestration. Standard workflows charge per state transition, and at 10,000+ items that cost compounds fast. Express workflows cut orchestration cost by 10x or more.
Idempotency Over Exactly-Once Illusions
True exactly-once delivery doesn’t exist in distributed systems. Every retry can create a duplicate. The real answer is making duplicates harmless through idempotent design.
S3 writes are naturally idempotent. Writing the same object twice produces the same result. For database writes, generate a deterministic key from the input: a SHA256 hash of the S3 key plus chunk offset. Check DynamoDB with a conditional PutItem before processing. Key exists? Skip the work and return success.
Set a 24-hour TTL on the deduplication table. Without it, the table grows forever and DynamoDB costs creep up until someone notices months later. Data engineering pipelines that bake in idempotency from day one skip the painful retrofit that every “just get it working” pipeline eventually needs.
The Small-File Problem
Five hundred concurrent Lambdas produce 500 small files. Query engines like Athena open each file one by one, read the metadata, and plan the query. Thousands of tiny Parquet files make queries painfully slow compared to a few hundred properly-sized ones.
Target: 128-256MB Parquet files with Snappy compression. Partition by the columns your queries actually filter on. Date is almost always right. Over-partitioning on multiple columns creates a directory tree with thousands of near-empty files, which is worse than no partitioning at all.
A compaction step is unavoidable. Schedule a Glue job or a Lambda triggered by S3 Inventory to merge small files into the right sizes. This is the least exciting part of the architecture. It’s also the part that decides whether your analytics queries are fast or miserable.
| Approach | Effort | Ongoing cost | Query impact |
|---|---|---|---|
| No compaction (raw Lambda output) | None | None | Catastrophic. Thousands of file opens per query. |
| Scheduled Glue compaction | Medium (days) | Low. Per-second Glue billing. | Excellent. Optimal file sizes for Athena/Spark. |
| Lambda-triggered compaction | Medium (days) | Very low. Runs only when needed. | Good. Slightly less optimal sizing. |
| Write directly to Iceberg/Hudi | High (weeks) | Medium. Table maintenance overhead. | Excellent. Built-in compaction and time travel. |
Orchestration: Step Functions vs. Airflow
Two orchestrators dominate serverless data pipelines, and they solve different problems.
Step Functions is serverless and AWS-native. Built-in retry, timeout, and error handling. The ASL (Amazon States Language) definition gets wordy for complex branching, but for straight-line and fan-out pipelines it’s clean. No servers to manage. You pay per state transition.
Airflow is Python-native and works across systems. Full visibility into DAG run history. Rich set of connectors for databases, SaaS APIs, and on-prem systems. But it needs a running scheduler, and managed Airflow (MWAA) costs money whether your pipelines run or not.
| When Step Functions fits | When Airflow fits |
|---|---|
| Pipeline is entirely within AWS | Pipeline spans cloud providers, SaaS, on-premises |
| Fan-out to hundreds of concurrent Lambdas | Complex dependency graphs with conditional branching |
| Zero idle cost matters (no always-on scheduler) | Team already runs Airflow for other workloads |
| Pipeline logic is straightforward (ingest, transform, load) | Pipeline requires Python-level orchestration logic |
Already running Airflow? Add Lambda steps to your existing DAGs. Don’t run two orchestrators. The pain of keeping both running outweighs whatever architectural neatness you get from picking the “right” one for each job.
What the Industry Gets Wrong About Serverless Data Processing
“Move everything to Lambda.” Lambda has a 15-minute hard timeout. Any partition that takes longer fails irrecoverably. Serverless Spark (EMR Serverless, Databricks Serverless) handles long-running jobs without that wall. Forcing a 45-minute transform into Lambda isn’t a serverless migration. It’s creative chunking that eventually collapses under data skew.
“Serverless is always cheaper for data processing.” When a cluster sits idle more than it runs, serverless wins by killing idle costs. Once utilization tips past the halfway mark, reserved containers pull ahead. A Spark cluster running 22 hours/day at 80% utilization isn’t a serverless candidate. The 3 hours of bursty ingestion with 4% utilization afterward? That is.
That Spark cluster burning money at 4% utilization for 18 hours? The bursts run on Lambda now. The sustained transforms run on reserved containers. Idle cost dropped to near zero. The architecture got less elegant and the bill got a lot smaller. Sometimes the boring answer is the right one.