Serverless Data Processing: Pay for What Runs

May 31, 2025 Metasphere Engineering 11 min read

You inherited a Spark cluster that runs 22 hours a day processing data that arrives in 3 bursts. During those bursts, the cluster hits 80% utilization. The other 18 hours it sits at 4%, burning compute on idle executors. A full-time kitchen staff sitting around between three meal rushes. You can feel the money evaporating. Serverless could eliminate that idle cost. Hire caterers for the events. Fire them between meals. But the last time someone tried moving a pipeline to Lambda, the largest partition hit the 15-minute timeout. The whole migration got rolled back while the data team watched. The caterer who leaves at a fixed time regardless of whether dessert is served.

Serverless data processing works great for the right workloads. It’s also a mess for the wrong ones. The difference comes down to partition strategy, orchestration choices, and knowing exactly where the cost crossover lives.

Key takeaways

A Spark cluster at 4% utilization for 18 hours/day is the textbook case for serverless. Bursty workloads with long idle periods save a lot on compute.
The 15-minute Lambda timeout is a binary wall, not a graceful limit. The caterer who walks out at 15 minutes regardless. Partition your data so no single invocation exceeds 10 minutes. The largest partition determines whether Lambda works at all.
Step Functions orchestrate multi-stage pipelines with retry, error handling, and state persistence. Never chain Lambdas through direct invocation.
Exactly-once processing does not exist. Idempotent design with deduplication keys approximates it so closely the distinction stops mattering.
Cost crossover happens at moderate sustained utilization. Once your kitchen stays busy more often than it sits idle, full-time staff pull ahead.

The Workload Fitness Test

Not every pipeline belongs on Lambda. Apache Flink handles stateful streams where you need windowed counts and session tracking. Spark excels at heavy joins across terabyte-scale datasets. Serverless works when each piece of work stands alone and finishes within the timeout.

Prerequisites

Each processing unit completes independently without cross-partition state
No single partition requires more than 10 minutes of compute
Workload has idle periods where zero-cost scaling delivers real savings
Downstream consumers tolerate eventual consistency (minutes, not seconds)
Data volume per run stays under 500GB

Workload characteristic	Serverless (Lambda + Step Functions)	Spark (EMR / Glue)
Event-driven triggers	Native S3/SQS/EventBridge triggers, zero idle cost	Cluster must be running or takes minutes to start
Partition-friendly transforms	Each file processed independently, linear scale-out	Overkill for embarrassingly parallel work
Shuffle-heavy joins	Impossible without external coordination	Built-in shuffle, sort-merge join, broadcast join
Stateful processing	No built-in state management	Windows, sessionization, iterative ML
Bursty schedule	Scales to zero between runs	Cluster idles or needs warm-up time
Sustained throughput	Cost escalates with duration	Reserved pricing wins above 3-4 hours/day

Signal	Serverless (Lambda/Step Functions)	Spark (EMR/Glue)	The Gap (Fargate/Glue)
Trigger pattern	Event-driven: S3 object arrives, SQS message, EventBridge schedule	Scheduled batch: daily/hourly ETL runs	Scheduled but lightweight
Data shape	Each file independent, no cross-partition joins	Large table joins, global aggregations, data redistribution	Medium joins, moderate shuffle
Processing time	Sub-10 min per chunk	Hours of sustained compute	10 min to 3 hours
Traffic pattern	Bursty: 3 bursts/day, zero cost between	Sustained: 3+ hours daily, predictable volume	Moderate: too long for Lambda, too short for Spark
State needs	Stateless per invocation	Window functions, sessionization, iterative ML	Light state via checkpoints
Cost model	Per-invocation. Wins when idle time > 60%	Per-cluster-hour. Wins at sustained throughput	Per-vCPU-second. Middle ground

The “too big for Lambda, too small for Spark” gap catches more teams than either extreme. AWS Glue (Spark with per-second billing) or Fargate tasks fill it. If your transforms need 20 minutes but not 3 hours, that middle tier is where you belong.

The 15-Minute Wall

The Timeout Wall The 15-minute Lambda execution limit that determines whether a data processing workload can run serverless. If the largest partition processes in 12 minutes, Lambda works. If it processes in 16, Lambda fails irrecoverably. Binary, not graceful. No “almost fits.”

In practice, you get about 10 minutes of safe processing time once you factor in cold starts, S3 uploads, and graceful shutdown. Chunk the dataset so no invocation exceeds that margin. Simple format conversions handle 500MB-1GB per Lambda. Complex transformations with parsing and enrichment: 50-100MB per chunk.

Failed chunks retry on their own. In Spark, one partition failure often restarts the whole stage. At terabyte scale, that’s the difference between a 30-second retry and a 45-minute restart.

Anti-pattern

Don’t: Set the Lambda timeout to 15 minutes and let it race the clock. One slow S3 write at minute 14 and the entire invocation fails without completing its output, leaving partial data that poisons downstream consumers.

Do: Target 10-minute processing with 5 minutes of safety margin. If a chunk approaches 8 minutes, that chunk needs further subdivision, not a longer timeout.

Fan-Out Orchestration and Backpressure

Fan-out is where serverless pipelines get their speed. One S3 event triggers a coordinator Lambda that distributes 1,000 chunks across 1,000 concurrent invocations. The whole pipeline finishes in minutes instead of hours.

But fan-out without backpressure is a footgun. Set MaximumConcurrency on the SQS event source mapping. Without it, 10,000 backlogged messages launch 10,000 concurrent Lambdas that overwhelm DynamoDB write capacity or exhaust your account’s concurrency limit. The pipeline doesn’t slow down gracefully. It crashes.

For fan-in collection, each worker writes to S3 with deterministic key patterns. The final step lists the output prefix and merges results. Use Step Functions Express workflows for high-fan-out orchestration. Standard workflows charge per state transition, and at 10,000+ items that cost compounds fast. Express workflows cut orchestration cost by 10x or more.

Idempotency Over Exactly-Once Illusions

True exactly-once delivery doesn’t exist in distributed systems. Every retry can create a duplicate. The real answer is making duplicates harmless through idempotent design.

S3 writes are naturally idempotent. Writing the same object twice produces the same result. For database writes, generate a deterministic key from the input: a SHA256 hash of the S3 key plus chunk offset. Check DynamoDB with a conditional PutItem before processing. Key exists? Skip the work and return success.

Set a 24-hour TTL on the deduplication table. Without it, the table grows forever and DynamoDB costs creep up until someone notices months later. Data engineering pipelines that bake in idempotency from day one skip the painful retrofit that every “just get it working” pipeline eventually needs.

The Small-File Problem

Five hundred concurrent Lambdas produce 500 small files. Query engines like Athena open each file one by one, read the metadata, and plan the query. Thousands of tiny Parquet files make queries painfully slow compared to a few hundred properly-sized ones.

Target: 128-256MB Parquet files with Snappy compression. Partition by the columns your queries actually filter on. Date is almost always right. Over-partitioning on multiple columns creates a directory tree with thousands of near-empty files, which is worse than no partitioning at all.

A compaction step is unavoidable. Schedule a Glue job or a Lambda triggered by S3 Inventory to merge small files into the right sizes. This is the least exciting part of the architecture. It’s also the part that decides whether your analytics queries are fast or miserable.

Approach	Effort	Ongoing cost	Query impact
No compaction (raw Lambda output)	None	None	Catastrophic. Thousands of file opens per query.
Scheduled Glue compaction	Medium (days)	Low. Per-second Glue billing.	Excellent. Optimal file sizes for Athena/Spark.
Lambda-triggered compaction	Medium (days)	Very low. Runs only when needed.	Good. Slightly less optimal sizing.
Write directly to Iceberg/Hudi	High (weeks)	Medium. Table maintenance overhead.	Excellent. Built-in compaction and time travel.

Orchestration: Step Functions vs. Airflow

Two orchestrators dominate serverless data pipelines, and they solve different problems.

Step Functions is serverless and AWS-native. Built-in retry, timeout, and error handling. The ASL (Amazon States Language) definition gets wordy for complex branching, but for straight-line and fan-out pipelines it’s clean. No servers to manage. You pay per state transition.

Airflow is Python-native and works across systems. Full visibility into DAG run history. Rich set of connectors for databases, SaaS APIs, and on-prem systems. But it needs a running scheduler, and managed Airflow (MWAA) costs money whether your pipelines run or not.

When Step Functions fits	When Airflow fits
Pipeline is entirely within AWS	Pipeline spans cloud providers, SaaS, on-premises
Fan-out to hundreds of concurrent Lambdas	Complex dependency graphs with conditional branching
Zero idle cost matters (no always-on scheduler)	Team already runs Airflow for other workloads
Pipeline logic is straightforward (ingest, transform, load)	Pipeline requires Python-level orchestration logic

Already running Airflow? Add Lambda steps to your existing DAGs. Don’t run two orchestrators. The pain of keeping both running outweighs whatever architectural neatness you get from picking the “right” one for each job.

What the Industry Gets Wrong About Serverless Data Processing

“Move everything to Lambda.” Lambda has a 15-minute hard timeout. Any partition that takes longer fails irrecoverably. Serverless Spark (EMR Serverless, Databricks Serverless) handles long-running jobs without that wall. Forcing a 45-minute transform into Lambda isn’t a serverless migration. It’s creative chunking that eventually collapses under data skew.

“Serverless is always cheaper for data processing.” When a cluster sits idle more than it runs, serverless wins by killing idle costs. Once utilization tips past the halfway mark, reserved containers pull ahead. A Spark cluster running 22 hours/day at 80% utilization isn’t a serverless candidate. The 3 hours of bursty ingestion with 4% utilization afterward? That is.

Our take Split the workload by traffic pattern. Run bursty, short-lived transforms on Lambda with Step Functions. Run sustained, long-running jobs on Serverless Spark or reserved containers. Forcing your whole pipeline into one compute model looks clean on a diagram but breaks in production. The boring hybrid answer saves the most money and causes the fewest incidents.

That Spark cluster burning money at 4% utilization for 18 hours? The bursts run on Lambda now. The sustained transforms run on reserved containers. Idle cost dropped to near zero. The architecture got less elegant and the bill got a lot smaller. Sometimes the boring answer is the right one.

Frequently Asked Questions

When does serverless ETL make sense versus Spark on EMR?

Serverless ETL wins when your pipelines process under 500GB per run and your schedule is bursty or unpredictable. Lambda plus Step Functions handles those workloads with zero idle cost. Above 500GB, or when you need heavy joins across big datasets or iterative ML work, Spark on EMR or Glue costs less. The crossover sits at roughly 3-4 hours of sustained compute per day.

How do you work around the 15-minute Lambda timeout?

Break your data into chunks that each finish within 10 minutes, leaving a 5-minute safety buffer. Use S3 event triggers or SQS messages to spread work across parallel Lambda calls, one chunk each. When chunks depend on each other, Step Functions Map state hands them out to concurrent Lambdas and gathers the results. Pipelines processing terabytes routinely finish in under 20 minutes using 1,000+ parallel Lambda calls.

Is exactly-once processing possible in serverless pipelines?

True exactly-once delivery doesn’t exist in distributed systems. Serverless pipelines get close by making every step idempotent: design each transformation so running it twice with the same input gives the same output. Use DynamoDB conditional writes or S3 object versioning to catch duplicates. With proper idempotency keys, duplicate processing becomes so rare it practically doesn’t matter.

How do cold starts affect batch processing windows?

A single Lambda cold start adds 200-800ms depending on runtime and package size. In a fan-out pipeline launching 500 functions at once, cold starts spread across all those parallel runs and add roughly 1-2 seconds to total pipeline time. Provisioned Concurrency kills cold starts but adds baseline cost. For batch windows above 5 minutes, cold start impact is tiny. For sub-minute pipelines, use Provisioned Concurrency or keep functions warm with scheduled pings.

What file format should serverless pipelines write?

Parquet with Snappy compression and file sizes around 128-256MB hits the sweet spot between query speed and write efficiency. Lambda functions writing lots of small Parquet files create a metadata headache for query engines like Athena. Use a compaction step (a scheduled Glue job or a Lambda triggered by S3 inventory) to merge small files into proper sizes. A well-partitioned Parquet dataset with 128MB files makes Athena queries 10x faster or more compared to raw CSV.