← Back to Insights

Serverless Data Processing: Pay for What Runs

Metasphere Engineering 11 min read

You inherited a Spark cluster that runs 22 hours a day processing data that arrives in 3 bursts. During those bursts, the cluster hits 80% utilization. The other 18 hours it sits at 4%, burning compute on idle executors. A full-time kitchen staff sitting around between three meal rushes. You can feel the money evaporating. Serverless could eliminate that idle cost. Hire caterers for the events. Fire them between meals. But the last time someone tried moving a pipeline to Lambda, the largest partition hit the 15-minute timeout. The whole migration got rolled back while the data team watched. The caterer who leaves at a fixed time regardless of whether dessert is served.

Serverless data processing works great for the right workloads. It’s also a mess for the wrong ones. The difference comes down to partition strategy, orchestration choices, and knowing exactly where the cost crossover lives.

Key takeaways
  • A Spark cluster at 4% utilization for 18 hours/day is the textbook case for serverless. Bursty workloads with long idle periods save a lot on compute.
  • The 15-minute Lambda timeout is a binary wall, not a graceful limit. The caterer who walks out at 15 minutes regardless. Partition your data so no single invocation exceeds 10 minutes. The largest partition determines whether Lambda works at all.
  • Step Functions orchestrate multi-stage pipelines with retry, error handling, and state persistence. Never chain Lambdas through direct invocation.
  • Exactly-once processing does not exist. Idempotent design with deduplication keys approximates it so closely the distinction stops mattering.
  • Cost crossover happens at moderate sustained utilization. Once your kitchen stays busy more often than it sits idle, full-time staff pull ahead.

The Workload Fitness Test

Not every pipeline belongs on Lambda. Apache Flink handles stateful streams where you need windowed counts and session tracking. Spark excels at heavy joins across terabyte-scale datasets. Serverless works when each piece of work stands alone and finishes within the timeout.

Prerequisites
  1. Each processing unit completes independently without cross-partition state
  2. No single partition requires more than 10 minutes of compute
  3. Workload has idle periods where zero-cost scaling delivers real savings
  4. Downstream consumers tolerate eventual consistency (minutes, not seconds)
  5. Data volume per run stays under 500GB
Workload characteristicServerless (Lambda + Step Functions)Spark (EMR / Glue)
Event-driven triggersNative S3/SQS/EventBridge triggers, zero idle costCluster must be running or takes minutes to start
Partition-friendly transformsEach file processed independently, linear scale-outOverkill for embarrassingly parallel work
Shuffle-heavy joinsImpossible without external coordinationBuilt-in shuffle, sort-merge join, broadcast join
Stateful processingNo built-in state managementWindows, sessionization, iterative ML
Bursty scheduleScales to zero between runsCluster idles or needs warm-up time
Sustained throughputCost escalates with durationReserved pricing wins above 3-4 hours/day
SignalServerless (Lambda/Step Functions)Spark (EMR/Glue)The Gap (Fargate/Glue)
Trigger patternEvent-driven: S3 object arrives, SQS message, EventBridge scheduleScheduled batch: daily/hourly ETL runsScheduled but lightweight
Data shapeEach file independent, no cross-partition joinsLarge table joins, global aggregations, data redistributionMedium joins, moderate shuffle
Processing timeSub-10 min per chunkHours of sustained compute10 min to 3 hours
Traffic patternBursty: 3 bursts/day, zero cost betweenSustained: 3+ hours daily, predictable volumeModerate: too long for Lambda, too short for Spark
State needsStateless per invocationWindow functions, sessionization, iterative MLLight state via checkpoints
Cost modelPer-invocation. Wins when idle time > 60%Per-cluster-hour. Wins at sustained throughputPer-vCPU-second. Middle ground

The “too big for Lambda, too small for Spark” gap catches more teams than either extreme. AWS Glue (Spark with per-second billing) or Fargate tasks fill it. If your transforms need 20 minutes but not 3 hours, that middle tier is where you belong.

The 15-Minute Wall

The Timeout Wall The 15-minute Lambda execution limit that determines whether a data processing workload can run serverless. If the largest partition processes in 12 minutes, Lambda works. If it processes in 16, Lambda fails irrecoverably. Binary, not graceful. No “almost fits.”

In practice, you get about 10 minutes of safe processing time once you factor in cold starts, S3 uploads, and graceful shutdown. Chunk the dataset so no invocation exceeds that margin. Simple format conversions handle 500MB-1GB per Lambda. Complex transformations with parsing and enrichment: 50-100MB per chunk.

Step Functions: Chunk, Fan-Out, CollectStep Functions: Chunk, Fan-Out, CollectS3 EventNew file uploadedTriggers workflowChunk + MapSplit into N chunksMap State distributesLambda 1: chunk 0-100MBLambda 2: chunk 100-200Lambda N: chunk NCollect + WriteAll chunks completeMerge results to ParquetRegister in Glue CatalogTotal time = overhead + longest chunk. Not sum of all chunks. That is the point.

Failed chunks retry on their own. In Spark, one partition failure often restarts the whole stage. At terabyte scale, that’s the difference between a 30-second retry and a 45-minute restart.

Anti-pattern

Don’t: Set the Lambda timeout to 15 minutes and let it race the clock. One slow S3 write at minute 14 and the entire invocation fails without completing its output, leaving partial data that poisons downstream consumers.

Do: Target 10-minute processing with 5 minutes of safety margin. If a chunk approaches 8 minutes, that chunk needs further subdivision, not a longer timeout.

Fan-Out Orchestration and Backpressure

Serverless fan-out fan-in data processing pipeline with parallel Lambda executionFive-stage pipeline showing S3 event triggering Step Functions, chunking data, fanning out to parallel Lambda workers via SQS, collecting results, and writing optimized Parquet output1. Trigger2. Chunk3. Fan-Out4. Collect5. OutputS3 BucketNew objectEventBridgeS3 event ruleStep FunctionsList S3 objectsCalculate chunksMap StateSQSQueueLambda 1Chunk 0-100MBLambda 2Chunk 100-200Lambda 3Chunk 200-300Lambda 4Chunk 300-400Lambda NChunk NokokokokokCollectVerify allchunks doneMerge keysS3 OutputParquet filesGlue CatalogPartitionsTrigger latency~1sChunk overhead~3sParallel workersN concurrentFan-in~2sOutputParquetTotal pipeline: overhead + longest chunk processing timeNot sum of all chunks

Fan-out is where serverless pipelines get their speed. One S3 event triggers a coordinator Lambda that distributes 1,000 chunks across 1,000 concurrent invocations. The whole pipeline finishes in minutes instead of hours.

But fan-out without backpressure is a footgun. Set MaximumConcurrency on the SQS event source mapping. Without it, 10,000 backlogged messages launch 10,000 concurrent Lambdas that overwhelm DynamoDB write capacity or exhaust your account’s concurrency limit. The pipeline doesn’t slow down gracefully. It crashes.

For fan-in collection, each worker writes to S3 with deterministic key patterns. The final step lists the output prefix and merges results. Use Step Functions Express workflows for high-fan-out orchestration. Standard workflows charge per state transition, and at 10,000+ items that cost compounds fast. Express workflows cut orchestration cost by 10x or more.

Idempotency Over Exactly-Once Illusions

True exactly-once delivery doesn’t exist in distributed systems. Every retry can create a duplicate. The real answer is making duplicates harmless through idempotent design.

S3 writes are naturally idempotent. Writing the same object twice produces the same result. For database writes, generate a deterministic key from the input: a SHA256 hash of the S3 key plus chunk offset. Check DynamoDB with a conditional PutItem before processing. Key exists? Skip the work and return success.

Idempotent Processing: Dedup Before TransformIdempotent Processing: Safe to RetryEvent ArrivesSQS / EventBridgeMay be a duplicateDedup CheckDynamoDB conditional writeevent_id already processed?If yes: return cached resultTransformProcess the eventBusiness logic runsStore Result + Mark DoneAtomic: write result + event_idConditional write prevents raceNext duplicate returns this resultExactly-once is a myth. At-least-once + idempotency = effectively-once.

Set a 24-hour TTL on the deduplication table. Without it, the table grows forever and DynamoDB costs creep up until someone notices months later. Data engineering pipelines that bake in idempotency from day one skip the painful retrofit that every “just get it working” pipeline eventually needs.

The Small-File Problem

Five hundred concurrent Lambdas produce 500 small files. Query engines like Athena open each file one by one, read the metadata, and plan the query. Thousands of tiny Parquet files make queries painfully slow compared to a few hundred properly-sized ones.

Target: 128-256MB Parquet files with Snappy compression. Partition by the columns your queries actually filter on. Date is almost always right. Over-partitioning on multiple columns creates a directory tree with thousands of near-empty files, which is worse than no partitioning at all.

A compaction step is unavoidable. Schedule a Glue job or a Lambda triggered by S3 Inventory to merge small files into the right sizes. This is the least exciting part of the architecture. It’s also the part that decides whether your analytics queries are fast or miserable.

ApproachEffortOngoing costQuery impact
No compaction (raw Lambda output)NoneNoneCatastrophic. Thousands of file opens per query.
Scheduled Glue compactionMedium (days)Low. Per-second Glue billing.Excellent. Optimal file sizes for Athena/Spark.
Lambda-triggered compactionMedium (days)Very low. Runs only when needed.Good. Slightly less optimal sizing.
Write directly to Iceberg/HudiHigh (weeks)Medium. Table maintenance overhead.Excellent. Built-in compaction and time travel.

Orchestration: Step Functions vs. Airflow

Two orchestrators dominate serverless data pipelines, and they solve different problems.

Step Functions is serverless and AWS-native. Built-in retry, timeout, and error handling. The ASL (Amazon States Language) definition gets wordy for complex branching, but for straight-line and fan-out pipelines it’s clean. No servers to manage. You pay per state transition.

Airflow is Python-native and works across systems. Full visibility into DAG run history. Rich set of connectors for databases, SaaS APIs, and on-prem systems. But it needs a running scheduler, and managed Airflow (MWAA) costs money whether your pipelines run or not.

When Step Functions fitsWhen Airflow fits
Pipeline is entirely within AWSPipeline spans cloud providers, SaaS, on-premises
Fan-out to hundreds of concurrent LambdasComplex dependency graphs with conditional branching
Zero idle cost matters (no always-on scheduler)Team already runs Airflow for other workloads
Pipeline logic is straightforward (ingest, transform, load)Pipeline requires Python-level orchestration logic

Already running Airflow? Add Lambda steps to your existing DAGs. Don’t run two orchestrators. The pain of keeping both running outweighs whatever architectural neatness you get from picking the “right” one for each job.

What the Industry Gets Wrong About Serverless Data Processing

“Move everything to Lambda.” Lambda has a 15-minute hard timeout. Any partition that takes longer fails irrecoverably. Serverless Spark (EMR Serverless, Databricks Serverless) handles long-running jobs without that wall. Forcing a 45-minute transform into Lambda isn’t a serverless migration. It’s creative chunking that eventually collapses under data skew.

“Serverless is always cheaper for data processing.” When a cluster sits idle more than it runs, serverless wins by killing idle costs. Once utilization tips past the halfway mark, reserved containers pull ahead. A Spark cluster running 22 hours/day at 80% utilization isn’t a serverless candidate. The 3 hours of bursty ingestion with 4% utilization afterward? That is.

Our take Split the workload by traffic pattern. Run bursty, short-lived transforms on Lambda with Step Functions. Run sustained, long-running jobs on Serverless Spark or reserved containers. Forcing your whole pipeline into one compute model looks clean on a diagram but breaks in production. The boring hybrid answer saves the most money and causes the fewest incidents.

That Spark cluster burning money at 4% utilization for 18 hours? The bursts run on Lambda now. The sustained transforms run on reserved containers. Idle cost dropped to near zero. The architecture got less elegant and the bill got a lot smaller. Sometimes the boring answer is the right one.

Your Spark Cluster Burns Money 18 Hours a Day

Serverless ETL eliminates idle cluster costs but introduces timeout traps, cold start delays, and exactly-once illusions. Getting the partition strategy, orchestration, and cost crossover analysis right is what separates savings from a six-month detour.

Architect Your Data Pipeline

Frequently Asked Questions

When does serverless ETL make sense versus Spark on EMR?

+

Serverless ETL wins when your pipelines process under 500GB per run and your schedule is bursty or unpredictable. Lambda plus Step Functions handles those workloads with zero idle cost. Above 500GB, or when you need heavy joins across big datasets or iterative ML work, Spark on EMR or Glue costs less. The crossover sits at roughly 3-4 hours of sustained compute per day.

How do you work around the 15-minute Lambda timeout?

+

Break your data into chunks that each finish within 10 minutes, leaving a 5-minute safety buffer. Use S3 event triggers or SQS messages to spread work across parallel Lambda calls, one chunk each. When chunks depend on each other, Step Functions Map state hands them out to concurrent Lambdas and gathers the results. Pipelines processing terabytes routinely finish in under 20 minutes using 1,000+ parallel Lambda calls.

Is exactly-once processing possible in serverless pipelines?

+

True exactly-once delivery doesn’t exist in distributed systems. Serverless pipelines get close by making every step idempotent: design each transformation so running it twice with the same input gives the same output. Use DynamoDB conditional writes or S3 object versioning to catch duplicates. With proper idempotency keys, duplicate processing becomes so rare it practically doesn’t matter.

How do cold starts affect batch processing windows?

+

A single Lambda cold start adds 200-800ms depending on runtime and package size. In a fan-out pipeline launching 500 functions at once, cold starts spread across all those parallel runs and add roughly 1-2 seconds to total pipeline time. Provisioned Concurrency kills cold starts but adds baseline cost. For batch windows above 5 minutes, cold start impact is tiny. For sub-minute pipelines, use Provisioned Concurrency or keep functions warm with scheduled pings.

What file format should serverless pipelines write?

+

Parquet with Snappy compression and file sizes around 128-256MB hits the sweet spot between query speed and write efficiency. Lambda functions writing lots of small Parquet files create a metadata headache for query engines like Athena. Use a compaction step (a scheduled Glue job or a Lambda triggered by S3 inventory) to merge small files into proper sizes. A well-partitioned Parquet dataset with 128MB files makes Athena queries 10x faster or more compared to raw CSV.