Serverless Data Processing: ETL Without Servers

May 31, 2025 Metasphere Engineering 12 min read

You inherited a Spark cluster that runs for 22 hours a day processing data that arrives in 3 bursts. During those bursts, the cluster is at 80% utilization. The other 18 hours, it sits at 4% utilization, burning compute budget on idle executors. You can feel the money evaporating. You know serverless could eliminate that idle cost. You also know that last time someone tried to move a data pipeline to Lambda, it hit the 15-minute timeout on the largest partition and the whole thing had to be rolled back on a Friday afternoon while the data team watched.

This is the tension at the center of serverless data processing. The economics are compelling when the workload fits. The failure modes are sharp when it does not. The gap between “this is perfect for Lambda” and “this will never work on Lambda” is narrower than most architecture diagrams suggest. Knowing which side of that line your pipeline falls on is the decision that determines whether this works or becomes a six-month detour.

Where Serverless ETL Wins and Where It Doesn’t

Here is what actually works in production. Serverless data processing works best for event-driven, partition-friendly workloads. An S3 object lands, a Lambda function transforms it, writes the result to another S3 prefix. Each invocation is independent. Scaling is automatic. Idle cost is zero. Clean.

It works poorly for workloads that require shuffling data across partitions (large joins, global aggregations), maintaining state between processing steps (windowed computations, sessionization), or processing that exceeds 15 minutes per unit of work. Do not try to engineer around these limitations. They are fundamental to the execution model.

The “too big for Lambda, too small for Spark” gap is real, and more pipelines fall into it than you would expect. A transformation processing 200GB with a 30-minute window does not justify a Spark cluster but will time out on Lambda without careful chunking. AWS Glue (which runs Spark under the hood but with per-second billing) or Fargate tasks (containers with no timeout limit and pay-per-second pricing) fill this gap. Glue’s DPU-hour billing makes it cost-competitive with Lambda for jobs that run 5-30 minutes on moderate data volumes.

For pipelines that do fit the serverless model, the 15-minute timeout is the constraint that shapes everything.

The 15-Minute Wall and How to Break Through It

Lambda’s hard timeout at 15 minutes is the constraint that shapes every serverless data pipeline. And you do not actually get 15 minutes of processing. You get about 10 minutes of safe processing time, because you need margin for cold starts, S3 upload time, and graceful shutdown. Plan for 10, not 15.

The solution is chunking. Break the input dataset into pieces that each process within 10 minutes. The chunk boundary depends on your transformation complexity. For simple format conversions (CSV to Parquet, JSON flattening), a single Lambda can handle 500MB-1GB per invocation. For complex transformations with validation, enrichment, and multiple output writes, 50-100MB per chunk is safer.

Step Functions Map state is the orchestration primitive for this pattern. It takes a list of chunk references, launches a Lambda invocation for each, runs them concurrently (up to the Map’s MaxConcurrency setting), and collects results. Failed chunks can be retried independently without reprocessing the entire dataset. This is where serverless data processing genuinely outperforms a monolithic Spark job. If one partition fails in Spark, you often restart the entire stage. With Step Functions, you retry the one failed chunk. That granularity matters when your pipeline processes terabytes.

The chunking pattern naturally leads to the broader question of how to distribute and collect work across many Lambda invocations.

Fan-Out / Fan-In: Parallelism Without Coordination

The fan-out/fan-in pattern is the workhorse of serverless data processing. A coordinator function (or Step Functions state machine) distributes work across hundreds or thousands of parallel Lambda invocations. Each invocation processes its chunk independently. Results are collected afterward. Simple concept. Deceptively tricky to get right.

The SQS-plus-Lambda variant is the most common. Drop chunk references onto an SQS queue. Lambda’s event source mapping pulls messages and invokes functions concurrently, scaling up to the queue’s throughput. The key configuration that teams miss: set MaximumConcurrency on the event source mapping to prevent Lambda from overwhelming downstream resources. Without this limit, a queue backlog of 10,000 messages will launch 10,000 concurrent Lambda invocations, which will exhaust your DynamoDB write capacity, S3 request rate, or VPC IP addresses. This routinely takes down production DynamoDB tables that have nothing to do with the pipeline.

For fan-in (collecting results), do not invoke Lambda from Lambda directly. That way lies madness and retry storms. Instead, each worker writes its output to S3 with a predictable key pattern. The final step lists the output prefix and either merges results or triggers the downstream consumer. S3’s strong consistency (since December 2020) means a ListObjects call immediately after the last write will return all objects. Before that guarantee, fan-in on S3 was a race condition waiting to happen. Thankfully, that era is over.

The coordination cost is low. SQS pricing is negligible for most data pipeline volumes. Step Functions standard workflows charge per state transition, which adds up fast for Map states processing thousands of items. Express workflows (charge per invocation and duration, not per transition) are the right choice for high-fan-out data processing. The pricing model shift reduces orchestration cost by 10-50x for large pipelines. Use Express. Do not learn this lesson from your bill.

Fan-out and fan-in guarantee that your transformation will run at least once. The question is what happens when it runs more than once.

Exactly-Once Is a Lie. Idempotency Is the Answer.

Every distributed system delivers messages at-least-once. SQS guarantees at-least-once delivery. Lambda’s event source mapping retries on failure. Step Functions retries failed states. At every layer, your transformation function will run more than once with the same input. Not “might.” Will.

The standard advice is “make your functions idempotent.” Everyone says this. Nobody explains how. For a function that reads from S3, transforms data, and writes to S3, idempotency comes naturally. Writing the same Parquet file to the same key is a no-op. But for functions that write to databases, send notifications, or call external APIs, idempotency requires deliberate engineering.

The pattern: generate a deterministic idempotency key from the input (typically a hash of the S3 key plus the chunk offset). Before processing, check whether that key exists in a deduplication store (DynamoDB with a conditional PutItem, or a PostgreSQL INSERT with ON CONFLICT DO NOTHING). If the key exists, skip processing. If not, process and write the key atomically.

The DynamoDB deduplication table should have a TTL set to the pipeline’s maximum retry window plus margin. For most pipelines, a 24-hour TTL is sufficient. Without TTL, the table grows indefinitely and the conditional writes get slower as partition sizes increase.

Effective data engineering practice treats idempotency as a pipeline design requirement, not an afterthought bolted on when duplicates are discovered in production. If you are designing the idempotency layer after you find duplicates in production data, you are already having a bad week.

Idempotency keeps your data correct. The next problem is keeping your data queryable.

Data Format Optimization: Small Files Will Destroy Your Query Performance

Lambda functions process data in parallel, and each writes its own output file. A pipeline with 500 concurrent Lambda invocations processing 200MB chunks produces 500 output files. This is where teams get bitten. If the downstream query engine is Athena, Redshift Spectrum, or Presto, those 500 small files create two problems: metadata overhead (listing and opening 500 files takes longer than scanning them) and suboptimal compression ratios (Parquet’s columnar compression works better on larger row groups).

The target is 128-256MB per output Parquet file with Snappy compression. This balances S3 request overhead against memory requirements for reading. Each file should contain complete row groups (ideally 128MB uncompressed per group) so that predicate pushdown can skip irrelevant groups.

Partition the output by the columns your queries filter on most. For time-series data: year/month/day. For multi-tenant data: tenant_id/year/month. Over-partitioning is worse than under-partitioning. Always. A partition with 10,000 files of 1MB each performs worse than a single partition with 10 files of 1GB each. The Glue Catalog or Athena’s MSCK REPAIR TABLE handles partition discovery.

The compaction step is unavoidable. Do not skip it. Either schedule a Glue job that rewrites small files into optimally-sized ones, or trigger a Lambda compaction function when the small file count in a partition exceeds a threshold. The cost of compaction is a fraction of the query performance gains. An un-compacted dataset with 50,000 small files takes 10-15x longer to query than the same data in 200 properly-sized files. Your analysts will blame the query engine. The real problem is the file layout.

With the data format sorted, you need to choose how to orchestrate the pipeline itself.

Orchestration: Step Functions vs Airflow

Step Functions and Airflow both orchestrate data pipelines. They solve the same problem with fundamentally different trade-offs.

Step Functions is serverless, deeply integrated with AWS services, and scales without operational overhead. It handles retry, timeout, and error states declaratively. Express Workflows execute up to 5 minutes; Standard Workflows can run for up to a year. The limitation: Step Functions’ state language (ASL) is painfully verbose for complex branching logic, and debugging failed executions means clicking through the console’s state machine visualization. If you have ever tried to debug a 40-step ASL definition, you know the pain.

Airflow (or MWAA, AWS’s managed version) is a Python-based orchestration platform with a scheduler, web UI, and plugin ecosystem. It excels at complex DAG logic, cross-system dependencies (trigger a Spark job, wait for a database load, then run a Lambda), and operational visibility. The limitation: Airflow itself needs a server. MWAA’s smallest environment runs continuously, adding baseline cost even when no pipelines are active. You are paying for the orchestrator to exist, not just to orchestrate.

The decision boundary is clear: if your pipeline is entirely AWS-native (S3, Lambda, Glue, DynamoDB), Step Functions is simpler and cheaper. If your pipeline spans AWS services, databases, SaaS APIs, and on-premises systems, Airflow’s flexibility justifies its operational overhead. For teams already running Airflow for Spark jobs, adding Lambda-based steps to existing DAGs is more pragmatic than introducing a second orchestration tool. Do not run two orchestrators unless you have a very good reason.

Scalable infrastructure design treats the orchestration layer as a first-class component, not an afterthought. The wrong choice here creates operational debt that compounds as pipeline count grows. And pipeline count always grows.

Cold Starts and Batch Windows

Lambda cold starts affect data pipelines differently than they affect APIs. For an API, a 500ms cold start means one user waits an extra half-second. For a batch pipeline launching 500 concurrent functions, cold starts are parallelized. All 500 cold starts happen simultaneously, adding roughly 1-2 seconds to total pipeline duration rather than 500 times 500ms. This is one case where serverless parallelism actually works in your favor.

The exception is sequential pipeline stages. If stage 2 depends on stage 1’s output, and stage 2 launches after stage 1 completes, you pay cold start latency for each stage serially. A 4-stage pipeline with 800ms cold starts per stage adds over 3 seconds of pure cold start overhead.

Provisioned Concurrency eliminates cold starts but adds continuous cost. It makes sense for pipelines with tight batch windows (sub-5-minute SLA) or pipelines that run frequently enough that warm instances would be maintained naturally. For pipelines running a few times per day, the cold start cost is negligible compared to the total pipeline duration, and Provisioned Concurrency just adds waste.

The Python runtime typically cold starts in 200-400ms at 512MB memory. Java cold starts range from 2-6 seconds without SnapStart. If cold starts matter for your pipeline, choosing the right runtime saves more than any configuration tuning. Node.js and Python are the pragmatic choices for data processing Lambdas. Pick Python unless you have a specific reason not to. Java is justified only when you need libraries (like Apache Parquet’s Java SDK) that do not have equivalent quality in other runtimes.

The economics of serverless architecture flip at sustained utilization. Below 30% average utilization, serverless wins on cost. Above that threshold, containers with auto-scaling become more efficient. Track your pipeline’s effective utilization. If Lambda functions are running 8+ hours per day, the pipeline has outgrown serverless. Migrate to Fargate or EKS before the bill forces the conversation.

Serverless data processing is not a universal replacement for Spark, EMR, or managed ETL platforms. It is a precise tool for a specific workload profile: event-driven, partition-friendly, bursty, and tolerant of the 15-minute execution boundary. When the workload fits, serverless ETL delivers zero idle cost, automatic scaling, and per-invocation pricing that traditional platforms cannot match. When it does not fit, forcing it creates timeout workarounds, coordination complexity, and costs that exceed what a simple Spark cluster would have handled. The teams that get this right are not the ones who chose serverless for everything. They are the ones who chose it for exactly the right things.

Frequently Asked Questions

When does serverless ETL make sense versus Spark on EMR?

Serverless ETL wins for event-driven pipelines processing under 500GB per run, with bursty or unpredictable schedules. Lambda plus Step Functions handles these workloads with zero idle cost. Above 500GB or when transformations require shuffles, joins across large datasets, or iterative ML preprocessing, Spark on EMR or Glue is more cost-effective. The crossover point is roughly 3-4 hours of sustained compute per day.

How do you work around the 15-minute Lambda timeout?

Break the dataset into chunks that each complete within 10 minutes, leaving a 5-minute safety margin. Use S3 event triggers or SQS messages to fan work out to parallel Lambda invocations, each processing one chunk. For sequential dependencies, Step Functions Map state distributes chunks across concurrent Lambda executions and collects results. Pipelines processing terabytes routinely run in under 20 minutes using 1,000+ parallel Lambda invocations.

Is exactly-once processing possible in serverless pipelines?

True exactly-once delivery does not exist in distributed systems. Serverless pipelines approximate it through idempotent processing: design each transformation so running it twice with the same input produces the same output. Use DynamoDB conditional writes or S3 object versioning to deduplicate. With proper idempotency keys, duplicate processing rates drop below 0.01% even under aggressive retry policies.

How do cold starts affect batch processing windows?

A single Lambda cold start adds 200-800ms depending on runtime and package size. In a fan-out pipeline launching 500 concurrent functions, cold starts are amortized across parallel executions and add roughly 1-2 seconds to total pipeline duration. Provisioned Concurrency eliminates cold starts but adds baseline cost. For batch windows above 5 minutes, cold start impact is negligible. For sub-minute pipelines, use Provisioned Concurrency or keep functions warm with scheduled pings.

What file format should serverless pipelines write?

Parquet with Snappy compression and partition-aligned file sizes of 128-256MB balances query performance with write efficiency. Lambda functions writing many small Parquet files create a metadata problem for downstream query engines. Use a compaction step, either a scheduled Glue job or a Lambda triggered by S3 inventory, to merge small files into optimal sizes. A well-partitioned Parquet dataset with 128MB files reduces Athena query time by 5-10x compared to raw CSV.