Real-Time Streaming in Production

Feb 18, 2025 Metasphere Engineering 12 min read

Your fraud detection pipeline works perfectly in staging. You replay a week of production events, windowed aggregations look right, alerts fire on the correct patterns. Deploy day arrives and confidence is high.

Testing the dam with a garden hose. Perfect flow. No leaks.

Then consumer lag starts climbing. 30 seconds. 60. 90. Two weeks in, the pipeline runs four minutes behind real time. A fraud pattern that should trigger a card block in 5 seconds arrives 4 minutes late. The customer’s account is already drained.

The processing logic was fine. It worked beautifully against replayed events, which arrive in a predictable stream off disk. Production traffic is a different animal. Variable arrival rates. Bursty patterns. Late events from mobile devices in tunnels. State that needed to survive two Flink redeployments that week. The garden hose. Then the river. Staging gave you a garden hose. Production gave you the river.

Key takeaways

Staging lies about streaming. Replayed events arrive predictably off disk. Live traffic has variable rates, bursts, and late events from devices in tunnels.
Watermark configuration requires production data. Too tight (1s) and you drop late events quietly. Too loose (60s) and you delay output for no reason. Start with 5s for server events, 30s for mobile.
Consumer lag is the single most important streaming metric. Rising lag means you’re falling behind. Alert on lag exceeding your business’s stale-data tolerance.
Exactly-once carries a real throughput penalty. Idempotent at-least-once with dedup delivers the same correctness for the vast majority of use cases at much higher throughput.
Savepoints before every deployment are non-negotiable. Skip it once and you lose state exactly when it matters most.

Late events, state management, watermark tuning, backpressure, consumer rebalancing. Streaming problems need streaming solutions and serious data engineering discipline.

Why Streaming Is Architecturally Different

Batch pipelines process a bounded dataset. “All records with a timestamp in the last hour.” The dataset has edges. You know when you’re done. A bucket you fill and empty. Streaming pipelines process an unbounded dataset: every event as it arrives, forever. The river. One difference, but it rewrites every assumption about the programming model.

Stateful aggregations need state that persists across individual events. Counting events per user over a sliding 30-minute window. Computing a running average. Maintaining session state. In batch, this is trivial: load all records for the period, aggregate in memory. In streaming, state has to be maintained one event at a time, checkpointed to durable storage (Flink uses RocksDB backed by S3) so it survives restarts, and bounded carefully. A state store tracking “events per user” grows with your user base. At 10 million active users with 30-minute windows, that’s 10 million state entries updating on every event. Without TTL-based state expiration, memory grows until your Flink TaskManagers OOM and take the whole pipeline down. Without TTL-based cleanup, memory grows until the TaskManagers crash.

Windowing solves the time boundary problem, but window type choice directly determines what the aggregation means. Pick the wrong one and you get subtly incorrect results. The kind that live in production for weeks before anyone notices the numbers are off.

Window Type	Behavior	Best For	Watch Out For
Tumbling	Non-overlapping fixed periods (10:00-10:05, then 10:05-10:10)	Periodic aggregations, billing intervals	Edge-of-window events split across two windows
Sliding	Moves continuously (10-min window, 1-min slide)	Trend detection, moving averages	Higher CPU cost. Each event participates in multiple windows
Session	Groups by inactivity gap (30 min silence closes session)	User journey analysis, clickstream	Variable-length windows complicate downstream joins

Watermarks manage the gap between event time and processing time. An event produced at 10:00:00 might arrive at 10:00:03 from network delay. Or at 10:01:30 because the phone was in a tunnel. The dam’s patience for late drops. Watermarks tell the processor how long to wait before closing a window. Too tight (1 second) and you quietly drop legitimate late events. Too loose (60 seconds) and you delay output uselessly. Getting this right requires analyzing your data’s actual arrival distribution in production. Staging approximations are fiction. Start with a 5-second watermark for server-side events, 30 seconds for mobile/IoT. Tune from there with real production data.

Consumer Groups, Lag, and the Metrics That Matter

Kafka consumer groups distribute partition consumption across multiple instances. Scale up consumers and partitions rebalance. A consumer fails and its partitions reassign to survivors. Sounds smooth, but rebalancing is disruptive. In-flight processing pauses until rebalancing finishes, then resumes from the last committed offset. Disruptive in practice.

For stateful consumers, rebalancing gets harder. The state tied to a partition’s event history has to move with the partition. Flink and Kafka Streams handle this automatically. Custom consumers? You handle it yourself. The complexity is routinely underestimated, and honestly, this alone is one of the strongest arguments for Flink over a custom consumer framework for anything stateful.

Consumer lag is the gap between the latest produced offset and the latest consumed offset. Rising lag means the pipeline is falling behind. A sudden spike means a processing bottleneck. Pairing lag monitoring with observability tooling catches trends before they become business incidents.

Rule of thumb: your pipeline should process events within 2-3x its configured window size under normal load. For fraud detection, alert on lag exceeding 30 seconds. For analytics, 5 minutes might be acceptable. Define the threshold based on how quickly stale results start hurting the business.

Anti-pattern

Don’t: Alert only when consumer lag crosses a fixed high-water mark (e.g., 10 minutes). By then the pipeline is already minutes behind and users are affected. By then you’re already minutes behind.

Do: Alert on the lag rate of change. If lag increases by more than 10 seconds per minute for 3 consecutive minutes, something is degrading. You catch the trend before it becomes a business incident.

The Exactly-Once Trade-Off

Exactly-once delivery sounds like the obvious correctness target. The cost is steep. Every message requires transactional coordination between broker and state store. In practice with Flink, throughput drops noticeably and per-record latency increases compared to at-least-once. The cost is steep.

Most pipelines don’t actually need that guarantee.

Idempotent at-least-once is usually the better trade-off. Design your processing operations to be safe when applied multiple times with the same input. A lock that stays locked no matter how many times you turn the key. Store a dedup set of recently seen event IDs (keep the last 24 hours in Redis or RocksDB). Duplicate arrives after failure recovery? The dedup check catches it. The pipeline might process an event twice in rare failure scenarios, but it handles it correctly both times. And it runs a lot faster.

Delivery Guarantee	Throughput	Latency	Operational Complexity	Best For
At-most-once	Highest	Lowest	Minimal	Metrics, analytics where drops are acceptable
At-least-once + idempotent dedup	High	Low	Moderate (dedup store)	Most production pipelines, fraud detection
Exactly-once	Reduced	Higher	Significant (transactional coordination)	Financial settlement, regulatory reporting

Financial settlement and regulatory reporting often do need true exactly-once. Double-counting a payment has legal consequences. Analytics aggregation almost always works fine with idempotent at-least-once. Make this decision explicitly at architecture time. Not as a performance surprise after deployment.

State Store Recovery and Savepoints

When a stream processor restarts, it must recover its in-memory state from durable storage. No shortcuts. Flink uses RocksDB as its state backend, with periodic checkpoints written to S3 or HDFS. On restart, it restores from the last successful checkpoint, then replays events from the matching Kafka offset.

Prerequisites

RocksDB state backend configured with incremental checkpointing enabled
Checkpoint storage (S3 or HDFS) accessible from all TaskManagers with write permissions
Checkpoint interval set based on state size (60s default, tune for jobs with 50GB+ state)
Savepoint trigger built into CI/CD pipeline for streaming job deployments
Monitoring on checkpoint duration and size to detect state growth before OOM

Checkpoint interval is a direct trade-off: recovery time versus processing overhead. A 30-second interval means at most 30 seconds of reprocessing after a crash, but checkpointing that often adds CPU and I/O overhead. A 5-minute interval means 5 minutes of reprocessing but lower steady-state cost. Start with 60-second checkpoints for most production workloads. Tune based on state size. A job with 50GB of RocksDB state takes far longer to checkpoint than one with 500MB.

Savepoints are manually triggered checkpoints for planned maintenance. Upgrading the application, migrating state schema, scaling parallelism. The workflow: trigger savepoint, stop job, deploy new version, resume from savepoint. New version picks up exactly where the old one left off. Zero data loss. Minimal reprocessing.

Run flink savepoint $JOB_ID s3://checkpoints/savepoints/ before every deployment. Make it non-negotiable. Skip it once and you lose state exactly when it matters most.

The Staging Lie The gap between streaming behavior in staging (replayed events arriving in a predictable stream from disk) and production (variable arrival rates, bursty patterns, late events from mobile devices in tunnels). Every team that tests streaming in staging and deploys with confidence discovers this gap within the first week of production traffic.

When Streaming Is Not Worth the Complexity

Not every data pipeline needs to be real-time. The complexity tax is real, and paying it when batch would suffice is a common and expensive architectural mistake.

Choose Streaming	Choose Micro-Batch	Choose Batch
Sub-60-second latency is a hard business need	1-5 minute latency is acceptable	Hourly or daily latency is fine
Event ordering matters across producers	Events are independent and idempotent	Processing order is irrelevant
Stateful aggregations drive real-time decisions	Simple transformations dominate	Complex joins across large datasets
Fraud detection, live inventory, personalization	Near-real-time dashboards, alerting	Reporting, ML training, data warehouse loads

Spark Structured Streaming in micro-batch mode delivers most of the value of “real-time” with a fraction of the operational overhead. If your stakeholders say “real-time” but actually mean “within 5 minutes,” micro-batch is the pragmatic choice. Push back on requirements before committing to the complexity.

What the Industry Gets Wrong About Streaming Architecture

“Kafka makes it real-time.” Kafka is a log. It doesn’t make your processing real-time. If your consumer processes events in micro-batches every 30 seconds, you have a slightly faster batch system with Kafka in front. Real-time requires stream processing (Flink, Kafka Streams) with windowing, watermarks, and state management.

“Exactly-once is the correct default.” Exactly-once carries a real throughput penalty. Idempotent at-least-once with a deduplication set delivers the same correctness for nearly every pipeline that doesn’t involve money. Reserve exactly-once for financial settlement where double-counting has legal consequences.

Our take Tune watermarks from production data, not staging. The arrival distribution of events in production is completely different from replayed data on disk. Start with a 5-second watermark for server-side events and 30 seconds for mobile/IoT. Monitor late-event drop rates for the first two weeks and adjust. Staging watermark tuning is fiction.

Same fraud pipeline. Same production traffic. Consumer lag holds at 2 seconds instead of climbing to 4 minutes. The river flows through the dam. Watermarks tuned to actual event skew. State checkpointed so recovery fits inside the SLA. The card block fires in 5 seconds, as designed, because the architecture was built for unbounded data from the start.

Frequently Asked Questions

When does real-time streaming justify its complexity over micro-batch?

Real-time streaming earns its complexity when business decisions need to happen within seconds: fraud detection, live inventory, or personalization where value drops fast after 60 seconds. If you can tolerate 5-minute latency, Spark Structured Streaming in micro-batch mode gives most of the benefit with far less operational pain. The tipping point is sub-60-second latency as a real business need, not just a preference.

What is the difference between Kafka and Kinesis?

Kafka (self-hosted or via Confluent/MSK) offers control over partition count, retention, and consumer groups but needs operational expertise. Kinesis has simpler operations with tighter AWS integration but is limited to 1MB messages, 7-day max retention, and fewer consumer group options. For AWS-native teams without Kafka expertise, Kinesis is pragmatic. For multi-cloud or volumes above 1GB/s, Kafka gives necessary control.

What are exactly-once semantics and why do they cost performance?

Exactly-once means each event gets processed once even across failures, which needs transactional coordination between broker and state store. This visibly cuts Flink throughput and adds real latency. Most production systems use idempotent at-least-once instead: operations designed to be safe when run more than once, deduplicated by event ID. Same correctness, much higher throughput for most use cases.

What is the difference between Lambda and Kappa streaming architectures?

Lambda runs two parallel paths: a batch layer for accurate historical results and a speed layer for low-latency approximate results, merged at query time. Kappa removes the batch layer, using a single streaming system for real-time and historical reprocessing via log replay. Kappa cuts complexity by removing a second code path. Lambda is occasionally needed when batch and streaming accuracy needs can’t be met in a single processor.

What is backpressure and why does it matter?

Backpressure is how a slow consumer tells upstream to slow down. Without it, a producer outpacing its consumer fills up buffer memory, then drops events quietly. Apache Flink handles backpressure natively through network buffer management. Without backpressure handling, a 2x traffic spike causes silent data loss. These systems fail in the worst way: they look like they’re running while dropping records.