Real-Time Streaming Architecture: Kafka, Flink, and Watermarks

Feb 18, 2025 Metasphere Engineering 9 min read

Your fraud detection pipeline works perfectly in staging. You replay a week of production events, the windowed aggregations look right, the alerts fire on the correct patterns. You deploy to production feeling confident. Within days, someone notices the consumer lag graph trending steadily upward: 30 seconds, 60 seconds, 90 seconds. Within two weeks, the pipeline is four minutes behind real time. A legitimate fraud pattern that should trigger a card block within 5 seconds is arriving 4 minutes too late. The customer’s account has already been drained.

That is not a Kafka problem. That is not a code bug. It is a streaming problem. The processing logic that worked against replayed historical events (which arrive in a steady, predictable stream from disk) behaves completely differently against live traffic with variable arrival rates, bursty patterns, late events from mobile devices on spotty connections, and state that needed to survive the two Flink redeployments that happened that week.

The temptation to treat streaming as “fast batch” is understandable. You already have a batch pipeline that runs every hour. You want lower latency. So you add Kafka in front and run the pipeline every 30 seconds. The architecture looks similar, the team’s existing skills apply, and the latency improves. Until it does not. And when it stops improving, it degrades in ways batch never prepared you for.

Late events, state management, watermark tuning, backpressure, consumer group rebalancing. These are not batch problems. They are streaming problems, and they require streaming solutions. Effective data engineering is built around exactly this operational discipline.

Why Streaming Is Architecturally Different

Batch pipelines process a bounded dataset: “all records with a timestamp in the last hour.” The dataset has edges. You know when you are done. Streaming pipelines process an unbounded dataset: every event as it arrives, forever. That single difference changes everything about the programming model.

Stateful aggregations require state that persists across individual event processing. Counting events per user over a sliding 30-minute window, computing a running average, maintaining session state. In batch, this is trivial: load all records for the period and aggregate in memory. In streaming, state must be maintained incrementally, checkpointed to durable storage (Flink uses RocksDB backed by S3) so it survives process restarts, and bounded carefully. A state store that tracks “events per user” grows with your user base. At 10 million active users with 30-minute windows, that is 10 million state entries being updated on every event. Without TTL-based state expiration, memory grows until your Flink TaskManagers OOM. This is not a theoretical concern. It happens.

Windowing addresses the time boundary problem. A tumbling window aggregates events in non-overlapping fixed periods (10:00-10:05, then 10:05-10:10). A sliding window moves continuously (10-minute window, slides every 1 minute). A session window groups events by activity with a configurable inactivity gap (30 minutes of silence closes the session). The windowing semantics directly determine the meaning of the aggregation. Choosing tumbling when you need sliding produces subtly incorrect results that live in production for weeks before anyone notices the numbers are off. This exact bug survives three code reviews regularly.

Watermarks manage the gap between event time and processing time. An event produced at 10:00:00 may arrive at 10:00:03 due to network delay, or at 10:01:30 because the mobile device was offline. The watermark tells the processor how long to wait before closing a window. Too tight (1 second) and you drop legitimate late events. Too loose (60 seconds) and you delay output unnecessarily. Getting watermark configuration right requires analyzing your data’s actual arrival distribution in production. Staging approximations will not cut it. Typical starting points: 5-second watermark for server-side events, 30-second watermark for mobile/IoT events.

Consumer Group Management and Lag

Understanding how watermarks interact with windowing is essential for tuning streaming correctness. The watermark determines when the processor considers a window complete. Get this wrong and you either drop valid late events or delay output unnecessarily. Neither is acceptable in production.

Kafka consumer groups distribute partition consumption across multiple consumer instances. When you scale up consumers, partitions rebalance. When a consumer fails, its partitions reassign to surviving instances. Rebalancing is disruptive: in-flight processing pauses while rebalancing completes, then resumes from the last committed offset. With the cooperative rebalancing strategy (available since Kafka 2.4), the disruption is reduced to only the reassigned partitions rather than all partitions, but it still causes a processing pause.

For stateful consumers, rebalancing requires state migration. The state associated with a partition’s event history must move with the partition. Flink and Kafka Streams handle this automatically. If you are writing custom consumers, you need to handle it explicitly. The implementation complexity is routinely underestimated. Do not build this yourself unless you have a very good reason. This is one of the strongest arguments for using Flink rather than a custom consumer framework for anything stateful.

Consumer lag (the difference between the latest produced offset and the latest consumed offset) is the single most important operational metric for streaming pipelines. Full stop. Rising lag means you are falling behind. A sudden spike indicates a processing bottleneck. Teams running streaming pipelines must monitor lag continuously, not just alerting when it is already critical. Pairing this with observability monitoring ensures you catch lag trends before they become business incidents.

The practical target: your pipeline should process events within 2-3x its configured window size under normal load. For fraud detection, alert on lag exceeding 30 seconds. For analytics aggregation, 5 minutes might be acceptable. Define the threshold based on how quickly stale results become harmful.

The Exactly-Once Trade-Off

Exactly-once delivery sounds like the obvious correctness target. Every event processed exactly once, even across failures. Correctness without compromise. But the cost is real: every message requires transactional coordination between the broker and the consumer’s state store. In our benchmarks with Flink, this reduces throughput by 20-40% and adds 5-15ms of latency per record compared to at-least-once. That is a steep price.

For most pipelines, idempotent at-least-once processing is the better trade-off. Design processing operations to be safe when applied multiple times with the same input. Store a deduplication set of recently seen event IDs (keep the last 24 hours in Redis or RocksDB). When a duplicate arrives after a failure recovery, the dedup check catches it. The result is a pipeline that may process an event twice in rare failure scenarios but handles it correctly both times, and runs significantly faster.

The decision depends on business requirements. Financial settlement and regulatory reporting often need true exactly-once because double-counting a payment has legal consequences. Analytics aggregation pipelines almost always tolerate idempotent at-least-once. Make this decision explicitly at architecture time. Do not discover it as a performance constraint after deployment. By then, your architecture is locked in and the refactor is painful. For the infrastructure backbone underpinning these decisions, see our work on scalable cloud-native infrastructure.

State Store Recovery and Savepoints

When a stream processor restarts, it must recover its in-memory state from durable storage. There is no shortcut here. Flink uses RocksDB as its state backend, with periodic checkpoints written to S3 or HDFS. On restart, Flink restores from the last successful checkpoint, then replays events from the corresponding Kafka offset.

The checkpoint interval is a direct trade-off between recovery time and processing overhead. A 30-second interval means at most 30 seconds of event reprocessing after a crash, but checkpointing that frequently adds CPU and I/O overhead. A 5-minute interval means 5 minutes of reprocessing but lower steady-state overhead. For most production workloads, start with 60-second checkpoints. Tune from there based on your state size. A job with 50GB of RocksDB state takes longer to checkpoint than one with 500MB.

Savepoints are manually triggered checkpoints used for planned maintenance: upgrading the application, migrating state schema, or scaling the job’s parallelism. The workflow is: trigger savepoint, stop job, deploy new version, resume from savepoint. The new version picks up exactly where the old one left off, with zero data loss and minimal reprocessing. This is Flink’s answer to the “how do I deploy without losing state” problem, and it is one of the primary reasons teams choose Flink over simpler streaming solutions for stateful workloads. Run flink savepoint $JOB_ID s3://checkpoints/savepoints/ before every deployment. Make it a non-negotiable part of your CI/CD pipeline for streaming jobs. Skip it once and you will lose state exactly when it matters most.

Streaming architecture is not fast batch. The unbounded data model, stateful processing requirements, watermark tuning, and failure recovery patterns are fundamentally different from anything batch pipelines prepare you for. Teams that treat streaming as a speed optimization over batch discover this gap in production. Consumer lag grows. Late events get silently dropped. State recovery takes longer than the SLA allows. By the time you realize the architecture is wrong, the system is already in production and users already depend on it. Respect the complexity from day one or budget for the rewrite later.

Frequently Asked Questions

When does real-time streaming justify its complexity over micro-batch?

Real-time streaming earns its complexity when business decisions must happen within seconds: fraud detection, live inventory, or personalization where value degrades sharply within 60 seconds. For 5-minute latency tolerance, Spark Structured Streaming in micro-batch mode delivers most of the benefit with far less operational overhead. The inflection point is sub-60-second latency as a hard business requirement, not a preference.

What is the difference between Kafka and Kinesis?

Kafka (self-hosted or via Confluent/MSK) offers control over partition count, retention, and consumer groups but requires operational expertise. Kinesis has simpler operations with tighter AWS integration but is limited to 1MB messages, 7-day max retention, and fewer consumer group options. For AWS-native teams without Kafka expertise, Kinesis is pragmatic. For multi-cloud or volumes above 1GB/s, Kafka provides necessary control.

What are exactly-once semantics and why do they cost performance?

Exactly-once guarantees each event is processed once even across failures, requiring transactional coordination between broker and state store. This reduces Flink throughput by 20-40% and adds measurable latency. Most production systems use idempotent at-least-once instead: operations designed to be safe when applied multiple times, deduplicated by event ID. This delivers equivalent correctness at significantly higher throughput for most use cases.

What is the difference between Lambda and Kappa streaming architectures?

Lambda runs two parallel paths: a batch layer for accurate historical results and a speed layer for low-latency approximate results, merged at query time. Kappa eliminates the batch layer, using a single streaming system for real-time and historical reprocessing via log replay. Kappa reduces complexity by eliminating a second code path. Lambda is occasionally necessary when batch and streaming accuracy requirements can’t be reconciled in a single processor.

What is backpressure and why does it matter?

Backpressure is how a slow consumer signals upstream to reduce event rate. Without it, a producer outpacing its consumer exhausts buffer memory, then drops events silently. Apache Flink handles backpressure natively through network buffer management. Without backpressure handling, a 2x traffic spike causes silent data loss. These systems fail in the worst way: they appear to keep running while dropping records.