Event-Driven Architecture: Schema Governance
Your analytics team has been complaining for two weeks that dashboard numbers look off. You investigate. The story unravels slowly: six weeks ago, an application team refactored the orders service and renamed a field in the orders.created event from customer_id to customerId. The refactoring passed code review. The service’s own tests passed. The deployment was clean. But the analytics consumer, which expects customer_id, started failing on deserialization and sending events to the DLQ. Nobody set up DLQ depth alerting. The DLQ now has 2.3 million unprocessed events. The business has been making decisions on incomplete order data for six weeks without anyone noticing.
Someone changed their address and didn’t tell anyone who writes to them. Six weeks of mail sitting in the dead letter office. Nobody checked.
- Schema registries prevent the #1 event architecture failure: field renames that quietly break downstream consumers. Enforce compatibility at the broker level, not in docs.
- Dead letter queues without depth alerting are invisible failure sinks. Events pile up for weeks with nobody noticing until business metrics look wrong.
- Event sourcing and CQRS solve different problems. Event sourcing gives you an immutable audit trail. CQRS separates read/write models. Neither requires the other.
- At-least-once with idempotent handlers beats exactly-once for most use cases. Lower latency, higher throughput, same correctness.
- Start with events for integration between services, not for everything. Synchronous APIs remain better for request/response where the caller needs an immediate answer.
Not a Kafka problem. No schema registry, no DLQ alerting, no data contracts . The governance layer is the difference between sustainable data engineering and collapse at 200 topics.
Schema Registry as Critical Infrastructure
The schema registry sits between “a developer pushed a change” and “that change reaches production consumers.” Without a registry, those are the same moment. With a registry, an automated checkpoint catches breaking changes before they break anything. The postal service’s address validator. Wrong format? Letter rejected before it ships.
| Compatibility Mode | Allows | Blocks | Use When |
|---|---|---|---|
| BACKWARD | Add optional fields, remove fields | Add required fields, change types | Consumers upgrade before producers |
| FORWARD | Add fields, remove optional fields | Remove required fields | Producers upgrade before consumers |
| FULL | Add/remove optional fields only | Any required field change | Both sides upgrade independently |
| NONE | Anything | Nothing | Development only. Never production. |
Before publishing events with a new schema version, the producer registers it with the registry. The registry checks the new version against the configured compatibility mode. If the change would break consumers (removing a required field, changing a field type, renaming a field without a transition period) the registry rejects the registration. The producer literally can’t publish events with the breaking schema.
Nobody reads documentation under deadline pressure. Everyone hits a CI gate that blocks their merge. The customer_id to customerId rename from the intro? A schema registry configured with BACKWARD compatibility would have rejected that registration right away. The developer would have seen the error in CI, understood the impact, and followed the versioning process. The friction is the entire point. (Speed bumps aren’t bugs. They’re features.)
Confluent Schema Registry (open source) is the most common implementation for Kafka, handling 100K+ schemas in large deployments. AWS Glue Schema Registry provides the same capabilities for Kinesis and MSK. Setup takes a day. The incidents it prevents are worth months of engineering time.
Event Versioning Discipline
Schema registries handle additive changes automatically. Adding an optional field? Ship it. The registry accepts it, consumers ignore what they don’t recognize, and life goes on.
The trouble starts when a breaking change is genuinely unavoidable. Business models evolve. Data models follow. When that happens, explicit versioning through a dual-publish pattern lets producers and consumers migrate on their own timelines.
{
"type": "record",
"name": "OrderCreated",
"namespace": "com.company.orders",
"fields": [
{"name": "order_id", "type": "string"},
{"name": "customer_id", "type": "string"},
{"name": "total_cents", "type": "long"},
{"name": "currency", "type": "string", "default": "USD"},
{"name": "created_at", "type": "long", "logicalType": "timestamp-millis"},
{"name": "source_system", "type": ["null", "string"], "default": null}
]
}
The source_system field above is additive and optional. A non-breaking change that the registry handles automatically. Breaking changes need a different approach entirely.
Don’t: Override schema registry compatibility checks to push a breaking change quickly. Every “just this once” override has caused a production incident. Every single one.
Do: Create a v2 topic. Dual-publish for 30-60 days. Consumers migrate at their own pace. More ceremony, far fewer fires.
Safe changes: adding a new optional field, adding a new optional enum value, documentation-only updates. Unsafe changes: removing a field consumers depend on, changing a field type (integer to string), changing timestamp formats, renaming a field. These are almost never compatible without a versioned transition. Don’t convince yourself otherwise.
For breaking changes: create orders.created.v2 as a new topic. The producer publishes to both during a 30-60 day transition. Consumers migrate at their own pace. Once everyone’s on v2, deprecate and eventually delete v1. More work than in-place migration? Yes. Far less work than debugging six simultaneous consumer failures from an incompatible schema change that shipped without warning.
Codify this versioning process in infrastructure-as-code templates so every team follows the same discipline.
The Data Contract Prerequisite
An event-driven architecture without explicit contracts between producers and consumers is an architecture where any producer change surprises everyone downstream. You changed the letter format without telling anyone who reads your letters. The contract doesn’t need to be a heavyweight document. It needs to cover four things: the event schema (enforced by the registry), the delivery SLA, the meaning of each field (what does status: completed actually mean for a payment event? what does null mean for discount_code?), and the notification process for schema changes.
- Schema registry deployed and enforcing BACKWARD or FULL compatibility on all production topics
- DLQ depth alerting set to fire on any depth above zero
- Consumer lag monitoring alerting within 5 minutes of threshold breach
- Topic ownership metadata registered in a service catalog (Backstage, DataHub, or equivalent)
- Breaking change process documented and enforced through CI gates
Consumer discovery is the organizational piece that schema registries can’t enforce. When an application team is about to change an event schema, they need to identify downstream consumers and communicate the timeline. You need to know who’s getting your mail before you change the format. Tools like Backstage or DataHub provide this: given a topic name, show which teams have registered consumers. Without this, producer teams genuinely don’t know who they’d break. Build the tooling or accept that teams will break each other.
Observability into consumer lag and DLQ depth provides the feedback loop that makes contract violations visible in minutes rather than weeks. Wire these three alerts from day one:
- DLQ depth > 0: any events in the DLQ means something failed. Alert right away.
- Consumer lag > threshold: lag increasing after a deployment means the new schema is causing processing delays. Alert within 5 minutes.
- Consumer throughput drop > 20%: a sudden drop in events processed per second means the consumer is quietly rejecting events. Alert within 5 minutes.
These three alerts would have caught the customer_id incident from the intro within minutes instead of six weeks. (Six weeks. That’s not a bug. That’s a relationship problem.)
When Events Are (and Aren’t) the Right Pattern
Events solve temporal coupling. They don’t solve all coupling. Choosing wrong between synchronous and asynchronous communication is one of the most expensive architectural mistakes because it’s hard to reverse once consumers depend on the pattern.
| When events fit | When they don’t |
|---|---|
| One producer, many consumers who process independently | Request-response where the caller needs an immediate answer |
| Consumers can tolerate seconds to minutes of latency | Sub-millisecond latency between services |
| Event ordering matters and replay adds value | Simple CRUD operations with a single consumer |
| Audit trail of state changes is a business requirement | Tight transactional consistency across services |
| Services evolve and deploy on independent schedules | Coordination logic that needs synchronous orchestration |
Not everything should be a letter. Some conversations need a phone call.
What the Industry Gets Wrong About Event-Driven Architecture
“Events solve coupling.” Events solve temporal coupling. They can create a different, harder-to-track kind: implicit schema dependencies between producers and consumers that nobody tracks. Without a schema registry, event-driven architectures trade visible REST contracts for invisible event contracts. The coupling is still there. It just hides better. (Hiding coupling is not removing coupling.)
“Exactly-once delivery is the right default.” Exactly-once carries a measurable throughput penalty and adds operational complexity through transaction coordinators. Idempotent at-least-once delivers the same correctness for most use cases. Some letters arrive twice. You throw away the duplicate. Reserve exactly-once for financial settlement and regulatory reporting where double-counting has legal consequences. Registered mail for legal documents. Regular mail for everything else.
“More topics means better decoupling.” Topic proliferation without governance produces the same mess as microservice proliferation without boundaries. Fifty topics nobody documents, ten of which overlap in meaning, three of which are abandoned but still receiving events. Opening 50 PO boxes and forgetting which is for what. Topic lifecycle management is as important as topic creation.
Serialization format comparison: Avro vs Protobuf vs JSON Schema
Avro provides binary serialization with strong evolution support. The schema is stored in the registry, not embedded in every message, which keeps payloads compact. Most common choice for Kafka because the registry enforces compatibility at publish time. Downside: requires the registry to be available for deserialization.
Protobuf offers more compact serialization with explicit field numbering, which provides naturally stable backward compatibility. Strong choice when the same events flow through gRPC services and Kafka. Downside: field renumbering is a breaking change that’s easy to miss.
JSON Schema is human-readable and debuggable with standard tools. Good for lower-volume topics where developer experience matters more than wire efficiency. Downside: no binary serialization means larger payloads and higher broker storage costs at scale.
For most Kafka deployments, Avro with Confluent Schema Registry is the path of least resistance. Switch to Protobuf if your services already use gRPC heavily and you want schema consistency across both transport layers.
The DLQ hit 2.3 million because a field rename passed code review with no schema registry to reject it. With schema governance in place, that rename fails at publish time, the producer team gets a compatibility error in CI, and the analytics dashboards never miss a single order. Same refactor. Completely different outcome. Same letter. The address validator caught it before it shipped.