Event-Driven Architecture: Schema Governance

May 8, 2025 Metasphere Engineering 13 min read

Your analytics team has been complaining for two weeks that dashboard numbers look off. You investigate. The story unravels slowly: six weeks ago, an application team refactored the orders service and renamed a field in the orders.created event from customer_id to customerId. The refactoring passed code review. The service’s own tests passed. The deployment was clean. But the analytics consumer, which expects customer_id, started failing on deserialization and sending events to the DLQ. Nobody set up DLQ depth alerting. The DLQ now has 2.3 million unprocessed events. The business has been making decisions on incomplete order data for six weeks without anyone noticing.

Someone changed their address and didn’t tell anyone who writes to them. Six weeks of mail sitting in the dead letter office. Nobody checked.

Key takeaways

Schema registries prevent the #1 event architecture failure: field renames that quietly break downstream consumers. Enforce compatibility at the broker level, not in docs.
Dead letter queues without depth alerting are invisible failure sinks. Events pile up for weeks with nobody noticing until business metrics look wrong.
Event sourcing and CQRS solve different problems. Event sourcing gives you an immutable audit trail. CQRS separates read/write models. Neither requires the other.
At-least-once with idempotent handlers beats exactly-once for most use cases. Lower latency, higher throughput, same correctness.
Start with events for integration between services, not for everything. Synchronous APIs remain better for request/response where the caller needs an immediate answer.

Not a Kafka problem. No schema registry, no DLQ alerting, no data contracts . The governance layer is the difference between sustainable data engineering and collapse at 200 topics.

Schema Registry as Critical Infrastructure

The schema registry sits between “a developer pushed a change” and “that change reaches production consumers.” Without a registry, those are the same moment. With a registry, an automated checkpoint catches breaking changes before they break anything. The postal service’s address validator. Wrong format? Letter rejected before it ships.

Compatibility Mode	Allows	Blocks	Use When
BACKWARD	Add optional fields, remove fields	Add required fields, change types	Consumers upgrade before producers
FORWARD	Add fields, remove optional fields	Remove required fields	Producers upgrade before consumers
FULL	Add/remove optional fields only	Any required field change	Both sides upgrade independently
NONE	Anything	Nothing	Development only. Never production.

Before publishing events with a new schema version, the producer registers it with the registry. The registry checks the new version against the configured compatibility mode. If the change would break consumers (removing a required field, changing a field type, renaming a field without a transition period) the registry rejects the registration. The producer literally can’t publish events with the breaking schema.

Nobody reads documentation under deadline pressure. Everyone hits a CI gate that blocks their merge. The customer_id to customerId rename from the intro? A schema registry configured with BACKWARD compatibility would have rejected that registration right away. The developer would have seen the error in CI, understood the impact, and followed the versioning process. The friction is the entire point. (Speed bumps aren’t bugs. They’re features.)

Confluent Schema Registry (open source) is the most common implementation for Kafka, handling 100K+ schemas in large deployments. AWS Glue Schema Registry provides the same capabilities for Kinesis and MSK. Setup takes a day. The incidents it prevents are worth months of engineering time.

Event Versioning Discipline

Schema registries handle additive changes automatically. Adding an optional field? Ship it. The registry accepts it, consumers ignore what they don’t recognize, and life goes on.

The trouble starts when a breaking change is genuinely unavoidable. Business models evolve. Data models follow. When that happens, explicit versioning through a dual-publish pattern lets producers and consumers migrate on their own timelines.

{
  "type": "record",
  "name": "OrderCreated",
  "namespace": "com.company.orders",
  "fields": [
    {"name": "order_id", "type": "string"},
    {"name": "customer_id", "type": "string"},
    {"name": "total_cents", "type": "long"},
    {"name": "currency", "type": "string", "default": "USD"},
    {"name": "created_at", "type": "long", "logicalType": "timestamp-millis"},
    {"name": "source_system", "type": ["null", "string"], "default": null}
  ]
}

The source_system field above is additive and optional. A non-breaking change that the registry handles automatically. Breaking changes need a different approach entirely.

Anti-pattern

Don’t: Override schema registry compatibility checks to push a breaking change quickly. Every “just this once” override has caused a production incident. Every single one.

Do: Create a v2 topic. Dual-publish for 30-60 days. Consumers migrate at their own pace. More ceremony, far fewer fires.

Safe changes: adding a new optional field, adding a new optional enum value, documentation-only updates. Unsafe changes: removing a field consumers depend on, changing a field type (integer to string), changing timestamp formats, renaming a field. These are almost never compatible without a versioned transition. Don’t convince yourself otherwise.

For breaking changes: create orders.created.v2 as a new topic. The producer publishes to both during a 30-60 day transition. Consumers migrate at their own pace. Once everyone’s on v2, deprecate and eventually delete v1. More work than in-place migration? Yes. Far less work than debugging six simultaneous consumer failures from an incompatible schema change that shipped without warning.

Codify this versioning process in infrastructure-as-code templates so every team follows the same discipline.

The Data Contract Prerequisite

An event-driven architecture without explicit contracts between producers and consumers is an architecture where any producer change surprises everyone downstream. You changed the letter format without telling anyone who reads your letters. The contract doesn’t need to be a heavyweight document. It needs to cover four things: the event schema (enforced by the registry), the delivery SLA, the meaning of each field (what does status: completed actually mean for a payment event? what does null mean for discount_code?), and the notification process for schema changes.

Prerequisites

Schema registry deployed and enforcing BACKWARD or FULL compatibility on all production topics
DLQ depth alerting set to fire on any depth above zero
Consumer lag monitoring alerting within 5 minutes of threshold breach
Topic ownership metadata registered in a service catalog (Backstage, DataHub, or equivalent)
Breaking change process documented and enforced through CI gates

Consumer discovery is the organizational piece that schema registries can’t enforce. When an application team is about to change an event schema, they need to identify downstream consumers and communicate the timeline. You need to know who’s getting your mail before you change the format. Tools like Backstage or DataHub provide this: given a topic name, show which teams have registered consumers. Without this, producer teams genuinely don’t know who they’d break. Build the tooling or accept that teams will break each other.

Observability into consumer lag and DLQ depth provides the feedback loop that makes contract violations visible in minutes rather than weeks. Wire these three alerts from day one:

DLQ depth > 0: any events in the DLQ means something failed. Alert right away.
Consumer lag > threshold: lag increasing after a deployment means the new schema is causing processing delays. Alert within 5 minutes.
Consumer throughput drop > 20%: a sudden drop in events processed per second means the consumer is quietly rejecting events. Alert within 5 minutes.

These three alerts would have caught the customer_id incident from the intro within minutes instead of six weeks. (Six weeks. That’s not a bug. That’s a relationship problem.)

When Events Are (and Aren’t) the Right Pattern

Events solve temporal coupling. They don’t solve all coupling. Choosing wrong between synchronous and asynchronous communication is one of the most expensive architectural mistakes because it’s hard to reverse once consumers depend on the pattern.

When events fit	When they don’t
One producer, many consumers who process independently	Request-response where the caller needs an immediate answer
Consumers can tolerate seconds to minutes of latency	Sub-millisecond latency between services
Event ordering matters and replay adds value	Simple CRUD operations with a single consumer
Audit trail of state changes is a business requirement	Tight transactional consistency across services
Services evolve and deploy on independent schedules	Coordination logic that needs synchronous orchestration

Not everything should be a letter. Some conversations need a phone call.

The Invisible Contract The implicit dependency between an event producer’s schema and every consumer that deserializes it. Unlike REST APIs where the contract is explicit in the URL and response shape, event contracts exist only in the serialization format. A field rename that passes the producer’s tests breaks every consumer quietly. An address change that the sender’s spell-check accepts but every recipient’s mailroom rejects. Making the invisible contract visible through schema registries and data catalogs is the entire purpose of event governance.

What the Industry Gets Wrong About Event-Driven Architecture

“Events solve coupling.” Events solve temporal coupling. They can create a different, harder-to-track kind: implicit schema dependencies between producers and consumers that nobody tracks. Without a schema registry, event-driven architectures trade visible REST contracts for invisible event contracts. The coupling is still there. It just hides better. (Hiding coupling is not removing coupling.)

“Exactly-once delivery is the right default.” Exactly-once carries a measurable throughput penalty and adds operational complexity through transaction coordinators. Idempotent at-least-once delivers the same correctness for most use cases. Some letters arrive twice. You throw away the duplicate. Reserve exactly-once for financial settlement and regulatory reporting where double-counting has legal consequences. Registered mail for legal documents. Regular mail for everything else.

“More topics means better decoupling.” Topic proliferation without governance produces the same mess as microservice proliferation without boundaries. Fifty topics nobody documents, ten of which overlap in meaning, three of which are abandoned but still receiving events. Opening 50 PO boxes and forgetting which is for what. Topic lifecycle management is as important as topic creation.

Our take Deploy a schema registry before the second producer goes live. The cost of retrofitting schema governance after 50 topics and 200 consumers is measured in quarters of engineering time. The cost of deploying it alongside the first topic is measured in days. One of those timelines is acceptable.

Serialization format comparison: vs vs JSON Schema

Avro provides binary serialization with strong evolution support. The schema is stored in the registry, not embedded in every message, which keeps payloads compact. Most common choice for Kafka because the registry enforces compatibility at publish time. Downside: requires the registry to be available for deserialization.

Protobuf offers more compact serialization with explicit field numbering, which provides naturally stable backward compatibility. Strong choice when the same events flow through gRPC services and Kafka. Downside: field renumbering is a breaking change that’s easy to miss.

JSON Schema is human-readable and debuggable with standard tools. Good for lower-volume topics where developer experience matters more than wire efficiency. Downside: no binary serialization means larger payloads and higher broker storage costs at scale.

For most Kafka deployments, Avro with Confluent Schema Registry is the path of least resistance. Switch to Protobuf if your services already use gRPC heavily and you want schema consistency across both transport layers.

The DLQ hit 2.3 million because a field rename passed code review with no schema registry to reject it. With schema governance in place, that rename fails at publish time, the producer team gets a compatibility error in CI, and the analytics dashboards never miss a single order. Same refactor. Completely different outcome. Same letter. The address validator caught it before it shipped.

Frequently Asked Questions

Why does event schema matter for Kafka architectures?

Without explicit schema, producers and consumers develop an unspoken understanding that breaks quietly when either side changes. Avro with Schema Registry is the most common choice for Kafka because the registry enforces compatibility at publish time, rejecting breaking changes before they reach consumers. Protobuf offers more compact serialization with explicit field numbering. JSON Schema is human-readable but lacks binary efficiency at scale.

What is a schema registry and why is it critical?

A schema registry stores event schemas and enforces compatibility rules on new versions. BACKWARD compatibility means new consumers can read old events. FORWARD means old consumers can read new events. FULL means both. Without a registry, breaking schema changes ship without warning and are found when consumer failures show up in production. Confluent Schema Registry handles 100K+ schemas in large deployments.

What is event replay and when do you need it?

Event replay re-processes historical events by resetting consumer offsets to an earlier position. You need it for: fixing a consumer bug (replay from before the bug), populating a new data store from the event stream, or A/B testing new logic against historical data. Kafka keeps events for a configurable period, typically 7-30 days. For arbitrary replay, archive events to object storage in Parquet format for indefinite retention.

How do dead letter queues work in event pipelines?

A DLQ receives events that a consumer failed to process after exhausting retries. It prevents one bad event from blocking an entire partition. Classify failures: transient errors (network timeout) retry with exponential backoff up to 3-5 attempts. Permanent errors (deserialization failure) route directly to DLQ. Alert on any DLQ depth above zero. A growing unalerted DLQ is silent data loss happening in slow motion.

What is event bus sprawl and how do you prevent it?

Event bus sprawl is the buildup of poorly named, overlapping, and abandoned Kafka topics nobody understands. Teams with 50+ topics and no governance lose a painful share of engineering time to incident triage from undocumented topic changes. Prevention requires enforced naming conventions (domain.entity.event_type), ownership metadata per topic, review process for creation, and lifecycle policies for deprecation.