Data Contracts: Schema Changes Without the Breakage
Late in the day. A backend engineer opens a PR that removes the discount_amount column from the orders table. The column was added for a promotion that ended six months ago. Dead code. The PR passes code review. CI is green. The migration runs cleanly.
The following morning. The head of finance drops into #data-support: “The weekly revenue report shows zero discounts applied. Promotions ran real volume last week. Something is very wrong.” The data team investigates. The revenue pipeline was joining on discount_amount to calculate net revenue. The column is gone. The join silently returned nulls. The pipeline ran successfully. Zero errors. Zero alerts. The revenue reports are wrong, and the CFO is asking questions nobody wants to answer.
The landlord knocked down a wall without checking who lived on the other side.
The Data Contract specification was built to prevent this kind of failure. A contract records the dependency, blocks the deployment, and requires migration coordination before merge.
- The #1 data engineering failure mode: upstream schema changes that quietly break downstream pipelines. No errors thrown. No alerts fired. Wrong numbers in the executive dashboard for days.
- A data contract is a versioned, machine-readable agreement between a producer and its consumers. Schema, freshness SLA, quality thresholds, ownership, change policy.
- Consumer registries show you who depends on what. Without one, engineers delete columns consumed by pipelines they’ve never heard of. Tenants the landlord doesn’t know about.
- Contract tests in CI block breaking changes before they merge. The PR fails with “3 downstream consumers depend on discount_amount.”
- Ownership belongs to a team, not a person. People leave. Teams persist.
What Makes a Contract Complete
A contract is not a schema file. It’s an agreement between a producer and every system that consumes its output. A lease. Four components make it real instead of wishful.
# data-contracts/orders.yaml
dataContract:
name: orders
version: 2.1.0
owner:
team: commerce-platform
slack: "#commerce-data"
oncall: commerce-data-oncall@company.com
schema:
fields:
- name: order_id
type: string
required: true
description: "UUID, unique per order"
- name: status
type: string
enum: [pending, paid, shipped, delivered, cancelled]
description: "paid = payment confirmed. shipped = left warehouse."
- name: total_amount
type: decimal
required: true
sla:
freshness: 15m
availability: 99.5%
incident_response: 1h acknowledge, 4h resolve
change_policy:
breaking_change_notice: 14d minimum
dual_publish_period: 30d
Schema defines fields, types, nullability, and what things mean. What does status: completed mean? The revenue report and the shipping report read it differently unless the contract spells it out. The lease that says “furnished” without listing the furniture.
SLA commits to freshness, availability, and incident response times. The maintenance clause. “Landlord fixes plumbing within 24 hours.” Your data pipelines need these numbers to schedule downstream jobs and set alerts.
Ownership names the producer team, escalation path, and on-call rotation. If you can’t answer “who gets paged when this table is stale?” you don’t have ownership. You have an orphan. An apartment with no landlord. Good luck getting the heat fixed.
Change policy specifies how much notice before breaking changes (2-4 weeks) and commits to migration support. That’s what stops the discount_amount scenario. The renovation clause. Can’t knock down walls without giving tenants notice.
A contract without enforcement is a lease nobody reads. Just a well-intentioned wiki page.
Contract Testing in CI/CD
Tests check the proposed schema against what each consumer says it needs. If the change removes a consumed field, the test fails and blocks deployment. The building inspector who checks with every tenant before approving a renovation. Same idea as continuous integration applied to data. dbt tests handle the consumer side: fail if an expected column is missing, a value has disappeared from an enum, or a null rate crosses the threshold.
- Consumer registry exists with declared field dependencies for each consuming pipeline
- Contract test runner in the producer CI/CD pipeline with deploy-blocking authority
- Schema registry checks event schemas at publish time (Confluent Schema Registry or equivalent)
- dbt tests validate downstream expectations on every pipeline run
- Alerting triggers when SLA freshness or availability thresholds are breached
Don’t: Rely on Confluence docs to track schema dependencies. Docs go stale within weeks, have no CI integration, and can’t stop a bad deploy. A verbal agreement. “I thought the dishwasher was included.”
Do: Store contracts as version-controlled YAML files alongside code. Wire contract tests into CI/CD so CI catches breaking changes automatically, with the consuming team’s contact info in the failure message.
Versioning and Schema Evolution
Semantic versioning adapted for data: patch (documentation-only changes), minor (backward-compatible additions: new nullable column, new enum value), major (breaking changes: removed columns, type changes, renames).
Breaking changes require dual-produce to both old and new schemas for 30-60 days. Running old and new plumbing at the same time during renovation. Tenants keep their water. Consumers migrate at their own pace. The producer tracks progress through the consumer registry. Once all consumers confirm, the old version sunsets. Data engineering teams track migration completion as a percentage, and it becomes a forcing function for lagging consumers.
| Change Type | Example | Consumer Impact | Required Process |
|---|---|---|---|
| Patch | Fix a field description typo | None | Deploy freely |
| Minor | Add nullable discount_type column | None (new field is optional) | Deploy with notification |
| Major | Remove discount_amount column | Breaking for any consumer using that field | 2-4 week notice, dual-produce, consumer sign-off |
| Major | Change order_id from string to integer | Breaking for all consumers | Full migration plan, dual-produce, coordinated cutover |
Dual-produce implementation pattern
During a breaking schema migration, the producer publishes to both the old and new schema versions at the same time. The old schema is frozen (no new features) while the new schema gets all updates.
- Day 0: Producer announces breaking change, provides migration guide
- Day 1-7: Consumers assess impact and plan migration
- Day 7-30: Producer dual-publishes. Consumers migrate at their own pace. Registry tracks completion.
- Day 30 (or when all consumers confirm): Old schema deprecated. Producer publishes only to new schema.
- Day 60: Old schema decommissioned. Any consumer still reading it gets a clear error rather than stale data.
The dual-produce period is the insurance policy. It gives consumers a real migration window without creating a hard deadline that forces rushed, error-prone changes. The renovation happens while tenants keep living there. Nobody gets displaced.
Measuring Contract Effectiveness
Without clear metrics, someone questions the ROI within a quarter. Three measurements matter.
| Metric | Before Contracts | With Contracts | How to Measure |
|---|---|---|---|
| Pipeline incidents from schema changes | Multiple per quarter | Rare after first quarter | Incident tags in your ticketing system |
| MTTD for breaking changes | Hours (manual discovery) | Minutes (CI catches it) | Time from schema deploy to first alert |
| Unplanned data team firefighting | Substantial portion of sprint capacity | Fraction of previous level | Sprint retrospective tracking |
Put contracts on your worst data products and incident counts drop within the first quarter. Detection time goes from hours of manual digging to minutes of automated CI catches. The overhead per contract is small. The ROI conversation gets easy fast. The building stopped having pipe bursts. The tenants stopped calling. The lease paid for itself.
Adoption That Actually Sticks
Mandating contracts across the organization on day one fails. Adoption needs trust, and trust needs demonstrated results.
Start with 3-5 contracts on the producer-consumer relationships that hurt the most. The tables that break pipelines most often. The joins that silently return nulls. Show clear value within 4-6 weeks. Then let other teams ask for contracts for their own pain points. Pull-based adoption. Within two to three quarters, coverage spreads because teams saw the results and wanted in. Not because a memo told them to. Nobody mandates fire extinguishers after the building stops having fires. They just become obviously necessary.
| Adoption approach | Effort | Risk | Outcome |
|---|---|---|---|
| Top-down mandate | High (org-wide policy, tooling, training) | Resentment, shallow compliance | Fast coverage, low quality contracts |
| Bottom-up, pain-driven | Low initial, grows organically | Slow early adoption | Deep contracts, genuine buy-in |
| Hybrid (recommended) | Medium (start small, expand with executive support) | Moderate | Fast wins that fund broader rollout |
Contracts are the foundation of a mature data mesh where each domain publishes data with defined quality guarantees .
What the Industry Gets Wrong About Data Contracts
“Documentation is enough.” A Confluence page describing schema fields is not a contract. Machines can’t read it. It’s not versioned. CI can’t test it. It can’t block a deploy. The producer can change the schema without the documentation author knowing. A verbal agreement. Worse than useless when things go wrong because it creates false confidence. Documentation describes intent. Contracts enforce it.
“Data contracts create friction.” Uncoordinated schema changes create more friction. The multi-day debugging session after a field rename costs more than the 20-minute migration coordination a contract requires. Contracts convert unplanned friction (incidents) into planned friction (coordination). The lease creates paperwork. Not having a lease creates lawsuits. Planned friction is always cheaper.
That discount_amount column removal from the opening. With a contract in place, the PR triggers the contract test in CI. The test checks the consumer registry, finds the revenue pipeline’s dependency, and blocks the merge. The backend engineer sees the failure, opens the registry, contacts the data team. They coordinate the migration together. The landlord checked the tenant list before knocking down the wall. The CFO never asks the question. The revenue report is never wrong.