Data Quality: When the Pipeline Lies
It’s a quiet afternoon. The VP of Sales pings the data team: “Revenue on the dashboard dropped 30% overnight. What happened?” Two hours of investigation later, the answer: nothing happened to the business. A product engineer renamed an enum value from "completed" to "complete" in the orders API a few days earlier. The pipeline kept running. The transformation kept filtering for "completed". Four days of orders silently fell through the filter. The dashboard showed real revenue dropping. The VP almost pulled a campaign based on phantom numbers.
The production line kept running. The batch looked fine. The contamination was invisible until the customer got sick.
If you’ve been in data engineering for more than a year, you’ve lived some version of this story. Not the pipeline that crashes with a stack trace. That one is easy. The expensive failure is the pipeline that runs successfully, returns HTTP 200, and delivers subtly wrong numbers to people who make decisions with them. A pipeline producing zero rows is visible right away. Someone asks why the dashboard is empty. A pipeline producing rows with incorrect totals is visible only when someone notices a metric doesn’t match their gut feel. By then, decisions have been made on bad data, and the harder question is not “what went wrong” but “does anyone still trust the numbers?”
Trust in data is earned in months and lost in a single incident. Rebuilding it takes longer. (Always longer.)
- The most expensive data quality bug is the one that runs successfully. No errors. No alerts. Subtly wrong numbers reaching decision-makers for days before anyone notices.
- Five dimensions of data quality need different enforcement strategies. Completeness, freshness, uniqueness, consistency, accuracy. A single approach can’t cover all five.
- Schema contract tests catch upstream changes before they break pipelines. Enum renamed from “completed” to “complete”? Caught at the PR stage. Before it reaches production.
- Statistical anomaly detection catches what schema tests miss. Revenue drops 30% day-over-day? Alert fires before the VP notices.
- Quality gates should block pipeline execution, not just generate warnings. A warning nobody reads is the same as no test at all.
The Five Dimensions of Data Quality
Data quality is not one thing. It’s five, and which one matters most depends on who uses the data and what they do with it.
| Dimension | What It Measures | Enforcement Approach | Impact When Violated |
|---|---|---|---|
| Completeness | Required fields filled | Null count checks per column | Missing customer_id = useless record |
| Freshness | Data reflects current state | Timestamp comparison against SLA window | Decisions on yesterday’s reality |
| Accuracy | Values match reality | Cross-reference against source of truth | Small errors compound into measurable ML problems |
| Uniqueness | No duplicate records | Dedup checks on natural keys | Even low duplication rates produce material misstatements |
| Validity | Values match expected formats/ranges | Regex and range checks | Months of country_code: "XX" in 8% of rows |
For ML training data, even 95% completeness introduces bias if the missing 5% isn’t random. And it almost never is. Freshness is deceptive: display “data as of” timestamps on every dashboard. Analysts who know data is 2 hours old make different decisions than analysts who assume it’s current. Validity checks are the cheapest assertions you’ll ever write, yet production pipelines run for months with country_code: "XX" in 8% of records because nobody spent ten minutes on a check. The ingredient label says “flour.” The bag contains sand. Nobody looked.
Three Layers of Quality Gates
Quality checks need to live at three distinct points in the pipeline. Miss a layer, and a category of bug gets through. Three inspection stations on the production line. Each catches what the others miss.
- Pipeline has defined owners (named individuals, not team aliases) for each data product
- Monitoring can alert within 5 minutes of a quality threshold breach
- Quarantine storage exists for failed records with at least 30 days retention
- Rollback mechanism can restore the previous clean state within 15 minutes
- Source systems expose schema version metadata accessible at ingestion time
Source validation runs at ingestion. The receiving dock. Schema conformance, type checking, row count sanity. When the source API returns a new column or drops one, this layer catches it within minutes instead of when a report looks wrong next week. Check the delivery against the order form before it enters the warehouse.
Transformation validation runs after each significant step. The mid-line inspection. A join dropping 15% of records because a key format changed gets flagged before downstream models consume the gap. Simple test: assert row count ratios between stages. If a join produces fewer rows than expected, something changed upstream.
Output validation is the final gate before shipping. Referential integrity across fact and dimension tables. Aggregate consistency (do line items sum to the total?). Cross-system reconciliation against authoritative sources. The final taste test before the product leaves the factory.
# dbt schema.yml - quality gates that block bad data
models:
- name: mart_orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: customer_id
tests:
- not_null
- relationships:
to: ref('stg_customers')
field: customer_id
- name: status
tests:
- accepted_values:
values: ['pending', 'paid', 'shipped', 'delivered', 'cancelled']
- name: total_amount
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0
max_value: 100000
Gates should be blocking by default. Teams that start with non-blocking checks and plan to “make them blocking later” never follow through. The inspection station that waves everything through and writes a report. Nobody reads the report. The transition from warning to enforcement needs a cultural shift that gets harder every month it’s delayed.
Great Expectations handles Python-native assertions, Soda provides YAML-based checks accessible to analysts, and dbt tests run as part of the build itself. Test fails? Model doesn’t materialize. No bad data reaches production. The contaminated batch never ships.
The Root Cause Problem
Most quality incidents share a common origin: application changes that upstream teams didn’t communicate downstream. A product team renames a field, changes enum values, adjusts a default. Their app tests pass. Your pipeline keeps running. Reports show wrong numbers for days. The supplier changed the recipe. Didn’t tell the factory. The factory kept making the product. The customers noticed.
The structural solution is data contracts that block deployment when schema changes affect downstream consumers. A contract test in CI catches the enum rename before the deploy goes through. Four days of wrong data becomes zero. Prevention beats detection every time. The supplier’s delivery gets checked against the contract before it enters the building.
Anomaly Detection Beyond Static Thresholds
Static thresholds catch sudden breaks. They miss gradual drift. Row count dropping 2% per week stays beneath every fixed check. Three months later, you’ve lost 52% of your records. Nobody noticed. The slow leak. The frog in the pot.
Z-scores on rolling windows solve this. If today’s count is 2.5+ standard deviations below the 30-day mean, alert regardless of where the static floor sits. This catches both sudden anomalies and sustained drift that erodes quality slowly. The inspector who compares today’s batch to the last 30 batches, not just a fixed minimum.
Add seasonal decomposition for cyclical data. Alert on the residual component, not raw counts, or drown in false positives every Monday morning. A Monday row count 40% below Friday is normal for most businesses. A Monday row count 40% below the previous Monday is not.
For multivariate anomalies, isolation forests trained on row count, null rates, distinct counts, and record size flag days where no individual metric looks wrong but the combination is suspicious. A batch where the weight, color, and temperature are each within range, but the combination has never occurred before. Advanced analytics extends these approaches into self-tuning monitors that adapt their baselines automatically.
Partition-level and cross-table checks
Partition-level checks validate each day on its own. A table at 1.2% nulls looks healthy. Today’s partition at 15% nulls does not. The total hides the spike.
Cross-table consistency catches records missing between related tables. Orders without matching entries in order_items. Invoices referencing customers that don’t exist in the customer dimension.
Historical comparison with rolling averages and 2-sigma thresholds catches gradual erosion. Compare today’s distinct value count against the 30-day rolling average. A slow decline in distinct product IDs might indicate a filtering bug eating categories.
| Detection Method | Catches | Misses | Best For |
|---|---|---|---|
| Static thresholds | Sudden breaks, known failure modes | Gradual drift, seasonal variation | Simple pipelines with stable patterns |
| Z-score rolling | Drift, sudden anomalies | Seasonal effects without decomposition | Non-cyclical metrics |
| Seasonal decomposition | Weekly/monthly patterns | Multivariate anomalies | Business metrics with known cycles |
| Isolation forest | Complex multivariate anomalies | Needs training data | Mature pipelines with history |
What the Industry Gets Wrong About Data Quality
“Data quality is a dashboard problem.” Dashboards display quality metrics. They don’t enforce them. A dashboard showing 99.5% completeness is useless if the pipeline that produces bad data runs uninterrupted. The report that says the factory is clean doesn’t stop the contaminated batch from shipping. Quality gates that block pipeline execution on failure are the control. Dashboards are the reporting layer.
“More tests mean better quality.” Tests that nobody maintains, that generate false positives teams learn to ignore, or that run after the pipeline has already shipped corrupted data create false confidence. Worse than no tests. A food safety poster in a kitchen nobody cleans. Five well-maintained tests per model that block on failure outperform 50 stale assertions that generate warnings nobody reads.
“Schema validation covers data quality.” Schema checks confirm structure. They say nothing about whether the values make business sense. A valid integer in the order_total column could still be negative, or zero, or a million times larger than any real order. The package weighs what the label says. The contents are wrong. Validity rules and business logic assertions catch what schema checks wave through.
Don’t: Start with non-blocking quality checks and plan to “make them blocking later.” The transition never happens. Teams build workflows around ignoring warnings, and by month three, the alert channel is muted. The inspection report nobody reads. Again.
Do: Start blocking on the three most critical checks from day one. Expand blocking coverage as confidence in the checks grows. The short-term pain of a blocked pipeline trains teams to fix root causes instead of ignoring symptoms.
That enum rename from the opening? A schema contract test catches it before the pipeline ever runs. Four days of wrong data becomes zero. Each post-mortem produces a new quality check. Over time, a team builds hundreds of assertions reflecting actual failure modes. The data engineering pipeline supporting these checks becomes institutional knowledge that doesn’t walk out when someone leaves. The factory’s inspection handbook. Every check is a scar from a real incident. Data quality is a practice that compounds.