Data Quality: When the Pipeline Lies

May 27, 2025 Metasphere Engineering 12 min read

It’s a quiet afternoon. The VP of Sales pings the data team: “Revenue on the dashboard dropped 30% overnight. What happened?” Two hours of investigation later, the answer: nothing happened to the business. A product engineer renamed an enum value from "completed" to "complete" in the orders API a few days earlier. The pipeline kept running. The transformation kept filtering for "completed". Four days of orders silently fell through the filter. The dashboard showed real revenue dropping. The VP almost pulled a campaign based on phantom numbers.

The production line kept running. The batch looked fine. The contamination was invisible until the customer got sick.

If you’ve been in data engineering for more than a year, you’ve lived some version of this story. Not the pipeline that crashes with a stack trace. That one is easy. The expensive failure is the pipeline that runs successfully, returns HTTP 200, and delivers subtly wrong numbers to people who make decisions with them. A pipeline producing zero rows is visible right away. Someone asks why the dashboard is empty. A pipeline producing rows with incorrect totals is visible only when someone notices a metric doesn’t match their gut feel. By then, decisions have been made on bad data, and the harder question is not “what went wrong” but “does anyone still trust the numbers?”

Trust in data is earned in months and lost in a single incident. Rebuilding it takes longer. (Always longer.)

Key takeaways

The most expensive data quality bug is the one that runs successfully. No errors. No alerts. Subtly wrong numbers reaching decision-makers for days before anyone notices.
Five dimensions of data quality need different enforcement strategies. Completeness, freshness, uniqueness, consistency, accuracy. A single approach can’t cover all five.
Schema contract tests catch upstream changes before they break pipelines. Enum renamed from “completed” to “complete”? Caught at the PR stage. Before it reaches production.
Statistical anomaly detection catches what schema tests miss. Revenue drops 30% day-over-day? Alert fires before the VP notices.
Quality gates should block pipeline execution, not just generate warnings. A warning nobody reads is the same as no test at all.

The Five Dimensions of Data Quality

Data quality is not one thing. It’s five, and which one matters most depends on who uses the data and what they do with it.

Dimension	What It Measures	Enforcement Approach	Impact When Violated
Completeness	Required fields filled	Null count checks per column	Missing customer_id = useless record
Freshness	Data reflects current state	Timestamp comparison against SLA window	Decisions on yesterday’s reality
Accuracy	Values match reality	Cross-reference against source of truth	Small errors compound into measurable ML problems
Uniqueness	No duplicate records	Dedup checks on natural keys	Even low duplication rates produce material misstatements
Validity	Values match expected formats/ranges	Regex and range checks	Months of `country_code: "XX"` in 8% of rows

For ML training data, even 95% completeness introduces bias if the missing 5% isn’t random. And it almost never is. Freshness is deceptive: display “data as of” timestamps on every dashboard. Analysts who know data is 2 hours old make different decisions than analysts who assume it’s current. Validity checks are the cheapest assertions you’ll ever write, yet production pipelines run for months with country_code: "XX" in 8% of records because nobody spent ten minutes on a check. The ingredient label says “flour.” The bag contains sand. Nobody looked.

Three Layers of Quality Gates

Quality checks need to live at three distinct points in the pipeline. Miss a layer, and a category of bug gets through. Three inspection stations on the production line. Each catches what the others miss.

Prerequisites

Pipeline has defined owners (named individuals, not team aliases) for each data product
Monitoring can alert within 5 minutes of a quality threshold breach
Quarantine storage exists for failed records with at least 30 days retention
Rollback mechanism can restore the previous clean state within 15 minutes
Source systems expose schema version metadata accessible at ingestion time

Source validation runs at ingestion. The receiving dock. Schema conformance, type checking, row count sanity. When the source API returns a new column or drops one, this layer catches it within minutes instead of when a report looks wrong next week. Check the delivery against the order form before it enters the warehouse.

Transformation validation runs after each significant step. The mid-line inspection. A join dropping 15% of records because a key format changed gets flagged before downstream models consume the gap. Simple test: assert row count ratios between stages. If a join produces fewer rows than expected, something changed upstream.

Output validation is the final gate before shipping. Referential integrity across fact and dimension tables. Aggregate consistency (do line items sum to the total?). Cross-system reconciliation against authoritative sources. The final taste test before the product leaves the factory.

# dbt schema.yml - quality gates that block bad data
models:
  - name: mart_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('stg_customers')
              field: customer_id
      - name: status
        tests:
          - accepted_values:
              values: ['pending', 'paid', 'shipped', 'delivered', 'cancelled']
      - name: total_amount
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0
              max_value: 100000

Gates should be blocking by default. Teams that start with non-blocking checks and plan to “make them blocking later” never follow through. The inspection station that waves everything through and writes a report. Nobody reads the report. The transition from warning to enforcement needs a cultural shift that gets harder every month it’s delayed.

Great Expectations handles Python-native assertions, Soda provides YAML-based checks accessible to analysts, and dbt tests run as part of the build itself. Test fails? Model doesn’t materialize. No bad data reaches production. The contaminated batch never ships.

The Root Cause Problem

Most quality incidents share a common origin: application changes that upstream teams didn’t communicate downstream. A product team renames a field, changes enum values, adjusts a default. Their app tests pass. Your pipeline keeps running. Reports show wrong numbers for days. The supplier changed the recipe. Didn’t tell the factory. The factory kept making the product. The customers noticed.

The structural solution is data contracts that block deployment when schema changes affect downstream consumers. A contract test in CI catches the enum rename before the deploy goes through. Four days of wrong data becomes zero. Prevention beats detection every time. The supplier’s delivery gets checked against the contract before it enters the building.

The Silent Corruption Window The time between when bad data enters the pipeline and when a human notices. For most organizations, this window stretches from days to a full week. A contaminated batch on the shelf. Every decision made on data during this window is contaminated. Reducing the window to zero requires automated quality gates that block execution, not just measure.

Anomaly Detection Beyond Static Thresholds

Static thresholds catch sudden breaks. They miss gradual drift. Row count dropping 2% per week stays beneath every fixed check. Three months later, you’ve lost 52% of your records. Nobody noticed. The slow leak. The frog in the pot.

Z-scores on rolling windows solve this. If today’s count is 2.5+ standard deviations below the 30-day mean, alert regardless of where the static floor sits. This catches both sudden anomalies and sustained drift that erodes quality slowly. The inspector who compares today’s batch to the last 30 batches, not just a fixed minimum.

Add seasonal decomposition for cyclical data. Alert on the residual component, not raw counts, or drown in false positives every Monday morning. A Monday row count 40% below Friday is normal for most businesses. A Monday row count 40% below the previous Monday is not.

For multivariate anomalies, isolation forests trained on row count, null rates, distinct counts, and record size flag days where no individual metric looks wrong but the combination is suspicious. A batch where the weight, color, and temperature are each within range, but the combination has never occurred before. Advanced analytics extends these approaches into self-tuning monitors that adapt their baselines automatically.

Partition-level and cross-table checks

Partition-level checks validate each day on its own. A table at 1.2% nulls looks healthy. Today’s partition at 15% nulls does not. The total hides the spike.

Cross-table consistency catches records missing between related tables. Orders without matching entries in order_items. Invoices referencing customers that don’t exist in the customer dimension.

Historical comparison with rolling averages and 2-sigma thresholds catches gradual erosion. Compare today’s distinct value count against the 30-day rolling average. A slow decline in distinct product IDs might indicate a filtering bug eating categories.

Detection Method	Catches	Misses	Best For
Static thresholds	Sudden breaks, known failure modes	Gradual drift, seasonal variation	Simple pipelines with stable patterns
Z-score rolling	Drift, sudden anomalies	Seasonal effects without decomposition	Non-cyclical metrics
Seasonal decomposition	Weekly/monthly patterns	Multivariate anomalies	Business metrics with known cycles
Isolation forest	Complex multivariate anomalies	Needs training data	Mature pipelines with history

What the Industry Gets Wrong About Data Quality

“Data quality is a dashboard problem.” Dashboards display quality metrics. They don’t enforce them. A dashboard showing 99.5% completeness is useless if the pipeline that produces bad data runs uninterrupted. The report that says the factory is clean doesn’t stop the contaminated batch from shipping. Quality gates that block pipeline execution on failure are the control. Dashboards are the reporting layer.

“More tests mean better quality.” Tests that nobody maintains, that generate false positives teams learn to ignore, or that run after the pipeline has already shipped corrupted data create false confidence. Worse than no tests. A food safety poster in a kitchen nobody cleans. Five well-maintained tests per model that block on failure outperform 50 stale assertions that generate warnings nobody reads.

“Schema validation covers data quality.” Schema checks confirm structure. They say nothing about whether the values make business sense. A valid integer in the order_total column could still be negative, or zero, or a million times larger than any real order. The package weighs what the label says. The contents are wrong. Validity rules and business logic assertions catch what schema checks wave through.

Anti-pattern

Don’t: Start with non-blocking quality checks and plan to “make them blocking later.” The transition never happens. Teams build workflows around ignoring warnings, and by month three, the alert channel is muted. The inspection report nobody reads. Again.

Do: Start blocking on the three most critical checks from day one. Expand blocking coverage as confidence in the checks grows. The short-term pain of a blocked pipeline trains teams to fix root causes instead of ignoring symptoms.

Our take The highest-ROI data quality investment is schema contract tests between your top five producer-consumer pairs. Not a quality platform. Not an anomaly detection system. A test in CI that blocks deployment when a field is removed, renamed, or changes type. The ingredient contract with your top five suppliers. Everything else is downstream of this single control. Get this right and half your quality incidents disappear within the first quarter.

That enum rename from the opening? A schema contract test catches it before the pipeline ever runs. Four days of wrong data becomes zero. Each post-mortem produces a new quality check. Over time, a team builds hundreds of assertions reflecting actual failure modes. The data engineering pipeline supporting these checks becomes institutional knowledge that doesn’t walk out when someone leaves. The factory’s inspection handbook. Every check is a scar from a real incident. Data quality is a practice that compounds.

Frequently Asked Questions

What are the five dimensions of data quality?

Completeness (required fields filled above 99%), accuracy (values match reality), freshness (current within the SLA window), uniqueness (zero duplicate records on natural keys), and validity (values match expected formats and ranges). For ML training, accuracy matters most because even small inaccuracies in labels weaken model performance. For operational reporting, freshness is the priority. Measure all five but set thresholds based on use case.

What is the difference between data validation and data monitoring?

Validation tests specific properties at pipeline runtime and catches known failures in seconds. Monitoring tracks metrics over time and catches anomalies like row count dropping well below the rolling average or null rates jumping overnight. Validation catches what you expected. Monitoring catches what you didn’t. Teams using both approaches catch far more issues before dashboards than teams relying on validation alone.

How do you set data freshness SLAs?

Start with the upstream source update frequency. If the source updates every 15 minutes, your pipeline can’t promise better than that. Add processing time plus a buffer of around 20%. Alert at 80% of the threshold to give the on-call team time to investigate before reports show stale data. Every SLA needs a named owner who responds within 15 minutes, not a team alias nobody checks.

What causes most data quality issues in practice?

Application-layer changes that don’t account for downstream consumers. A product team renames a field, changes enum values, or fills a previously empty column with different meaning. The change passes all app tests, the pipeline keeps running, and reports show wrong numbers for days before anyone notices. External schema changes from SaaS vendors are the second most common cause.

How do you handle quality failures without blocking pipelines?

Match response to severity. For critical failures where completeness drops below acceptable levels, block the pipeline and alert right away. Bad data reaching reports is worse than late reports. For non-critical issues, continue with a quality flag on affected records and alert the owner. Quarantine failed records rather than discarding them. They’re often recoverable after root cause analysis.