Data Quality Pipelines: Catching Corruption Before Dashboards

May 27, 2025 Metasphere Engineering 13 min read

It is a quiet afternoon. The VP of Sales pings the data team: “Revenue on the dashboard dropped 30% overnight. What happened?” Two hours of investigation later, the answer: nothing happened to the business. A product engineer renamed an enum value from "completed" to "complete" in the orders API a few days earlier. The pipeline kept running. The transformation kept filtering for "completed". Four days of orders silently fell through the filter. The dashboard showed real revenue dropping. The VP almost pulled a campaign based on phantom numbers. If you have been in data engineering for more than a year, you have lived some version of this story.

Your most expensive data quality problem is not the pipeline that fails with an error. That one is easy. It is the pipeline that runs successfully, returns HTTP 200, and delivers subtly wrong numbers to people who make decisions with them. A pipeline that produces zero rows is visible immediately. Someone asks why the dashboard is empty. A pipeline that produces rows with incorrect aggregations is visible only when someone notices a metric does not match their gut feel. That might be days, weeks, or months after the corruption started. By then, decisions have been made on bad data, and the harder question is not “what went wrong.” It is “does anyone still trust the numbers?”

Data quality as an engineering discipline is about building automated checks, monitoring, and response processes that detect corruption before it reaches the people who depend on the data. Not because quality is a compliance exercise. Because bad data has real operational consequences that compound with every day it goes undetected.

The Five Dimensions of Data Quality

Data quality is multidimensional, and enforcement priorities depend on who is consuming the data and why. Here are the five dimensions that matter in practice.

Completeness: required fields are populated. An order record without a customer_id is incomplete. Completeness checks are straightforward: measure null rates against thresholds. For most analytics tables, 99%+ completeness on key fields is the minimum. For ML training data, even 95% completeness introduces bias if the missing 5% is not random.

Freshness: data reflects the current state of the world within an acceptable window. A daily batch pipeline that did not run yesterday produces stale data. Monitor pipeline completion timestamps, alert at 80% of SLA threshold (not after the breach), and display a “data as of” timestamp on every dashboard. Analysts who know the data is 2 hours old make different decisions than analysts who assume it is current.

Accuracy: values reflect reality. This is the hardest dimension to measure automatically because it requires ground truth. Accuracy checks typically involve cross-referencing against authoritative sources (comparing your transaction counts to Stripe’s settlement reports) or statistical validation against expected distributions. A 5% inaccuracy rate in labeled training data degrades ML model performance by 15-20%. That is not a rounding error. That is a broken model.

Uniqueness: records are not duplicated. Duplicate records cause double-counting in every downstream aggregation. One duplicated order in a revenue report is a rounding error. A pipeline bug that duplicates 3% of orders is a material misstatement. Check natural keys for uniqueness at every layer.

Validity: values conform to expected formats, ranges, and constraints. Phone numbers parse correctly. Ages fall between 0 and 130. Currency codes exist in ISO 4217. Validity checks are explicit assertions against business rules. They are cheap to write and they catch a surprising number of issues. Production pipelines routinely run for months with country_code: "XX" in 8% of records because nobody wrote a validity check. Months.

Data Quality SLAs and On-Call

Data quality without accountability produces data quality with exceptions. Everyone agrees quality matters. Nobody’s pager goes off when it degrades. That gap between agreeing and enforcing is where data trust goes to die. The organizational structure that makes quality real is SLAs with named owners and on-call rotations.

A data product SLA specifies: freshness (data no older than X minutes), completeness (null rate below Y%), and row count expectations (between Z_min and Z_max rows per day). When the SLA is violated, an alert fires and a named engineer is responsible for investigating. Not “the data team.” A person. With a phone number. If you cannot name the person, you do not have an SLA. You have a wish.

The maturity required here mirrors software engineering’s SRE model. The team that owns the pipeline is responsible for its reliability, including quality. Data quality incidents follow the same on-call process as service incidents: acknowledge within 15 minutes, investigate, mitigate, post-mortem. The data engineering pipelines practice covers the tooling and team structures that support data reliability engineering.

Each post-mortem produces one or more new quality checks. This is the compounding value that makes the whole system work: you encode the lesson learned as an automated test that catches the same failure category permanently. Over 12 months, a team that runs post-mortems consistently builds a quality check library of 200+ assertions that reflect the actual failure modes of their specific data environment. That is institutional knowledge that does not walk out the door when an engineer leaves.

Implementing Quality Gates with Modern Tooling

Quality dimensions and SLAs are only as effective as the tooling that enforces them. The good news: the modern data quality stack has matured considerably. Tools like Great Expectations, Soda, and dbt tests give teams the ability to embed quality assertions directly into pipeline code rather than running checks as a separate afterthought.

Great Expectations excels at expressive, Python-native assertions. A single expectation suite can validate that a column’s values fall within a range, that row counts match a historical baseline within 10%, or that a foreign key relationship holds across two tables. Teams running Great Expectations report catching 85% of data quality issues before they reach production, compared to roughly 40% with ad-hoc SQL checks. Soda takes a declarative approach with YAML-based check definitions, making it accessible to analysts who do not write Python. Its scan results integrate directly with Slack, PagerDuty, and other alerting tools. dbt tests are the most lightweight option. They run as part of the dbt build process itself, meaning quality checks execute in the same transaction as the transformation. If a test fails, the model does not materialize. That is a quality gate with zero additional infrastructure.

The most effective quality architectures structure checks in three layers. Source validation runs at ingestion and catches issues before data enters the warehouse. These checks focus on schema conformance, field-level type validation, and row count sanity bounds. If a source API starts returning a new column or drops an existing one, source validation catches it within minutes. Transformation validation runs after each significant transformation step and asserts that business logic produces expected outputs. If a join drops 15% of records because a key format changed, transformation validation flags the discrepancy before downstream models consume the incomplete result. Output validation runs as the final gate before data reaches consumers and checks end-to-end properties like referential integrity across fact and dimension tables, aggregate consistency (do individual line items sum to the reported total?), and cross-system reconciliation against authoritative sources.

Within each layer, specific patterns make quality gates more precise. Partition-level checks validate each daily or hourly partition independently rather than checking aggregate statistics across the full table. A table-level null rate of 1.2% might look healthy, but if today’s partition has a 15% null rate while historical partitions average 0.8%, that is a clear signal something broke. Cross-table consistency checks verify that records flowing through a pipeline maintain relationships. If the orders table has 10,000 records for yesterday but the order_items table has entries for only 7,200 distinct orders, something dropped 28% of the parent records during transformation. Historical comparison checks use rolling averages and standard deviations to detect drift. Comparing today’s row count against the 30-day rolling average with a 2-sigma threshold catches both sudden drops and gradual erosion.

Integration with orchestrators like Airflow or Dagster makes quality gates truly blocking. In Airflow, quality checks run as dedicated tasks between transformation tasks. If the check task fails, the DAG halts and downstream tasks never execute. Dagster’s asset-based model is even more natural for this pattern. Each asset can declare quality expectations as part of its definition, and Dagster will refuse to materialize downstream assets if upstream quality checks fail. Here is the design principle that matters most: quality gates should be blocking by default and non-blocking only by explicit exception. Teams that start with non-blocking alerts and plan to “make them blocking later” never do. The alerts become noise within weeks. Make them blocking from day one or do not bother.

The Root Cause Problem

Here is the uncomfortable truth: most data quality incidents are not pipeline bugs. They are application behavior changes that upstream teams did not communicate to downstream consumers.

A product engineer renames a column. Changes an enum. Starts populating a previously null field with different semantics. The change is correct from the application’s perspective. No application tests fail. The data pipeline keeps running. Reports start showing wrong numbers. The data team finds out days later when a stakeholder notices something looks off. This pattern repeats itself every quarter at organizations without contracts in place.

The structural solution is data contracts: formal agreements between data producers and consumers that specify schema, semantics, SLAs, and change notification requirements. When an application team changes a field, the contract blocks deployment until downstream consumers acknowledge the change. Organizations with contracts in place reduce pipeline incidents caused by upstream changes by 70% within the first quarter of enforcement.

This is as much a process change as a technical one. It requires organizational buy-in beyond the data engineering team. Product engineering leadership needs to accept that their teams have downstream data consumers, and that breaking those consumers has a real business cost. Without that buy-in, you are just building better cleanup downstream instead of preventing the mess at the source. That is the wrong investment.

Anomaly Detection Beyond Static Thresholds

Static thresholds are the starting point for data quality enforcement, but they have a fundamental blind spot: gradual drift. This is the boiling frog problem in data quality. A static rule like “alert if row count drops below 8,000” catches sudden failures. It does not catch a row count that was 10,000 three months ago, dropped to 9,800 the next week, then 9,600 the week after that. Each individual week looks fine. The threshold never fires. Three months later, the table has 4,800 rows per day. That is a 52% loss that accumulated slowly enough to stay beneath every static check. The data degrades incrementally, each increment is within tolerance, and by the time someone notices, months of reports are compromised.

Statistical anomaly detection addresses this by evaluating data quality metrics against their own recent history rather than against fixed numbers. The simplest approach that works well: z-scores on a rolling window. If today’s row count is more than 2.5 standard deviations below the 30-day rolling mean, it triggers an alert regardless of whether it crosses a static floor. Rolling z-scores catch both sudden anomalies and sustained directional drift because the rolling mean itself shifts gradually. A metric drifting downward will eventually produce a z-score spike when the current value diverges far enough from the trailing window. Seasonal decomposition adds another layer. Many data sources have weekly, monthly, or quarterly cycles. A retail dataset’s Saturday row count is naturally 3x higher than Tuesday’s. Alerting on raw row counts produces false positives every Tuesday. Decomposing the time series into trend, seasonal, and residual components lets you alert on the residual. The residual strips out known cyclicality and exposes genuine anomalies.

Here is a concrete example that illustrates why this matters. An e-commerce pipeline processes order events. Historical daily row counts average 10,000 on weekdays with a standard deviation of 400. A static threshold set at 8,000 rows (20% below average) seems reasonable. Now suppose a bug in event tracking causes 2% of events to be silently dropped each week. Week one: 9,800 rows. Week two: 9,604. Week eight: 8,508. Week twelve: 7,847. The static threshold finally fires at week twelve, after nearly three months of data loss totaling over 100,000 missing records. A z-score on a 14-day rolling window would have flagged the anomaly by week three or four, when the cumulative 6-8% drop pushed the current value beyond 2 standard deviations of the short-term average. Three months versus three weeks. That is not a marginal improvement.

For more complex scenarios, machine learning approaches detect multivariate anomalies that no single metric captures. An isolation forest trained on a feature vector of row count, null rates across key columns, distinct value counts, and average record size can identify days where no individual metric is anomalous but the combination is unusual. For example, row count is normal, but the null rate on email increased by 3% while the distinct count of country_code dropped by 40%. Individually, both are within threshold. Together, they suggest a geographic segment of users stopped being tracked. These multivariate methods require more setup and ongoing tuning, but they catch the subtle, correlated failures that single-metric checks miss entirely. Teams investing in advanced analytics capabilities can extend these same techniques to build self-tuning quality monitors that adapt their sensitivity as data patterns evolve.

Data quality is not a project with a completion date. It is a practice that compounds value over time as quality checks accumulate, ownership matures, and the organization develops the reflexes to catch corruption at the source rather than discovering it in a dashboard days later. The teams that treat data quality as infrastructure build trust. The teams that treat it as a project lose trust the moment the project ends.

Frequently Asked Questions

What are the five dimensions of data quality?

Completeness (99%+ required fields populated), accuracy (values match ground truth), freshness (current within a 15-minute to 24-hour window), uniqueness (0% duplicate records on natural keys), and validity (values conform to expected formats). For ML training, accuracy matters most. A 5% inaccurate label rate degrades model performance by 15-20%. For operational reporting, freshness is the priority. Measure all dimensions but enforce thresholds based on use case.

What is the difference between data validation and data monitoring?

Validation tests deterministic properties at pipeline runtime and catches known failures in under 30 seconds. Monitoring tracks metrics over time and detects anomalies like row count 40% below the 7-day average or null rates jumping from 2% to 15% overnight. Teams using both approaches catch 95% of issues before dashboards, compared to 60% with validation alone.

How do you set data freshness SLAs?

Start with the upstream source update frequency. If the source updates every 15 minutes, your pipeline can’t guarantee better than that. Add processing time (typically 2-5 minutes) plus a 20% buffer. For a 15-minute source with 3-minute processing, set the SLA at 22 minutes. Alert at 80% of threshold (17 minutes) to give the on-call team investigation time before reports show stale data.

What causes most data quality issues in practice?

Application-layer changes that don’t account for downstream consumers. A product team renames a field, changes enum values, or populates a previously empty column with different semantics. The change passes all application tests, the pipeline keeps running, and reports show wrong numbers for days before anyone notices. External schema changes from SaaS vendors are the second most common cause.

How do you handle quality failures without blocking pipelines?

Match response to severity. For critical failures where completeness drops below a threshold, block the pipeline and alert immediately. Bad data reaching reports is worse than late reports. For non-catastrophic degradation, continue with a quality flag on affected records and alert the owner. Quarantine failed records rather than discarding them. They’re often recoverable after root cause analysis.