MLOps Pipelines: From Notebook to Production ML

Q: What does ML reproducibility actually require in practice?

Reproducibility requires versioning three things together: code (training script and model architecture via Git), data (exact training and validation datasets tracked by content hash, using DVC or Delta Lake snapshots), and configuration (hyperparameters, preprocessing steps, random seeds via MLflow or Weights and Biases). If any of these three change without tracking, you cannot reproduce the model - and you cannot prove to a regulator what data was used.

Q: What is shadow deployment for ML models?

Shadow deployment runs a new model candidate alongside the current production model. Production traffic is served normally by the production model. Simultaneously, the same inputs are sent to the shadow model and predictions are logged but not served to users. This lets you evaluate the new model's behavior on real production traffic - including data distributions and edge cases your offline test set does not capture - without any user-facing impact.

Q: When should you retrain a model vs. roll it back?

Roll back when performance has degraded suddenly due to a specific identifiable cause - a data pipeline bug introduced corrupted features, a deployment error changed preprocessing logic. Retrain when performance has degraded gradually due to data drift and you have new labeled data available. The distinction matters: retraining on drifted data without understanding the cause may produce a model that improves on evaluation metrics but has learned the wrong patterns.

Q: What is the difference between online and offline model evaluation?

Offline evaluation uses a held-out test set and runs in minutes, but can miss distribution shifts between test data and production. Online evaluation measures real production traffic using business metrics like conversion rate or fraud detection accuracy. Most teams find a 5-15% gap between offline test metrics and real production performance. Online evaluation requires shadow deployment or A/B testing infrastructure, and typically needs 2-4 weeks of data to reach statistical significance.

Mar 22, 2025 Metasphere Engineering 7 min read

Machine Learning AI Infrastructure Data Engineering

The data science team spent three months training a fraud detection model that achieved 94% precision on the test set. They pickled it, uploaded it to an S3 bucket, and asked DevOps to “just deploy it.” Six months later, the model’s precision had drifted to 71%, well below the threshold where it was generating value. Nobody noticed because nobody was monitoring it. When the team tried to reproduce the original results for a compliance audit, they could not. The notebook had been modified since training. The training data path pointed to a directory that had been reorganized. The random seed was never set. The exact version of scikit-learn was not recorded. If this sounds familiar, you’re not alone.

This is the most common MLOps failure mode, and it is not a people problem. It is an architecture problem. The research mindset that produces good models is fundamentally incompatible with the engineering mindset required to keep them working in production. Experimentation culture accepts that most experiments fail and iterations are exploratory. Production engineering requires that deployed systems are reproducible, monitored, and systematically improvable. MLOps resolves this tension not by making research less experimental, but by applying software engineering discipline to the parts of the ML lifecycle that production demands. Teams that do this well treat model deployment with the same rigor as application deployment. Teams that don’t end up in the scenario above. Every time.

Reproducibility Is Not Optional

Here is a test worth running with your ML team right now. Pick your best production model. Can you reproduce the exact training run that created it? Not approximately. Exactly. Same data, same code, same hyperparameters, same output within floating-point tolerance.

About 15% of teams can. That number should scare you. The rest have some combination of: training scripts that loaded data from a path that no longer exists, preprocessing done interactively in a notebook modified after training, random seeds not set, exact library versions not recorded, or training data that has been updated in place since the run completed.

None of this matters for a prototype. All of it matters when you need to debug a production performance regression, investigate why a model is behaving unexpectedly, or prove to a regulator exactly what data was used to train a model making consequential decisions about loans, insurance, or healthcare. And that day will come.

Reproducibility requires versioning three things together. Miss any one of them and the whole guarantee collapses.

Code. The training script, model architecture definition, and preprocessing logic. Git provides this. Pin your requirements.txt or poetry.lock. Do not use >= version specifiers for ML libraries. A minor version bump in scikit-learn or PyTorch will change model outputs. Not might. Will.

Data. The exact training and validation datasets at the exact version used for training. Data versioning tools like DVC or Delta Lake snapshots provide this. The key is content-addressable storage: each dataset version gets a hash, and the training run references that hash. Robust data engineering pipelines provide the infrastructure required to make this tractable at scale.

Configuration. Hyperparameters, preprocessing steps, random seeds, framework versions. Experiment tracking tools like MLflow or Weights and Biases capture this alongside the resulting model artifact, linking all three together. Here is the command every team should be running: mlflow.log_params() with every hyperparameter, mlflow.log_artifact() with the frozen requirements file, and mlflow.log_input() with the DVC data reference.

If any of these three change without tracking, you cannot reproduce the model. Period.

With reproducibility infrastructure in place, you can build the real prize: a full ML pipeline that connects data ingestion through experimentation, evaluation, deployment safety, and production monitoring into a continuous loop where drift detection triggers retraining automatically.

Feature Management at Scale

This is where most production ML systems silently fall apart. Training-serving skew accounts for an estimated 30-40% of production ML performance issues. The failure mode is distinctive: the same feature is computed differently for training and for inference. Training uses a batch pipeline that computes 30-day rolling average purchase frequency on historical data. The inference pipeline computes the same feature on live data with a slightly different window or different null handling. The model receives features during inference that look subtly different from training, and performance degrades in ways that are brutally hard to diagnose.

Feature stores (Feast, Tecton, or well-designed custom implementations) solve this by centralizing feature computation and ensuring the same logic runs for both training (reading from the offline store) and inference (reading from the online store). Make this investment before scale forces the issue, not after debugging three months of silent performance degradation. This is a core component of MLOps model lifecycle automation. For a deeper dive into when and how to adopt feature stores, see our dedicated guide on ML feature stores and training-serving skew.

Deployment Safety for ML Models

Model deployment has risks that are fundamentally different from application deployment. A bad application crashes. A bad model fails silently, producing predictions that are wrong but plausible, with no exception thrown and no error in the logs. The fraud model dropping from 94% to 71% precision produces no alerts by default. It just quietly costs you money.

Shadow deployment is the ML equivalent of a canary deployment, and every team should use it. The new model processes real production inputs but does not serve its predictions to users. Predictions are logged and compared against the current production model. Systematic differences surface before any user impact and before any business impact.

Champion-challenger testing goes further, and this is where you get real confidence. A small percentage of production traffic is served predictions from the challenger (new candidate). Business metrics, not just ML metrics, are compared between champion and challenger cohorts: conversion rate, fraud detection rate, click-through rate. The challenger is promoted only when it demonstrably outperforms the champion on metrics the business cares about, not just on offline test set performance.

The AI/ML engineering discipline that separates teams with reliable production models from teams with fragile ones comes down to this: treat model deployment as a software deployment problem requiring systematic validation. “It worked on the test set” is not evidence. It’s a hypothesis. The fraud model that drifted to 71% would have been caught in shadow mode within days of the distribution shift beginning.

For teams in financial services, where model drift has direct capital impact, our guide on financial AI data quality covers the automated retraining loops and drift detection thresholds specific to that domain. For teams managing the governance layer around production models, responsible AI governance covers bias monitoring and audit trail requirements. The models you ship are only as good as the infrastructure keeping them honest.

Frequently Asked Questions

What is model drift and how do you detect it before it causes business impact?

Model drift is performance degradation because real-world data has shifted from the training distribution. Most models experience measurable drift within 3-6 months of deployment. Detection requires monitoring input feature distributions using statistical tests like PSI (threshold > 0.2 indicates significant drift) or KS test (p-value < 0.05). Automated alerts should trigger retraining evaluation when drift scores exceed thresholds on more than 10% of monitored features.

What does ML reproducibility actually require in practice?