MLOps: From Notebook to Monitored Production

Q: What does ML reproducibility actually require in practice?

Reproducibility requires versioning three things together: code (training script and model architecture via Git), data (exact training and validation datasets tracked by content hash, using DVC or Delta Lake snapshots), and configuration (hyperparameters, preprocessing steps, random seeds via MLflow or Weights and Biases). If any of these three change without tracking, you can't reproduce the model and you can't prove to a regulator what data was used.

Q: What is shadow deployment for ML models?

Shadow deployment runs a new model candidate alongside the current production model. Production traffic is served normally by the production model. At the same time, the same inputs go to the shadow model and predictions are logged but not served to users. This lets you evaluate the new model's behavior on real production traffic, including data distributions and edge cases your offline test set doesn't capture, without any user-facing impact.

Q: When should you retrain a model vs. roll it back?

Roll back when performance dropped suddenly due to a specific identifiable cause like a data pipeline bug or deployment error. Retrain when performance dropped gradually due to data drift and you have new labeled data available. The distinction matters: retraining on drifted data without understanding the cause may produce a model that improves on evaluation metrics but has learned the wrong patterns.

Q: What is the difference between online and offline model evaluation?

Offline evaluation uses a held-out test set and runs in minutes, but can miss distribution shifts between test data and production. Online evaluation measures real production traffic using business metrics like conversion rate or fraud detection accuracy. Most teams find a noticeable gap between offline test metrics and real production performance. Online evaluation needs shadow deployment or A/B testing infrastructure, and typically takes 2-4 weeks of data to reach statistical significance.

Mar 22, 2025 Metasphere Engineering 13 min read

The data science team spent three months training a fraud detection model that achieved 94% precision on the test set. They pickled it, uploaded it to an S3 bucket, and asked DevOps to “just deploy it.” Six months later, the model’s precision had drifted to 71%, well below the threshold where it generated value. Nobody noticed because nobody was monitoring it.

The drug worked in the lab. The factory shipped it. Six months later, the disease mutated and the drug stopped working. Nobody checked.

When the team tried to reproduce the original results for a compliance audit, they couldn’t. The notebook had been modified since training. The training data path pointed to a directory that had been reorganized. The random seed was never set. The exact version of scikit-learn was not recorded. The lab notebook is illegible. The ingredients were moved. The recipe was never written down.

If this sounds familiar, you’re in good company. MLflow and DVC exist specifically because almost every ML team has lived some version of this story.

Key takeaways

Reproducibility is non-negotiable. If you can’t recreate the exact model that’s in production (same data, same code, same result), you can’t debug it, audit it, or improve it. If you can’t reproduce the drug, you can’t certify it.
Model monitoring catches drift before business impact. Precision dropped from 94% to 71% over six months with zero alerts. Production model metrics need the same observability as application metrics.
Feature pipelines need versioned, content-addressable datasets. “Training data” pointing to a directory that got reorganized is how compliance audits fail. Ingredients with no batch number.
Canary deployments work for models too. Route 5% of traffic to the new model. Compare predictions against the champion. Promote only when metrics improve. Clinical trials before the drug ships.
The notebook-to-production gap is where most ML initiatives stall. MLOps bridges it by applying software engineering discipline to the parts of the lifecycle that production demands.

MLOps applies engineering discipline to the ML lifecycle, giving model deployment the same rigor as application deployment.

Reproducibility: The Foundation Everything Else Depends On

Try this right now. Pick your best production model. Can you reproduce the exact training run that created it? Same data, same code, same hyperparameters, same output within floating-point tolerance.

Most teams cannot. The typical situation: training scripts load data from a path that no longer exists. Preprocessing was done interactively in a notebook that’s been modified since. Random seeds were never set. Library versions were never pinned. The lab burned down and the formula was on a whiteboard. For a prototype, none of this matters. All of it matters when you need to debug a production performance regression. Or investigate unexpected model behavior. Or prove to a regulator exactly what data trained a model making decisions about loans, insurance, or healthcare.

Reproducibility requires versioning three things together. Miss any one and the guarantee collapses. The formula. The ingredients. The process. All three or it’s not the same drug.

Code. The training script, model architecture definition, and preprocessing logic. Git handles this, but pin your requirements.txt or poetry.lock to exact versions. Don’t use >= specifiers for ML libraries. A minor version bump in scikit-learn or PyTorch will change model outputs. You’ll spend a week tracking down why predictions suddenly differ, and the answer will be a patch version of numpy you didn’t pin. (A slightly different ingredient supplier. Same name on the label. Different chemical composition.)

Data. The exact training and validation datasets at the exact version used for training. Data versioning tools like DVC or Delta Lake snapshots provide content-addressable storage: each dataset version gets a hash, and the training run references that hash. The ingredient batch number. Solid data engineering pipelines provide the infrastructure to make this tractable at scale.

Configuration. Hyperparameters, preprocessing steps, random seeds, framework versions. Experiment tracking tools like MLflow or W&B capture this alongside the resulting model artifact. The manufacturing process. Temperature, pressure, duration. The commands every team should be running: mlflow.log_params() with every hyperparameter, mlflow.log_artifact() with the frozen requirements file, and mlflow.log_input() with the DVC data reference.

Anti-pattern

Don’t: Version the model artifact alone and call it reproducibility. A pickled model without its training data hash, hyperparameter record, and pinned library versions is a binary blob you can deploy but can’t explain, audit, or improve. A drug with no formula sheet.

Do: Link every model artifact to its exact code commit, data snapshot hash, and configuration record. If any of the three is missing, the model isn’t reproducible. Treat it like an unsigned binary.

With reproducibility infrastructure in place, the real prize becomes reachable: a full ML pipeline connecting data ingestion through experimentation, deployment safety, and production monitoring into a continuous loop. The manufacturing line from raw ingredients to certified drug.

Feature Management and Training-Serving Skew

Most production ML systems fall apart here quietly. Training-serving skew is when the same feature gets computed differently for training and for inference. The lab uses one measurement technique. The factory uses another. Same ingredient name. Different substance. The training pipeline uses a batch job that computes 30-day rolling average purchase frequency on historical data. The inference pipeline computes the same feature on live data with a slightly different window or different null handling. The model receives features during inference that look subtly different from training, and performance degrades in ways that are brutally hard to diagnose because nothing looks broken.

Feature stores solve this by centralizing feature computation and making sure the same logic runs for both training (reading from the offline store) and inference (reading from the online store). Same formula for the lab and the factory. Make this investment before scale forces the issue, not after three months of debugging silent performance degradation. This is a core component of MLOps model lifecycle automation . For a deeper dive into when and how to adopt feature stores, see the dedicated guide on ML feature stores and training-serving skew .

Deployment Safety for ML Models

Model deployment carries risks that are different from application deployment in kind. A bad application deploy crashes, throws an exception, returns a 500. You know it’s broken. A bad model fails quietly, producing predictions that are wrong but plausible. No exception. No error in the logs. The drug that stopped working doesn’t announce it. The patients just stop getting better. The fraud model dropping from 94% to 71% precision generates no alerts by default. It just quietly costs you while looking perfectly healthy on every operational dashboard.

ML Deployment Strategy	Risk Level	When to Use	Catches
Shadow deployment	Zero (no user impact)	Every model change	Quality regression, latency issues
Champion-challenger	Low (small traffic split)	After shadow validates	Business metric impact at scale
Gradual rollout	Medium (increasing exposure)	After challenger wins	Long-tail edge cases
Feature flag gated	Low (instant toggle)	Any model serving change	Allows instant rollback

Shadow deployment is the ML equivalent of a clinical trial. The new drug given alongside the proven one. Predictions logged. Results compared. Zero patient risk. Every team should use it. The new model processes real production inputs but doesn’t serve its predictions to users. Predictions get logged and compared against the current production model. Systematic differences surface before any user impact.

Prerequisites

Model serving infrastructure supports routing production inputs to two models at the same time
Prediction logging captures both champion and challenger outputs with matching request IDs
Statistical comparison pipeline can detect meaningful differences within 1-2 weeks of traffic
Rollback procedure tested: production traffic can revert to champion-only within minutes
Business metric tracking (conversion rate, fraud detection rate) set up for cohort comparison

Champion-challenger testing goes further. A small share of production traffic gets served predictions from the challenger model. The clinical trial where some patients get the new drug. Business metrics, not just ML metrics, are compared between cohorts: conversion rate, fraud detection rate, click-through rate. The challenger is promoted only when it demonstrably outperforms on metrics the business actually cares about. Not lab results. Real-world outcomes.

MLOps maturity levels: where most teams actually are

Level 0 (Ad-Hoc): Pickled models in S3. No versioning. Manual deployment. No monitoring. The drug formula on a whiteboard. Most teams start here.

Level 1 (Tracked): MLflow experiment tracking. DVC for data versioning. Manual deployment pipeline. Basic accuracy monitoring. The first meaningful step. Lab notebooks that are actually legible.

Level 2 (Automated): CI/CD for model deployment. Shadow and canary testing. Feature store integration. Drift detection alerts. The manufacturing line. The point where ML engineering becomes sustainable.

Level 3 (Autonomous): Automated retraining triggers. Champion-challenger testing as a pipeline stage. Full audit trail. Self-healing pipelines that detect drift and retrain without human intervention. The factory that reformulates when the disease mutates. Few teams reach this level, and most don’t need to.

The Notebook-to-Production Gap The distance between a model that works in a Jupyter notebook and one that works in production. The gap between “it works in the lab” and “it works in the factory.” The gap includes containerization, API serving, monitoring, canary deployment, rollback procedures, and compliance documentation. Most teams badly underestimate this distance when they say “the model is ready.” The formula is ready. The manufacturing line is not.

What the Industry Gets Wrong About MLOps

“Data scientists should own model deployment.” Data scientists should own model quality. Platform engineers should own deployment infrastructure. Asking a data scientist to configure Kubernetes manifests, CI/CD pipelines, and monitoring alerts is asking the chemist to build the factory. Wrong skills, wrong incentives, wrong outcomes.

“Version the model artifact and you have reproducibility.” Reproducibility requires versioning the model, the training data, the feature pipeline, the hyperparameters, the random seed, and the library versions. All linked together. A model artifact without its full lineage is a binary blob you can deploy but can’t explain. A drug with no formula sheet.

“Offline evaluation is enough.” Most teams find a noticeable gap between offline test metrics and real production performance. The held-out test set was drawn from the same distribution as training data. Production data drifts. The lab results don’t match the clinical trial. Shadow deployment on real traffic is the only reliable measure of how a model will actually perform.

Our take Shadow deployment is mandatory for every model change. The clinical trial. Route production traffic to the new model, compare predictions against the champion, promote only when metrics improve. Teams that skip shadow mode and deploy directly to production discover regression from their users, not their monitoring. The infrastructure cost of shadow deployment is trivial compared to the cost of shipping a degraded model. Run the trial. Every time.

The fraud model that drifted to 71%? Shadow mode would have caught it within days of the distribution shift. The clinical trial that would have shown the drug stopped working. Instead, it ran unchecked for months. The AI/ML engineering discipline behind reliable production models comes down to treating model deployment as a software deployment problem requiring systematic validation. “It worked on the test set” is a hypothesis, not evidence. “It worked in the lab” is the start, not the finish. Monitoring, not talent, makes the difference between a model that works and one that merely worked once.

For financial services, where drift has direct capital impact, financial AI data quality covers retraining loops and domain-specific thresholds. For the governance layer, responsible AI governance covers bias monitoring and audit trails.

Frequently Asked Questions

What is model drift and how do you detect it before it causes business impact?

Model drift is performance degradation because real-world data has shifted from the training distribution. Most models experience measurable drift within 3-6 months of deployment. Detection requires monitoring input feature distributions using statistical tests like PSI (threshold > 0.2 means significant drift) or KS test (p-value < 0.05). Automated alerts should trigger retraining evaluation when drift scores exceed thresholds on more than 10% of monitored features.

What does ML reproducibility actually require in practice?

Reproducibility requires versioning three things together: code (training script and model architecture via Git), data (exact training and validation datasets tracked by content hash, using DVC or Delta Lake snapshots), and configuration (hyperparameters, preprocessing steps, random seeds via MLflow or Weights and Biases). If any of these three change without tracking, you can’t reproduce the model and you can’t prove to a regulator what data was used.

What is shadow deployment for ML models?

Shadow deployment runs a new model candidate alongside the current production model. Production traffic is served normally by the production model. At the same time, the same inputs go to the shadow model and predictions are logged but not served to users. This lets you evaluate the new model’s behavior on real production traffic, including data distributions and edge cases your offline test set doesn’t capture, without any user-facing impact.

When should you retrain a model vs. roll it back?