MLOps: From Notebook to Monitored Production
The data science team spent three months training a fraud detection model that achieved 94% precision on the test set. They pickled it, uploaded it to an S3 bucket, and asked DevOps to “just deploy it.” Six months later, the model’s precision had drifted to 71%, well below the threshold where it generated value. Nobody noticed because nobody was monitoring it.
The drug worked in the lab. The factory shipped it. Six months later, the disease mutated and the drug stopped working. Nobody checked.
When the team tried to reproduce the original results for a compliance audit, they couldn’t. The notebook had been modified since training. The training data path pointed to a directory that had been reorganized. The random seed was never set. The exact version of scikit-learn was not recorded. The lab notebook is illegible. The ingredients were moved. The recipe was never written down.
If this sounds familiar, you’re in good company. MLflow and DVC exist specifically because almost every ML team has lived some version of this story.
- Reproducibility is non-negotiable. If you can’t recreate the exact model that’s in production (same data, same code, same result), you can’t debug it, audit it, or improve it. If you can’t reproduce the drug, you can’t certify it.
- Model monitoring catches drift before business impact. Precision dropped from 94% to 71% over six months with zero alerts. Production model metrics need the same observability as application metrics.
- Feature pipelines need versioned, content-addressable datasets. “Training data” pointing to a directory that got reorganized is how compliance audits fail. Ingredients with no batch number.
- Canary deployments work for models too. Route 5% of traffic to the new model. Compare predictions against the champion. Promote only when metrics improve. Clinical trials before the drug ships.
- The notebook-to-production gap is where most ML initiatives stall. MLOps bridges it by applying software engineering discipline to the parts of the lifecycle that production demands.
MLOps applies engineering discipline to the ML lifecycle, giving model deployment the same rigor as application deployment.
Reproducibility: The Foundation Everything Else Depends On
Try this right now. Pick your best production model. Can you reproduce the exact training run that created it? Same data, same code, same hyperparameters, same output within floating-point tolerance.
Most teams cannot. The typical situation: training scripts load data from a path that no longer exists. Preprocessing was done interactively in a notebook that’s been modified since. Random seeds were never set. Library versions were never pinned. The lab burned down and the formula was on a whiteboard. For a prototype, none of this matters. All of it matters when you need to debug a production performance regression. Or investigate unexpected model behavior. Or prove to a regulator exactly what data trained a model making decisions about loans, insurance, or healthcare.
Reproducibility requires versioning three things together. Miss any one and the guarantee collapses. The formula. The ingredients. The process. All three or it’s not the same drug.
Code. The training script, model architecture definition, and preprocessing logic. Git handles this, but pin your requirements.txt or poetry.lock to exact versions. Don’t use >= specifiers for ML libraries. A minor version bump in scikit-learn or PyTorch will change model outputs. You’ll spend a week tracking down why predictions suddenly differ, and the answer will be a patch version of numpy you didn’t pin. (A slightly different ingredient supplier. Same name on the label. Different chemical composition.)
Data. The exact training and validation datasets at the exact version used for training. Data versioning tools like DVC or Delta Lake snapshots provide content-addressable storage: each dataset version gets a hash, and the training run references that hash. The ingredient batch number. Solid data engineering pipelines provide the infrastructure to make this tractable at scale.
Configuration. Hyperparameters, preprocessing steps, random seeds, framework versions. Experiment tracking tools like MLflow or W&B capture this alongside the resulting model artifact. The manufacturing process. Temperature, pressure, duration. The commands every team should be running: mlflow.log_params() with every hyperparameter, mlflow.log_artifact() with the frozen requirements file, and mlflow.log_input() with the DVC data reference.
Don’t: Version the model artifact alone and call it reproducibility. A pickled model without its training data hash, hyperparameter record, and pinned library versions is a binary blob you can deploy but can’t explain, audit, or improve. A drug with no formula sheet.
Do: Link every model artifact to its exact code commit, data snapshot hash, and configuration record. If any of the three is missing, the model isn’t reproducible. Treat it like an unsigned binary.
With reproducibility infrastructure in place, the real prize becomes reachable: a full ML pipeline connecting data ingestion through experimentation, deployment safety, and production monitoring into a continuous loop. The manufacturing line from raw ingredients to certified drug.
Feature Management and Training-Serving Skew
Most production ML systems fall apart here quietly. Training-serving skew is when the same feature gets computed differently for training and for inference. The lab uses one measurement technique. The factory uses another. Same ingredient name. Different substance. The training pipeline uses a batch job that computes 30-day rolling average purchase frequency on historical data. The inference pipeline computes the same feature on live data with a slightly different window or different null handling. The model receives features during inference that look subtly different from training, and performance degrades in ways that are brutally hard to diagnose because nothing looks broken.
Feature stores solve this by centralizing feature computation and making sure the same logic runs for both training (reading from the offline store) and inference (reading from the online store). Same formula for the lab and the factory. Make this investment before scale forces the issue, not after three months of debugging silent performance degradation. This is a core component of MLOps model lifecycle automation . For a deeper dive into when and how to adopt feature stores, see the dedicated guide on ML feature stores and training-serving skew .
Deployment Safety for ML Models
Model deployment carries risks that are different from application deployment in kind. A bad application deploy crashes, throws an exception, returns a 500. You know it’s broken. A bad model fails quietly, producing predictions that are wrong but plausible. No exception. No error in the logs. The drug that stopped working doesn’t announce it. The patients just stop getting better. The fraud model dropping from 94% to 71% precision generates no alerts by default. It just quietly costs you while looking perfectly healthy on every operational dashboard.
| ML Deployment Strategy | Risk Level | When to Use | Catches |
|---|---|---|---|
| Shadow deployment | Zero (no user impact) | Every model change | Quality regression, latency issues |
| Champion-challenger | Low (small traffic split) | After shadow validates | Business metric impact at scale |
| Gradual rollout | Medium (increasing exposure) | After challenger wins | Long-tail edge cases |
| Feature flag gated | Low (instant toggle) | Any model serving change | Allows instant rollback |
Shadow deployment is the ML equivalent of a clinical trial. The new drug given alongside the proven one. Predictions logged. Results compared. Zero patient risk. Every team should use it. The new model processes real production inputs but doesn’t serve its predictions to users. Predictions get logged and compared against the current production model. Systematic differences surface before any user impact.
- Model serving infrastructure supports routing production inputs to two models at the same time
- Prediction logging captures both champion and challenger outputs with matching request IDs
- Statistical comparison pipeline can detect meaningful differences within 1-2 weeks of traffic
- Rollback procedure tested: production traffic can revert to champion-only within minutes
- Business metric tracking (conversion rate, fraud detection rate) set up for cohort comparison
Champion-challenger testing goes further. A small share of production traffic gets served predictions from the challenger model. The clinical trial where some patients get the new drug. Business metrics, not just ML metrics, are compared between cohorts: conversion rate, fraud detection rate, click-through rate. The challenger is promoted only when it demonstrably outperforms on metrics the business actually cares about. Not lab results. Real-world outcomes.
MLOps maturity levels: where most teams actually are
Level 0 (Ad-Hoc): Pickled models in S3. No versioning. Manual deployment. No monitoring. The drug formula on a whiteboard. Most teams start here.
Level 1 (Tracked): MLflow experiment tracking. DVC for data versioning. Manual deployment pipeline. Basic accuracy monitoring. The first meaningful step. Lab notebooks that are actually legible.
Level 2 (Automated): CI/CD for model deployment. Shadow and canary testing. Feature store integration. Drift detection alerts. The manufacturing line. The point where ML engineering becomes sustainable.
Level 3 (Autonomous): Automated retraining triggers. Champion-challenger testing as a pipeline stage. Full audit trail. Self-healing pipelines that detect drift and retrain without human intervention. The factory that reformulates when the disease mutates. Few teams reach this level, and most don’t need to.
What the Industry Gets Wrong About MLOps
“Data scientists should own model deployment.” Data scientists should own model quality. Platform engineers should own deployment infrastructure. Asking a data scientist to configure Kubernetes manifests, CI/CD pipelines, and monitoring alerts is asking the chemist to build the factory. Wrong skills, wrong incentives, wrong outcomes.
“Version the model artifact and you have reproducibility.” Reproducibility requires versioning the model, the training data, the feature pipeline, the hyperparameters, the random seed, and the library versions. All linked together. A model artifact without its full lineage is a binary blob you can deploy but can’t explain. A drug with no formula sheet.
“Offline evaluation is enough.” Most teams find a noticeable gap between offline test metrics and real production performance. The held-out test set was drawn from the same distribution as training data. Production data drifts. The lab results don’t match the clinical trial. Shadow deployment on real traffic is the only reliable measure of how a model will actually perform.
The fraud model that drifted to 71%? Shadow mode would have caught it within days of the distribution shift. The clinical trial that would have shown the drug stopped working. Instead, it ran unchecked for months. The AI/ML engineering discipline behind reliable production models comes down to treating model deployment as a software deployment problem requiring systematic validation. “It worked on the test set” is a hypothesis, not evidence. “It worked in the lab” is the start, not the finish. Monitoring, not talent, makes the difference between a model that works and one that merely worked once.
For financial services, where drift has direct capital impact, financial AI data quality covers retraining loops and domain-specific thresholds. For the governance layer, responsible AI governance covers bias monitoring and audit trails.