MLOps Step 3: ML Model Training
📓 Jupyter Notebook 📊 MLflow UI
Training a model is not a one-time event — it's a repeatable process that happens every time your data changes, your code improves, or you want to try a different approach. This guide focuses on making training reproducible, tracked, and deployable to Kubernetes at scale.
We'll train a LogisticRegression model to predict employee attrition — whether an employee is likely to leave the company. The training script uses a scikit-learn Pipeline that scales numeric features and feeds them into a logistic regression classifier, and tracks the run with MLflow.
The Training Pipeline
Every ML training run follows the same pattern: load data, define model, train, evaluate, save. The employee attrition model uses a scikit-learn Pipeline to chain a StandardScaler with a LogisticRegression classifier. Using a Pipeline means preprocessing and model are serialized together — when you load the model at serving time, you don't need to re-implement the scaling logic separately.
MLflow Experiment Tracking
MLflow is the de facto standard for tracking ML experiments. It stores every run's parameters, metrics, and artifacts in a queryable database, so you can compare runs and reproduce any result.
MLflow Components
- Tracking Server: Stores experiment metadata (params, metrics). Backed by PostgreSQL in production.
- Artifact Store: Stores large files (model binaries, plots). Backed by S3.
- Model Registry: Versioned model catalog with stage transitions (Staging → Production).
- UI: Web interface for comparing runs, viewing metrics, and downloading artifacts.
Hyperparameter Tuning with Optuna
Manual hyperparameter search doesn't scale. Optuna is a modern, efficient hyperparameter optimization framework that uses Bayesian optimization to find good parameters much faster than grid search.
Model Evaluation and Promotion
Before promoting a model to production, you need an automated evaluation gate that checks it meets minimum performance thresholds — and that it's better than the current production model.
Running Training on Kubernetes
For large models or datasets that won't fit on a single machine, you need to run training as a Kubernetes Job. This gives you access to GPU nodes, large memory instances, and automatic cleanup.