MLOps Step 3: ML Model Training
Training a model is not a one-time event — it's a repeatable process that happens every time your data changes, your code improves, or you want to try a different approach. This guide focuses on making training reproducible, tracked, and deployable to Kubernetes at scale.
The Training Loop
Every ML training run follows the same pattern: load data, define model, train iteratively, evaluate, save. The specifics differ by model type, but the structure is universal.
MLflow Experiment Tracking
MLflow is the de facto standard for tracking ML experiments. It stores every run's parameters, metrics, and artifacts in a queryable database, so you can compare runs and reproduce any result.
MLflow Components
- Tracking Server: Stores experiment metadata (params, metrics). Backed by PostgreSQL in production.
- Artifact Store: Stores large files (model binaries, plots). Backed by S3.
- Model Registry: Versioned model catalog with stage transitions (Staging → Production).
- UI: Web interface for comparing runs, viewing metrics, and downloading artifacts.
Hyperparameter Tuning with Optuna
Manual hyperparameter search doesn't scale. Optuna is a modern, efficient hyperparameter optimization framework that uses Bayesian optimization to find good parameters much faster than grid search.
Model Evaluation and Promotion
Before promoting a model to production, you need an automated evaluation gate that checks it meets minimum performance thresholds — and that it's better than the current production model.
Running Training on Kubernetes
For large models or datasets that won't fit on a single machine, you need to run training as a Kubernetes Job. This gives you access to GPU nodes, large memory instances, and automatic cleanup.