Foundation

DevOps to MLOps: What You Actually Need to Learn

● Beginner ⏱ 35 min read Foundation

If you're a DevOps engineer being asked to support ML workloads, the good news is you already have 60–70% of the skills you need. The bad news is it's not always obvious which 30–40% you're missing — and that gap can cause expensive production incidents.

This guide gives you a clear map: what transfers directly, what needs to be reframed, and what is genuinely new territory.

The Mental Model Shift

In DevOps, you operate deterministic systems. Given the same code and the same inputs, a service behaves the same way every time. Infrastructure as Code means you can reproduce any environment exactly. Tests pass or fail with binary clarity.

MLOps operates probabilistic systems. A model produces outputs with associated confidence scores, not binary right/wrong answers. "Is this deployment healthy?" becomes a statistical question: "Has the distribution of predictions shifted significantly from our baseline?"

This creates a fundamentally different operational posture:

  • Your alerting needs statistical thresholds, not just error rates
  • Your "tests" include data quality checks that can have gray areas
  • A model can be "working" (returning predictions, no crashes) but failing (returning bad predictions)
  • Rollbacks aren't just about code — you may need to roll back the model, the training data, or both
⚠️
Common Mistake Many DevOps engineers new to MLOps set up monitoring that only checks if the inference server is running and returning 200 status codes. The model can be perfectly healthy by these metrics while silently returning garbage predictions. You must also monitor the predictions themselves.

MLOps Maturity Model

Google's MLOps maturity model defines three levels. Most teams start at Level 0 and should aim for Level 1 before worrying about Level 2.

LevelTrainingDeploymentMonitoring
Level 0
Manual
Manual, ad hoc, in notebooks Script or manual deploy, no CI/CD None or manual checks
Level 1
ML Pipeline
Automated training pipeline, experiment tracking CI/CD for model, automated validation gate Basic prediction monitoring, alerting
Level 2
CI/CD Pipeline
Automated retraining triggered by drift Canary/shadow deployments, automated rollback Full drift detection, automated retraining loop

Most production ML teams operate at Level 1. Getting to Level 2 is a multi-team effort and requires significant investment. Don't try to skip to Level 2 from Level 0.

Tool Mapping: DevOps → MLOps

The MLOps ecosystem looks intimidating until you realize most tools solve familiar problems:

ProblemDevOps ToolMLOps Tool
Source code version controlGitGit (same)
Large binary artifact versioningArtifactory / NexusDVC + S3 / MLflow Artifacts
Build & test automationJenkins / GitHub ActionsKubeflow Pipelines / Argo Workflows
Artifact registryDocker Registry / NexusMLflow Model Registry
Runtime environmentDockerDocker (same) + KServe serving runtime
Infrastructure orchestrationKubernetesKubernetes + GPU nodes + KServe
Monitoring & alertingPrometheus + GrafanaPrometheus + Grafana + Evidently AI
Secrets managementVault / K8s SecretsSame + feature store credentials
Environment managementTerraform / HelmSame + GPU node pools, quotas

Skills You Already Have (and How They Map)

Container expertise

Everything in MLOps runs in containers. Training jobs, serving endpoints, data pipelines — all Docker. Your ability to write efficient Dockerfiles, understand layer caching, and debug container issues transfers 100%. The difference: ML containers are often much larger (CUDA libraries, model weights) and may need GPU access.

Kubernetes

K8s is the platform for production MLOps. You'll use namespaces, resource quotas, node affinity, and persistent volumes — all familiar. New additions: GPU resource requests (nvidia.com/gpu: 1), the KServe operator, and understanding how the NVIDIA device plugin works.

Pipeline thinking

Your CI/CD pipeline intuition directly applies. An ML pipeline is just: ingest → validate → transform → train → evaluate → package → deploy → monitor. Each stage is a job with inputs, outputs, and success criteria. The tooling is different (Airflow, Kubeflow) but the mental model is identical.

Infrastructure as Code

Terraform and Helm apply directly. You'll additionally need to provision GPU node pools, configure the NVIDIA operator, and deploy KServe. These are all just Helm charts and Terraform modules.

Your 90-Day Learning Plan

text
💡
Focus on Depth Over Breadth Don't try to learn every tool in the ecosystem. Pick one tool per category and get good at it. MLflow for experiment tracking, Airflow for pipelines, KServe for serving. Once you understand the pattern, switching to alternatives (Weights & Biases, Kubeflow, Seldon) is straightforward.

The remaining guides in this series walk through each phase of this plan in detail. The next guide dives into building your first dataset pipeline — the foundation of every MLOps system.