Kubernetes for ML

Kubernetes Built-in Features for AI/ML

● Intermediate ⏱ 40 min read Kubernetes for ML

Before reaching for specialized ML platforms like Kubeflow or Ray, it's worth deeply understanding the Kubernetes primitives that already solve many ML infrastructure problems. Node affinity, taints, Jobs, and resource quotas are available in every K8s cluster and together give you most of what you need for a production ML platform.

Node Affinity: Scheduling ML Pods on the Right Nodes

ML workloads are heterogeneous. Training jobs need GPUs and lots of RAM. Inference servers need fast CPUs and low latency. Preprocessing jobs just need general-purpose nodes. Node affinity lets you express these requirements declaratively.

Setting Up Node Labels

bash

yaml

Taints and Tolerations: Dedicated ML Nodes

Node affinity says "prefer/require this node type." But it doesn't prevent other pods from landing on your expensive GPU nodes. Taints + tolerations do: a taint on a node repels all pods that don't explicitly tolerate it.

bash

yaml

💡

Taint Strategy for ML Taint your GPU nodes with NoSchedule so general workloads don't accidentally consume them. Your cluster autoscaler will not scale down GPU nodes if pods are pending that need them — this is the correct behavior.

Persistent Volumes for Datasets

Training jobs need to read large datasets from shared storage. Ephemeral container storage doesn't work — you need Persistent Volumes that survive pod restarts and can be mounted by multiple pods simultaneously (ReadWriteMany).

yaml

Kubernetes Jobs for ML Training

Kubernetes Jobs are purpose-built for batch workloads like ML training — they run to completion and report success or failure. Key features for ML:

completions: How many successful completions to run (for hyperparameter search across N configurations)
parallelism: How many pods to run concurrently
backoffLimit: How many times to retry on failure
ttlSecondsAfterFinished: Auto-cleanup after the job completes
activeDeadlineSeconds: Hard timeout — prevents runaway training jobs from burning GPU budget

yaml

Resource Quotas and LimitRanges

Without quotas, a single training job can consume all cluster resources and starve other workloads. Use ResourceQuota to cap namespace-level consumption and LimitRange to set per-pod defaults.

yaml