Kubernetes Built-in Features for AI/ML
Before reaching for specialized ML platforms like Kubeflow or Ray, it's worth deeply understanding the Kubernetes primitives that already solve many ML infrastructure problems. Node affinity, taints, Jobs, and resource quotas are available in every K8s cluster and together give you most of what you need for a production ML platform.
Node Affinity: Scheduling ML Pods on the Right Nodes
ML workloads are heterogeneous. Training jobs need GPUs and lots of RAM. Inference servers need fast CPUs and low latency. Preprocessing jobs just need general-purpose nodes. Node affinity lets you express these requirements declaratively.
Setting Up Node Labels
Taints and Tolerations: Dedicated ML Nodes
Node affinity says "prefer/require this node type." But it doesn't prevent other pods from landing on your expensive GPU nodes. Taints + tolerations do: a taint on a node repels all pods that don't explicitly tolerate it.
NoSchedule so general workloads don't accidentally consume them. Your cluster autoscaler will not scale down GPU nodes if pods are pending that need them — this is the correct behavior.
Persistent Volumes for Datasets
Training jobs need to read large datasets from shared storage. Ephemeral container storage doesn't work — you need Persistent Volumes that survive pod restarts and can be mounted by multiple pods simultaneously (ReadWriteMany).
Kubernetes Jobs for ML Training
Kubernetes Jobs are purpose-built for batch workloads like ML training — they run to completion and report success or failure. Key features for ML:
- completions: How many successful completions to run (for hyperparameter search across N configurations)
- parallelism: How many pods to run concurrently
- backoffLimit: How many times to retry on failure
- ttlSecondsAfterFinished: Auto-cleanup after the job completes
- activeDeadlineSeconds: Hard timeout — prevents runaway training jobs from burning GPU budget
Resource Quotas and LimitRanges
Without quotas, a single training job can consume all cluster resources and starve other workloads. Use ResourceQuota to cap namespace-level consumption and LimitRange to set per-pod defaults.