Kubernetes for ML

How Kubernetes Actually Runs GPU Workloads

● Advanced ⏱ 50 min read Kubernetes for ML

Running GPU workloads on Kubernetes involves more than adding nvidia.com/gpu: 1 to your resource spec. You need to understand the full stack: how the NVIDIA kernel driver is installed on nodes, how the device plugin exposes GPUs to the scheduler, how containers get access to CUDA libraries, and how you monitor GPU utilization in production.

How Kubernetes Sees GPUs

Kubernetes treats GPUs as "extended resources" — custom resource types that can be requested by pods. The GPU count is tracked in the node's allocatable resources. Unlike CPU and memory, GPUs are not overcommittable: if a node has 4 GPUs and a pod requests 4, no other pod can get a GPU on that node until the first pod releases them.

bash

NVIDIA Device Plugin

The NVIDIA device plugin is a DaemonSet that runs on every GPU node. It's responsible for three things: discovering GPUs on the node, reporting them to the kubelet as allocatable resources, and mounting the GPU device files into containers that request them.

bash
ℹ️
Device Plugin vs GPU Operator The device plugin handles the K8s-side GPU registration. It does NOT install or manage NVIDIA drivers, CUDA toolkit, or container runtime configuration on the node. For managing all GPU software components, use the GPU Operator (described below).

Requesting GPUs in Pod Specs

yaml

NVIDIA GPU Operator

The GPU Operator automates the entire GPU software stack installation and management on Kubernetes nodes. Instead of manually installing NVIDIA drivers, CUDA, and container runtime on each node, you install the GPU Operator once and it handles everything.

bash

The GPU Operator installs a "driver container" on each node that loads the NVIDIA kernel driver without needing to modify the host OS — critical for immutable OS distributions like CoreOS/Flatcar.

MIG: Multi-Instance GPU Partitioning

NVIDIA A100 and H100 GPUs support MIG (Multi-Instance GPU), which partitions a single physical GPU into multiple isolated "GPU instances" that can be independently assigned to pods. This is useful for inference workloads that don't need a full GPU.

bash
A100 80GB MIG ProfileMemoryComputeMax Per GPUUse Case
1g.10gb10 GB1/77Small inference
2g.20gb20 GB2/73Medium inference
3g.40gb40 GB3/72Large inference / fine-tuning
7g.80gb80 GB7/71Full GPU training

DCGM Monitoring

NVIDIA DCGM (Data Center GPU Manager) provides deep GPU health and performance metrics. The GPU Operator deploys DCGM Exporter as a DaemonSet, which exposes metrics in Prometheus format.

bash
GPU Utilization Baseline A healthy training job should show GPU utilization consistently above 85%. If you see utilization spiking between 0% and 100%, your data loading pipeline is the bottleneck — the GPU is waiting for data. Use multiple DataLoader workers and pin_memory=True in PyTorch to maximize data throughput.