ML Docker Image Optimization: From 3 GB to Under 400 MB
A fresh ML Docker image in a Kubeflow pipeline project landed at 3.17 GB. After applying the techniques in this guide, it shrank to 354 MB — an 89% reduction. Pull times dropped from minutes to seconds. Kubernetes pod startup went from slow to near-instant. Node disk pressure alerts went quiet.
This guide explains why ML images balloon, what to do about it, and how to measure the results. The same techniques apply to training images, inference images, Airflow worker images, and any other ML container in your stack.
Why ML Images Bloat
Standard application images hover around 50–200 MB. ML images routinely exceed 2–4 GB. The bloat comes from five compounding sources:
| Source | Typical Size | Why It's There |
|---|---|---|
| CUDA runtime + cuDNN | 1.2–2.5 GB | GPU acceleration for PyTorch/TensorFlow |
| PyTorch or TensorFlow | 700 MB–2 GB | Deep learning framework with all backends |
| scipy / numpy / sklearn stack | 100–300 MB | Scientific Python, often pulled transitively |
| Build tools left in image | 50–500 MB | gcc, cmake, headers installed during pip compile |
| Unused pip packages | varies | Requirements copied from dev environment |
The most common mistake is starting from a nvidia/cuda or pytorch/pytorch base image that ships the full CUDA toolkit, then layering every library the data science team uses locally — including dev tools, Jupyter, and visualization packages that the pipeline never calls.
pip install in a RUN layer without cleanup bakes the download cache into the image.
Always append && pip cache purge or use --no-cache-dir. This alone can save 200–400 MB on a heavy ML install.
Base Image Selection
The base image is the largest single lever you have. The difference between a bad and a good choice can be 1–2 GB before you install a single package.
| Base Image | Typical Size | Best For |
|---|---|---|
nvidia/cuda:12.x-runtime-ubuntu22.04 | ~500 MB | GPU inference — CUDA runtime only, no toolkit |
nvidia/cuda:12.x-devel-ubuntu22.04 | ~3.5 GB | Building CUDA extensions — never use for serving |
python:3.11-slim-bookworm | ~130 MB | CPU-only workloads, Airflow tasks, data processing |
python:3.11-alpine | ~55 MB | Ultra-minimal — some C extensions won't compile |
pytorch/pytorch:2.x-cuda12-runtime | ~2 GB | PyTorch + CUDA, but still large — prefer slim + install |
The key insight: use -runtime not -devel. The devel variant includes the full CUDA compiler toolkit needed to build CUDA extensions from source. Your serving container doesn't need to compile anything — it needs to run already-compiled code. The runtime variant has everything needed for inference at a fraction of the size.
python:3.11-slim as the base. You'll save 500 MB–1.5 GB instantly. Reserve the CUDA base images for training and inference containers.
Multi-Stage Builds
Multi-stage builds are the most powerful technique for ML images. The idea: use a fat builder stage that has all the compilers, headers, and build tools needed to install Python packages with C extensions — but copy only the resulting site-packages into the final slim runtime stage.
# Stage 1: Builder — has gcc, cmake, build headers
FROM python:3.11-slim-bookworm AS builder
RUN apt-get update && apt-get install -y --no-install-recommends \
gcc g++ libgomp1 \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /build
COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt
# Stage 2: Runtime — starts fresh, no build tools
FROM python:3.11-slim-bookworm AS runtime
# Copy only the installed packages from the builder
COPY --from=builder /install /usr/local
WORKDIR /app
COPY . .
CMD ["python", "pipeline.py"]
The runtime stage never sees gcc, cmake, or the apt package lists. All the build scaffolding stays in the builder layer, which is discarded. The final image only contains Python and the runtime site-packages.
.so files explicitly from the builder.
Dependency Optimization
The requirements file that a data scientist uses locally is not the right requirements file for a pipeline image. Local dev installs include Jupyter, matplotlib, ipykernel, and dozens of other tools the pipeline never calls.
Split requirements by environment
# requirements-dev.txt — local dev and notebooks
jupyter
jupyterlab
matplotlib
seaborn
ipykernel
ipywidgets
black
pytest
# + everything in requirements-base.txt
# requirements-base.txt — what the pipeline actually needs at runtime
scikit-learn==1.4.2
pandas==2.2.1
numpy==1.26.4
mlflow-skinny==2.12.1 # skinny = no heavy ML framework deps
dvc[s3]==3.50.0
pandera==0.19.2
The Dockerfile uses requirements-base.txt. Data scientists use requirements-dev.txt. The key packages to audit:
| Package | Swap For | Savings |
|---|---|---|
mlflow | mlflow-skinny | ~200 MB (no TensorFlow/sklearn bundled) |
torch (CPU+CUDA) | torch --index-url .../cpu | ~1.2 GB if CPU only |
tensorflow | tensorflow-cpu | ~800 MB on non-GPU nodes |
matplotlib | Remove (log plots to MLflow) | ~50 MB |
jupyter | Remove entirely | ~350 MB |
Pin exact versions
Unpinned dependencies let pip resolve at build time, which can pull in newer, larger transitive deps. Pin every package with ==. Generate the pinned file from your dev environment:
# In your dev environment, after testing your pipeline works:
pip freeze | grep -v "^-e" > requirements-base.txt
# Review carefully — remove anything that's dev-only
# Then commit to git
Use --no-cache-dir and purge
# Always use --no-cache-dir for pip installs in Docker
RUN pip install --no-cache-dir -r requirements-base.txt
# For apt: clean up in the same RUN layer
RUN apt-get update && apt-get install -y --no-install-recommends \
libgomp1 \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
&& in a single RUN command.
Layer Caching Strategy
Layer ordering matters for build speed in CI/CD. Docker caches layers from top to bottom, invalidating all subsequent layers when a layer changes. For ML images:
FROM python:3.11-slim-bookworm
# 1. System deps first — changes rarely, cache stays warm for months
RUN apt-get update && apt-get install -y --no-install-recommends \
libgomp1 libglib2.0-0 \
&& apt-get clean && rm -rf /var/lib/apt/lists/*
# 2. Requirements before app code — changes less often than code
COPY requirements-base.txt .
RUN pip install --no-cache-dir -r requirements-base.txt
# 3. App code last — changes every commit, but pip install is already cached
COPY . /app
WORKDIR /app
CMD ["python", "pipeline.py"]
With this ordering, when you change a Python file, Docker reuses the cached pip install layer. A code change that previously triggered a 3-minute pip install now completes in seconds.
Case Study: Kubeflow Image (89% Reduction)
Here's a real example of what this looks like in practice. A Kubeflow pipeline project had a training image that weighed in at 3.17 GB. The audit uncovered several layers of bloat, each owned by a different team:
| Finding | Owner | Fix | Savings |
|---|---|---|---|
Base image was nvidia/cuda:12-devel | DevOps | Switch to cuda:12-runtime | ~2.2 GB |
Full torch with CUDA wheel included for CPU-only preprocessing step | Data Scientist | Split into two images: CPU preprocessing + GPU training | ~1.2 GB on preprocessing |
mlflow full package installed | ML Engineer | Switch to mlflow-skinny | ~200 MB |
| Jupyter + visualization libs in pipeline image | Data Scientist | Move to dev requirements only | ~350 MB |
| pip cache not purged | DevOps | Add --no-cache-dir | ~180 MB |
| apt build-essential left installed | DevOps | Multi-stage build | ~90 MB |
The final image size: 354 MB — an 89% reduction from the original 3.17 GB. The Dockerfile that achieved this:
# Builder: has gcc for compiling C extensions
FROM python:3.11-slim-bookworm AS builder
RUN apt-get update && apt-get install -y --no-install-recommends \
gcc g++ libgomp1 \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /build
COPY requirements-pipeline.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements-pipeline.txt
# Runtime: clean slate, no compilers
FROM python:3.11-slim-bookworm AS runtime
RUN apt-get update && apt-get install -y --no-install-recommends \
libgomp1 libglib2.0-0 \
&& apt-get clean && rm -rf /var/lib/apt/lists/*
COPY --from=builder /install /usr/local
WORKDIR /app
COPY pipeline/ ./pipeline/
COPY config/ ./config/
ENV PYTHONUNBUFFERED=1 \
PYTHONDONTWRITEBYTECODE=1
CMD ["python", "-m", "pipeline.main"]
And the slimmed requirements file (CPU-only preprocessing pipeline):
# requirements-pipeline.txt
scikit-learn==1.4.2
pandas==2.2.1
numpy==1.26.4
mlflow-skinny==2.12.1
boto3==1.34.69
pandera==0.19.2
dvc[s3]==3.50.0
Measuring and Verifying
Optimization without measurement is guesswork. Use these commands to audit images before and after:
# See the compressed size (what Kubernetes actually pulls)
docker image ls your-ml-image:latest
# Inspect each layer's contribution to total size
docker image history your-ml-image:latest --human --format "table {{.CreatedBy}}\t{{.Size}}"
# Deep dive: which packages take the most space
docker run --rm your-ml-image:latest \
pip list --format=columns | head -30
# Find the largest directories inside the image
docker run --rm your-ml-image:latest \
du -sh /usr/local/lib/python3.11/site-packages/* | sort -rh | head -20
# Compare compressed vs uncompressed size (what matters for registry pulls)
docker save your-ml-image:latest | wc -c
Integrate into CI/CD
Add an image size gate to your GitHub Actions workflow so images can't silently regrow:
# .github/workflows/build.yml
- name: Build ML image
run: docker build -t ml-pipeline:${{ github.sha }} .
- name: Check image size
run: |
SIZE_MB=$(docker image inspect ml-pipeline:${{ github.sha }} \
--format='{{.Size}}' | awk '{print int($1/1024/1024)}')
echo "Image size: ${SIZE_MB} MB"
if [ "$SIZE_MB" -gt 500 ]; then
echo "ERROR: Image size ${SIZE_MB}MB exceeds 500MB limit"
exit 1
fi
What to Track Over Time
| Metric | How to Measure | Target |
|---|---|---|
| Image size (uncompressed) | docker image ls | Set per-image budget; alert on growth >10% |
| Pull time on cold node | Kubernetes pod startup latency | <30s for most ML images |
| Node disk pressure events | Prometheus kube_node_status_condition | Zero DiskPressure events |
| Registry storage cost | Registry dashboard or aws ecr describe-repositories | Track month-over-month |
Smaller images are not just about disk space. They reduce attack surface (fewer packages = fewer CVEs), speed up Kubernetes pod scheduling, and cut CI/CD build times. For teams running dozens of daily training jobs, a 2 GB → 400 MB reduction meaningfully lowers infrastructure costs.