MLOps Pipeline

ML Docker Image Optimization: From 3 GB to Under 400 MB

● Intermediate ⏱ 35 min read MLOps Pipeline

A fresh ML Docker image in a Kubeflow pipeline project landed at 3.17 GB. After applying the techniques in this guide, it shrank to 354 MB — an 89% reduction. Pull times dropped from minutes to seconds. Kubernetes pod startup went from slow to near-instant. Node disk pressure alerts went quiet.

This guide explains why ML images balloon, what to do about it, and how to measure the results. The same techniques apply to training images, inference images, Airflow worker images, and any other ML container in your stack.

💡

Optimization is a Team Sport Image optimization is not a pure DevOps problem. A DevOps engineer cannot safely remove a Python library without confirming with the data scientist whether the model actually needs it at runtime. Real reduction happens when all teams ask together: What does this image actually need to run?

Why ML Images Bloat

Standard application images hover around 50–200 MB. ML images routinely exceed 2–4 GB. The bloat comes from five compounding sources:

Source	Typical Size	Why It's There
CUDA runtime + cuDNN	1.2–2.5 GB	GPU acceleration for PyTorch/TensorFlow
PyTorch or TensorFlow	700 MB–2 GB	Deep learning framework with all backends
scipy / numpy / sklearn stack	100–300 MB	Scientific Python, often pulled transitively
Build tools left in image	50–500 MB	gcc, cmake, headers installed during pip compile
Unused pip packages	varies	Requirements copied from dev environment

The most common mistake is starting from a nvidia/cuda or pytorch/pytorch base image that ships the full CUDA toolkit, then layering every library the data science team uses locally — including dev tools, Jupyter, and visualization packages that the pipeline never calls.

⚠️

Each pip install in a RUN layer without cleanup bakes the download cache into the image. Always append && pip cache purge or use --no-cache-dir. This alone can save 200–400 MB on a heavy ML install.

Base Image Selection

The base image is the largest single lever you have. The difference between a bad and a good choice can be 1–2 GB before you install a single package.

Base Image	Typical Size	Best For
`nvidia/cuda:12.x-runtime-ubuntu22.04`	~500 MB	GPU inference — CUDA runtime only, no toolkit
`nvidia/cuda:12.x-devel-ubuntu22.04`	~3.5 GB	Building CUDA extensions — never use for serving
`python:3.11-slim-bookworm`	~130 MB	CPU-only workloads, Airflow tasks, data processing
`python:3.11-alpine`	~55 MB	Ultra-minimal — some C extensions won't compile
`pytorch/pytorch:2.x-cuda12-runtime`	~2 GB	PyTorch + CUDA, but still large — prefer slim + install

The key insight: use -runtime not -devel. The devel variant includes the full CUDA compiler toolkit needed to build CUDA extensions from source. Your serving container doesn't need to compile anything — it needs to run already-compiled code. The runtime variant has everything needed for inference at a fraction of the size.

💡

CPU-Only Pipelines Don't Need CUDA at All If your Airflow data pipeline, feature engineering job, or validation step doesn't use a GPU, use python:3.11-slim as the base. You'll save 500 MB–1.5 GB instantly. Reserve the CUDA base images for training and inference containers.

Multi-Stage Builds

Multi-stage builds are the most powerful technique for ML images. The idea: use a fat builder stage that has all the compilers, headers, and build tools needed to install Python packages with C extensions — but copy only the resulting site-packages into the final slim runtime stage.

dockerfile

# Stage 1: Builder — has gcc, cmake, build headers
FROM python:3.11-slim-bookworm AS builder

RUN apt-get update && apt-get install -y --no-install-recommends \
    gcc g++ libgomp1 \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /build
COPY requirements.txt .

RUN pip install --no-cache-dir --prefix=/install -r requirements.txt

# Stage 2: Runtime — starts fresh, no build tools
FROM python:3.11-slim-bookworm AS runtime

# Copy only the installed packages from the builder
COPY --from=builder /install /usr/local

WORKDIR /app
COPY . .

CMD ["python", "pipeline.py"]

The runtime stage never sees gcc, cmake, or the apt package lists. All the build scaffolding stays in the builder layer, which is discarded. The final image only contains Python and the runtime site-packages.

ℹ️

When Multi-Stage Doesn't Apply Some packages bundle their own shared libraries (PyTorch ships libcuda, libtorch). Copying just site-packages works fine for these. But if you build a custom CUDA extension or compile something that links to system libraries, you may need to also copy those .so files explicitly from the builder.

Dependency Optimization

The requirements file that a data scientist uses locally is not the right requirements file for a pipeline image. Local dev installs include Jupyter, matplotlib, ipykernel, and dozens of other tools the pipeline never calls.

Split requirements by environment

text

# requirements-dev.txt — local dev and notebooks
jupyter
jupyterlab
matplotlib
seaborn
ipykernel
ipywidgets
black
pytest
# + everything in requirements-base.txt

# requirements-base.txt — what the pipeline actually needs at runtime
scikit-learn==1.4.2
pandas==2.2.1
numpy==1.26.4
mlflow-skinny==2.12.1   # skinny = no heavy ML framework deps
dvc[s3]==3.50.0
pandera==0.19.2

The Dockerfile uses requirements-base.txt. Data scientists use requirements-dev.txt. The key packages to audit:

Package	Swap For	Savings
`mlflow`	`mlflow-skinny`	~200 MB (no TensorFlow/sklearn bundled)
`torch` (CPU+CUDA)	`torch --index-url .../cpu`	~1.2 GB if CPU only
`tensorflow`	`tensorflow-cpu`	~800 MB on non-GPU nodes
`matplotlib`	Remove (log plots to MLflow)	~50 MB
`jupyter`	Remove entirely	~350 MB

Pin exact versions

Unpinned dependencies let pip resolve at build time, which can pull in newer, larger transitive deps. Pin every package with ==. Generate the pinned file from your dev environment:

bash

# In your dev environment, after testing your pipeline works:
pip freeze | grep -v "^-e" > requirements-base.txt

# Review carefully — remove anything that's dev-only
# Then commit to git

Use `--no-cache-dir` and purge

dockerfile

# Always use --no-cache-dir for pip installs in Docker
RUN pip install --no-cache-dir -r requirements-base.txt

# For apt: clean up in the same RUN layer
RUN apt-get update && apt-get install -y --no-install-recommends \
    libgomp1 \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

⚠️

Cleanup must be in the same RUN layer as the install. Docker layers are immutable. If you install in one RUN and delete cache in the next, the cache is still baked into the first layer. Always chain cleanup with && in a single RUN command.

Layer Caching Strategy

Layer ordering matters for build speed in CI/CD. Docker caches layers from top to bottom, invalidating all subsequent layers when a layer changes. For ML images:

dockerfile

FROM python:3.11-slim-bookworm

# 1. System deps first — changes rarely, cache stays warm for months
RUN apt-get update && apt-get install -y --no-install-recommends \
    libgomp1 libglib2.0-0 \
    && apt-get clean && rm -rf /var/lib/apt/lists/*

# 2. Requirements before app code — changes less often than code
COPY requirements-base.txt .
RUN pip install --no-cache-dir -r requirements-base.txt

# 3. App code last — changes every commit, but pip install is already cached
COPY . /app
WORKDIR /app

CMD ["python", "pipeline.py"]

With this ordering, when you change a Python file, Docker reuses the cached pip install layer. A code change that previously triggered a 3-minute pip install now completes in seconds.

Case Study: Kubeflow Image (89% Reduction)

Here's a real example of what this looks like in practice. A Kubeflow pipeline project had a training image that weighed in at 3.17 GB. The audit uncovered several layers of bloat, each owned by a different team:

Finding	Owner	Fix	Savings
Base image was `nvidia/cuda:12-devel`	DevOps	Switch to `cuda:12-runtime`	~2.2 GB
Full `torch` with CUDA wheel included for CPU-only preprocessing step	Data Scientist	Split into two images: CPU preprocessing + GPU training	~1.2 GB on preprocessing
`mlflow` full package installed	ML Engineer	Switch to `mlflow-skinny`	~200 MB
Jupyter + visualization libs in pipeline image	Data Scientist	Move to dev requirements only	~350 MB
pip cache not purged	DevOps	Add `--no-cache-dir`	~180 MB
apt build-essential left installed	DevOps	Multi-stage build	~90 MB

The final image size: 354 MB — an 89% reduction from the original 3.17 GB. The Dockerfile that achieved this:

dockerfile

# Builder: has gcc for compiling C extensions
FROM python:3.11-slim-bookworm AS builder

RUN apt-get update && apt-get install -y --no-install-recommends \
    gcc g++ libgomp1 \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /build
COPY requirements-pipeline.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements-pipeline.txt

# Runtime: clean slate, no compilers
FROM python:3.11-slim-bookworm AS runtime

RUN apt-get update && apt-get install -y --no-install-recommends \
    libgomp1 libglib2.0-0 \
    && apt-get clean && rm -rf /var/lib/apt/lists/*

COPY --from=builder /install /usr/local

WORKDIR /app
COPY pipeline/ ./pipeline/
COPY config/ ./config/

ENV PYTHONUNBUFFERED=1 \
    PYTHONDONTWRITEBYTECODE=1

CMD ["python", "-m", "pipeline.main"]

And the slimmed requirements file (CPU-only preprocessing pipeline):

text

# requirements-pipeline.txt
scikit-learn==1.4.2
pandas==2.2.1
numpy==1.26.4
mlflow-skinny==2.12.1
boto3==1.34.69
pandera==0.19.2
dvc[s3]==3.50.0

💡

Separate Training and Serving Images Your training image and your serving image have completely different runtime requirements. The training image needs experiment tracking, data loading, and evaluation code. The serving image needs only the model loader and prediction logic. Maintaining two separate Dockerfiles — one for training, one for serving — typically halves both image sizes.

Measuring and Verifying

Optimization without measurement is guesswork. Use these commands to audit images before and after:

bash

# See the compressed size (what Kubernetes actually pulls)
docker image ls your-ml-image:latest

# Inspect each layer's contribution to total size
docker image history your-ml-image:latest --human --format "table {{.CreatedBy}}\t{{.Size}}"

# Deep dive: which packages take the most space
docker run --rm your-ml-image:latest \
  pip list --format=columns | head -30

# Find the largest directories inside the image
docker run --rm your-ml-image:latest \
  du -sh /usr/local/lib/python3.11/site-packages/* | sort -rh | head -20

# Compare compressed vs uncompressed size (what matters for registry pulls)
docker save your-ml-image:latest | wc -c

Integrate into CI/CD

Add an image size gate to your GitHub Actions workflow so images can't silently regrow:

yaml

# .github/workflows/build.yml
- name: Build ML image
  run: docker build -t ml-pipeline:${{ github.sha }} .

- name: Check image size
  run: |
    SIZE_MB=$(docker image inspect ml-pipeline:${{ github.sha }} \
      --format='{{.Size}}' | awk '{print int($1/1024/1024)}')
    echo "Image size: ${SIZE_MB} MB"
    if [ "$SIZE_MB" -gt 500 ]; then
      echo "ERROR: Image size ${SIZE_MB}MB exceeds 500MB limit"
      exit 1
    fi

ℹ️

Dive is a Great Visual Tool dive is an open-source CLI tool that renders a visual layer-by-layer breakdown of a Docker image, showing exactly what each RUN command added or deleted. It's particularly useful for finding unexpectedly large layers caused by forgotten build artifacts.

What to Track Over Time

Metric	How to Measure	Target
Image size (uncompressed)	`docker image ls`	Set per-image budget; alert on growth >10%
Pull time on cold node	Kubernetes pod startup latency	<30s for most ML images
Node disk pressure events	Prometheus `kube_node_status_condition`	Zero DiskPressure events
Registry storage cost	Registry dashboard or `aws ecr describe-repositories`	Track month-over-month

Smaller images are not just about disk space. They reduce attack surface (fewer packages = fewer CVEs), speed up Kubernetes pod scheduling, and cut CI/CD build times. For teams running dozens of daily training jobs, a 2 GB → 400 MB reduction meaningfully lowers infrastructure costs.

ML Docker Image Optimization: From 3 GB to Under 400 MB

Why ML Images Bloat

Base Image Selection

Multi-Stage Builds

Dependency Optimization

Split requirements by environment

Pin exact versions

Use --no-cache-dir and purge

Layer Caching Strategy

Case Study: Kubeflow Image (89% Reduction)

Measuring and Verifying

Integrate into CI/CD

What to Track Over Time

Use `--no-cache-dir` and purge