Versioning Data With DVC
You trained a model six months ago. It's now in production and accuracy has degraded. You need to retrain it — but on which data? What was the exact dataset used in the original training run? Without dataset versioning you're guessing. DVC (Data Version Control) solves this: every model is linked to a Git commit, and every Git commit is linked to a precise, reproducible snapshot of your dataset.
What is DVC?
DVC is an open-source tool that brings Git-style versioning to large files — datasets, models, and anything else that doesn't belong in a Git repository directly.
Why Git Can't Handle Large Files
Git is designed for text files and small binaries. The moment you try to track a 2 GB CSV or a 500 MB model checkpoint, Git performance collapses: cloning takes forever, every pull re-downloads gigabytes, and the repository history bloats permanently. GitHub even rejects files over 100 MB. You need a different strategy for large ML artifacts.
DVC = "Git for Data"
DVC stores your actual data in a remote (typically MinIO, S3, GCS,
or Azure Blob), and stores a tiny pointer file (a
.dvc file) in Git. That pointer contains the MD5
checksum and file path — enough information to retrieve the exact
file version at any point in history. When you
git checkout an old commit, DVC knows which data
version matches, and dvc pull fetches it.
| What goes in Git | What goes in MinIO (via DVC) |
|---|---|
HR-Employee-Attrition.csv.dvc (pointer, ~200
bytes)
|
HR-Employee-Attrition.csv (actual data, 150 MB)
|
.dvc/config (remote URL) |
Hashed cache objects (deduplicated) |
| All your Python code and notebooks | Model artifacts, large binaries |
git commit locks both the code and the
dataset version.
How DVC Works Under the Hood
Understanding the internals helps you debug issues and design your pipeline correctly.
Pointer Files (.dvc files)
When you run dvc add mydata.csv, DVC creates
mydata.csv.dvc alongside it. This tiny YAML file
records the MD5 checksum, file size, and path. Git tracks this file
— not the data itself. When someone clones the repo, they get the
pointer. Running dvc pull then fetches the matching
data from remote storage.
# HR-Employee-Attrition.csv.dvc — what DVC commits to Git
outs:
- md5: a3f8c2d1e4b7...
size: 157286400
path: HR-Employee-Attrition.csv
Local Cache (.dvc/cache)
DVC keeps a local cache at .dvc/cache/ in your project
root. Files are stored by their MD5 hash (e.g.,
.dvc/cache/a3/f8c2d1e4b7...). When you run
dvc add, the file is copied into the cache and a
symlink or hard link is created in your working directory. This
means DVC can restore any version without re-downloading it, as long
as it's in the local cache.
Remote Storage (MinIO)
The remote is where the cache is backed up for sharing. You
configure one remote per project (or multiple if needed). When you
run dvc push, DVC uploads any locally cached objects
not yet in MinIO. When a teammate runs dvc pull, DVC
downloads only the objects not in their local cache. Storage is
deduplicated: if two datasets share 80% of their rows, only the
unique portions are stored twice.
Deduplication via MD5
Every file version gets a unique MD5 checksum. If you update a 1 GB dataset but only 10 MB changed, DVC doesn't store 1 GB again — it only stores the new file (DVC operates at the file level, not block level). For datasets that frequently change in small ways, consider partitioning them into smaller files to get more granular deduplication.
.dvc/cache to .gitignore
The local cache can grow large. DVC adds it to
.gitignore automatically, but double-check this if
you see unexpected large files staged in Git. The cache directory
should never be committed.
Where DVC Runs
DVC isn't just a local developer tool — it integrates into your automated pipeline at multiple points.
In Airflow (Automated Versioning)
Your ETL DAG produces a fresh CSV every time it runs. Rather than
overwriting the file silently, the DAG can call DVC to version it
automatically. The Airflow worker runs dvc add on the
output CSV, then pushes to MinIO and commits the updated pointer to
Git. Every pipeline run produces a versioned, reproducible dataset
snapshot.
# Snippet from an Airflow DAG task
def version_dataset(**context):
import subprocess
run_date = context['ds'] # e.g. "2024-03-15"
csv_path = f"/opt/airflow/datasets/HR-Employee-Attrition_{run_date}.csv"
# Add to DVC (updates the .dvc pointer file)
subprocess.run(["dvc", "add", csv_path], check=True)
# Push data to MinIO
subprocess.run(["dvc", "push"], check=True)
# Commit the updated pointer to Git
subprocess.run(["git", "add", f"{csv_path}.dvc"], check=True)
subprocess.run(["git", "commit", "-m", f"chore: version dataset {run_date}"], check=True)
subprocess.run(["git", "push"], check=True)
On Local Machines (Data Scientists)
A data scientist clones the repo and wants the exact dataset used in
a particular training run. They check out the relevant Git commit
and run dvc pull. DVC reads the pointer file, looks up
the MD5 in MinIO, and downloads the exact file version. They're
guaranteed to have the same data the original training run used — no
matter when or where they run this command.
# Reproduce the exact dataset from any past training run
git checkout abc1234 # check out the commit from 3 months ago
dvc pull # downloads the matching dataset from MinIO
python train.py # trains on the historically-accurate data
DevOps Responsibilities
As a DevOps engineer, your role in the DVC setup covers three areas:
-
MinIO bucket provisioning — the
dvcbucket is created automatically by the MinIOmcsidecar on first deploy -
MinIO credentials — Airflow workers need
MINIO_ACCESS_KEY/MINIO_SECRET_KEY(fromminio/.env) andMLFLOW_S3_ENDPOINT_URL=https://s3.ops4life.comin their environment -
DVC CLI in the worker image — the Airflow worker
Docker image needs
dvcanddvc-s3installed, with the MinIO credentials and endpoint available at runtime
Hands-On: Version a Dataset with DVC
This walkthrough takes the
IBM HR Analytics dataset from Kaggle
used across the MLOps pipeline guides and puts it under DVC version
control. The dvc bucket in MinIO is pre-created by the
MinIO deploy — no manual setup needed.
Step 1: Prerequisites
Before starting, you need:
-
MinIO deployed and running — credentials in
minio/.env(MINIO_ACCESS_KEY/MINIO_SECRET_KEY) -
The
mlops-get-startedrepository cloned locally
Step 2: Pull the Latest Repository Changes
git pull origin main
Step 3: Install DVC
Install DVC along with its S3 extension. The
dvc-s3 package handles the S3-compatible protocol that
MinIO speaks — without it, DVC won't be able to push or pull data.
pip install dvc dvc-s3
Step 4: Initialize DVC
Run this from the root of your Git repository. DVC creates a
.dvc/ folder (similar to .git/) and adds
it to the repo. The .dvc/config file holds your remote
configuration. Commit this initialization to Git so everyone on the
team shares the same setup.
dvc init
git add .dvc/
git commit -m "chore: initialize DVC"
Step 5: Add Your MinIO Remote
Configure MinIO as the default remote storage. The
-d flag marks this as the default remote. The
endpointurl tells DVC to use the MinIO S3-compatible
API instead of AWS.
dvc remote add -d ml-dataset s3://dvc/store
dvc remote modify ml-dataset endpointurl https://s3.ops4life.com
export AWS_ACCESS_KEY_ID=<MINIO_ACCESS_KEY>
export AWS_SECRET_ACCESS_KEY=<MINIO_SECRET_KEY>
git add .dvc/config
git commit -m "chore: configure MinIO DVC remote"
Step 6: Stop Tracking the Dataset in Git
If the CSV is already tracked by Git, remove it from the index
first. This does not delete the file from disk — it only
tells Git to stop tracking it. The .gitignore entry DVC
creates will prevent it from being accidentally re-added.
git rm -r --cached datasets/HR-Employee-Attrition.csv
Step 7: Add the Dataset to DVC
This is the core DVC command. It computes the MD5 checksum of the
file, copies it into the local cache (.dvc/cache/),
creates the .dvc pointer file, and adds the data file
to .gitignore so Git ignores it going forward.
dvc add datasets/HR-Employee-Attrition.csv
# DVC creates these two files:
# datasets/HR-Employee-Attrition.csv.dvc ← track this in Git
# datasets/.gitignore ← auto-generated
Step 8: Push the Dataset to MinIO
Upload the locally cached data to the MinIO remote. DVC only uploads objects that aren't already in the remote — subsequent pushes after small dataset changes are fast.
dvc push
# Uploading to s3://dvc/store (via https://s3.ops4life.com)
# 1 file pushed
Step 9: Commit the Pointer to Git
Commit the .dvc pointer file and the auto-generated
.gitignore. This Git commit is the versioning anchor —
anyone who checks out this commit and runs
dvc pull will get exactly the dataset you just pushed.
git add datasets/HR-Employee-Attrition.csv.dvc
git add datasets/.gitignore
git commit -m "feat: track HR-Employee-Attrition dataset with DVC"
git push
rm datasets/HR-Employee-Attrition.csv && rm -rf
.dvc/cache), then run dvc pull. DVC should download the file
from MinIO and restore it to its original location. If it does,
your setup is correct.
Congratulations — you've completed the full MLOps learning path. You now understand the complete lifecycle: from data ingestion and preparation, through model training and deployment with KServe, Kubernetes GPU workloads, production monitoring with drift detection, and finally dataset versioning with DVC to close the reproducibility loop.