Operations & Monitoring

Versioning Data With DVC

● Intermediate ⏱ 30 min read Operations

You trained a model six months ago. It's now in production and accuracy has degraded. You need to retrain it — but on which data? What was the exact dataset used in the original training run? Without dataset versioning you're guessing. DVC (Data Version Control) solves this: every model is linked to a Git commit, and every Git commit is linked to a precise, reproducible snapshot of your dataset.

What is DVC?

DVC is an open-source tool that brings Git-style versioning to large files — datasets, models, and anything else that doesn't belong in a Git repository directly.

Why Git Can't Handle Large Files

Git is designed for text files and small binaries. The moment you try to track a 2 GB CSV or a 500 MB model checkpoint, Git performance collapses: cloning takes forever, every pull re-downloads gigabytes, and the repository history bloats permanently. GitHub even rejects files over 100 MB. You need a different strategy for large ML artifacts.

DVC = "Git for Data"

DVC stores your actual data in a remote (typically MinIO, S3, GCS, or Azure Blob), and stores a tiny pointer file (a .dvc file) in Git. That pointer contains the MD5 checksum and file path — enough information to retrieve the exact file version at any point in history. When you git checkout an old commit, DVC knows which data version matches, and dvc pull fetches it.

What goes in Git What goes in MinIO (via DVC)
HR-Employee-Attrition.csv.dvc (pointer, ~200 bytes) HR-Employee-Attrition.csv (actual data, 150 MB)
.dvc/config (remote URL) Hashed cache objects (deduplicated)
All your Python code and notebooks Model artifacts, large binaries
The Key Insight DVC doesn't replace Git — it extends it. Git tracks your code and the pointer files. DVC uses those pointers to sync actual data with remote storage. The two work together: one git commit locks both the code and the dataset version.

How DVC Works Under the Hood

Understanding the internals helps you debug issues and design your pipeline correctly.

Pointer Files (.dvc files)

When you run dvc add mydata.csv, DVC creates mydata.csv.dvc alongside it. This tiny YAML file records the MD5 checksum, file size, and path. Git tracks this file — not the data itself. When someone clones the repo, they get the pointer. Running dvc pull then fetches the matching data from remote storage.

yaml
# HR-Employee-Attrition.csv.dvc — what DVC commits to Git
outs:
- md5: a3f8c2d1e4b7...
  size: 157286400
  path: HR-Employee-Attrition.csv

Local Cache (.dvc/cache)

DVC keeps a local cache at .dvc/cache/ in your project root. Files are stored by their MD5 hash (e.g., .dvc/cache/a3/f8c2d1e4b7...). When you run dvc add, the file is copied into the cache and a symlink or hard link is created in your working directory. This means DVC can restore any version without re-downloading it, as long as it's in the local cache.

Remote Storage (MinIO)

The remote is where the cache is backed up for sharing. You configure one remote per project (or multiple if needed). When you run dvc push, DVC uploads any locally cached objects not yet in MinIO. When a teammate runs dvc pull, DVC downloads only the objects not in their local cache. Storage is deduplicated: if two datasets share 80% of their rows, only the unique portions are stored twice.

Deduplication via MD5

Every file version gets a unique MD5 checksum. If you update a 1 GB dataset but only 10 MB changed, DVC doesn't store 1 GB again — it only stores the new file (DVC operates at the file level, not block level). For datasets that frequently change in small ways, consider partitioning them into smaller files to get more granular deduplication.

⚠️
Add .dvc/cache to .gitignore The local cache can grow large. DVC adds it to .gitignore automatically, but double-check this if you see unexpected large files staged in Git. The cache directory should never be committed.

Where DVC Runs

DVC isn't just a local developer tool — it integrates into your automated pipeline at multiple points.

In Airflow (Automated Versioning)

Your ETL DAG produces a fresh CSV every time it runs. Rather than overwriting the file silently, the DAG can call DVC to version it automatically. The Airflow worker runs dvc add on the output CSV, then pushes to MinIO and commits the updated pointer to Git. Every pipeline run produces a versioned, reproducible dataset snapshot.

python
# Snippet from an Airflow DAG task
def version_dataset(**context):
    import subprocess
    run_date = context['ds']  # e.g. "2024-03-15"
    csv_path = f"/opt/airflow/datasets/HR-Employee-Attrition_{run_date}.csv"

    # Add to DVC (updates the .dvc pointer file)
    subprocess.run(["dvc", "add", csv_path], check=True)

    # Push data to MinIO
    subprocess.run(["dvc", "push"], check=True)

    # Commit the updated pointer to Git
    subprocess.run(["git", "add", f"{csv_path}.dvc"], check=True)
    subprocess.run(["git", "commit", "-m", f"chore: version dataset {run_date}"], check=True)
    subprocess.run(["git", "push"], check=True)

On Local Machines (Data Scientists)

A data scientist clones the repo and wants the exact dataset used in a particular training run. They check out the relevant Git commit and run dvc pull. DVC reads the pointer file, looks up the MD5 in MinIO, and downloads the exact file version. They're guaranteed to have the same data the original training run used — no matter when or where they run this command.

bash
# Reproduce the exact dataset from any past training run
git checkout abc1234          # check out the commit from 3 months ago
dvc pull                      # downloads the matching dataset from MinIO
python train.py               # trains on the historically-accurate data

DevOps Responsibilities

As a DevOps engineer, your role in the DVC setup covers three areas:

  • MinIO bucket provisioning — the dvc bucket is created automatically by the MinIO mc sidecar on first deploy
  • MinIO credentials — Airflow workers need MINIO_ACCESS_KEY / MINIO_SECRET_KEY (from minio/.env) and MLFLOW_S3_ENDPOINT_URL=https://s3.ops4life.com in their environment
  • DVC CLI in the worker image — the Airflow worker Docker image needs dvc and dvc-s3 installed, with the MinIO credentials and endpoint available at runtime

Hands-On: Version a Dataset with DVC

This walkthrough takes the IBM HR Analytics dataset from Kaggle used across the MLOps pipeline guides and puts it under DVC version control. The dvc bucket in MinIO is pre-created by the MinIO deploy — no manual setup needed.

Step 1: Prerequisites

Before starting, you need:

  • MinIO deployed and running — credentials in minio/.env (MINIO_ACCESS_KEY / MINIO_SECRET_KEY)
  • The mlops-get-started repository cloned locally

Step 2: Pull the Latest Repository Changes

bash
git pull origin main

Step 3: Install DVC

Install DVC along with its S3 extension. The dvc-s3 package handles the S3-compatible protocol that MinIO speaks — without it, DVC won't be able to push or pull data.

bash
pip install dvc dvc-s3

Step 4: Initialize DVC

Run this from the root of your Git repository. DVC creates a .dvc/ folder (similar to .git/) and adds it to the repo. The .dvc/config file holds your remote configuration. Commit this initialization to Git so everyone on the team shares the same setup.

bash
dvc init
git add .dvc/
git commit -m "chore: initialize DVC"

Step 5: Add Your MinIO Remote

Configure MinIO as the default remote storage. The -d flag marks this as the default remote. The endpointurl tells DVC to use the MinIO S3-compatible API instead of AWS.

bash
dvc remote add -d ml-dataset s3://dvc/store
dvc remote modify ml-dataset endpointurl https://s3.ops4life.com
export AWS_ACCESS_KEY_ID=<MINIO_ACCESS_KEY>
export AWS_SECRET_ACCESS_KEY=<MINIO_SECRET_KEY>
git add .dvc/config
git commit -m "chore: configure MinIO DVC remote"

Step 6: Stop Tracking the Dataset in Git

If the CSV is already tracked by Git, remove it from the index first. This does not delete the file from disk — it only tells Git to stop tracking it. The .gitignore entry DVC creates will prevent it from being accidentally re-added.

bash
git rm -r --cached datasets/HR-Employee-Attrition.csv

Step 7: Add the Dataset to DVC

This is the core DVC command. It computes the MD5 checksum of the file, copies it into the local cache (.dvc/cache/), creates the .dvc pointer file, and adds the data file to .gitignore so Git ignores it going forward.

bash
dvc add datasets/HR-Employee-Attrition.csv

# DVC creates these two files:
# datasets/HR-Employee-Attrition.csv.dvc  ← track this in Git
# datasets/.gitignore                      ← auto-generated

Step 8: Push the Dataset to MinIO

Upload the locally cached data to the MinIO remote. DVC only uploads objects that aren't already in the remote — subsequent pushes after small dataset changes are fast.

bash
dvc push
# Uploading to s3://dvc/store (via https://s3.ops4life.com)
# 1 file pushed

Step 9: Commit the Pointer to Git

Commit the .dvc pointer file and the auto-generated .gitignore. This Git commit is the versioning anchor — anyone who checks out this commit and runs dvc pull will get exactly the dataset you just pushed.

bash
git add datasets/HR-Employee-Attrition.csv.dvc
git add datasets/.gitignore
git commit -m "feat: track HR-Employee-Attrition dataset with DVC"
git push
Verify it works Delete the local file and cache (rm datasets/HR-Employee-Attrition.csv && rm -rf .dvc/cache), then run dvc pull. DVC should download the file from MinIO and restore it to its original location. If it does, your setup is correct.

Congratulations — you've completed the full MLOps learning path. You now understand the complete lifecycle: from data ingestion and preparation, through model training and deployment with KServe, Kubernetes GPU workloads, production monitoring with drift detection, and finally dataset versioning with DVC to close the reproducibility loop.