MLOps Pipeline

Airflow + DVC Pipeline on Kubernetes

● Advanced ⏱ 45 min read MLOps Pipeline

In the DVC guide you versioned a dataset manually — running dvc add and dvc push from your laptop to understand how it works. In real setups, ETL pipelines run on a schedule and data versioning happens automatically with no manual steps. This guide integrates DVC into an Airflow DAG running on Kubernetes, exactly the way production teams do it.

By the end, you will understand how Airflow DAGs work, how the KubernetesExecutor spawns pods per task, and how DVC commits a versioned dataset pointer back to Git — all without a single manual step.

What is Apache Airflow?

Apache Airflow is an open-source application for building and managing complex workflows and data pipelines. Over 80,000 organizations run Airflow, with more than 30% using it for MLOps workloads. Apache Airflow 3 is a total redesign that expands its capabilities to support AI/ML and near real-time data workloads.

Quick Airflow Primer

A data pipeline is a series of steps: collect data, clean it, transform it, and store it — the ETL pattern. Each individual step is called a Task. Tasks don't run in isolation; they run in a specific order. You can't clean data before collecting it, and you can't store data before transforming it.

This ordering is defined by a DAG — Directed Acyclic Graph. A DAG is your entire pipeline written in Python. It describes all the tasks and the order in which they run.

Term	What it means
Directed	Tasks flow in one direction: collect → clean → transform → store
Acyclic	No cycles. Once a task completes, the workflow never goes back to a previous step
Graph	The pipeline is a graph: each task is a node, each dependency is an edge

📌

DevOps Insight Terraform already uses this concept internally. It builds a resource dependency graph (also a DAG) before running terraform apply — that's how it knows to create the VPC before the subnet. If you've worked with CI/CD pipelines, a DAG applies the same model to data: one failure stops everything downstream.

Airflow Executors

Where does the Python DAG code actually run? That's what the Executor decides. Just as Jenkins uses agents and GitHub Actions uses runners to execute CI/CD jobs, Airflow uses Executors to decide how and where tasks run.

For Kubernetes deployments, the KubernetesExecutor is the right choice. When a DAG is triggered, the KubernetesExecutor spins up a fresh pod for every task. The pod runs the task, then gets deleted. This means:

Tasks are isolated — one task's failure doesn't affect others
Each task can have its own resource requests (CPU, memory, GPU)
The cluster scales automatically: no tasks, no pods consuming resources

How Airflow, DVC, and S3 Work Together

Here is the full architecture of the automated pipeline:

Developers push DAG files to GitHub through CI/CD. The entire pipeline code, including DVC tasks, lives in Git.
GitSync keeps Airflow in sync. The DAG Processor runs a GitSync sidecar that continuously watches the repository. When a new DAG is pushed, Airflow picks it up automatically — no restart needed.
The Scheduler reads the DAG and decides what tasks should run and when (scheduled or on-demand).
The KubernetesExecutor creates a separate pod per task. Pods are short-lived: they start, execute the task, then are deleted.
Worker pods use Pod Identity mapped to an IAM Role, allowing secure S3 access without storing any credentials in the pod or the DAG code.
DVC pushes the dataset to S3. After the final dataset is produced, the worker runs dvc push to upload it.
DVC commits the pointer file to GitHub. The updated .dvc file is committed and pushed back — so every pipeline run is traceable in Git history.

Custom Airflow Worker Image

A default Airflow worker pod on Kubernetes has none of the tools required for DVC versioning. You need a custom Docker image that includes:

DVC installed and baked in, along with your pipeline's Python dependencies
Git installed and configured so the pod can run git add, git commit, and git push from inside the DAG

This is the same pattern as configuring a CI/CD runner for specific tasks. The pre-built image airflow-dvc-worker:v2.0.0 in the MLOps repository already includes everything needed.

The DVC Pipeline DAG

The dvc-pipeline.py DAG (at platform-tools/airflow/dags/ in the MLOps repo) has three tasks, each running in its own pod via KubernetesPodOperator. All pods share a Persistent Volume Claim (PVC) mounted at /shared so data flows between tasks.

Task	What it does
`pull_data`	Clones the GitHub repo into `/shared/repo` using credentials from a Kubernetes secret, then runs `dvc pull` to fetch the current dataset from S3
`modify_data`	Reads the CSV from the shared PVC, appends one new randomly generated record (simulating an ETL step), and writes the updated file back to the same path
`push_data`	Runs `dvc add` on the updated CSV, `dvc push` to upload it to S3, stages the updated `.dvc` pointer, and commits + pushes to GitHub

📌

Key Insight In this hands-on DAG, Task 1 uses dvc pull to start with existing versioned data. In a real production pipeline, Task 1 would be a series of tasks that collect raw data and run ETL jobs. Only the final DVC push step (Task 3) stays the same.

python

# Simplified view of the three-task DAG structure
from airflow import DAG
from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator
from datetime import datetime

IMAGE = "ops4life/airflow-dvc-worker:v2.0.0"
PVC = "airflow-shared-pvc"

with DAG("dataset_pipeline", start_date=datetime(2024, 1, 1), schedule=None) as dag:

    pull_data = KubernetesPodOperator(
        task_id="pull_data",
        image=IMAGE,
        volumes=[shared_volume],
        volume_mounts=[shared_mount],
        env_from=[git_credentials_secret],
        cmds=["bash", "-c"],
        arguments=["git clone $GIT_REPO /shared/repo && cd /shared/repo && dvc pull"],
    )

    modify_data = KubernetesPodOperator(
        task_id="modify_data",
        image=IMAGE,
        volumes=[shared_volume],
        volume_mounts=[shared_mount],
        cmds=["python", "/scripts/modify_data.py"],
    )

    push_data = KubernetesPodOperator(
        task_id="push_data",
        image=IMAGE,
        volumes=[shared_volume],
        volume_mounts=[shared_mount],
        env_from=[git_credentials_secret],
        cmds=["bash", "-c"],
        arguments=["cd /shared/repo && dvc add datasets/employee_attrition.csv && dvc push && git add . && git commit -m 'chore: update dataset' && git push"],
    )

    pull_data >> modify_data >> push_data

Hands-On: Deploy on Kubernetes

This walkthrough deploys Airflow on EKS with the KubernetesExecutor and runs the DVC pipeline DAG. The same steps apply to any Kubernetes cluster — adjust provider-specific config (eksctl, IAM) to match your environment.

⚠️

Prerequisite Complete the Versioning Data With DVC guide first. The first task in this DAG uses dvc pull to fetch an existing versioned dataset from S3. You need the DVC configuration and the initial dataset already in your S3 bucket.

Step 1: Pull Latest Repository Changes

Pull the latest MLOps repository changes to get the DAG, Helm values, and EKS configuration files:

bash

git pull origin main

Step 2: Update the DAG with Your Repository Details

Open platform-tools/airflow/dags/dvc-pipeline.py and replace the GIT_REPO value with your forked repository URL:

python

GIT_REPO = "https://github.com/<your-username>/mlops-get-started.git"

Push the change to GitHub:

bash

git add platform-tools/airflow/dags/dvc-pipeline.py
git commit -m "chore: set git repo for airflow dvc dag"
git push origin main

Step 3: Deploy an EKS Cluster

An eksctl cluster configuration file is included in the repository at platform-tools/eks/mlops-cluster.yaml. Update the VPC, subnets, region, and public key details to match your environment, then deploy:

bash

cd platform-tools/eks
eksctl create cluster -f mlops-cluster.yaml

The cluster configuration enables the Pod Identity add-on, which allows Airflow worker pods to access S3 via IAM roles without storing credentials.

Step 4: Enable S3 Access for Airflow Worker Pods

Run the setup script to create an IAM role (Airflow-DVCS3Role) and associate it with the Airflow worker service account via Pod Identity:

bash

# Update AWS_REGION and DVC_BUCKET in script.sh first
chmod +x script.sh
./script.sh create

After this step, any pod associated with the Airflow worker service account can read and write to your DVC S3 bucket without any credentials in the pod spec or environment variables.

Step 5: Create the Git Credentials Secret

The push_data task needs to commit and push the updated .dvc pointer file to GitHub. Create a Kubernetes secret with your GitHub credentials:

bash

kubectl -n airflow create secret generic git-credentials \
  --from-literal=GIT_SYNC_USERNAME=<your-github-username> \
  --from-literal=GIT_SYNC_PASSWORD=<your-github-token> \
  --from-literal=GITSYNC_USERNAME=<your-github-username> \
  --from-literal=GITSYNC_PASSWORD=<your-github-token>

✅

Get your GitHub token from: GitHub → Settings → Developer settings → Personal access tokens. The token needs repo scope to push commits.

Step 6: Create the Shared PVC

All three tasks share a Persistent Volume Claim to pass the dataset between pods. Apply the manifest:

bash

cd platform-tools/airflow/helm
kubectl apply -f pvc.yaml

📌

Production Note The shared PVC approach works well for learning and tightly coupled pipelines. In production, replace it with Amazon S3 or Amazon EFS as the shared storage layer between tasks.

Step 7: Configure Helm Values

The Helm values file at platform-tools/airflow/helm/custom-values.yaml configures Airflow to use the KubernetesExecutor and sets up GitSync to pull DAGs from your repository. Update the GitSync repository URL to point to your fork:

yaml

# custom-values.yaml (relevant section)
executor: KubernetesExecutor

dags:
  gitSync:
    enabled: true
    repo: https://github.com/<your-username>/mlops-get-started.git
    branch: main
    subPath: platform-tools/airflow/dags

Step 8: Deploy Airflow with Helm

bash

cd platform-tools/airflow/helm
helm repo add apache-airflow https://airflow.apache.org
helm install airflow apache-airflow/airflow \
  -n airflow --create-namespace \
  -f custom-values.yaml

Wait for all pods to be ready:

bash

kubectl -n airflow get po

You should see the Airflow scheduler, DAG processor, and triggerer pods running, along with the GitSync sidecar that continuously syncs DAGs from your GitHub repository.

Step 9: Access the Airflow UI

bash

kubectl -n airflow port-forward svc/airflow-api-server 8080:8080

Open http://localhost:8080. Default credentials: username admin, password admin. In production, use LDAP or OAuth with SSL/TLS to protect the web server.

Navigate to DAGs. The dataset_pipeline DAG should appear automatically — GitSync pulled it from your repository. Click the DAG to open the Graph View and see the three tasks in sequence.

Step 10: Trigger the DAG

Click the Trigger button, select single run, and confirm. The KubernetesExecutor creates a pod for each task as it becomes ready. Monitor progress in the Task Instance view.

📌

Production Triggering Patterns In real MLOps pipelines, DAGs are triggered in three ways: scheduled (cron-based, e.g., nightly at 02:00), event-driven (e.g., on data arrival in S3), or API-triggered for ad-hoc runs from CI/CD pipelines via the Airflow REST API.

Verify DVC Versioning

After the pipeline completes, Task 3 has committed and pushed the updated .dvc pointer file to GitHub. Before running any local DVC checks, pull the latest changes — otherwise your local .dvc file will point to the previous version:

bash

git pull origin main

Inspect the updated DVC pointer to see the new MD5 hash:

bash

cat phase-1-local-dev/datasets/employee_attrition.csv.dvc

# Expected output:
outs:
- md5: 5911ebf0fa91033fb323989b7c6d7fbc
  size: 9983309
  path: employee_attrition.csv

In S3, DVC stores data using the MD5 hash. Files are organized into folders named with the first two characters of the hash (e.g., 59/), with the file named after the remaining characters. To verify the object exists:

bash

aws s3 ls s3://<your-dvc-bucket>/store/59/ --recursive

Any data scientist who wants to work with this exact dataset version can now run:

bash

git pull origin main   # get the latest .dvc pointer
dvc pull               # download the matching dataset from S3
python train.py        # train on the pipeline-versioned data

Clean Up

When you're done with the hands-on, tear down the EKS cluster to avoid unexpected charges:

bash

# Remove Pod Identity association and IAM role first
./script.sh delete

# Delete the EKS cluster (takes 10-15 minutes)
eksctl delete cluster -f mlops-cluster.yaml

✅

You now have a fully automated data versioning pipeline. Any data scientist can run git checkout <commit> && dvc pull to reproduce the exact dataset used in any training run — months or years later. The pipeline runs on schedule, versions every output, and leaves a complete audit trail in Git.