Airflow + DVC Pipeline on Kubernetes
In the DVC guide you versioned a dataset manually — running dvc add and dvc push from your laptop to understand how it works. In real setups, ETL pipelines run on a schedule and data versioning happens automatically with no manual steps. This guide integrates DVC into an Airflow DAG running on Kubernetes, exactly the way production teams do it.
By the end, you will understand how Airflow DAGs work, how the KubernetesExecutor spawns pods per task, and how DVC commits a versioned dataset pointer back to Git — all without a single manual step.
What is Apache Airflow?
Apache Airflow is an open-source application for building and managing complex workflows and data pipelines. Over 80,000 organizations run Airflow, with more than 30% using it for MLOps workloads. Apache Airflow 3 is a total redesign that expands its capabilities to support AI/ML and near real-time data workloads.
Quick Airflow Primer
A data pipeline is a series of steps: collect data, clean it, transform it, and store it — the ETL pattern. Each individual step is called a Task. Tasks don't run in isolation; they run in a specific order. You can't clean data before collecting it, and you can't store data before transforming it.
This ordering is defined by a DAG — Directed Acyclic Graph. A DAG is your entire pipeline written in Python. It describes all the tasks and the order in which they run.
| Term | What it means |
|---|---|
| Directed | Tasks flow in one direction: collect → clean → transform → store |
| Acyclic | No cycles. Once a task completes, the workflow never goes back to a previous step |
| Graph | The pipeline is a graph: each task is a node, each dependency is an edge |
terraform apply — that's how it knows to create the VPC before the subnet. If you've worked with CI/CD pipelines, a DAG applies the same model to data: one failure stops everything downstream.
Airflow Executors
Where does the Python DAG code actually run? That's what the Executor decides. Just as Jenkins uses agents and GitHub Actions uses runners to execute CI/CD jobs, Airflow uses Executors to decide how and where tasks run.
For Kubernetes deployments, the KubernetesExecutor is the right choice. When a DAG is triggered, the KubernetesExecutor spins up a fresh pod for every task. The pod runs the task, then gets deleted. This means:
- Tasks are isolated — one task's failure doesn't affect others
- Each task can have its own resource requests (CPU, memory, GPU)
- The cluster scales automatically: no tasks, no pods consuming resources
How Airflow, DVC, and S3 Work Together
Here is the full architecture of the automated pipeline:
- Developers push DAG files to GitHub through CI/CD. The entire pipeline code, including DVC tasks, lives in Git.
- GitSync keeps Airflow in sync. The DAG Processor runs a GitSync sidecar that continuously watches the repository. When a new DAG is pushed, Airflow picks it up automatically — no restart needed.
- The Scheduler reads the DAG and decides what tasks should run and when (scheduled or on-demand).
- The KubernetesExecutor creates a separate pod per task. Pods are short-lived: they start, execute the task, then are deleted.
- Worker pods use Pod Identity mapped to an IAM Role, allowing secure S3 access without storing any credentials in the pod or the DAG code.
- DVC pushes the dataset to S3. After the final dataset is produced, the worker runs
dvc pushto upload it. - DVC commits the pointer file to GitHub. The updated
.dvcfile is committed and pushed back — so every pipeline run is traceable in Git history.
Custom Airflow Worker Image
A default Airflow worker pod on Kubernetes has none of the tools required for DVC versioning. You need a custom Docker image that includes:
- DVC installed and baked in, along with your pipeline's Python dependencies
- Git installed and configured so the pod can run
git add,git commit, andgit pushfrom inside the DAG
This is the same pattern as configuring a CI/CD runner for specific tasks. The pre-built image airflow-dvc-worker:v2.0.0 in the MLOps repository already includes everything needed.
The DVC Pipeline DAG
The dvc-pipeline.py DAG (at platform-tools/airflow/dags/ in the MLOps repo) has three tasks, each running in its own pod via KubernetesPodOperator. All pods share a Persistent Volume Claim (PVC) mounted at /shared so data flows between tasks.
| Task | What it does |
|---|---|
pull_data |
Clones the GitHub repo into /shared/repo using credentials from a Kubernetes secret, then runs dvc pull to fetch the current dataset from S3 |
modify_data |
Reads the CSV from the shared PVC, appends one new randomly generated record (simulating an ETL step), and writes the updated file back to the same path |
push_data |
Runs dvc add on the updated CSV, dvc push to upload it to S3, stages the updated .dvc pointer, and commits + pushes to GitHub |
dvc pull to start with existing versioned data. In a real production pipeline, Task 1 would be a series of tasks that collect raw data and run ETL jobs. Only the final DVC push step (Task 3) stays the same.
# Simplified view of the three-task DAG structure
from airflow import DAG
from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator
from datetime import datetime
IMAGE = "ops4life/airflow-dvc-worker:v2.0.0"
PVC = "airflow-shared-pvc"
with DAG("dataset_pipeline", start_date=datetime(2024, 1, 1), schedule=None) as dag:
pull_data = KubernetesPodOperator(
task_id="pull_data",
image=IMAGE,
volumes=[shared_volume],
volume_mounts=[shared_mount],
env_from=[git_credentials_secret],
cmds=["bash", "-c"],
arguments=["git clone $GIT_REPO /shared/repo && cd /shared/repo && dvc pull"],
)
modify_data = KubernetesPodOperator(
task_id="modify_data",
image=IMAGE,
volumes=[shared_volume],
volume_mounts=[shared_mount],
cmds=["python", "/scripts/modify_data.py"],
)
push_data = KubernetesPodOperator(
task_id="push_data",
image=IMAGE,
volumes=[shared_volume],
volume_mounts=[shared_mount],
env_from=[git_credentials_secret],
cmds=["bash", "-c"],
arguments=["cd /shared/repo && dvc add datasets/employee_attrition.csv && dvc push && git add . && git commit -m 'chore: update dataset' && git push"],
)
pull_data >> modify_data >> push_data
Hands-On: Deploy on Kubernetes
This walkthrough deploys Airflow on EKS with the KubernetesExecutor and runs the DVC pipeline DAG. The same steps apply to any Kubernetes cluster — adjust provider-specific config (eksctl, IAM) to match your environment.
dvc pull to fetch an existing versioned dataset from S3. You need the DVC configuration and the initial dataset already in your S3 bucket.
Step 1: Pull Latest Repository Changes
Pull the latest MLOps repository changes to get the DAG, Helm values, and EKS configuration files:
git pull origin main
Step 2: Update the DAG with Your Repository Details
Open platform-tools/airflow/dags/dvc-pipeline.py and replace the GIT_REPO value with your forked repository URL:
GIT_REPO = "https://github.com/<your-username>/mlops-get-started.git"
Push the change to GitHub:
git add platform-tools/airflow/dags/dvc-pipeline.py
git commit -m "chore: set git repo for airflow dvc dag"
git push origin main
Step 3: Deploy an EKS Cluster
An eksctl cluster configuration file is included in the repository at platform-tools/eks/mlops-cluster.yaml. Update the VPC, subnets, region, and public key details to match your environment, then deploy:
cd platform-tools/eks
eksctl create cluster -f mlops-cluster.yaml
The cluster configuration enables the Pod Identity add-on, which allows Airflow worker pods to access S3 via IAM roles without storing credentials.
Step 4: Enable S3 Access for Airflow Worker Pods
Run the setup script to create an IAM role (Airflow-DVCS3Role) and associate it with the Airflow worker service account via Pod Identity:
# Update AWS_REGION and DVC_BUCKET in script.sh first
chmod +x script.sh
./script.sh create
After this step, any pod associated with the Airflow worker service account can read and write to your DVC S3 bucket without any credentials in the pod spec or environment variables.
Step 5: Create the Git Credentials Secret
The push_data task needs to commit and push the updated .dvc pointer file to GitHub. Create a Kubernetes secret with your GitHub credentials:
kubectl -n airflow create secret generic git-credentials \
--from-literal=GIT_SYNC_USERNAME=<your-github-username> \
--from-literal=GIT_SYNC_PASSWORD=<your-github-token> \
--from-literal=GITSYNC_USERNAME=<your-github-username> \
--from-literal=GITSYNC_PASSWORD=<your-github-token>
repo scope to push commits.
Step 6: Create the Shared PVC
All three tasks share a Persistent Volume Claim to pass the dataset between pods. Apply the manifest:
cd platform-tools/airflow/helm
kubectl apply -f pvc.yaml
Step 7: Configure Helm Values
The Helm values file at platform-tools/airflow/helm/custom-values.yaml configures Airflow to use the KubernetesExecutor and sets up GitSync to pull DAGs from your repository. Update the GitSync repository URL to point to your fork:
# custom-values.yaml (relevant section)
executor: KubernetesExecutor
dags:
gitSync:
enabled: true
repo: https://github.com/<your-username>/mlops-get-started.git
branch: main
subPath: platform-tools/airflow/dags
Step 8: Deploy Airflow with Helm
cd platform-tools/airflow/helm
helm repo add apache-airflow https://airflow.apache.org
helm install airflow apache-airflow/airflow \
-n airflow --create-namespace \
-f custom-values.yaml
Wait for all pods to be ready:
kubectl -n airflow get po
You should see the Airflow scheduler, DAG processor, and triggerer pods running, along with the GitSync sidecar that continuously syncs DAGs from your GitHub repository.
Step 9: Access the Airflow UI
kubectl -n airflow port-forward svc/airflow-api-server 8080:8080
Open http://localhost:8080. Default credentials: username admin, password admin. In production, use LDAP or OAuth with SSL/TLS to protect the web server.
Navigate to DAGs. The dataset_pipeline DAG should appear automatically — GitSync pulled it from your repository. Click the DAG to open the Graph View and see the three tasks in sequence.
Step 10: Trigger the DAG
Click the Trigger button, select single run, and confirm. The KubernetesExecutor creates a pod for each task as it becomes ready. Monitor progress in the Task Instance view.
Verify DVC Versioning
After the pipeline completes, Task 3 has committed and pushed the updated .dvc pointer file to GitHub. Before running any local DVC checks, pull the latest changes — otherwise your local .dvc file will point to the previous version:
git pull origin main
Inspect the updated DVC pointer to see the new MD5 hash:
cat phase-1-local-dev/datasets/employee_attrition.csv.dvc
# Expected output:
outs:
- md5: 5911ebf0fa91033fb323989b7c6d7fbc
size: 9983309
path: employee_attrition.csv
In S3, DVC stores data using the MD5 hash. Files are organized into folders named with the first two characters of the hash (e.g., 59/), with the file named after the remaining characters. To verify the object exists:
aws s3 ls s3://<your-dvc-bucket>/store/59/ --recursive
Any data scientist who wants to work with this exact dataset version can now run:
git pull origin main # get the latest .dvc pointer
dvc pull # download the matching dataset from S3
python train.py # train on the pipeline-versioned data
Clean Up
When you're done with the hands-on, tear down the EKS cluster to avoid unexpected charges:
# Remove Pod Identity association and IAM role first
./script.sh delete
# Delete the EKS cluster (takes 10-15 minutes)
eksctl delete cluster -f mlops-cluster.yaml
git checkout <commit> && dvc pull to reproduce the exact dataset used in any training run — months or years later. The pipeline runs on schedule, versions every output, and leaves a complete audit trail in Git.