MLOps Step 1: Building a Dataset Pipeline
📓 Jupyter Notebook 📋 Airflow UI
Garbage in, garbage out. Data scientists love to say this, and it's true — no amount of model sophistication overcomes bad data. The dataset pipeline is the foundation of your entire MLOps system, and it's where DevOps engineers can add enormous value from day one.
In this guide we'll build a dataset pipeline using an employee attrition dataset — a classic HR analytics problem where we predict whether an employee is likely to leave the company. We generate a synthetic 1,470-row dataset that mirrors the IBM HR Analytics schema inline with a fixed random seed, so the pipeline produces identical results on every run. We validate the schema with Pandera, orchestrate it with Airflow, and version it with DVC.
Data Ingestion
The ingestion step reads the raw employee attrition CSV and writes it to a known location for downstream pipeline stages. Even for a CSV source, formalizing this step means you have a clear audit point: you know exactly what data entered your pipeline, when, and what its shape was.
What the ingestion step does
It generates a 1,470-row synthetic dataset inline using numpy with a fixed random seed, prints basic diagnostics — shape, head, column types — and writes the output as raw_ingested.csv. This explicit handoff between stages is what makes the pipeline debuggable: if validation fails downstream, you can always inspect what ingestion actually produced.
Data Validation with Pandera
Pandera is a lightweight Python library for defining DataFrame schemas and running them as validation checks. Unlike Great Expectations, which requires a full context setup, Pandera schemas are just Python code — easy to read, version-control, and run anywhere including Colab.
What the schema enforces
The employee attrition schema checks that every critical column exists, has the right type, and only contains valid values. For example: Attrition must be "Yes" or "No", Age must be ≥ 22, and categorical columns like Job Level and Work-Life Balance must only contain their allowed ordinal values. Running this after every ingestion catches upstream data issues before they silently corrupt your model.
lazy=True mode collects all schema violations before raising an error, instead of stopping at the first failure. This gives you a complete picture of data quality issues in a single run — much more useful for debugging than one-at-a-time errors.
Orchestrating with Airflow
Apache Airflow is the standard orchestrator for batch data pipelines. A DAG (Directed Acyclic Graph) defines the sequence of tasks and their dependencies. For the employee attrition pipeline, an Airflow DAG chains ingestion → validation → feature engineering → model training as dependent tasks, retrying on failure and recording each run in the Airflow metadata database.
Data Versioning with DVC
DVC (Data Version Control) tracks large datasets in git without storing the actual bytes in git. It stores a small .dvc metadata file in git that points to the actual data in S3 (or any other remote).
The key insight: git checkout + dvc pull gives you an exact snapshot of both the code AND data at any point in history. This is the foundation of reproducible ML.
dvc run / dvc repro) that tracks dependencies between pipeline stages. If input data changes, only the downstream stages that depend on it are re-run. Think of it like Make for data pipelines.
With ingestion, validation, and versioning in place, you're ready to move to the next stage: data preparation and feature engineering.