MLOps Pipeline

MLOps Step 1: Building a Dataset Pipeline

● Intermediate ⏱ 45 min read MLOps Pipeline

Garbage in, garbage out. Data scientists love to say this, and it's true — no amount of model sophistication overcomes bad data. The dataset pipeline is the foundation of your entire MLOps system, and it's where DevOps engineers can add enormous value from day one.

In this guide we'll build a production-grade dataset pipeline that ingests raw data, validates it, stores it in a structured S3 data lake, orchestrates it with Airflow, and versions it with DVC.

Data Ingestion Patterns

Raw data arrives from many sources: databases, APIs, event streams, file uploads, partner feeds. Your ingestion layer needs to handle all of these reliably.

Batch vs Stream Ingestion

Batch ingestion runs on a schedule — hourly, daily, weekly. It's simpler to implement and debug, and works for most ML use cases where training happens periodically. Stream ingestion processes data in real time via Kafka or Kinesis. Use this when you need features that reflect the last few minutes of behavior (fraud detection, recommendation systems).

For most teams starting out: batch first, streaming only when you genuinely need sub-hour freshness.

python

S3 Data Lake Structure

Your S3 data lake should have a clear zone structure that makes it easy to understand the state of any dataset at any point in time:

text
📋
Immutability Principle The raw/ zone is write-once, never modified. If you discover a bug in your ingestion code, you fix the code and re-run — you don't edit the raw files. This gives you a reliable audit trail and lets you re-process data as your understanding evolves.

Data Validation with Great Expectations

Great Expectations is a Python library that lets you define "expectations" about your data — essentially schema + quality tests — and run them automatically in your pipeline.

python

Orchestrating with Airflow

Apache Airflow is the standard orchestrator for batch data pipelines. A DAG (Directed Acyclic Graph) defines the sequence of tasks and their dependencies.

python

Data Versioning with DVC

DVC (Data Version Control) tracks large datasets in git without storing the actual bytes in git. It stores a small .dvc metadata file in git that points to the actual data in S3 (or any other remote).

bash

The key insight: git checkout + dvc pull gives you an exact snapshot of both the code AND data at any point in history. This is the foundation of reproducible ML.

DVC Pipelines DVC also has a pipeline feature (dvc run / dvc repro) that tracks dependencies between pipeline stages. If input data changes, only the downstream stages that depend on it are re-run. Think of it like Make for data pipelines.

With ingestion, validation, and versioning in place, you're ready to move to the next stage: data preparation and feature engineering.