MLOps Step 1: Building a Dataset Pipeline
Garbage in, garbage out. Data scientists love to say this, and it's true — no amount of model sophistication overcomes bad data. The dataset pipeline is the foundation of your entire MLOps system, and it's where DevOps engineers can add enormous value from day one.
In this guide we'll build a production-grade dataset pipeline that ingests raw data, validates it, stores it in a structured S3 data lake, orchestrates it with Airflow, and versions it with DVC.
Data Ingestion Patterns
Raw data arrives from many sources: databases, APIs, event streams, file uploads, partner feeds. Your ingestion layer needs to handle all of these reliably.
Batch vs Stream Ingestion
Batch ingestion runs on a schedule — hourly, daily, weekly. It's simpler to implement and debug, and works for most ML use cases where training happens periodically. Stream ingestion processes data in real time via Kafka or Kinesis. Use this when you need features that reflect the last few minutes of behavior (fraud detection, recommendation systems).
For most teams starting out: batch first, streaming only when you genuinely need sub-hour freshness.
S3 Data Lake Structure
Your S3 data lake should have a clear zone structure that makes it easy to understand the state of any dataset at any point in time:
raw/ zone is write-once, never modified. If you discover a bug in your ingestion code, you fix the code and re-run — you don't edit the raw files. This gives you a reliable audit trail and lets you re-process data as your understanding evolves.
Data Validation with Great Expectations
Great Expectations is a Python library that lets you define "expectations" about your data — essentially schema + quality tests — and run them automatically in your pipeline.
Orchestrating with Airflow
Apache Airflow is the standard orchestrator for batch data pipelines. A DAG (Directed Acyclic Graph) defines the sequence of tasks and their dependencies.
Data Versioning with DVC
DVC (Data Version Control) tracks large datasets in git without storing the actual bytes in git. It stores a small .dvc metadata file in git that points to the actual data in S3 (or any other remote).
The key insight: git checkout + dvc pull gives you an exact snapshot of both the code AND data at any point in history. This is the foundation of reproducible ML.
dvc run / dvc repro) that tracks dependencies between pipeline stages. If input data changes, only the downstream stages that depend on it are re-run. Think of it like Make for data pipelines.
With ingestion, validation, and versioning in place, you're ready to move to the next stage: data preparation and feature engineering.