MLOps Pipeline

MLOps Step 2: Data Preparation

● Intermediate ⏱ 40 min read MLOps Pipeline

Raw data almost never goes directly into a model. It's dirty, inconsistently formatted, and missing the derived features that would actually help the model learn. Data preparation is the process of transforming raw data into clean, model-ready features — and doing it in a way that's reproducible, automated, and consistent between training and serving.

Data Cleaning

Data cleaning is about removing or fixing records that would mislead the model. The specific steps depend on your data, but these are the most common issues:

Missing Values

You have three options when a value is missing: drop the row (if missing is rare and not systematic), impute with a statistic (median for numeric, mode for categorical), or add a missingness indicator feature (when "missing" itself is a signal). The worst option is to do nothing and let the model see NaN values — most models will fail or produce garbage.

python
⚠️
Data Leakage Warning Never fit your preprocessor (scaler, imputer, encoder) on the validation or test set. If you calculate the mean for imputation using the test set, you've leaked future information into your model. Always fit_transform on train only, then transform on val and test.

Feature Engineering

Feature engineering is creating new input variables from existing ones that better capture the patterns you want the model to learn. A domain expert can often double model performance through good features.

Common Feature Engineering Techniques

  • Aggregations: Count of events per user in last 7 days, average order value per category
  • Time-based features: Day of week, hour of day, days since last event, recency-frequency-monetary (RFM)
  • Ratios: Conversion rate (purchases / sessions), click-through rate, churn risk score
  • Text features: TF-IDF, word count, sentiment score, topic models
  • Interaction features: Product of two features when their combination has meaning
python

Train/Val/Test Splits

How you split data matters enormously. Random splitting is the default but is wrong for time-series data.

Random Split (for non-temporal data)

Use sklearn.model_selection.train_test_split with stratify=y to maintain class balance across splits. Typical ratios: 70/15/15 or 80/10/10.

Temporal Split (for time-series data)

For behavioral data where you're predicting future behavior, you must split by time — never randomly. Random splitting causes data leakage because you'd be "predicting" the past with future data.

python

Handling Class Imbalance

Many real-world classification problems are imbalanced: 99% of users don't churn, 0.1% of transactions are fraud. A model that always predicts the majority class gets 99% accuracy but is useless.

Common strategies:

  • class_weight="balanced" — Tell scikit-learn to weight the minority class more heavily. Easy win, try this first.
  • SMOTE — Synthetic Minority Over-sampling Technique. Generates synthetic minority class examples.
  • Threshold tuning — Instead of defaulting to 0.5, adjust the decision threshold to balance precision/recall for your use case.
  • Use better metrics — Don't use accuracy for imbalanced data. Use F1, precision-recall AUC, or ROC AUC.

Feature Stores in Production

A feature store solves a nasty production problem: the features you compute in your offline training pipeline are often computed differently in your online serving code, causing subtle bugs (the "training-serving skew" problem).

Feast is the most widely used open-source feature store. It maintains a registry of feature definitions and provides both an offline store (for training) and an online store (for low-latency serving).

python
Start Without a Feature Store Feature stores add significant complexity. If you're just getting started, use pandas/SQL for feature computation and manually ensure your training code and serving code use identical logic. Add a feature store when training-serving skew becomes a documented, recurring problem.