MLOps Pipeline

MLOps Step 2: Data Preparation

● Intermediate ⏱ 40 min read MLOps Pipeline

Raw data almost never goes directly into a model. It's inconsistently encoded, missing derived signals, and not split into training and test sets. Data preparation is the process of transforming validated data into model-ready features — and doing it in a way that's reproducible, automated, and consistent between training and serving.

We'll use the employee attrition dataset throughout — predicting whether an employee (Attrition="Yes") will leave, based on HR attributes like job level, tenure, income, and satisfaction scores.

Feature Engineering

Feature engineering turns raw columns into signals the model can actually use. The employee attrition dataset has 20 columns that need encoding, binning, and derived feature construction before a model can train on them.

What the feature engineering step does

  • Target encoding: Maps Attrition from "No"/"Yes" to 0/1
  • Binary encoding: Maps yes/no columns like OverTime and Gender to 0/1
  • Ordinal encoding: Maps ordered categories (Job Satisfaction: Low → 1, Medium → 2, High → 3, Very High → 4) to integers that preserve rank
  • Aggregate features: Combines Work-Life Balance, Job Satisfaction, and Employee Recognition into a single OverallSatisfaction score
  • Income binning: Converts raw Monthly Income into 5 annual income bands, reducing noise from outliers
  • Derived features: Computes PromotionStagnation, CareerVelocity, LongCommute, and StableManager — composite signals that capture patterns correlated with attrition
python

Train/Test Split

After feature engineering, the dataset is split into training and test sets using stratified sampling — which preserves the proportion of Attrition=1 records in both splits. The employee attrition dataset has a class imbalance (~16% attrition), so stratification matters: without it, random splits might put most of the minority class in one partition.

python
⚠️
Data Leakage Warning Never fit your preprocessor (scaler, imputer, encoder) on the validation or test set. If you calculate the mean for imputation using the test set, you've leaked future information into your model. Always fit_transform on train only, then transform on val and test.

Handling Class Imbalance

Many real-world classification problems are imbalanced. In the employee attrition dataset, roughly 84% of employees have Attrition="No" and 16% have Attrition="Yes". A model that always predicts "No" gets 84% accuracy but has zero predictive value for the minority class — the one you actually care about.

Common strategies:

  • class_weight="balanced" — Tell scikit-learn to weight the minority class more heavily during training. Easy win, try this first.
  • SMOTE — Synthetic Minority Over-sampling Technique. Generates synthetic minority class examples.
  • Threshold tuning — Instead of defaulting to 0.5, adjust the decision threshold to balance precision/recall for your use case.
  • Use better metrics — Don't use accuracy for imbalanced data. Use F1, precision-recall AUC, or ROC AUC.

Temporal Splits vs Random Splits

How you split data matters. Random splitting is the default and works for the employee attrition dataset because it's cross-sectional — each row is an employee snapshot, not a time-ordered event. For behavioral data where you're predicting future behavior from past actions (fraud detection, churn, recommendation), you must split by time — never randomly. Random splitting on temporal data causes leakage because you'd be training on future events to predict past ones.

Feature Stores in Production

A feature store solves a nasty production problem: the features you compute in your offline training pipeline are often computed differently in your online serving code, causing subtle bugs (the "training-serving skew" problem). For the employee attrition pipeline, PromotionStagnation or income binning computed differently at serving time would cause silent mispredictions.

Feast is the most widely used open-source feature store. It maintains a registry of feature definitions and provides both an offline store (for training) and an online store (for low-latency serving).

Start Without a Feature Store Feature stores add significant complexity. If you're just getting started, use pandas for feature computation and ensure your training and serving code use identical logic — ideally the same Python function. Add a feature store when training-serving skew becomes a documented, recurring problem.