MLOps Step 2: Data Preparation
Raw data almost never goes directly into a model. It's inconsistently encoded, missing derived signals, and not split into training and test sets. Data preparation is the process of transforming validated data into model-ready features — and doing it in a way that's reproducible, automated, and consistent between training and serving.
We'll use the employee attrition dataset throughout — predicting whether an employee (Attrition="Yes") will leave, based on HR attributes like job level, tenure, income, and satisfaction scores.
Feature Engineering
Feature engineering turns raw columns into signals the model can actually use. The employee attrition dataset has 20 columns that need encoding, binning, and derived feature construction before a model can train on them.
What the feature engineering step does
- Target encoding: Maps
Attritionfrom"No"/"Yes"to0/1 - Binary encoding: Maps yes/no columns like
OverTimeandGenderto0/1 - Ordinal encoding: Maps ordered categories (
Job Satisfaction: Low → 1, Medium → 2, High → 3, Very High → 4) to integers that preserve rank - Aggregate features: Combines
Work-Life Balance,Job Satisfaction, andEmployee Recognitioninto a singleOverallSatisfactionscore - Income binning: Converts raw
Monthly Incomeinto 5 annual income bands, reducing noise from outliers - Derived features: Computes
PromotionStagnation,CareerVelocity,LongCommute, andStableManager— composite signals that capture patterns correlated with attrition
Train/Test Split
After feature engineering, the dataset is split into training and test sets using stratified sampling — which preserves the proportion of Attrition=1 records in both splits. The employee attrition dataset has a class imbalance (~16% attrition), so stratification matters: without it, random splits might put most of the minority class in one partition.
fit_transform on train only, then transform on val and test.
Handling Class Imbalance
Many real-world classification problems are imbalanced. In the employee attrition dataset, roughly 84% of employees have Attrition="No" and 16% have Attrition="Yes". A model that always predicts "No" gets 84% accuracy but has zero predictive value for the minority class — the one you actually care about.
Common strategies:
- class_weight="balanced" — Tell scikit-learn to weight the minority class more heavily during training. Easy win, try this first.
- SMOTE — Synthetic Minority Over-sampling Technique. Generates synthetic minority class examples.
- Threshold tuning — Instead of defaulting to 0.5, adjust the decision threshold to balance precision/recall for your use case.
- Use better metrics — Don't use accuracy for imbalanced data. Use F1, precision-recall AUC, or ROC AUC.
Temporal Splits vs Random Splits
How you split data matters. Random splitting is the default and works for the employee attrition dataset because it's cross-sectional — each row is an employee snapshot, not a time-ordered event. For behavioral data where you're predicting future behavior from past actions (fraud detection, churn, recommendation), you must split by time — never randomly. Random splitting on temporal data causes leakage because you'd be training on future events to predict past ones.
Feature Stores in Production
A feature store solves a nasty production problem: the features you compute in your offline training pipeline are often computed differently in your online serving code, causing subtle bugs (the "training-serving skew" problem). For the employee attrition pipeline, PromotionStagnation or income binning computed differently at serving time would cause silent mispredictions.
Feast is the most widely used open-source feature store. It maintains a registry of feature definitions and provides both an offline store (for training) and an online store (for low-latency serving).