MLOps Step 2: Data Preparation
Raw data almost never goes directly into a model. It's dirty, inconsistently formatted, and missing the derived features that would actually help the model learn. Data preparation is the process of transforming raw data into clean, model-ready features — and doing it in a way that's reproducible, automated, and consistent between training and serving.
Data Cleaning
Data cleaning is about removing or fixing records that would mislead the model. The specific steps depend on your data, but these are the most common issues:
Missing Values
You have three options when a value is missing: drop the row (if missing is rare and not systematic), impute with a statistic (median for numeric, mode for categorical), or add a missingness indicator feature (when "missing" itself is a signal). The worst option is to do nothing and let the model see NaN values — most models will fail or produce garbage.
fit_transform on train only, then transform on val and test.
Feature Engineering
Feature engineering is creating new input variables from existing ones that better capture the patterns you want the model to learn. A domain expert can often double model performance through good features.
Common Feature Engineering Techniques
- Aggregations: Count of events per user in last 7 days, average order value per category
- Time-based features: Day of week, hour of day, days since last event, recency-frequency-monetary (RFM)
- Ratios: Conversion rate (purchases / sessions), click-through rate, churn risk score
- Text features: TF-IDF, word count, sentiment score, topic models
- Interaction features: Product of two features when their combination has meaning
Train/Val/Test Splits
How you split data matters enormously. Random splitting is the default but is wrong for time-series data.
Random Split (for non-temporal data)
Use sklearn.model_selection.train_test_split with stratify=y to maintain class balance across splits. Typical ratios: 70/15/15 or 80/10/10.
Temporal Split (for time-series data)
For behavioral data where you're predicting future behavior, you must split by time — never randomly. Random splitting causes data leakage because you'd be "predicting" the past with future data.
Handling Class Imbalance
Many real-world classification problems are imbalanced: 99% of users don't churn, 0.1% of transactions are fraud. A model that always predicts the majority class gets 99% accuracy but is useless.
Common strategies:
- class_weight="balanced" — Tell scikit-learn to weight the minority class more heavily. Easy win, try this first.
- SMOTE — Synthetic Minority Over-sampling Technique. Generates synthetic minority class examples.
- Threshold tuning — Instead of defaulting to 0.5, adjust the decision threshold to balance precision/recall for your use case.
- Use better metrics — Don't use accuracy for imbalanced data. Use F1, precision-recall AUC, or ROC AUC.
Feature Stores in Production
A feature store solves a nasty production problem: the features you compute in your offline training pipeline are often computed differently in your online serving code, causing subtle bugs (the "training-serving skew" problem).
Feast is the most widely used open-source feature store. It maintains a registry of feature definitions and provides both an offline store (for training) and an online store (for low-latency serving).