MLOps Pipeline

Feature Store Explained: Feast on Kubernetes

● Intermediate ⏱ 35 min read MLOps Pipeline

You've built the dataset pipeline, cleaned the data, and engineered your features into a tidy CSV. Now the model goes to production — and within weeks you discover that inference predictions are silently wrong. Not because the model is bad, but because the feature values at serving time don't match what the model saw during training.

This is the feature consistency problem, and it's one of the most common sources of production ML failures. A feature store solves it by acting as a single, versioned source of truth for features — shared between training and inference, so both pipelines always consume features the same way.

We'll use Feast, an open-source, Kubernetes-native feature store used in production by companies like Nvidia, Shopify, and Expedia.

The Problem Feature Stores Solve

In a naive MLOps setup, feature engineering happens twice:

  • Training time: a Python script reads the raw CSV, transforms it, and writes a featured dataset
  • Inference time: an API handler reads raw employee data, applies "the same" transformations, and feeds the model

The word "same" is doing a lot of work there. In practice, two separate implementations of the same logic drift apart. A column gets renamed. A normalization constant changes. A derived feature gets computed in a different order. The result is training-serving skew: the model predicts on different data than it was trained on, and it doesn't throw an error — it just silently degrades.

⚠️
Training-Serving Skew Is Silent The model doesn't raise an exception when you pass mismatched features. It produces a number. That number might look plausible. You won't know it's wrong until you measure model accuracy against ground truth — which might take days or weeks.

A feature store enforces a single definition of every feature. That definition is used both when materializing data for training and when serving features at inference time. Skew becomes structurally impossible.

Why Feature Order Is Critical

Before diving into the feature store architecture, you need to understand a subtle but important constraint: the model does not understand column names.

During training, your model sees a matrix of numbers. It learns patterns based on the position of each value: "if index 0 is high and index 2 is 1, predict attrition." The column header tenure is never stored in the model. Only the position matters.

featured.csv column order What the model actually sees
tenure → 0.45 index 0 → 0.45
salary → 0.78 index 1 → 0.78
overtime → 1 index 2 → 1

If at inference time you pass salary as the first column, the model treats it as tenure. It will still run. It will still return a probability. But that probability is based on completely wrong inputs, with no warning.

The Feature Store Guarantee The feature registry stores the canonical column order alongside each feature definition. At inference time, the SDK fetches features in the exact same order used during training. You never manually construct the input vector.

What Is a Feature Store

A feature store is an infrastructure layer with four responsibilities:

Responsibility What It Means
Define Store feature definitions (schema, entities, metadata) in a central registry
Store Persist feature data in both offline (historical) and online (real-time) stores
Serve Retrieve features at low latency during inference, in the correct order
Share Expose the same features across teams — model A and model B can reuse the same employee_tenure feature without recomputing it

In the employee attrition project, a feature is any derived signal: tenure, overtime, promotion_stagnation, career_velocity. These values are computed once, stored in the feature store, and served consistently to every model that needs them.

Offline vs Online Store

Training and inference have very different data access patterns, so a feature store uses two separate backends.

Offline Store — For Training

Training requires large volumes of historical feature data — lakhs of employee records spanning years. Speed is secondary; completeness and reproducibility matter. The offline store is backed by slow but scalable storage: S3, Parquet files, BigQuery, or Redshift.

When you kick off a training job, the Feast SDK reads the feature definitions from the registry, queries the offline store directly, and returns a time-travel-correct dataset: for each training example, it retrieves the feature values that existed at the exact timestamp of that label. This prevents future leakage — a subtle bug where you accidentally use feature values that weren't available at prediction time.

Note on Training Path The training job uses the Feast SDK to read the feature registry and query the offline store (S3) directly. The Feast Feature Server pod is not involved in the training path — it only serves online features during inference.

Online Store — For Inference

During inference, the model API receives an employee ID and needs to return a prediction in milliseconds. The offline store is far too slow for this — S3 latency is 100–200 ms per read. The online store uses Redis or DynamoDB, which serve feature lookups in 1–5 ms.

Store Backend Typical Latency Used For
Offline S3 / Parquet / BigQuery 100–500 ms Training, batch scoring
Online Redis / DynamoDB 1–5 ms Real-time inference

The online store only holds the latest feature values for active entities (e.g., current employees). It is not a historical archive — it's a cache of the most recent precomputed features, optimized for fast point lookups by entity key (e.g., employee_id).

Materialization: How Data Gets Into Redis

You now have two stores. The offline store holds years of historical features in S3. The online store in Redis holds the latest values for current employees. But how does data move from S3 into Redis?

That process is called materialization. It reads the latest relevant feature values from the offline store and loads them into Redis. As a DevOps engineer, this is typically yours to own — materialization runs as a scheduled Kubernetes CronJob or Airflow DAG.

bash
# Materialize features updated in the last 7 days into Redis
feast materialize-incremental $(date -u +"%Y-%m-%dT%H:%M:%S")
yaml
# Kubernetes CronJob: nightly materialization
apiVersion: batch/v1
kind: CronJob
metadata:
  name: feast-materialize
  namespace: mlops
spec:
  schedule: "0 2 * * *"   # 02:00 UTC daily
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: materialize
            image: ops4life/feast-worker:latest
            command:
            - bash
            - -c
            - feast materialize-incremental $(date -u +"%Y-%m-%dT%H:%M:%S")
            env:
            - name: FEAST_REPO_PATH
              value: /feast
          restartPolicy: OnFailure

Materialization is incremental: it only processes feature records that have changed since the last run. In the employee attrition project, only active employee records are materialized — the filtering is handled in the Feast feature view definition before the CronJob runs.

Feature Registry

The feature registry is the central catalog of all feature definitions: their schema, the entities they're keyed on, and where the underlying data lives. In Feast, the registry is backed by PostgreSQL.

Instead of manually constructing the input vector for every inference call, the inference API asks the registry: "give me the feature set defined for the attrition model, for employee 101." The SDK returns features in the exact order the model expects, fetched from Redis.

python
from feast import FeatureStore

store = FeatureStore(repo_path="/feast")

# Fetch online features by entity key — order is guaranteed by the registry
features = store.get_online_features(
    features=feature_service,          # feature set defined at training time
    entity_rows=[{"employee_id": 101}]
).to_dict()

# Pass directly to model — no manual column sorting needed
prediction = model.predict([list(features.values())])

The training job uses the same registry to read feature definitions and construct the historical training dataset from S3. Because both paths read the same schema, the column order is guaranteed to match.

What the Registry Stores

Object What It Defines Example
Entity The primary key that identifies a record employee_id
Feature View A group of related features sourced from one dataset employee_features
Feature Service A named set of feature views consumed by a specific model attrition_model_v1
Data Source Where raw feature data lives (S3 path, table name) s3://features/employee_attrition.parquet

Feast Architecture on Kubernetes

On Kubernetes, Feast runs as a set of deployments. Here's the complete picture of how the components interact:

Component Implementation Role
Feature Registry PostgreSQL Stores feature definitions, schema, and metadata
Online Store Redis Serves latest feature values at low latency (<5 ms)
Offline Store S3 / Parquet Historical features for training and batch scoring
Feature Server Feast Feature Server pod HTTP/gRPC API that serves online features to inference pods
Materialization Kubernetes CronJob Moves updated features from S3 → Redis on schedule

Training Data Flow

  1. Training job starts with a list of entity keys and timestamps
  2. Feast SDK reads feature view definitions from PostgreSQL registry
  3. SDK performs a point-in-time join against the offline store (S3)
  4. Returns a training dataset with time-correct features — no future leakage

Inference Data Flow

  1. HR user submits an employee ID to the inference API
  2. Inference API calls the Feast Feature Server: GET /get-online-features
  3. Feature Server reads definitions from PostgreSQL, fetches values from Redis
  4. Returns feature vector in the canonical order from the registry
  5. Inference API passes the vector to the model, returns prediction
Redis Memory Sizing Redis holds only the latest feature values for active entities. For the employee attrition use case (thousands of employees, dozens of features), Redis memory usage is typically under 1 GB. Monitor redis_memory_used_bytes and eviction rate — evictions mean Redis is running out of memory and dropping feature data, which breaks inference.

Deploying the Stack

bash
# Deploy PostgreSQL for the feature registry
helm repo add bitnami https://charts.bitnami.com/bitnami
helm install feast-postgres bitnami/postgresql \
  --namespace mlops --create-namespace \
  --set auth.database=feast \
  --set auth.username=feast \
  --set auth.password=feastpassword

# Deploy Redis for the online store
helm install feast-redis bitnami/redis \
  --namespace mlops \
  --set auth.enabled=false \
  --set architecture=standalone
yaml
# feast/feature_store.yaml — Feast configuration
project: employee_attrition
provider: local

registry:
  registry_type: sql
  path: postgresql://feast:feastpassword@feast-postgres:5432/feast

online_store:
  type: redis
  connection_string: "feast-redis-master:6379"

offline_store:
  type: file   # or use spark/bigquery/redshift for production scale

entity_key_serialization_version: 2
python
# feast/features.py — Feature definitions registered in PostgreSQL
from datetime import timedelta
from feast import Entity, FeatureView, Field, FileSource
from feast.types import Float32, Int64

employee = Entity(name="employee_id", description="Employee identifier")

employee_source = FileSource(
    path="s3://mlops-features/employee_attrition.parquet",
    timestamp_field="event_timestamp",
)

employee_features = FeatureView(
    name="employee_features",
    entities=[employee],
    ttl=timedelta(days=7),
    schema=[
        Field(name="tenure",              dtype=Float32),
        Field(name="salary_band",         dtype=Int64),
        Field(name="overtime",            dtype=Int64),
        Field(name="promotion_stagnation",dtype=Float32),
        Field(name="career_velocity",     dtype=Float32),
        Field(name="overall_satisfaction",dtype=Float32),
    ],
    source=employee_source,
)
bash
# Apply feature definitions to the registry
cd feast && feast apply

# Run initial materialization (loads Redis from S3)
feast materialize-incremental $(date -u +"%Y-%m-%dT%H:%M:%S")

# Verify: retrieve features for employee 101
feast get-online-features \
  --features employee_features:tenure,employee_features:salary_band \
  --entity-rows '[{"employee_id": 101}]'

Smoke Testing the Feature Server

bash
# Port-forward the Feast Feature Server
kubectl -n mlops port-forward svc/feast-feature-server 6566:6566 &

# Request features over HTTP
curl -s http://localhost:6566/get-online-features \
  -H "Content-Type: application/json" \
  -d '{
    "feature_service": "attrition_model_v1",
    "entities": {"employee_id": [101, 202, 303]}
  }' | jq .

# Check p99 latency from Prometheus
kubectl -n mlops exec -it deploy/feast-feature-server -- \
  curl -s localhost:8080/metrics | grep feast_feature_server_latency

Your Role as a DevOps Engineer

ML engineers define features. DevOps engineers keep the feature serving infrastructure running reliably in production.

Responsibility What You Own
Deploy & manage Feast Feature Server, Redis, and PostgreSQL on Kubernetes. Helm charts, resource limits, PodDisruptionBudgets.
Availability & scaling HPA on the Feature Server based on request rate. Redis in HA mode (Sentinel or Cluster) for production. PostgreSQL with read replicas.
Materialization pipeline CronJob or Airflow DAG that runs feast materialize-incremental on schedule. Alert on failures — a failed materialization means stale features in Redis.
CI/CD integration Run feast apply in CI after feature definition changes. Schema validation before merge — a breaking schema change breaks inference.
Observability Monitor Redis memory, eviction rate, and Feature Server p99 latency. Feast exposes Prometheus metrics out of the box.

Key Metrics to Watch

bash
# Redis: memory usage and evictions
redis-cli info memory | grep used_memory_human
redis-cli info stats | grep evicted_keys

# Feast Feature Server: latency histogram (Prometheus)
feast_feature_server_request_duration_seconds_bucket

# Materialization: check last run timestamp
feast feature-views list  # TTL tells you how fresh features are
⚠️
Alert on Materialization Failures If the nightly CronJob fails, Redis holds stale features. The Feature Server keeps serving — it won't error out — but the model predicts on old data. Wire up alerts on CronJob failure (kube_cronjob_status_last_schedule_time) and on Redis key TTL expiry to catch this before it silently degrades model accuracy.

CI/CD: Safe Feature Deployments

Feature definitions are code. They must go through review and validation before they hit production. A schema change that removes a column will break any inference pipeline that references that column.

yaml
# .github/workflows/feast-deploy.yml
name: Deploy Feature Definitions
on:
  push:
    paths: ['feast/**']

jobs:
  validate-and-apply:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    - name: Install Feast
      run: pip install feast[postgres,redis]
    - name: Validate feature definitions
      run: |
        cd feast
        python -c "from features import employee_features; print('schema OK')"
    - name: Apply to registry (production)
      if: github.ref == 'refs/heads/main'
      run: |
        cd feast
        feast apply
      env:
        FEAST_REGISTRY_DB_URL: ${{ secrets.FEAST_REGISTRY_DB_URL }}

With the feature store in place, the full data flow for production inference looks like this: HR records flow through the ETL pipeline into S3, nightly materialization pushes current employee features into Redis, and every inference call fetches a consistent, registry-governed feature vector — the same schema the model was trained on.