MLOps Pipeline

MLOps Step 4: Deploying with KServe

● Advanced ⏱ 50 min read MLOps Pipeline

KServe (formerly KFServing) is the Kubernetes-native model serving platform. It provides standardized model serving, canary deployments, autoscaling to zero, and explainability — all through a simple Kubernetes custom resource called InferenceService.

KServe Architecture

KServe runs on top of Knative Serving and Istio, which gives it powerful traffic management capabilities without you having to configure those layers directly. The architecture has three main planes:

  • Control Plane: The KServe controller watches InferenceService resources and creates the underlying Knative Services.
  • Data Plane: Each model is served by a container that implements the KServe v2 inference protocol — a standardized REST + gRPC API for model predictions.
  • Serving Runtime: Pre-built serving containers for common ML frameworks (sklearn, PyTorch, TensorFlow, Hugging Face). You pick one; KServe handles pulling the model artifact and starting the server.
📋
KServe Prerequisites KServe requires Knative Serving and either Istio or a network layer like Kourier. For production, use the full Istio installation. For development/testing, use the serverless mode with Knative. Check kserve.io for the current installation guide.

InferenceService YAML

Deploying a model with KServe means writing an InferenceService custom resource. Here's a complete production example for our churn prediction model trained with scikit-learn:

yaml
bash

Canary Deployments

KServe has first-class support for canary deployments — routing a percentage of traffic to a new model version while the majority still hits the current model. This is the safest way to roll out model updates.

yaml

To promote the canary to 100%, update the YAML with canaryTrafficPercent: 100. To roll back, set it to 0 or remove the canary spec entirely.

Shadow Deployments KServe also supports shadow (mirror) deployments where the new model receives a copy of all requests but its responses are discarded. Use this to evaluate a new model on real traffic without any risk. Add shadow: true to the canary spec.

Autoscaling

KServe uses Knative's request-based autoscaler (KPA) by default, which scales based on requests per second. You can also use HPA for CPU/memory-based scaling.

yaml

Monitoring and Rollback

KServe exposes Prometheus metrics out of the box. Wire these up to Grafana to get a serving dashboard.

bash

For automated rollback, you can create a Prometheus alerting rule that triggers a PagerDuty incident when error rate exceeds a threshold, and wire that to a script that deletes the canary spec from the InferenceService.