MLOps Step 4: Deploying with KServe
KServe (formerly KFServing) is the Kubernetes-native model serving platform. It provides standardized model serving, canary deployments, autoscaling to zero, and explainability — all through a simple Kubernetes custom resource called InferenceService.
KServe Architecture
KServe runs on top of Knative Serving and Istio, which gives it powerful traffic management capabilities without you having to configure those layers directly. The architecture has three main planes:
- Control Plane: The KServe controller watches
InferenceServiceresources and creates the underlying Knative Services. - Data Plane: Each model is served by a container that implements the KServe v2 inference protocol — a standardized REST + gRPC API for model predictions.
- Serving Runtime: Pre-built serving containers for common ML frameworks (sklearn, PyTorch, TensorFlow, Hugging Face). You pick one; KServe handles pulling the model artifact and starting the server.
InferenceService YAML
Deploying a model with KServe means writing an InferenceService custom resource. Here's a complete production example for our churn prediction model trained with scikit-learn:
Canary Deployments
KServe has first-class support for canary deployments — routing a percentage of traffic to a new model version while the majority still hits the current model. This is the safest way to roll out model updates.
To promote the canary to 100%, update the YAML with canaryTrafficPercent: 100. To roll back, set it to 0 or remove the canary spec entirely.
shadow: true to the canary spec.
Autoscaling
KServe uses Knative's request-based autoscaler (KPA) by default, which scales based on requests per second. You can also use HPA for CPU/memory-based scaling.
Monitoring and Rollback
KServe exposes Prometheus metrics out of the box. Wire these up to Grafana to get a serving dashboard.
For automated rollback, you can create a Prometheus alerting rule that triggers a PagerDuty incident when error rate exceeds a threshold, and wire that to a script that deletes the canary spec from the InferenceService.