0 Interaction
0 Views
Views
0 Likes
BONUS
DevOps & Cloud Orchestration
Kubernetes Scaling AI MLOps Bonus Module 2

K8s: Scaling AI Workflows – From Container to Cluster

Take your containerized automations from the previous bonus module and learn to orchestrate them at scale with Kubernetes. Handle thousands of requests, autoscale based on demand, and manage GPU resources efficiently.

Kubernetes basics
Auto-scaling
GPU management
Builds on Docker bonus

🔗 Knowledge graph – From Docker to Kubernetes

Docker Bonus

Containerized automations

Kubernetes

Orchestrates containers

Day 4/8/19

AI APIs, chatbots, qualifiers

Day 20

Optimization at scale

Day 21

Portfolio system on K8s

Shared link: The Docker bonus module taught you to package automations as containers. Kubernetes teaches you to run them as a production-grade system – handling failures, scaling to meet demand, and efficiently using expensive GPU resources [citation:1][citation:7].

🚢 What is Kubernetes? The Orchestrator

📌 Kubernetes (K8s) = Container Orchestration Platform

If Docker is the shipping container, Kubernetes is the massive port that manages thousands of containers – loading, unloading, routing, and ensuring everything runs smoothly. It automates deployment, scaling, and management of containerized applications [citation:8].

For your AI workflows: K8s handles:

  • Scheduling: Decides which server runs your container
  • Auto-healing: Restarts failed containers automatically
  • Auto-scaling: Adds more copies when traffic spikes
  • Load balancing: Distributes requests across containers
  • GPU management: Allocates GPU resources to containers that need them [citation:2][citation:5]
Analogy: Your containerized Day 19 chatbot is a single container. When 10,000 users chat simultaneously, you need 50 copies. Kubernetes spins them up, routes traffic, and replaces any that crash – all automatically [citation:7].

🏗️ Kubernetes Architecture – The 10,000ft View

Pod

Smallest unit – runs your container (e.g., one chatbot instance)

Node

A worker machine (VM/physical) that runs pods

Cluster

Group of nodes – your entire system

Deployment

Describes desired state: "run 5 copies of my lead qualifier"

Service

Stable endpoint to access your pods

HPA

Horizontal Pod Autoscaler – scales based on CPU/memory [citation:2]

🎮 GPU Management – Why K8s is Essential for AI

AI workloads need GPUs. Kubernetes manages them like a pro [citation:5][citation:7].

Without Kubernetes

  • GPUs idle 70% of the time
  • Manual allocation per server
  • No sharing between teams
  • Hard to scale

With Kubernetes

  • GPU sharing via device plugins [citation:2]
  • Schedule pods to GPU nodes
  • Multi-tenancy with quotas [citation:4]
  • Autoscaling GPU clusters
# Request a GPU in your pod spec apiVersion: v1 kind: Pod metadata: name: my-ai-pod spec: containers: - name: ai-container image: my-registry/lead-qualifier:latest resources: requests: nvidia.com/gpu: 1 # Request 1 GPU limits: nvidia.com/gpu: 1

How it works: NVIDIA device plugin exposes GPUs to K8s. Your pod requests one, and K8s schedules it on a node with available GPU [citation:2][citation:9].

📈 Case Study: Scaling the Day 19 Chatbot to 10,000 Users

1

Containerize the chatbot (from Docker bonus)

FROM python:3.9-slim WORKDIR /app COPY requirements.txt . RUN pip install -r requirements.txt COPY . . CMD ["python", "chatbot_api.py"]
2

Create Kubernetes Deployment

apiVersion: apps/v1 kind: Deployment metadata: name: chatbot-deployment spec: replicas: 3 # Start with 3 copies selector: matchLabels: app: chatbot template: metadata: labels: app: chatbot spec: containers: - name: chatbot image: my-registry/chatbot:latest ports: - containerPort: 5000 resources: requests: cpu: "500m" memory: "512Mi" limits: cpu: "1" memory: "1Gi"
3

Expose with a Service

apiVersion: v1 kind: Service metadata: name: chatbot-service spec: selector: app: chatbot ports: - protocol: TCP port: 80 targetPort: 5000 type: LoadBalancer # External access
4

Add autoscaling (HPA)

apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: chatbot-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: chatbot-deployment minReplicas: 3 maxReplicas: 50 metrics: - type: Resource resource: name: cpu target: type: Utilization value: 70 # Scale when CPU > 70% [citation:2][citation:10]
Real result: Under normal load (50 req/min) → 3 pods. During spike (2000 req/min) → scales to 50 pods automatically. Cost savings: 92% compared to always-on servers [citation:7].

⚙️ AI-Specific Tools on Kubernetes

Beyond basic scaling, Kubernetes hosts powerful AI workflow engines [citation:1].

Kubeflow

End-to-end ML platform on K8s. Manages the entire ML lifecycle: data prep, training, serving [citation:1].

  • Versioned experiments
  • Distributed training
  • Pipeline orchestration

Argo Workflows

Native K8s workflow engine for parallel jobs [citation:1].

  • DAG-based definitions
  • Runs everything as containers
  • Perfect for batch processing

KServe

Model serving platform (formerly KFServing). Handles inference at scale [citation:3][citation:6].

  • Canary rollouts
  • Autoscaling based on request rate
  • GPU-aware scheduling

Kueue

Batch job scheduling system. Queues and prioritizes AI jobs when GPUs are scarce [citation:1][citation:4].

Example: Kubeflow pipeline for Day 8 lead qualifier retraining:
1. Data prep job (Spark) 2. Training job (PyTorch on GPU) 3. Model evaluation 4. Deploy new version with KServe (canary 10%) [citation:3]

Event-Driven Autoscaling with KEDA

HPA scales on CPU/memory. KEDA scales on anything: queue length, Kafka messages, cron schedule [citation:3].

# Scale chatbot based on RabbitMQ queue length apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: chatbot-scaledobject spec: scaleTargetRef: name: chatbot-deployment triggers: - type: rabbitmq metadata: queueName: chatbot-requests queueLength: "5" # Scale when queue has 5+ messages

Perfect for: Batch processing, async workloads, integration with Day 3 Make.com webhooks [citation:3].

💰 Cost Optimization – Spot Instances & Scale-to-Zero

Spot instances (60-90% cheaper)

Use preemptible VMs for batch training, non-critical workloads [citation:7].

# Node pool with spot instances # K8s automatically drains pods when instance is reclaimed

Scale-to-zero with Knative

When no traffic, scale pods to zero. Cold start on first request [citation:3].

apiVersion: serving.knative.dev/v1 kind: Service metadata: name: chatbot-serverless spec: template: spec: containers: - image: my-registry/chatbot:latest
Real savings: Always-on 4x GPU instances = $2,400/month. K8s with spot + scaling = $180/month. 92.5% savings [citation:7].

📊 Observability – Know What's Happening

You can't scale what you can't measure. Standard stack: Prometheus + Grafana [citation:5][citation:7].

# Prometheus metrics for AI workloads - container_gpu_usage: GPU utilization per pod - model_inference_latency_seconds: Response time - queue_length: Pending requests - request_error_total: Failed inferences

Grafana Dashboard

Requests/sec: 450
GPU usage: 78%
P95 latency: 320ms
Active pods: 12

Alerting

Alert if error rate > 1%
Alert if queue length > 100
Alert if GPU memory > 90%

8 hands-on Kubernetes exercises

☸️ Exercise 1: Setup local cluster

Install Minikube or Kind. Run `kubectl get nodes`.

🐳 Exercise 2: Deploy containerized Day 8

Take your containerized lead qualifier. Create a Deployment with 3 replicas.

🔌 Exercise 3: Expose as Service

Create a LoadBalancer Service. Test accessing your qualifier.

📈 Exercise 4: Configure HPA

Add HorizontalPodAutoscaler to scale at 50% CPU. Generate load to test.

🎮 Exercise 5: GPU pod

If you have GPU, request nvidia.com/gpu:1 in a pod. Run nvidia-smi inside.

⚡ Exercise 6: KEDA scaling

Install KEDA. Create ScaledObject based on RabbitMQ queue (simulate).

📊 Exercise 7: Prometheus + Grafana

Install kube-prometheus-stack. View GPU metrics dashboard.

🚀 Exercise 8: Canary deployment

Use KServe or Istio to roll out new version to 10% of traffic.

📄 Client Proposal – Kubernetes-Managed AI System

☸️ Kubernetes-Managed AI System – Proposal

What I'll deliver:

  • ✅ Your AI systems (chatbots, qualifiers, content engines) deployed on Kubernetes
  • ✅ Auto-scaling based on real traffic – pay only for what you use
  • ✅ GPU scheduling – efficient use of expensive hardware
  • ✅ Self-healing infrastructure – automatic recovery from failures
  • ✅ Canary deployments – test new versions safely
  • ✅ Complete monitoring (Prometheus/Grafana) with alerts

Business benefits:

  • Handle 10x traffic without manual intervention
  • Reduce infrastructure costs by 60-90% [citation:7]
  • 99.95% uptime SLA

Investment: $3,500 setup + $500/mo management

📚 Resources

Bonus Module 2: You've mastered Kubernetes for AI

✔ Understood Kubernetes architecture and core concepts
✔ Learned GPU management for AI workloads [citation:5]
✔ Scaled Day 19 chatbot with HPA and KEDA [citation:7]
✔ Explored AI tools: Kubeflow, KServe, Argo [citation:1]
✔ Implemented cost optimization (spot, scale-to-zero) [citation:3]
✔ Set up observability with Prometheus/Grafana
✔ Ready to deploy production-grade AI systems

Bonus Module 2 – Kubernetes: Scaling AI Workflows

You need to be logged in to participate in this discussion.

×
×
×