Bonus: K8s – Scaling AI Workflows | Automation Masterclass

🔗 Knowledge graph – From Docker to Kubernetes

Docker Bonus

Containerized automations

Kubernetes

Orchestrates containers

Day 4/8/19

AI APIs, chatbots, qualifiers

Day 20

Optimization at scale

Day 21

Portfolio system on K8s

Shared link: The Docker bonus module taught you to package automations as containers. Kubernetes teaches you to run them as a production-grade system – handling failures, scaling to meet demand, and efficiently using expensive GPU resources [citation:1][citation:7].

🚢 What is Kubernetes? The Orchestrator

📌 Kubernetes (K8s) = Container Orchestration Platform

If Docker is the shipping container, Kubernetes is the massive port that manages thousands of containers – loading, unloading, routing, and ensuring everything runs smoothly. It automates deployment, scaling, and management of containerized applications [citation:8].

For your AI workflows: K8s handles:

Scheduling: Decides which server runs your container
Auto-healing: Restarts failed containers automatically
Auto-scaling: Adds more copies when traffic spikes
Load balancing: Distributes requests across containers
GPU management: Allocates GPU resources to containers that need them [citation:2][citation:5]

Analogy: Your containerized Day 19 chatbot is a single container. When 10,000 users chat simultaneously, you need 50 copies. Kubernetes spins them up, routes traffic, and replaces any that crash – all automatically [citation:7].

🏗️ Kubernetes Architecture – The 10,000ft View

Pod

Smallest unit – runs your container (e.g., one chatbot instance)

Node

A worker machine (VM/physical) that runs pods

Cluster

Group of nodes – your entire system

Deployment

Describes desired state: "run 5 copies of my lead qualifier"

Service

Stable endpoint to access your pods

HPA

Horizontal Pod Autoscaler – scales based on CPU/memory [citation:2]

🎮 GPU Management – Why K8s is Essential for AI

AI workloads need GPUs. Kubernetes manages them like a pro [citation:5][citation:7].

Without Kubernetes

GPUs idle 70% of the time
Manual allocation per server
No sharing between teams
Hard to scale

With Kubernetes

GPU sharing via device plugins [citation:2]
Schedule pods to GPU nodes
Multi-tenancy with quotas [citation:4]
Autoscaling GPU clusters

# Request a GPU in your pod spec
apiVersion: v1
kind: Pod
metadata:
  name: my-ai-pod
spec:
  containers:
  - name: ai-container
    image: my-registry/lead-qualifier:latest
    resources:
      requests:
        nvidia.com/gpu: 1  # Request 1 GPU
      limits:
        nvidia.com/gpu: 1
            

How it works: NVIDIA device plugin exposes GPUs to K8s. Your pod requests one, and K8s schedules it on a node with available GPU [citation:2][citation:9].

📈 Case Study: Scaling the Day 19 Chatbot to 10,000 Users

Containerize the chatbot (from Docker bonus)

FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["python", "chatbot_api.py"]
                        

Create Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: chatbot-deployment
spec:
  replicas: 3  # Start with 3 copies
  selector:
    matchLabels:
      app: chatbot
  template:
    metadata:
      labels:
        app: chatbot
    spec:
      containers:
      - name: chatbot
        image: my-registry/chatbot:latest
        ports:
        - containerPort: 5000
        resources:
          requests:
            cpu: "500m"
            memory: "512Mi"
          limits:
            cpu: "1"
            memory: "1Gi"
                        

Expose with a Service

apiVersion: v1
kind: Service
metadata:
  name: chatbot-service
spec:
  selector:
    app: chatbot
  ports:
    - protocol: TCP
      port: 80
      targetPort: 5000
  type: LoadBalancer  # External access
                        

Add autoscaling (HPA)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: chatbot-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: chatbot-deployment
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        value: 70  # Scale when CPU > 70% [citation:2][citation:10]
                        

Real result: Under normal load (50 req/min) → 3 pods. During spike (2000 req/min) → scales to 50 pods automatically. Cost savings: 92% compared to always-on servers [citation:7].

⚙️ AI-Specific Tools on Kubernetes

Beyond basic scaling, Kubernetes hosts powerful AI workflow engines [citation:1].

Kubeflow

End-to-end ML platform on K8s. Manages the entire ML lifecycle: data prep, training, serving [citation:1].

Versioned experiments
Distributed training
Pipeline orchestration

Argo Workflows

Native K8s workflow engine for parallel jobs [citation:1].

DAG-based definitions
Runs everything as containers
Perfect for batch processing

KServe

Model serving platform (formerly KFServing). Handles inference at scale [citation:3][citation:6].

Canary rollouts
Autoscaling based on request rate
GPU-aware scheduling

Kueue

Batch job scheduling system. Queues and prioritizes AI jobs when GPUs are scarce [citation:1][citation:4].

Example: Kubeflow pipeline for Day 8 lead qualifier retraining:

Data prep job (Spark)
Training job (PyTorch on GPU)
Model evaluation
Deploy new version with KServe (canary 10%) [citation:3]
                

⚡ Event-Driven Autoscaling with KEDA

HPA scales on CPU/memory. KEDA scales on anything: queue length, Kafka messages, cron schedule [citation:3].

# Scale chatbot based on RabbitMQ queue length
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: chatbot-scaledobject
spec:
  scaleTargetRef:
    name: chatbot-deployment
  triggers:
  - type: rabbitmq
    metadata:
      queueName: chatbot-requests
      queueLength: "5"  # Scale when queue has 5+ messages
            

Perfect for: Batch processing, async workloads, integration with Day 3 Make.com webhooks [citation:3].

💰 Cost Optimization – Spot Instances & Scale-to-Zero

Spot instances (60-90% cheaper)

Use preemptible VMs for batch training, non-critical workloads [citation:7].

# Node pool with spot instances
# K8s automatically drains pods when instance is reclaimed
                    

Scale-to-zero with Knative

When no traffic, scale pods to zero. Cold start on first request [citation:3].

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: chatbot-serverless
spec:
  template:
    spec:
      containers:
      - image: my-registry/chatbot:latest
                    

Real savings: Always-on 4x GPU instances = $2,400/month. K8s with spot + scaling = $180/month. 92.5% savings [citation:7].

📊 Observability – Know What's Happening

You can't scale what you can't measure. Standard stack: Prometheus + Grafana [citation:5][citation:7].

# Prometheus metrics for AI workloads
- container_gpu_usage: GPU utilization per pod
- model_inference_latency_seconds: Response time
- queue_length: Pending requests
- request_error_total: Failed inferences
            

Grafana Dashboard

Requests/sec: 450
GPU usage: 78%
P95 latency: 320ms
Active pods: 12

Alerting

Alert if error rate > 1%
Alert if queue length > 100
Alert if GPU memory > 90%

8 hands-on Kubernetes exercises

☸️ Exercise 1: Setup local cluster

Install Minikube or Kind. Run `kubectl get nodes`.

🐳 Exercise 2: Deploy containerized Day 8

Take your containerized lead qualifier. Create a Deployment with 3 replicas.

🔌 Exercise 3: Expose as Service

Create a LoadBalancer Service. Test accessing your qualifier.

📈 Exercise 4: Configure HPA

Add HorizontalPodAutoscaler to scale at 50% CPU. Generate load to test.

🎮 Exercise 5: GPU pod

If you have GPU, request nvidia.com/gpu:1 in a pod. Run nvidia-smi inside.

⚡ Exercise 6: KEDA scaling

Install KEDA. Create ScaledObject based on RabbitMQ queue (simulate).

📊 Exercise 7: Prometheus + Grafana

Install kube-prometheus-stack. View GPU metrics dashboard.

🚀 Exercise 8: Canary deployment

Use KServe or Istio to roll out new version to 10% of traffic.

📄 Client Proposal – Kubernetes-Managed AI System

☸️ Kubernetes-Managed AI System – Proposal

What I'll deliver:

✅ Your AI systems (chatbots, qualifiers, content engines) deployed on Kubernetes
✅ Auto-scaling based on real traffic – pay only for what you use
✅ GPU scheduling – efficient use of expensive hardware
✅ Self-healing infrastructure – automatic recovery from failures
✅ Canary deployments – test new versions safely
✅ Complete monitoring (Prometheus/Grafana) with alerts

Business benefits:

Handle 10x traffic without manual intervention
Reduce infrastructure costs by 60-90% [citation:7]
99.95% uptime SLA

Investment: $3,500 setup + $500/mo management

📚 Resources

Bonus Module 2: You've mastered Kubernetes for AI

✔ Understood Kubernetes architecture and core concepts
✔ Learned GPU management for AI workloads [citation:5]
✔ Scaled Day 19 chatbot with HPA and KEDA [citation:7]
✔ Explored AI tools: Kubeflow, KServe, Argo [citation:1]
✔ Implemented cost optimization (spot, scale-to-zero) [citation:3]
✔ Set up observability with Prometheus/Grafana
✔ Ready to deploy production-grade AI systems

Report Content