AI Model Deployment: From Prototype to Production
Only 22% of ML models make it to production (Gartner 2023). This tutorial covers deployment patterns, serving architectures, and optimization techniques for successful AI model deployment across environments.
Model Deployment Challenges (2023)
1. Deployment Architectures
Patterns Comparison:
| Architecture | Throughput | Latency | Use Case |
|---|---|---|---|
| Real-time API | Medium | Low | User-facing apps |
| Batch Processing | High | High | ETL pipelines |
| Edge Deployment | Low | Ultra-low | IoT devices |
| Streaming | Variable | Medium | Real-time analytics |
Key Components:
Model Serving
TorchServe, TF Serving
Low-latency inferenceOrchestration
Kubernetes, Docker
Scalable deploymentMonitoring
Prometheus, Evidently
Performance tracking2. Cloud Deployment
Major Cloud Services:
- AWS: SageMaker, Lambda, ECS
- Azure: ML Studio, AKS, Functions
- GCP: Vertex AI, Cloud Run, GKE
SageMaker Deployment Example:
from sagemaker.pytorch import PyTorchModel
import sagemaker
# Package model artifacts
model = PyTorchModel(
model_data='s3://bucket/model.tar.gz',
role=sagemaker.get_execution_role(),
framework_version='1.12.0',
entry_script='inference.py',
source_dir='src'
)
# Deploy endpoint
predictor = model.deploy(
initial_instance_count=1,
instance_type='ml.m5.xlarge',
endpoint_name='fraud-detection-v1',
# Auto-scaling config
autoscaling_enabled=True,
min_capacity=1,
max_capacity=4,
target_value=70 # 70% CPU utilization
)
# Invoke endpoint
response = predictor.predict(data={'transaction': transaction_data})
Optimization Techniques:
Model Quantization
# PyTorch dynamic quantization
model = torch.quantization.quantize_dynamic(
model, {torch.nn.Linear}, dtype=torch.qint8
)
torch.save(model.state_dict(), 'quantized_model.pth')
Container Optimization
# Dockerfile for lightweight serving
FROM python:3.9-slim
RUN pip install torchserve torch-model-archiver
COPY model-store /home/model-server/model-store
CMD ["torchserve", "--start", "--model-store", "model-store"]
Deployment Tool Comparison
| Tool | Best For | Latency | Max Model Size |
|---|---|---|---|
| TorchServe | PyTorch models | 5-15ms | 2GB |
| TF Serving | TensorFlow | 3-10ms | 1.5GB |
| Triton | Multi-framework | 2-8ms | 10GB+ |
| BentoML | Custom pipelines | 10-20ms | 5GB |
3. Edge & Mobile Deployment
Optimization Pipeline:
- Pruning: Remove redundant neurons
- Quantization: FP32 → INT8 weights
- Compilation: Hardware-specific optimization
- Deployment: On-device inference
TensorFlow Lite Implementation:
# Convert to TFLite
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
tflite_model = converter.convert()
# Deploy to Android
try (Interpreter interpreter = new Interpreter(tflite_model_buffer)) {
interpreter.run(input, output);
}
# CoreML for iOS
import coremltools as ct
mlmodel = ct.convert(tflite_model,
convert_to="mlprogram",
compute_units=ct.ComputeUnit.ALL)
mlmodel.save("model.mlpackage")
Performance Benchmarks:
iPhone 14
Raspberry Pi 4
Jetson Nano
4. Monitoring & Maintenance
Monitoring Stack:
- Data Drift: Evidently, Whylogs
- Model Performance: Fiddler, Arize
- Infrastructure: Prometheus, Grafana
- Business Metrics: Custom dashboards
Evidently Implementation:
from evidently import ColumnMapping
from evidently.report import Report
from evidently.metrics import DataDriftTable
data_drift_report = Report(metrics=[DataDriftTable()])
column_mapping = ColumnMapping(
numerical_features=['age', 'income'],
categorical_features=['gender', 'city']
)
data_drift_report.run(
reference_data=ref_df,
current_data=current_df,
column_mapping=column_mapping
)
# Generate alert if drift detected
if data_drift_report.metrics[0].dataset_drift:
send_alert("Significant data drift detected!")
Canary Deployment Pattern:
1
Route 5% Traffic
New model receives small percentage
2
Compare Metrics
Accuracy, latency, business KPIs
3
Full Rollout
If performance meets thresholds
4
Rollback
Automatic if errors exceed limit
Conclusion & Next Steps
Effective model deployment requires infrastructure, monitoring, and optimization:
- Choose architecture based on latency/throughput needs
- Optimize models for target hardware
- Implement comprehensive monitoring
- Use progressive rollout strategies
Learning Resources:
Ready to deploy? Try these templates:
×