Loading...
Loading...

AI Model Deployment: From Prototype to Production

Only 22% of ML models make it to production (Gartner 2023). This tutorial covers deployment patterns, serving architectures, and optimization techniques for successful AI model deployment across environments.

Model Deployment Challenges (2023)

Latency (32%)
Scalability (28%)
Monitoring (25%)
Security (15%)

1. Deployment Architectures

Patterns Comparison:

Architecture Throughput Latency Use Case
Real-time API Medium Low User-facing apps
Batch Processing High High ETL pipelines
Edge Deployment Low Ultra-low IoT devices
Streaming Variable Medium Real-time analytics

Key Components:

Model Serving

TorchServe, TF Serving

Low-latency inference

Orchestration

Kubernetes, Docker

Scalable deployment

Monitoring

Prometheus, Evidently

Performance tracking

2. Cloud Deployment

Major Cloud Services:

  • AWS: SageMaker, Lambda, ECS
  • Azure: ML Studio, AKS, Functions
  • GCP: Vertex AI, Cloud Run, GKE

SageMaker Deployment Example:


from sagemaker.pytorch import PyTorchModel
import sagemaker

# Package model artifacts
model = PyTorchModel(
    model_data='s3://bucket/model.tar.gz',
    role=sagemaker.get_execution_role(),
    framework_version='1.12.0',
    entry_script='inference.py',
    source_dir='src'
)

# Deploy endpoint
predictor = model.deploy(
    initial_instance_count=1,
    instance_type='ml.m5.xlarge',
    endpoint_name='fraud-detection-v1',
    
    # Auto-scaling config
    autoscaling_enabled=True,
    min_capacity=1,
    max_capacity=4,
    target_value=70  # 70% CPU utilization
)

# Invoke endpoint
response = predictor.predict(data={'transaction': transaction_data})
        

Optimization Techniques:

Model Quantization


# PyTorch dynamic quantization
model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)
torch.save(model.state_dict(), 'quantized_model.pth')

Container Optimization


# Dockerfile for lightweight serving
FROM python:3.9-slim
RUN pip install torchserve torch-model-archiver
COPY model-store /home/model-server/model-store
CMD ["torchserve", "--start", "--model-store", "model-store"]

Deployment Tool Comparison

Tool Best For Latency Max Model Size
TorchServe PyTorch models 5-15ms 2GB
TF Serving TensorFlow 3-10ms 1.5GB
Triton Multi-framework 2-8ms 10GB+
BentoML Custom pipelines 10-20ms 5GB

3. Edge & Mobile Deployment

Optimization Pipeline:

  1. Pruning: Remove redundant neurons
  2. Quantization: FP32 → INT8 weights
  3. Compilation: Hardware-specific optimization
  4. Deployment: On-device inference

TensorFlow Lite Implementation:


# Convert to TFLite
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8

tflite_model = converter.convert()

# Deploy to Android
try (Interpreter interpreter = new Interpreter(tflite_model_buffer)) {
  interpreter.run(input, output);
}

# CoreML for iOS
import coremltools as ct
mlmodel = ct.convert(tflite_model,
                    convert_to="mlprogram",
                    compute_units=ct.ComputeUnit.ALL)
mlmodel.save("model.mlpackage")
        

Performance Benchmarks:

iPhone 14
90 FPS
Raspberry Pi 4
45 FPS
Jetson Nano
75 FPS

4. Monitoring & Maintenance

Monitoring Stack:

  • Data Drift: Evidently, Whylogs
  • Model Performance: Fiddler, Arize
  • Infrastructure: Prometheus, Grafana
  • Business Metrics: Custom dashboards

Evidently Implementation:


from evidently import ColumnMapping
from evidently.report import Report
from evidently.metrics import DataDriftTable

data_drift_report = Report(metrics=[DataDriftTable()])
column_mapping = ColumnMapping(
    numerical_features=['age', 'income'],
    categorical_features=['gender', 'city']
)

data_drift_report.run(
    reference_data=ref_df,
    current_data=current_df,
    column_mapping=column_mapping
)

# Generate alert if drift detected
if data_drift_report.metrics[0].dataset_drift:
    send_alert("Significant data drift detected!")
        

Canary Deployment Pattern:

1

Route 5% Traffic

New model receives small percentage

2

Compare Metrics

Accuracy, latency, business KPIs

3

Full Rollout

If performance meets thresholds

4

Rollback

Automatic if errors exceed limit

0 Interaction
0 Views
Views
0 Likes
×
×
🍪 CookieConsent@Ptutorials:~

Welcome to Ptutorials

$ Allow cookies on this site ? (y/n)

top-home