Deploying Models: Endpoints in Production
Your model is trained, validated, and registered. Now ship it. Learn to deploy to real-time and batch endpoints, test them, and implement progressive rollout with safe rollback.
Getting models into production
Training a model is like building a kitchen. Deploying it is opening the restaurant.
Your model needs to serve predictions to real users, reliably, at speed. Azure ML gives you two “service styles”:
- Real-time endpoints — like a drive-through: instant responses, one request at a time
- Batch endpoints — like a catering order: process a big batch overnight, results ready in the morning
You also need a way to update the restaurant without closing it — that’s progressive rollout (shift traffic gradually to the new model).
Real-time vs batch endpoints
| Feature | Latency | Input | Scaling | Best For |
|---|---|---|---|---|
| Managed Online Endpoint | Milliseconds | Single request (JSON) | Auto-scales 1-N instances | APIs, web apps, real-time decisions |
| Batch Endpoint | Minutes to hours | Large dataset (files/folders) | Cluster scales to process data | Nightly scoring, bulk processing, reports |
Deploying a real-time endpoint
# Step 1: Create the endpoint (the front door)
az ml online-endpoint create \
--name churn-predictor \
--resource-group rg-ml-prod \
--workspace-name ml-workspace-prod \
--auth-mode key
# Step 2: Create a deployment (the model behind the door)
az ml online-deployment create \
--name blue \
--endpoint-name churn-predictor \
--model azureml:churn-predictor:3 \
--instance-type Standard_DS3_v2 \
--instance-count 2 \
--resource-group rg-ml-prod \
--workspace-name ml-workspace-prod
# Step 3: Route all traffic to this deployment
az ml online-endpoint update \
--name churn-predictor \
--traffic "blue=100" \
--resource-group rg-ml-prod \
--workspace-name ml-workspace-prod
What’s happening:
- Lines 2-6: Creates an endpoint — a stable URL that clients call.
auth-mode keymeans clients authenticate with an API key - Lines 9-16: Creates a deployment named “blue” behind the endpoint — uses model v3 on 2 instances
- Lines 19-23: Sends 100% of traffic to the “blue” deployment
Key concept: Endpoints vs deployments. An endpoint is the URL. A deployment is a specific model version running on specific compute behind that URL. One endpoint can have multiple deployments (for A/B testing).
Testing endpoints
# Test with sample data
az ml online-endpoint invoke \
--name churn-predictor \
--request-file sample-request.json
# Check endpoint health
az ml online-endpoint show \
--name churn-predictor \
--output table
// sample-request.json
{
"input_data": {
"columns": ["tenure", "monthly_charges", "support_tickets"],
"data": [[24, 65.50, 3]]
}
}
Exam tip: Troubleshooting endpoints
Common deployment failures and their fixes:
- Model loading errors — environment missing a dependency. Fix: check
conda.yamlmatches training environment - Scoring script errors —
score.pyinit()orrun()fails. Fix: test locally withaz ml online-deployment get-logs - Timeout errors — model inference too slow. Fix: use a larger instance type or optimise the scoring function
- Out of memory — model too large for instance. Fix: use a larger VM SKU
Always check deployment logs first: az ml online-deployment get-logs --name blue --endpoint-name churn-predictor
Progressive rollout (blue-green deployment)
Deploy a new model version safely by shifting traffic gradually:
# Step 1: Deploy new model as "green" (0% traffic initially)
az ml online-deployment create \
--name green \
--endpoint-name churn-predictor \
--model azureml:churn-predictor:4 \
--instance-type Standard_DS3_v2 \
--instance-count 2
# Step 2: Send 10% of traffic to green (canary)
az ml online-endpoint update \
--name churn-predictor \
--traffic "blue=90 green=10"
# Step 3: Monitor green metrics...
# If green looks good, increase traffic
az ml online-endpoint update \
--name churn-predictor \
--traffic "blue=50 green=50"
# Step 4: Full rollout
az ml online-endpoint update \
--name churn-predictor \
--traffic "blue=0 green=100"
# Step 5: Remove old deployment
az ml online-deployment delete \
--name blue --endpoint-name churn-predictor --yes
What’s happening:
- Blue = current production model (v3)
- Green = new model (v4), deployed alongside blue
- Traffic shifts: 0% → 10% → 50% → 100% — monitoring at each stage
- If green performs badly at 10%, roll back:
--traffic "blue=100 green=0"
Scenario: Kai ships model v4 without downtime
NeuralSpark’s churn model v4 has better accuracy, but Kai wants zero risk of customer-facing issues:
- Deploys v4 as “green” with 0% traffic
- Tests green with synthetic data (invoke with test requests)
- Shifts 10% of live traffic to green — monitors for 2 hours
- Green’s latency and accuracy look good — shifts to 50%
- After 24 hours at 50%, shifts to 100%
- Removes the old blue deployment
Total downtime: zero. If anything went wrong, one CLI command rolls back to blue.
Batch endpoints
For processing large datasets (not real-time):
# Create a batch endpoint
az ml batch-endpoint create \
--name churn-batch-scorer \
--resource-group rg-ml-prod \
--workspace-name ml-workspace-prod
# Create a deployment
az ml batch-deployment create \
--name default \
--endpoint-name churn-batch-scorer \
--model azureml:churn-predictor:3 \
--compute azureml:cpu-cluster \
--mini-batch-size 100 \
--max-concurrency 4
# Invoke with a dataset
az ml batch-endpoint invoke \
--name churn-batch-scorer \
--input azureml:monthly-customers:latest
What’s happening:
- Line 13:
mini-batch-size— processes 100 records per mini-batch - Line 14:
max-concurrency— 4 parallel mini-batches (4 nodes scoring simultaneously) - Line 19: Input is a data asset — the endpoint processes the entire dataset and writes results to storage
Key terms flashcards
Knowledge check
Kai wants to deploy churn model v4 without any downtime or risk to live users. He currently has v3 in production. What strategy should he use?
Dr. Fatima needs to score 2 million customer records monthly to update churn risk flags. The process should run overnight without blocking real-time systems. What should she use?
🎬 Video coming soon
Next up: Drift, Monitoring & Retraining — keeping models healthy after deployment.