Deploying Models: Endpoints in Production

Getting models into production

Simple explanation

Training a model is like building a kitchen. Deploying it is opening the restaurant.

Your model needs to serve predictions to real users, reliably, at speed. Azure ML gives you two “service styles”:

Real-time endpoints — like a drive-through: instant responses, one request at a time
Batch endpoints — like a catering order: process a big batch overnight, results ready in the morning

You also need a way to update the restaurant without closing it — that’s progressive rollout (shift traffic gradually to the new model).

Real-time vs batch endpoints

Real-time vs batch endpoints
Feature	Latency	Input	Scaling	Best For
Managed Online Endpoint	Milliseconds	Single request (JSON)	Auto-scales 1-N instances	APIs, web apps, real-time decisions
Batch Endpoint	Minutes to hours	Large dataset (files/folders)	Cluster scales to process data	Nightly scoring, bulk processing, reports

Deploying a real-time endpoint

# Step 1: Create the endpoint (the front door)
az ml online-endpoint create \
  --name churn-predictor \
  --resource-group rg-ml-prod \
  --workspace-name ml-workspace-prod \
  --auth-mode key

# Step 2: Create a deployment (the model behind the door)
az ml online-deployment create \
  --name blue \
  --endpoint-name churn-predictor \
  --model azureml:churn-predictor:3 \
  --instance-type Standard_DS3_v2 \
  --instance-count 2 \
  --resource-group rg-ml-prod \
  --workspace-name ml-workspace-prod

# Step 3: Route all traffic to this deployment
az ml online-endpoint update \
  --name churn-predictor \
  --traffic "blue=100" \
  --resource-group rg-ml-prod \
  --workspace-name ml-workspace-prod

What’s happening:

Lines 2-6: Creates an endpoint — a stable URL that clients call. auth-mode key means clients authenticate with an API key
Lines 9-16: Creates a deployment named “blue” behind the endpoint — uses model v3 on 2 instances
Lines 19-23: Sends 100% of traffic to the “blue” deployment

Key concept: Endpoints vs deployments. An endpoint is the URL. A deployment is a specific model version running on specific compute behind that URL. One endpoint can have multiple deployments (for A/B testing).

Testing endpoints

# Test with sample data
az ml online-endpoint invoke \
  --name churn-predictor \
  --request-file sample-request.json

# Check endpoint health
az ml online-endpoint show \
  --name churn-predictor \
  --output table

// sample-request.json
{
  "input_data": {
    "columns": ["tenure", "monthly_charges", "support_tickets"],
    "data": [[24, 65.50, 3]]
  }
}

Exam tip: Troubleshooting endpoints

Common deployment failures and their fixes:

Model loading errors — environment missing a dependency. Fix: check conda.yaml matches training environment
Scoring script errors — score.py init() or run() fails. Fix: test locally with az ml online-deployment get-logs
Timeout errors — model inference too slow. Fix: use a larger instance type or optimise the scoring function
Out of memory — model too large for instance. Fix: use a larger VM SKU

Always check deployment logs first: az ml online-deployment get-logs --name blue --endpoint-name churn-predictor

Progressive rollout (blue-green deployment)

Deploy a new model version safely by shifting traffic gradually:

# Step 1: Deploy new model as "green" (0% traffic initially)
az ml online-deployment create \
  --name green \
  --endpoint-name churn-predictor \
  --model azureml:churn-predictor:4 \
  --instance-type Standard_DS3_v2 \
  --instance-count 2

# Step 2: Send 10% of traffic to green (canary)
az ml online-endpoint update \
  --name churn-predictor \
  --traffic "blue=90 green=10"

# Step 3: Monitor green metrics...
# If green looks good, increase traffic
az ml online-endpoint update \
  --name churn-predictor \
  --traffic "blue=50 green=50"

# Step 4: Full rollout
az ml online-endpoint update \
  --name churn-predictor \
  --traffic "blue=0 green=100"

# Step 5: Remove old deployment
az ml online-deployment delete \
  --name blue --endpoint-name churn-predictor --yes

What’s happening:

Blue = current production model (v3)
Green = new model (v4), deployed alongside blue
Traffic shifts: 0% → 10% → 50% → 100% — monitoring at each stage
If green performs badly at 10%, roll back: --traffic "blue=100 green=0"

Scenario: Kai ships model v4 without downtime

NeuralSpark’s churn model v4 has better accuracy, but Kai wants zero risk of customer-facing issues:

Deploys v4 as “green” with 0% traffic
Tests green with synthetic data (invoke with test requests)
Shifts 10% of live traffic to green — monitors for 2 hours
Green’s latency and accuracy look good — shifts to 50%
After 24 hours at 50%, shifts to 100%
Removes the old blue deployment

Total downtime: zero. If anything went wrong, one CLI command rolls back to blue.

Batch endpoints

For processing large datasets (not real-time):

# Create a batch endpoint
az ml batch-endpoint create \
  --name churn-batch-scorer \
  --resource-group rg-ml-prod \
  --workspace-name ml-workspace-prod

# Create a deployment
az ml batch-deployment create \
  --name default \
  --endpoint-name churn-batch-scorer \
  --model azureml:churn-predictor:3 \
  --compute azureml:cpu-cluster \
  --mini-batch-size 100 \
  --max-concurrency 4

# Invoke with a dataset
az ml batch-endpoint invoke \
  --name churn-batch-scorer \
  --input azureml:monthly-customers:latest

What’s happening:

Line 13: mini-batch-size — processes 100 records per mini-batch
Line 14: max-concurrency — 4 parallel mini-batches (4 nodes scoring simultaneously)
Line 19: Input is a data asset — the endpoint processes the entire dataset and writes results to storage

Key terms flashcards

Question

Endpoint vs deployment — what's the difference?

Click or press Enter to reveal answer

Answer

An endpoint is the stable URL clients call. A deployment is a specific model version on specific compute behind that URL. One endpoint can have multiple deployments with traffic splitting.

Click to flip back

Question

What is blue-green deployment for ML models?

Click or press Enter to reveal answer

Answer

Deploy a new model (green) alongside the current one (blue). Gradually shift traffic: 0% → 10% → 50% → 100%. Monitor at each stage. Roll back instantly by shifting traffic back to blue.

Click to flip back

Question

Real-time vs batch endpoints?

Click or press Enter to reveal answer

Answer

Real-time: millisecond latency, single requests, auto-scaling instances. Batch: minutes-to-hours, large datasets, cluster compute. Use real-time for APIs; batch for nightly scoring.

Click to flip back

Question

How do you troubleshoot a failed deployment?

Click or press Enter to reveal answer

Answer

Check deployment logs: az ml online-deployment get-logs. Common issues: missing environment dependencies, scoring script errors, timeout from slow inference, OOM from undersized VM.

Click to flip back

Knowledge check

Knowledge Check

Kai wants to deploy churn model v4 without any downtime or risk to live users. He currently has v3 in production. What strategy should he use?

Knowledge Check

Dr. Fatima needs to score 2 million customer records monthly to update churn risk flags. The process should run overnight without blocking real-time systems. What should she use?

Next up: Drift, Monitoring & Retraining — keeping models healthy after deployment.