🔒 Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided AI-300 Domain 2
Domain 2 — Module 7 of 8 88%
12 of 25 overall

AI-300 Study Guide

Domain 1: Design and Implement an MLOps Infrastructure

  • ML Workspace: Your AI Control Room Free
  • Data, Environments & Components
  • Compute Targets: Choosing the Right Engine
  • Infrastructure as Code: Provisioning at Scale
  • Git & CI/CD for ML Projects

Domain 2: Implement Machine Learning Model Lifecycle and Operations

  • MLflow: Track Every Experiment Free
  • AutoML & Hyperparameter Tuning
  • Training Pipelines: Automate Everything
  • Distributed Training: Scale to Big Data
  • Model Registration & Versioning
  • Model Approval & Responsible AI Gates
  • Deploying Models: Endpoints in Production
  • Drift, Monitoring & Retraining

Domain 3: Design and Implement a GenAIOps Infrastructure

  • Foundry: Hubs, Projects & Platform Setup Free
  • Network Security & IaC for Foundry
  • Deploying Foundation Models
  • Model Versioning & Production Strategies
  • PromptOps: Design, Compare, Version & Ship

Domain 4: Implement Generative AI Quality Assurance and Observability

  • Evaluation: Datasets, Metrics & Quality Gates Free
  • Safety Evaluations & Custom Metrics
  • Monitoring GenAI in Production
  • Cost Tracking, Logging & Debugging

Domain 5: Optimize Generative AI Systems and Model Performance

  • RAG Optimization: Better Retrieval, Better Answers Free
  • Embeddings & Hybrid Search
  • Fine-Tuning: Methods, Data & Production

AI-300 Study Guide

Domain 1: Design and Implement an MLOps Infrastructure

  • ML Workspace: Your AI Control Room Free
  • Data, Environments & Components
  • Compute Targets: Choosing the Right Engine
  • Infrastructure as Code: Provisioning at Scale
  • Git & CI/CD for ML Projects

Domain 2: Implement Machine Learning Model Lifecycle and Operations

  • MLflow: Track Every Experiment Free
  • AutoML & Hyperparameter Tuning
  • Training Pipelines: Automate Everything
  • Distributed Training: Scale to Big Data
  • Model Registration & Versioning
  • Model Approval & Responsible AI Gates
  • Deploying Models: Endpoints in Production
  • Drift, Monitoring & Retraining

Domain 3: Design and Implement a GenAIOps Infrastructure

  • Foundry: Hubs, Projects & Platform Setup Free
  • Network Security & IaC for Foundry
  • Deploying Foundation Models
  • Model Versioning & Production Strategies
  • PromptOps: Design, Compare, Version & Ship

Domain 4: Implement Generative AI Quality Assurance and Observability

  • Evaluation: Datasets, Metrics & Quality Gates Free
  • Safety Evaluations & Custom Metrics
  • Monitoring GenAI in Production
  • Cost Tracking, Logging & Debugging

Domain 5: Optimize Generative AI Systems and Model Performance

  • RAG Optimization: Better Retrieval, Better Answers Free
  • Embeddings & Hybrid Search
  • Fine-Tuning: Methods, Data & Production
Domain 2: Implement Machine Learning Model Lifecycle and Operations Premium ⏱ ~14 min read

Deploying Models: Endpoints in Production

Your model is trained, validated, and registered. Now ship it. Learn to deploy to real-time and batch endpoints, test them, and implement progressive rollout with safe rollback.

Getting models into production

☕ Simple explanation

Training a model is like building a kitchen. Deploying it is opening the restaurant.

Your model needs to serve predictions to real users, reliably, at speed. Azure ML gives you two “service styles”:

  • Real-time endpoints — like a drive-through: instant responses, one request at a time
  • Batch endpoints — like a catering order: process a big batch overnight, results ready in the morning

You also need a way to update the restaurant without closing it — that’s progressive rollout (shift traffic gradually to the new model).

Azure ML provides two managed endpoint types for model serving:

  • Managed online endpoints — low-latency, real-time inference via REST API. Auto-scales based on request volume. Supports traffic splitting for A/B testing.
  • Batch endpoints — process large datasets asynchronously. Spins up compute clusters, processes data in parallel, returns results to storage.

Both are managed — Azure handles the infrastructure, scaling, and networking.

Real-time vs batch endpoints

Real-time vs batch endpoints
FeatureLatencyInputScalingBest For
Managed Online EndpointMillisecondsSingle request (JSON)Auto-scales 1-N instancesAPIs, web apps, real-time decisions
Batch EndpointMinutes to hoursLarge dataset (files/folders)Cluster scales to process dataNightly scoring, bulk processing, reports

Deploying a real-time endpoint

# Step 1: Create the endpoint (the front door)
az ml online-endpoint create \
  --name churn-predictor \
  --resource-group rg-ml-prod \
  --workspace-name ml-workspace-prod \
  --auth-mode key

# Step 2: Create a deployment (the model behind the door)
az ml online-deployment create \
  --name blue \
  --endpoint-name churn-predictor \
  --model azureml:churn-predictor:3 \
  --instance-type Standard_DS3_v2 \
  --instance-count 2 \
  --resource-group rg-ml-prod \
  --workspace-name ml-workspace-prod

# Step 3: Route all traffic to this deployment
az ml online-endpoint update \
  --name churn-predictor \
  --traffic "blue=100" \
  --resource-group rg-ml-prod \
  --workspace-name ml-workspace-prod

What’s happening:

  • Lines 2-6: Creates an endpoint — a stable URL that clients call. auth-mode key means clients authenticate with an API key
  • Lines 9-16: Creates a deployment named “blue” behind the endpoint — uses model v3 on 2 instances
  • Lines 19-23: Sends 100% of traffic to the “blue” deployment

Key concept: Endpoints vs deployments. An endpoint is the URL. A deployment is a specific model version running on specific compute behind that URL. One endpoint can have multiple deployments (for A/B testing).

Testing endpoints

# Test with sample data
az ml online-endpoint invoke \
  --name churn-predictor \
  --request-file sample-request.json

# Check endpoint health
az ml online-endpoint show \
  --name churn-predictor \
  --output table
// sample-request.json
{
  "input_data": {
    "columns": ["tenure", "monthly_charges", "support_tickets"],
    "data": [[24, 65.50, 3]]
  }
}
💡 Exam tip: Troubleshooting endpoints

Common deployment failures and their fixes:

  • Model loading errors — environment missing a dependency. Fix: check conda.yaml matches training environment
  • Scoring script errors — score.py init() or run() fails. Fix: test locally with az ml online-deployment get-logs
  • Timeout errors — model inference too slow. Fix: use a larger instance type or optimise the scoring function
  • Out of memory — model too large for instance. Fix: use a larger VM SKU

Always check deployment logs first: az ml online-deployment get-logs --name blue --endpoint-name churn-predictor

Progressive rollout (blue-green deployment)

Deploy a new model version safely by shifting traffic gradually:

# Step 1: Deploy new model as "green" (0% traffic initially)
az ml online-deployment create \
  --name green \
  --endpoint-name churn-predictor \
  --model azureml:churn-predictor:4 \
  --instance-type Standard_DS3_v2 \
  --instance-count 2

# Step 2: Send 10% of traffic to green (canary)
az ml online-endpoint update \
  --name churn-predictor \
  --traffic "blue=90 green=10"

# Step 3: Monitor green metrics...
# If green looks good, increase traffic
az ml online-endpoint update \
  --name churn-predictor \
  --traffic "blue=50 green=50"

# Step 4: Full rollout
az ml online-endpoint update \
  --name churn-predictor \
  --traffic "blue=0 green=100"

# Step 5: Remove old deployment
az ml online-deployment delete \
  --name blue --endpoint-name churn-predictor --yes

What’s happening:

  • Blue = current production model (v3)
  • Green = new model (v4), deployed alongside blue
  • Traffic shifts: 0% → 10% → 50% → 100% — monitoring at each stage
  • If green performs badly at 10%, roll back: --traffic "blue=100 green=0"
Scenario: Kai ships model v4 without downtime

NeuralSpark’s churn model v4 has better accuracy, but Kai wants zero risk of customer-facing issues:

  1. Deploys v4 as “green” with 0% traffic
  2. Tests green with synthetic data (invoke with test requests)
  3. Shifts 10% of live traffic to green — monitors for 2 hours
  4. Green’s latency and accuracy look good — shifts to 50%
  5. After 24 hours at 50%, shifts to 100%
  6. Removes the old blue deployment

Total downtime: zero. If anything went wrong, one CLI command rolls back to blue.

Batch endpoints

For processing large datasets (not real-time):

# Create a batch endpoint
az ml batch-endpoint create \
  --name churn-batch-scorer \
  --resource-group rg-ml-prod \
  --workspace-name ml-workspace-prod

# Create a deployment
az ml batch-deployment create \
  --name default \
  --endpoint-name churn-batch-scorer \
  --model azureml:churn-predictor:3 \
  --compute azureml:cpu-cluster \
  --mini-batch-size 100 \
  --max-concurrency 4

# Invoke with a dataset
az ml batch-endpoint invoke \
  --name churn-batch-scorer \
  --input azureml:monthly-customers:latest

What’s happening:

  • Line 13: mini-batch-size — processes 100 records per mini-batch
  • Line 14: max-concurrency — 4 parallel mini-batches (4 nodes scoring simultaneously)
  • Line 19: Input is a data asset — the endpoint processes the entire dataset and writes results to storage

Key terms flashcards

Question

Endpoint vs deployment — what's the difference?

Click or press Enter to reveal answer

Answer

An endpoint is the stable URL clients call. A deployment is a specific model version on specific compute behind that URL. One endpoint can have multiple deployments with traffic splitting.

Click to flip back

Question

What is blue-green deployment for ML models?

Click or press Enter to reveal answer

Answer

Deploy a new model (green) alongside the current one (blue). Gradually shift traffic: 0% → 10% → 50% → 100%. Monitor at each stage. Roll back instantly by shifting traffic back to blue.

Click to flip back

Question

Real-time vs batch endpoints?

Click or press Enter to reveal answer

Answer

Real-time: millisecond latency, single requests, auto-scaling instances. Batch: minutes-to-hours, large datasets, cluster compute. Use real-time for APIs; batch for nightly scoring.

Click to flip back

Question

How do you troubleshoot a failed deployment?

Click or press Enter to reveal answer

Answer

Check deployment logs: az ml online-deployment get-logs. Common issues: missing environment dependencies, scoring script errors, timeout from slow inference, OOM from undersized VM.

Click to flip back

Knowledge check

Knowledge Check

Kai wants to deploy churn model v4 without any downtime or risk to live users. He currently has v3 in production. What strategy should he use?

Knowledge Check

Dr. Fatima needs to score 2 million customer records monthly to update churn risk flags. The process should run overnight without blocking real-time systems. What should she use?

🎬 Video coming soon


Next up: Drift, Monitoring & Retraining — keeping models healthy after deployment.

← Previous

Model Approval & Responsible AI Gates

Next →

Drift, Monitoring & Retraining

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.