πŸ”’ Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided AI-300 Domain 3
Domain 3 β€” Module 3 of 5 60%
16 of 25 overall

AI-300 Study Guide

Domain 1: Design and Implement an MLOps Infrastructure

  • ML Workspace: Your AI Control Room Free
  • Data, Environments & Components
  • Compute Targets: Choosing the Right Engine
  • Infrastructure as Code: Provisioning at Scale
  • Git & CI/CD for ML Projects

Domain 2: Implement Machine Learning Model Lifecycle and Operations

  • MLflow: Track Every Experiment Free
  • AutoML & Hyperparameter Tuning
  • Training Pipelines: Automate Everything
  • Distributed Training: Scale to Big Data
  • Model Registration & Versioning
  • Model Approval & Responsible AI Gates
  • Deploying Models: Endpoints in Production
  • Drift, Monitoring & Retraining

Domain 3: Design and Implement a GenAIOps Infrastructure

  • Foundry: Hubs, Projects & Platform Setup Free
  • Network Security & IaC for Foundry
  • Deploying Foundation Models
  • Model Versioning & Production Strategies
  • PromptOps: Design, Compare, Version & Ship

Domain 4: Implement Generative AI Quality Assurance and Observability

  • Evaluation: Datasets, Metrics & Quality Gates Free
  • Safety Evaluations & Custom Metrics
  • Monitoring GenAI in Production
  • Cost Tracking, Logging & Debugging

Domain 5: Optimize Generative AI Systems and Model Performance

  • RAG Optimization: Better Retrieval, Better Answers Free
  • Embeddings & Hybrid Search
  • Fine-Tuning: Methods, Data & Production

AI-300 Study Guide

Domain 1: Design and Implement an MLOps Infrastructure

  • ML Workspace: Your AI Control Room Free
  • Data, Environments & Components
  • Compute Targets: Choosing the Right Engine
  • Infrastructure as Code: Provisioning at Scale
  • Git & CI/CD for ML Projects

Domain 2: Implement Machine Learning Model Lifecycle and Operations

  • MLflow: Track Every Experiment Free
  • AutoML & Hyperparameter Tuning
  • Training Pipelines: Automate Everything
  • Distributed Training: Scale to Big Data
  • Model Registration & Versioning
  • Model Approval & Responsible AI Gates
  • Deploying Models: Endpoints in Production
  • Drift, Monitoring & Retraining

Domain 3: Design and Implement a GenAIOps Infrastructure

  • Foundry: Hubs, Projects & Platform Setup Free
  • Network Security & IaC for Foundry
  • Deploying Foundation Models
  • Model Versioning & Production Strategies
  • PromptOps: Design, Compare, Version & Ship

Domain 4: Implement Generative AI Quality Assurance and Observability

  • Evaluation: Datasets, Metrics & Quality Gates Free
  • Safety Evaluations & Custom Metrics
  • Monitoring GenAI in Production
  • Cost Tracking, Logging & Debugging

Domain 5: Optimize Generative AI Systems and Model Performance

  • RAG Optimization: Better Retrieval, Better Answers Free
  • Embeddings & Hybrid Search
  • Fine-Tuning: Methods, Data & Production
Domain 3: Design and Implement a GenAIOps Infrastructure Premium ⏱ ~15 min read

Deploying Foundation Models

GPT-4o, Llama, Phi β€” choose the right model and deploy it. Learn serverless API vs managed compute, model selection criteria, and provisioned throughput for high-volume workloads.

How foundation models get deployed

β˜• Simple explanation

Three ways to get a ride:

  • Serverless API (taxi) β€” hail a ride when you need one. Pay per trip. No car to maintain. Great when trips are unpredictable.
  • Managed compute (company car) β€” your own dedicated vehicle. Always available, fixed monthly cost. Great when you drive every day.
  • Provisioned throughput (reserved lane) β€” a guaranteed express lane on the motorway. No traffic, guaranteed speed. You pay for the lane whether you use it or not.

Each option trades flexibility for control. Serverless is easiest, provisioned throughput gives the most guarantees.

Azure AI Foundry provides three deployment options for foundation models:

  • Serverless API (Models as a Service) β€” pay-per-token. No infrastructure to manage. Azure handles scaling. Best for variable or bursty workloads.
  • Managed compute β€” deploy a model on dedicated VMs that you control. Choose the SKU, set the instance count. Best for consistent throughput with custom configuration.
  • Provisioned throughput (PTUs) β€” reserve a fixed amount of model processing capacity. Guaranteed tokens per minute regardless of other tenants. Best for high-volume, latency-sensitive workloads.

Serverless API vs managed compute vs provisioned throughput

Three deployment options for foundation models
FeatureCost ModelScalingLatencyBest For
Serverless APIPay-per-token (input + output)Auto-scales, shared capacityVariable β€” depends on load across all tenantsPrototyping, variable workloads, getting started fast
Managed ComputePer-hour VM cost (regardless of usage)Manual or auto-scale by instance countConsistent β€” dedicated VMsSteady workloads, custom model configurations, fine-tuned models
Provisioned Throughput (PTUs)Per-hour per PTU (reserved capacity)Fixed capacity, no scaling neededLow and predictable β€” reserved capacityHigh-volume production, SLA-critical workloads, guaranteed throughput

Model selection criteria

Before deploying, choose the right model for the job:

CriterionQuestions to Ask
Task typeText generation? Code? Embeddings? Multi-modal (text + image)?
Latency requirementsReal-time chat (under 500ms first token)? Batch processing (minutes OK)?
Cost sensitivityCan you afford GPT-4o? Or does GPT-4o-mini do the job at 1/10th the cost?
Data privacyMust data stay in your region? Are you comfortable with a hosted API?
Language supportDoes the model perform well in your target languages?
Context windowHow much text does the model need to process in one call?
πŸ’‘ Exam tip: Model selection decision tree

The exam tests your ability to pick the right model:

  • Complex reasoning, accuracy critical β†’ GPT-4o
  • Simple tasks, cost sensitive β†’ GPT-4o-mini
  • Embeddings for search/RAG β†’ text-embedding-3-large (or small for cost savings)
  • On-premises or edge deployment β†’ Phi models (small, can run on consumer hardware)
  • Open-source requirement β†’ Llama, Mistral via model catalog
  • Multi-modal (images + text) β†’ GPT-4o (supports vision)

Always consider: can a smaller, cheaper model do the job? Start small, upgrade only if quality is insufficient.

The model catalog

Foundry’s model catalog gives you access to models beyond Azure OpenAI:

CategoryModelsDeployed Via
Azure OpenAIGPT-4o, GPT-4o-mini, GPT-4.1, o3, o4-miniServerless API or PTUs
MetaLlama 3.1, Llama 3.2Serverless API or managed compute
MicrosoftPhi-3, Phi-3.5, Phi-4Serverless API or managed compute
MistralMistral Large, Mistral SmallServerless API
CohereCommand R, Command R+, EmbedServerless API

Deploying a serverless API endpoint

# Deploy GPT-4o as a serverless endpoint
az ml serverless-endpoint create \
  --name gpt4o-support-bot \
  --model-id azureml://registries/azure-openai/models/gpt-4o/versions/2024-11-20 \
  --resource-group rg-genai-prod \
  --workspace-name proj-support-bot

What’s happening:

  • Line 2: Creates a serverless endpoint β€” no VMs to manage, pay per token
  • Line 3: Names the endpoint for its purpose (support bot)
  • Line 4: Specifies the exact model and version from the model registry
  • Lines 5-6: Deploys into a specific project workspace

Deploying with managed compute

# Deploy a fine-tuned Llama model on dedicated compute
az ml online-endpoint create \
  --name llama-doc-analyzer \
  --resource-group rg-genai-prod \
  --workspace-name proj-doc-analysis

az ml online-deployment create \
  --name v1 \
  --endpoint-name llama-doc-analyzer \
  --model azureml://registries/azureml-meta/models/Llama-3.1-8B-Instruct/versions/3 \
  --instance-type Standard_NC24ads_A100_v4 \
  --instance-count 2 \
  --resource-group rg-genai-prod \
  --workspace-name proj-doc-analysis

# Route all traffic to this deployment
az ml online-endpoint update \
  --name llama-doc-analyzer \
  --traffic "v1=100" \
  --resource-group rg-genai-prod \
  --workspace-name proj-doc-analysis

What’s happening:

  • Lines 2-5: Creates the endpoint (the stable URL that clients call)
  • Lines 7-14: Creates a deployment with a specific model on GPU VMs (A100s for large language models)
  • Line 11: Standard_NC24ads_A100_v4 β€” an A100 GPU SKU suited for LLM inference
  • Line 12: 2 instances for redundancy and throughput
  • Lines 17-21: Routes 100% of traffic to the v1 deployment

Provisioned throughput units (PTUs)

For high-volume workloads, reserve capacity:

# Deploy GPT-4o with provisioned throughput
az cognitiveservices account deployment create \
  --name aoai-genai-prod \
  --resource-group rg-genai-prod \
  --deployment-name gpt4o-doc-processing \
  --model-name gpt-4o \
  --model-version 2024-11-20 \
  --model-format OpenAI \
  --sku-capacity 100 \
  --sku-name ProvisionedManaged

What’s happening:

  • Lines 2-4: Targets the Azure OpenAI resource and resource group
  • Lines 5-8: Specifies the model, version, and format
  • Line 9: --sku-capacity 100 β€” reserves 100 PTUs of capacity (each PTU provides a fixed number of tokens per minute)
  • Line 10: ProvisionedManaged SKU type β€” guaranteed, dedicated capacity
πŸ’‘ How many PTUs do you need?

PTU sizing depends on your workload:

  • PTU capacity varies by model and version. Use the Azure OpenAI capacity calculator to estimate the number of PTUs needed for your workload. The calculator takes into account model type, expected tokens per minute, and latency targets.
  • Estimate: peak requests/minute multiplied by average tokens per request, divided by tokens per PTU
  • Start conservatively β€” you can increase PTUs but decreasing takes time (commitment periods apply)

Exam questions often test: β€œThe company processes 10,000 documents per hour, each requiring 2000 tokens. Should they use serverless or PTUs?” Answer: PTUs β€” high-volume, predictable load benefits from reserved capacity and guaranteed throughput.

Scenario: Kai deploys serverless for NeuralSpark's chatbot

Kai needs to deploy GPT-4o for NeuralSpark’s customer support chatbot. The chatbot handles:

  • 50 requests/minute during business hours
  • 5 requests/minute overnight
  • Occasional spikes to 200 requests/minute during product launches

Kai chooses serverless API because:

  • Traffic is highly variable β€” 40x difference between quiet and peak
  • Pay-per-token means quiet periods cost almost nothing
  • Auto-scaling handles the spikes without manual intervention
  • No infrastructure to maintain β€” Kai’s small team can’t afford to manage VMs

CTO Priya approves: cost-efficient for a startup with unpredictable traffic patterns.

Scenario: Dr. Fatima deploys PTUs for Meridian's document processing

Meridian Financial processes 50,000 loan documents daily through GPT-4o for extraction and summarisation. Dr. Fatima’s requirements:

  • Predictable, high volume β€” 50K documents every business day
  • Latency SLA: each document must be processed within 30 seconds
  • Compliance: cannot be affected by noisy neighbors on shared infrastructure

Fatima chooses provisioned throughput (PTUs) because:

  • Volume is predictable and consistently high β€” PTU cost is justified
  • Reserved capacity guarantees throughput regardless of other tenants
  • SLA-critical workload needs predictable latency
  • She provisions 200 PTUs based on the Azure capacity calculator, with 20% headroom

CISO James Chen approves: dedicated capacity means no risk of throttling during regulatory deadlines.

Key terms flashcards

Question

Serverless API vs managed compute β€” when to use each?

Click or press Enter to reveal answer

Answer

Serverless API: variable/bursty workloads, pay per token, no infra management. Managed compute: steady workloads, need custom configuration (fine-tuned models, specific GPU SKUs), pay per hour regardless of usage.

Click to flip back

Question

What are provisioned throughput units (PTUs)?

Click or press Enter to reveal answer

Answer

PTUs reserve a fixed amount of model processing capacity. You get guaranteed tokens per minute regardless of other tenants. Pay per PTU per hour whether you use them or not. Best for high-volume, latency-sensitive, SLA-critical workloads.

Click to flip back

Question

When would you choose an open-source model like Llama over GPT-4o?

Click or press Enter to reveal answer

Answer

When you need: open-source licensing (no vendor lock-in), smaller model that can run on less expensive compute, fine-tuning capabilities, data sovereignty (host entirely on your own infrastructure), or cost savings for simpler tasks where GPT-4o is overkill.

Click to flip back

Question

What is the Azure AI model catalog?

Click or press Enter to reveal answer

Answer

A curated collection of foundation models available in Foundry β€” including Azure OpenAI models (GPT-4o), Meta (Llama), Microsoft (Phi), Mistral, and Cohere. You can deploy them as serverless APIs or on managed compute directly from the catalog.

Click to flip back

Knowledge check

Knowledge Check

NeuralSpark's chatbot handles 50 requests/minute during business hours but drops to 5 requests/minute overnight, with occasional spikes to 200 during launches. Kai wants to minimise cost while handling the spikes. Which deployment option should he choose?

Knowledge Check

Meridian Financial processes 50,000 loan documents daily with a 30-second latency SLA. Dr. Fatima needs guaranteed throughput that is not affected by other Azure tenants. Which deployment option is correct?

🎬 Video coming soon


Next up: Model Versioning and Production Strategies β€” managing model updates, rollbacks, and traffic splitting.

← Previous

Network Security & IaC for Foundry

Next β†’

Model Versioning & Production Strategies

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.