🔒 Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided AI-901 Domain 1
Domain 1 — Module 5 of 11 45%
5 of 26 overall

AI-901 Study Guide

Domain 1: AI Concepts and Capabilities

  • What is AI? Your First 10 Minutes Free
  • Responsible AI: The Six Principles Free
  • How Generative AI Actually Works Free
  • Choosing the Right AI Model Free
  • Deploying AI Models: Options & Settings
  • AI Workloads at a Glance
  • Text Analysis: Keywords, Entities & Sentiment
  • Speech: Recognition & Synthesis
  • Computer Vision: Seeing the World
  • Image Generation: Creating with AI
  • Information Extraction: From Chaos to Structure

Domain 2: Implement AI Solutions Using Foundry

  • Prompting Fundamentals: System & User Prompts
  • Microsoft Foundry: Your AI Command Center Free
  • Building a Chat App with the Foundry SDK
  • Agents in Foundry: Create & Test
  • Building an Agent Client App
  • Building a Text Analysis App
  • Multimodal: Responding to Speech
  • Azure Speech in Foundry Tools
  • Visual Prompts: Images as Input
  • Generating Images with AI
  • Building a Vision App
  • Content Understanding: Documents & Forms
  • Multimodal Extraction: Images, Audio & Video
  • Building an Extraction App
  • Exam Prep: Putting It All Together

AI-901 Study Guide

Domain 1: AI Concepts and Capabilities

  • What is AI? Your First 10 Minutes Free
  • Responsible AI: The Six Principles Free
  • How Generative AI Actually Works Free
  • Choosing the Right AI Model Free
  • Deploying AI Models: Options & Settings
  • AI Workloads at a Glance
  • Text Analysis: Keywords, Entities & Sentiment
  • Speech: Recognition & Synthesis
  • Computer Vision: Seeing the World
  • Image Generation: Creating with AI
  • Information Extraction: From Chaos to Structure

Domain 2: Implement AI Solutions Using Foundry

  • Prompting Fundamentals: System & User Prompts
  • Microsoft Foundry: Your AI Command Center Free
  • Building a Chat App with the Foundry SDK
  • Agents in Foundry: Create & Test
  • Building an Agent Client App
  • Building a Text Analysis App
  • Multimodal: Responding to Speech
  • Azure Speech in Foundry Tools
  • Visual Prompts: Images as Input
  • Generating Images with AI
  • Building a Vision App
  • Content Understanding: Documents & Forms
  • Multimodal Extraction: Images, Audio & Video
  • Building an Extraction App
  • Exam Prep: Putting It All Together
Domain 1: AI Concepts and Capabilities Premium ⏱ ~12 min read

Deploying AI Models: Options & Settings

You've picked the right model — now how do you deploy it? Learn about deployment options, configuration parameters like temperature and top-p, and how to tune your model's behaviour.

How do you deploy an AI model?

☕ Simple explanation

Deploying a model is like setting up a coffee machine.

You choose the machine (the model), plug it in (deploy it), then adjust the settings: how strong you want the coffee (temperature), how much to pour (max tokens), and whether you want consistent flavour or experimental blends (top-p).

In Azure, “deploying” a model means making it available through an API endpoint that your applications can call. You don’t download the model — it runs in the cloud, and you just send requests to it.

Model deployment in Microsoft Foundry creates an API endpoint that applications can call to interact with the model. Deployment involves selecting a model from the catalog, choosing a deployment type (which affects performance and cost), and configuring parameters that control the model’s output behaviour.

Different deployment types offer trade-offs between cost, performance, availability, and customisability. Configuration parameters allow you to fine-tune the model’s responses without retraining it.

Deployment options in Microsoft Foundry

Model deployment types in Microsoft Foundry
FeatureHow It WorksBest For
Global StandardShared infrastructure, pay-per-token, automatic routing across regionsGetting started quickly, development, variable workloads
StandardPay-per-token in a specific Azure region (not shared across regions)Production workloads needing regional data residency
ProvisionedReserved compute capacity (PTUs), consistent throughput guaranteedHigh-volume production with predictable costs and latency
Serverless APIPay-per-token for non-OpenAI models (Meta, Mistral, etc.)Trying models from different providers without infrastructure
ℹ️ What are PTUs (Provisioned Throughput Units)?

Provisioned deployments use PTUs — reserved compute capacity you purchase in advance.

Think of it like reserving a table at a restaurant:

  • Pay-per-token = walk in, pay per meal, might wait during peak hours
  • Provisioned (PTUs) = reserve a table, guaranteed seating, pay monthly regardless of how much you eat

When to use PTUs:

  • Predictable, high-volume workloads (1000+ requests/minute)
  • Need guaranteed latency (no queuing)
  • Cost optimisation at scale (PTUs can be cheaper than pay-per-token at high volumes)

When NOT to use PTUs:

  • Development/testing (pay-per-token is cheaper for low volume)
  • Variable or unpredictable workloads

Configuration parameters: tuning your model

When you deploy a model, you can adjust these settings to control its behaviour:

Temperature

What it does: Controls how creative or predictable the model’s responses are.

TemperatureBehaviourUse Case
0Deterministic — always picks the most likely tokenFact extraction, classification, data processing
0.3-0.5Mostly predictable with slight variationCustomer support, summarisation
0.7-0.9Creative and varied responsesBrainstorming, creative writing, marketing copy
1.0+Highly random, unpredictableExperimental, not recommended for production

Analogy: Temperature is like a music DJ’s “experimental” dial. At 0, they play the most popular songs every time. At 1.0, they play random deep cuts nobody’s heard.

Top-p (nucleus sampling)

What it does: Controls the range of words the model considers for each token.

  • Top-p = 0.1 → Only considers the top 10% most likely words (very focused)
  • Top-p = 0.9 → Considers the top 90% most likely words (more varied)
  • Top-p = 1.0 → Considers all possible words

Exam tip: Temperature and top-p both control “randomness” but work differently. Usually, you adjust one and leave the other at its default. Don’t set both to extreme values.

Max tokens

What it does: Sets the maximum length of the model’s response.

  • Lower values → shorter, cheaper responses
  • Higher values → longer, more detailed (and more expensive) responses
  • This does NOT affect input length — only the output

Stop sequences

What they do: Tell the model when to stop generating. For example, you could set a stop sequence of \n\n to make the model stop after a double line break.

Frequency penalty and presence penalty

ParameterWhat It DoesEffect
Frequency penaltyReduces repetition of words already usedHigher = less repetitive
Presence penaltyEncourages the model to talk about new topicsHigher = more diverse topics

Putting it together: DataFlow Corp’s deployment

DataFlow Corp deploys three different models for three use cases:

Use CaseModelDeploymentTemperatureMax TokensWhy
Customer support chatGPT-4oProvisioned (PTUs)0.3500High volume, needs consistent quality and latency
Internal report summariesPhi-4Standard0.21000Cost-efficient, needs accuracy, moderate volume
Marketing copy generatorGPT-4oGlobal Standard0.82000Creative, variable usage, doesn’t need dedicated capacity

Content filtering

Azure AI includes built-in content filters that block harmful content:

Four harm categories (always on by default):

  • Hate and unfairness — blocks discriminatory content
  • Sexual content — blocks explicit material
  • Violence — blocks graphic violence
  • Self-harm — blocks content promoting self-harm

Additional protections (configurable):

  • Prompt shields — detects jailbreak and prompt injection attempts
  • Protected material detection — identifies copyrighted text/code

Content filters are enabled by default on all Azure OpenAI deployments. You can adjust severity thresholds (low, medium, high). Fully disabling core filters requires approval.

💡 Exam tip: Content filtering

Key facts for the exam:

  • Content filters are on by default — you don’t need to enable them
  • Filters apply to both input and output
  • You can configure severity thresholds but not disable core filters
  • This connects to the Reliability & Safety responsible AI principle

🎬 Video walkthrough

🎬 Video coming soon

Deploying AI Models — AI-901 Module 5

Deploying AI Models — AI-901 Module 5

~12 min

Flashcards

Question

What does the temperature parameter control in an AI model?

Click or press Enter to reveal answer

Answer

How creative or predictable the model's responses are. Temperature 0 = deterministic (always the most likely answer). Temperature 1.0 = highly random and creative. For factual tasks, use low temperature; for creative tasks, use higher.

Click to flip back

Question

What is a Provisioned deployment (PTUs)?

Click or press Enter to reveal answer

Answer

Reserved compute capacity for AI models. You pay for guaranteed throughput upfront, getting consistent latency and performance. Best for high-volume production workloads. Think: reserving a restaurant table vs walking in.

Click to flip back

Question

What does the max tokens parameter control?

Click or press Enter to reveal answer

Answer

The maximum length of the model's response (output only, not input). Lower values = shorter and cheaper responses. Higher values = longer and more detailed but more expensive.

Click to flip back

Question

What are Azure AI content filters?

Click or press Enter to reveal answer

Answer

Built-in safety filters that block harmful content in both inputs and outputs. Categories: hate, sexual, violence, self-harm, jailbreak. They're enabled by default on all Azure OpenAI deployments.

Click to flip back

Question

What is the difference between Global Standard and Standard deployments?

Click or press Enter to reveal answer

Answer

Global Standard routes requests across regions automatically (flexible, easy). Standard deploys to a specific region (needed for data residency requirements). Both are pay-per-token.

Click to flip back

Knowledge Check

Knowledge Check

GreenLeaf needs their AI model to extract invoice numbers from scanned documents. The responses must be consistent — the same document should always produce the same result. Which temperature setting is most appropriate?

Knowledge Check

DataFlow Corp processes 5,000 customer support queries per minute with strict latency requirements. Which deployment type should they choose?

Knowledge Check

Priya wants to deploy a model for a class project with minimal cost. She'll only use it occasionally for testing. Which deployment type is best?


Next up: AI Workloads at a Glance — a tour of the six types of AI workloads and when to use each one.

← Previous

Choosing the Right AI Model

Next →

AI Workloads at a Glance

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.