Deploying AI Models: Options & Settings
You've picked the right model — now how do you deploy it? Learn about deployment options, configuration parameters like temperature and top-p, and how to tune your model's behaviour.
How do you deploy an AI model?
Deploying a model is like setting up a coffee machine.
You choose the machine (the model), plug it in (deploy it), then adjust the settings: how strong you want the coffee (temperature), how much to pour (max tokens), and whether you want consistent flavour or experimental blends (top-p).
In Azure, “deploying” a model means making it available through an API endpoint that your applications can call. You don’t download the model — it runs in the cloud, and you just send requests to it.
Deployment options in Microsoft Foundry
| Feature | How It Works | Best For |
|---|---|---|
| Global Standard | Shared infrastructure, pay-per-token, automatic routing across regions | Getting started quickly, development, variable workloads |
| Standard | Pay-per-token in a specific Azure region (not shared across regions) | Production workloads needing regional data residency |
| Provisioned | Reserved compute capacity (PTUs), consistent throughput guaranteed | High-volume production with predictable costs and latency |
| Serverless API | Pay-per-token for non-OpenAI models (Meta, Mistral, etc.) | Trying models from different providers without infrastructure |
What are PTUs (Provisioned Throughput Units)?
Provisioned deployments use PTUs — reserved compute capacity you purchase in advance.
Think of it like reserving a table at a restaurant:
- Pay-per-token = walk in, pay per meal, might wait during peak hours
- Provisioned (PTUs) = reserve a table, guaranteed seating, pay monthly regardless of how much you eat
When to use PTUs:
- Predictable, high-volume workloads (1000+ requests/minute)
- Need guaranteed latency (no queuing)
- Cost optimisation at scale (PTUs can be cheaper than pay-per-token at high volumes)
When NOT to use PTUs:
- Development/testing (pay-per-token is cheaper for low volume)
- Variable or unpredictable workloads
Configuration parameters: tuning your model
When you deploy a model, you can adjust these settings to control its behaviour:
Temperature
What it does: Controls how creative or predictable the model’s responses are.
| Temperature | Behaviour | Use Case |
|---|---|---|
| 0 | Deterministic — always picks the most likely token | Fact extraction, classification, data processing |
| 0.3-0.5 | Mostly predictable with slight variation | Customer support, summarisation |
| 0.7-0.9 | Creative and varied responses | Brainstorming, creative writing, marketing copy |
| 1.0+ | Highly random, unpredictable | Experimental, not recommended for production |
Analogy: Temperature is like a music DJ’s “experimental” dial. At 0, they play the most popular songs every time. At 1.0, they play random deep cuts nobody’s heard.
Top-p (nucleus sampling)
What it does: Controls the range of words the model considers for each token.
- Top-p = 0.1 → Only considers the top 10% most likely words (very focused)
- Top-p = 0.9 → Considers the top 90% most likely words (more varied)
- Top-p = 1.0 → Considers all possible words
Exam tip: Temperature and top-p both control “randomness” but work differently. Usually, you adjust one and leave the other at its default. Don’t set both to extreme values.
Max tokens
What it does: Sets the maximum length of the model’s response.
- Lower values → shorter, cheaper responses
- Higher values → longer, more detailed (and more expensive) responses
- This does NOT affect input length — only the output
Stop sequences
What they do: Tell the model when to stop generating. For example, you could set a stop sequence of \n\n to make the model stop after a double line break.
Frequency penalty and presence penalty
| Parameter | What It Does | Effect |
|---|---|---|
| Frequency penalty | Reduces repetition of words already used | Higher = less repetitive |
| Presence penalty | Encourages the model to talk about new topics | Higher = more diverse topics |
Putting it together: DataFlow Corp’s deployment
DataFlow Corp deploys three different models for three use cases:
| Use Case | Model | Deployment | Temperature | Max Tokens | Why |
|---|---|---|---|---|---|
| Customer support chat | GPT-4o | Provisioned (PTUs) | 0.3 | 500 | High volume, needs consistent quality and latency |
| Internal report summaries | Phi-4 | Standard | 0.2 | 1000 | Cost-efficient, needs accuracy, moderate volume |
| Marketing copy generator | GPT-4o | Global Standard | 0.8 | 2000 | Creative, variable usage, doesn’t need dedicated capacity |
Content filtering
Azure AI includes built-in content filters that block harmful content:
Four harm categories (always on by default):
- Hate and unfairness — blocks discriminatory content
- Sexual content — blocks explicit material
- Violence — blocks graphic violence
- Self-harm — blocks content promoting self-harm
Additional protections (configurable):
- Prompt shields — detects jailbreak and prompt injection attempts
- Protected material detection — identifies copyrighted text/code
Content filters are enabled by default on all Azure OpenAI deployments. You can adjust severity thresholds (low, medium, high). Fully disabling core filters requires approval.
Exam tip: Content filtering
Key facts for the exam:
- Content filters are on by default — you don’t need to enable them
- Filters apply to both input and output
- You can configure severity thresholds but not disable core filters
- This connects to the Reliability & Safety responsible AI principle
🎬 Video walkthrough
🎬 Video coming soon
Deploying AI Models — AI-901 Module 5
Deploying AI Models — AI-901 Module 5
~12 minFlashcards
Knowledge Check
GreenLeaf needs their AI model to extract invoice numbers from scanned documents. The responses must be consistent — the same document should always produce the same result. Which temperature setting is most appropriate?
DataFlow Corp processes 5,000 customer support queries per minute with strict latency requirements. Which deployment type should they choose?
Priya wants to deploy a model for a class project with minimal cost. She'll only use it occasionally for testing. Which deployment type is best?
Next up: AI Workloads at a Glance — a tour of the six types of AI workloads and when to use each one.