Deploying AI Models: Options & Settings

How do you deploy an AI model?

Simple explanation

Deploying a model is like setting up a coffee machine.

You choose the machine (the model), plug it in (deploy it), then adjust the settings: how strong you want the coffee (temperature), how much to pour (max tokens), and whether you want consistent flavour or experimental blends (top-p).

In Azure, “deploying” a model means making it available through an API endpoint that your applications can call. You don’t download the model — it runs in the cloud, and you just send requests to it.

Deployment options in Microsoft Foundry

Model deployment types in Microsoft Foundry
Feature	How It Works	Best For
Global Standard	Shared infrastructure, pay-per-token, automatic routing across regions	Getting started quickly, development, variable workloads
Standard	Pay-per-token in a specific Azure region (not shared across regions)	Production workloads needing regional data residency
Provisioned	Reserved compute capacity (PTUs), consistent throughput guaranteed	High-volume production with predictable costs and latency
Serverless API	Pay-per-token for non-OpenAI models (Meta, Mistral, etc.)	Trying models from different providers without infrastructure

What are PTUs (Provisioned Throughput Units)?

Provisioned deployments use PTUs — reserved compute capacity you purchase in advance.

Think of it like reserving a table at a restaurant:

Pay-per-token = walk in, pay per meal, might wait during peak hours
Provisioned (PTUs) = reserve a table, guaranteed seating, pay monthly regardless of how much you eat

When to use PTUs:

Predictable, high-volume workloads (1000+ requests/minute)
Need guaranteed latency (no queuing)
Cost optimisation at scale (PTUs can be cheaper than pay-per-token at high volumes)

When NOT to use PTUs:

Development/testing (pay-per-token is cheaper for low volume)
Variable or unpredictable workloads

Configuration parameters: tuning your model

When you deploy a model, you can adjust these settings to control its behaviour:

Temperature

What it does: Controls how creative or predictable the model’s responses are.

Temperature	Behaviour	Use Case
0	Deterministic — always picks the most likely token	Fact extraction, classification, data processing
0.3-0.5	Mostly predictable with slight variation	Customer support, summarisation
0.7-0.9	Creative and varied responses	Brainstorming, creative writing, marketing copy
1.0+	Highly random, unpredictable	Experimental, not recommended for production

Analogy: Temperature is like a music DJ’s “experimental” dial. At 0, they play the most popular songs every time. At 1.0, they play random deep cuts nobody’s heard.

Top-p (nucleus sampling)

What it does: Controls the range of words the model considers for each token.

Top-p = 0.1 → Only considers the top 10% most likely words (very focused)
Top-p = 0.9 → Considers the top 90% most likely words (more varied)
Top-p = 1.0 → Considers all possible words

Exam tip: Temperature and top-p both control “randomness” but work differently. Usually, you adjust one and leave the other at its default. Don’t set both to extreme values.

Max tokens

What it does: Sets the maximum length of the model’s response.

Lower values → shorter, cheaper responses
Higher values → longer, more detailed (and more expensive) responses
This does NOT affect input length — only the output

Stop sequences

What they do: Tell the model when to stop generating. For example, you could set a stop sequence of \n\n to make the model stop after a double line break.

Frequency penalty and presence penalty

Parameter	What It Does	Effect
Frequency penalty	Reduces repetition of words already used	Higher = less repetitive
Presence penalty	Encourages the model to talk about new topics	Higher = more diverse topics

Putting it together: DataFlow Corp’s deployment

DataFlow Corp deploys three different models for three use cases:

Use Case	Model	Deployment	Temperature	Max Tokens	Why
Customer support chat	GPT-4o	Provisioned (PTUs)	0.3	500	High volume, needs consistent quality and latency
Internal report summaries	Phi-4	Standard	0.2	1000	Cost-efficient, needs accuracy, moderate volume
Marketing copy generator	GPT-4o	Global Standard	0.8	2000	Creative, variable usage, doesn’t need dedicated capacity

Content filtering

Azure AI includes built-in content filters that block harmful content:

Four harm categories (always on by default):

Hate and unfairness — blocks discriminatory content
Sexual content — blocks explicit material
Violence — blocks graphic violence
Self-harm — blocks content promoting self-harm

Additional protections (configurable):

Prompt shields — detects jailbreak and prompt injection attempts
Protected material detection — identifies copyrighted text/code

Content filters are enabled by default on all Azure OpenAI deployments. You can adjust severity thresholds (low, medium, high). Fully disabling core filters requires approval.

Exam tip: Content filtering

Key facts for the exam:

Content filters are on by default — you don’t need to enable them
Filters apply to both input and output
You can configure severity thresholds but not disable core filters
This connects to the Reliability & Safety responsible AI principle