Deploying Foundation Models

How foundation models get deployed

Simple explanation

Three ways to get a ride:

Serverless API (taxi) — hail a ride when you need one. Pay per trip. No car to maintain. Great when trips are unpredictable.
Managed compute (company car) — your own dedicated vehicle. Always available, fixed monthly cost. Great when you drive every day.
Provisioned throughput (reserved lane) — a guaranteed express lane on the motorway. No traffic, guaranteed speed. You pay for the lane whether you use it or not.

Each option trades flexibility for control. Serverless is easiest, provisioned throughput gives the most guarantees.

Serverless API vs managed compute vs provisioned throughput

Three deployment options for foundation models
Feature	Cost Model	Scaling	Latency	Best For
Serverless API	Pay-per-token (input + output)	Auto-scales, shared capacity	Variable — depends on load across all tenants	Prototyping, variable workloads, getting started fast
Managed Compute	Per-hour VM cost (regardless of usage)	Manual or auto-scale by instance count	Consistent — dedicated VMs	Steady workloads, custom model configurations, fine-tuned models
Provisioned Throughput (PTUs)	Per-hour per PTU (reserved capacity)	Fixed capacity, no scaling needed	Low and predictable — reserved capacity	High-volume production, SLA-critical workloads, guaranteed throughput

Model selection criteria

Before deploying, choose the right model for the job:

Criterion	Questions to Ask
Task type	Text generation? Code? Embeddings? Multi-modal (text + image)?
Latency requirements	Real-time chat (under 500ms first token)? Batch processing (minutes OK)?
Cost sensitivity	Can you afford GPT-4o? Or does GPT-4o-mini do the job at 1/10th the cost?
Data privacy	Must data stay in your region? Are you comfortable with a hosted API?
Language support	Does the model perform well in your target languages?
Context window	How much text does the model need to process in one call?

Exam tip: Model selection decision tree

The exam tests your ability to pick the right model:

Complex reasoning, accuracy critical → GPT-4o
Simple tasks, cost sensitive → GPT-4o-mini
Embeddings for search/RAG → text-embedding-3-large (or small for cost savings)
On-premises or edge deployment → Phi models (small, can run on consumer hardware)
Open-source requirement → Llama, Mistral via model catalog
Multi-modal (images + text) → GPT-4o (supports vision)

Always consider: can a smaller, cheaper model do the job? Start small, upgrade only if quality is insufficient.

The model catalog

Foundry’s model catalog gives you access to models beyond Azure OpenAI:

Category	Models	Deployed Via
Azure OpenAI	GPT-4o, GPT-4o-mini, GPT-4.1, o3, o4-mini	Serverless API or PTUs
Meta	Llama 3.1, Llama 3.2	Serverless API or managed compute
Microsoft	Phi-3, Phi-3.5, Phi-4	Serverless API or managed compute
Mistral	Mistral Large, Mistral Small	Serverless API
Cohere	Command R, Command R+, Embed	Serverless API

Deploying a serverless API endpoint

# Deploy GPT-4o as a serverless endpoint
az ml serverless-endpoint create \
  --name gpt4o-support-bot \
  --model-id azureml://registries/azure-openai/models/gpt-4o/versions/2024-11-20 \
  --resource-group rg-genai-prod \
  --workspace-name proj-support-bot

What’s happening:

Line 2: Creates a serverless endpoint — no VMs to manage, pay per token
Line 3: Names the endpoint for its purpose (support bot)
Line 4: Specifies the exact model and version from the model registry
Lines 5-6: Deploys into a specific project workspace

Deploying with managed compute

# Deploy a fine-tuned Llama model on dedicated compute
az ml online-endpoint create \
  --name llama-doc-analyzer \
  --resource-group rg-genai-prod \
  --workspace-name proj-doc-analysis

az ml online-deployment create \
  --name v1 \
  --endpoint-name llama-doc-analyzer \
  --model azureml://registries/azureml-meta/models/Llama-3.1-8B-Instruct/versions/3 \
  --instance-type Standard_NC24ads_A100_v4 \
  --instance-count 2 \
  --resource-group rg-genai-prod \
  --workspace-name proj-doc-analysis

# Route all traffic to this deployment
az ml online-endpoint update \
  --name llama-doc-analyzer \
  --traffic "v1=100" \
  --resource-group rg-genai-prod \
  --workspace-name proj-doc-analysis

What’s happening:

Lines 2-5: Creates the endpoint (the stable URL that clients call)
Lines 7-14: Creates a deployment with a specific model on GPU VMs (A100s for large language models)
Line 11: Standard_NC24ads_A100_v4 — an A100 GPU SKU suited for LLM inference
Line 12: 2 instances for redundancy and throughput
Lines 17-21: Routes 100% of traffic to the v1 deployment

Provisioned throughput units (PTUs)

For high-volume workloads, reserve capacity:

# Deploy GPT-4o with provisioned throughput
az cognitiveservices account deployment create \
  --name aoai-genai-prod \
  --resource-group rg-genai-prod \
  --deployment-name gpt4o-doc-processing \
  --model-name gpt-4o \
  --model-version 2024-11-20 \
  --model-format OpenAI \
  --sku-capacity 100 \
  --sku-name ProvisionedManaged

What’s happening:

Lines 2-4: Targets the Azure OpenAI resource and resource group
Lines 5-8: Specifies the model, version, and format
Line 9: --sku-capacity 100 — reserves 100 PTUs of capacity (each PTU provides a fixed number of tokens per minute)
Line 10: ProvisionedManaged SKU type — guaranteed, dedicated capacity

How many PTUs do you need?

PTU sizing depends on your workload:

PTU capacity varies by model and version. Use the Azure OpenAI capacity calculator to estimate the number of PTUs needed for your workload. The calculator takes into account model type, expected tokens per minute, and latency targets.
Estimate: peak requests/minute multiplied by average tokens per request, divided by tokens per PTU
Start conservatively — you can increase PTUs but decreasing takes time (commitment periods apply)

Exam questions often test: “The company processes 10,000 documents per hour, each requiring 2000 tokens. Should they use serverless or PTUs?” Answer: PTUs — high-volume, predictable load benefits from reserved capacity and guaranteed throughput.

Scenario: Kai deploys serverless for NeuralSpark's chatbot

Kai needs to deploy GPT-4o for NeuralSpark’s customer support chatbot. The chatbot handles:

50 requests/minute during business hours
5 requests/minute overnight
Occasional spikes to 200 requests/minute during product launches

Kai chooses serverless API because:

Traffic is highly variable — 40x difference between quiet and peak
Pay-per-token means quiet periods cost almost nothing
Auto-scaling handles the spikes without manual intervention
No infrastructure to maintain — Kai’s small team can’t afford to manage VMs

CTO Priya approves: cost-efficient for a startup with unpredictable traffic patterns.

Scenario: Dr. Fatima deploys PTUs for Meridian's document processing

Meridian Financial processes 50,000 loan documents daily through GPT-4o for extraction and summarisation. Dr. Fatima’s requirements:

Predictable, high volume — 50K documents every business day
Latency SLA: each document must be processed within 30 seconds
Compliance: cannot be affected by noisy neighbors on shared infrastructure

Fatima chooses provisioned throughput (PTUs) because:

Volume is predictable and consistently high — PTU cost is justified
Reserved capacity guarantees throughput regardless of other tenants
SLA-critical workload needs predictable latency
She provisions 200 PTUs based on the Azure capacity calculator, with 20% headroom

CISO James Chen approves: dedicated capacity means no risk of throttling during regulatory deadlines.

Key terms flashcards

Question

Serverless API vs managed compute — when to use each?

Click or press Enter to reveal answer

Answer

Serverless API: variable/bursty workloads, pay per token, no infra management. Managed compute: steady workloads, need custom configuration (fine-tuned models, specific GPU SKUs), pay per hour regardless of usage.

Click to flip back

Question

What are provisioned throughput units (PTUs)?

Click or press Enter to reveal answer

Answer

PTUs reserve a fixed amount of model processing capacity. You get guaranteed tokens per minute regardless of other tenants. Pay per PTU per hour whether you use them or not. Best for high-volume, latency-sensitive, SLA-critical workloads.

Click to flip back

Question

When would you choose an open-source model like Llama over GPT-4o?

Click or press Enter to reveal answer

Answer

When you need: open-source licensing (no vendor lock-in), smaller model that can run on less expensive compute, fine-tuning capabilities, data sovereignty (host entirely on your own infrastructure), or cost savings for simpler tasks where GPT-4o is overkill.

Click to flip back

Question

What is the Azure AI model catalog?

Click or press Enter to reveal answer

Answer

A curated collection of foundation models available in Foundry — including Azure OpenAI models (GPT-4o), Meta (Llama), Microsoft (Phi), Mistral, and Cohere. You can deploy them as serverless APIs or on managed compute directly from the catalog.

Click to flip back

Knowledge check

Knowledge Check

NeuralSpark's chatbot handles 50 requests/minute during business hours but drops to 5 requests/minute overnight, with occasional spikes to 200 during launches. Kai wants to minimise cost while handling the spikes. Which deployment option should he choose?

Knowledge Check

Meridian Financial processes 50,000 loan documents daily with a 30-second latency SLA. Dr. Fatima needs guaranteed throughput that is not affected by other Azure tenants. Which deployment option is correct?

Next up: Model Versioning and Production Strategies — managing model updates, rollbacks, and traffic splitting.