Deploying Foundation Models
GPT-4o, Llama, Phi β choose the right model and deploy it. Learn serverless API vs managed compute, model selection criteria, and provisioned throughput for high-volume workloads.
How foundation models get deployed
Three ways to get a ride:
- Serverless API (taxi) β hail a ride when you need one. Pay per trip. No car to maintain. Great when trips are unpredictable.
- Managed compute (company car) β your own dedicated vehicle. Always available, fixed monthly cost. Great when you drive every day.
- Provisioned throughput (reserved lane) β a guaranteed express lane on the motorway. No traffic, guaranteed speed. You pay for the lane whether you use it or not.
Each option trades flexibility for control. Serverless is easiest, provisioned throughput gives the most guarantees.
Serverless API vs managed compute vs provisioned throughput
| Feature | Cost Model | Scaling | Latency | Best For |
|---|---|---|---|---|
| Serverless API | Pay-per-token (input + output) | Auto-scales, shared capacity | Variable β depends on load across all tenants | Prototyping, variable workloads, getting started fast |
| Managed Compute | Per-hour VM cost (regardless of usage) | Manual or auto-scale by instance count | Consistent β dedicated VMs | Steady workloads, custom model configurations, fine-tuned models |
| Provisioned Throughput (PTUs) | Per-hour per PTU (reserved capacity) | Fixed capacity, no scaling needed | Low and predictable β reserved capacity | High-volume production, SLA-critical workloads, guaranteed throughput |
Model selection criteria
Before deploying, choose the right model for the job:
| Criterion | Questions to Ask |
|---|---|
| Task type | Text generation? Code? Embeddings? Multi-modal (text + image)? |
| Latency requirements | Real-time chat (under 500ms first token)? Batch processing (minutes OK)? |
| Cost sensitivity | Can you afford GPT-4o? Or does GPT-4o-mini do the job at 1/10th the cost? |
| Data privacy | Must data stay in your region? Are you comfortable with a hosted API? |
| Language support | Does the model perform well in your target languages? |
| Context window | How much text does the model need to process in one call? |
Exam tip: Model selection decision tree
The exam tests your ability to pick the right model:
- Complex reasoning, accuracy critical β GPT-4o
- Simple tasks, cost sensitive β GPT-4o-mini
- Embeddings for search/RAG β text-embedding-3-large (or small for cost savings)
- On-premises or edge deployment β Phi models (small, can run on consumer hardware)
- Open-source requirement β Llama, Mistral via model catalog
- Multi-modal (images + text) β GPT-4o (supports vision)
Always consider: can a smaller, cheaper model do the job? Start small, upgrade only if quality is insufficient.
The model catalog
Foundryβs model catalog gives you access to models beyond Azure OpenAI:
| Category | Models | Deployed Via |
|---|---|---|
| Azure OpenAI | GPT-4o, GPT-4o-mini, GPT-4.1, o3, o4-mini | Serverless API or PTUs |
| Meta | Llama 3.1, Llama 3.2 | Serverless API or managed compute |
| Microsoft | Phi-3, Phi-3.5, Phi-4 | Serverless API or managed compute |
| Mistral | Mistral Large, Mistral Small | Serverless API |
| Cohere | Command R, Command R+, Embed | Serverless API |
Deploying a serverless API endpoint
# Deploy GPT-4o as a serverless endpoint
az ml serverless-endpoint create \
--name gpt4o-support-bot \
--model-id azureml://registries/azure-openai/models/gpt-4o/versions/2024-11-20 \
--resource-group rg-genai-prod \
--workspace-name proj-support-bot
Whatβs happening:
- Line 2: Creates a serverless endpoint β no VMs to manage, pay per token
- Line 3: Names the endpoint for its purpose (support bot)
- Line 4: Specifies the exact model and version from the model registry
- Lines 5-6: Deploys into a specific project workspace
Deploying with managed compute
# Deploy a fine-tuned Llama model on dedicated compute
az ml online-endpoint create \
--name llama-doc-analyzer \
--resource-group rg-genai-prod \
--workspace-name proj-doc-analysis
az ml online-deployment create \
--name v1 \
--endpoint-name llama-doc-analyzer \
--model azureml://registries/azureml-meta/models/Llama-3.1-8B-Instruct/versions/3 \
--instance-type Standard_NC24ads_A100_v4 \
--instance-count 2 \
--resource-group rg-genai-prod \
--workspace-name proj-doc-analysis
# Route all traffic to this deployment
az ml online-endpoint update \
--name llama-doc-analyzer \
--traffic "v1=100" \
--resource-group rg-genai-prod \
--workspace-name proj-doc-analysis
Whatβs happening:
- Lines 2-5: Creates the endpoint (the stable URL that clients call)
- Lines 7-14: Creates a deployment with a specific model on GPU VMs (A100s for large language models)
- Line 11:
Standard_NC24ads_A100_v4β an A100 GPU SKU suited for LLM inference - Line 12: 2 instances for redundancy and throughput
- Lines 17-21: Routes 100% of traffic to the v1 deployment
Provisioned throughput units (PTUs)
For high-volume workloads, reserve capacity:
# Deploy GPT-4o with provisioned throughput
az cognitiveservices account deployment create \
--name aoai-genai-prod \
--resource-group rg-genai-prod \
--deployment-name gpt4o-doc-processing \
--model-name gpt-4o \
--model-version 2024-11-20 \
--model-format OpenAI \
--sku-capacity 100 \
--sku-name ProvisionedManaged
Whatβs happening:
- Lines 2-4: Targets the Azure OpenAI resource and resource group
- Lines 5-8: Specifies the model, version, and format
- Line 9:
--sku-capacity 100β reserves 100 PTUs of capacity (each PTU provides a fixed number of tokens per minute) - Line 10:
ProvisionedManagedSKU type β guaranteed, dedicated capacity
How many PTUs do you need?
PTU sizing depends on your workload:
- PTU capacity varies by model and version. Use the Azure OpenAI capacity calculator to estimate the number of PTUs needed for your workload. The calculator takes into account model type, expected tokens per minute, and latency targets.
- Estimate: peak requests/minute multiplied by average tokens per request, divided by tokens per PTU
- Start conservatively β you can increase PTUs but decreasing takes time (commitment periods apply)
Exam questions often test: βThe company processes 10,000 documents per hour, each requiring 2000 tokens. Should they use serverless or PTUs?β Answer: PTUs β high-volume, predictable load benefits from reserved capacity and guaranteed throughput.
Scenario: Kai deploys serverless for NeuralSpark's chatbot
Kai needs to deploy GPT-4o for NeuralSparkβs customer support chatbot. The chatbot handles:
- 50 requests/minute during business hours
- 5 requests/minute overnight
- Occasional spikes to 200 requests/minute during product launches
Kai chooses serverless API because:
- Traffic is highly variable β 40x difference between quiet and peak
- Pay-per-token means quiet periods cost almost nothing
- Auto-scaling handles the spikes without manual intervention
- No infrastructure to maintain β Kaiβs small team canβt afford to manage VMs
CTO Priya approves: cost-efficient for a startup with unpredictable traffic patterns.
Scenario: Dr. Fatima deploys PTUs for Meridian's document processing
Meridian Financial processes 50,000 loan documents daily through GPT-4o for extraction and summarisation. Dr. Fatimaβs requirements:
- Predictable, high volume β 50K documents every business day
- Latency SLA: each document must be processed within 30 seconds
- Compliance: cannot be affected by noisy neighbors on shared infrastructure
Fatima chooses provisioned throughput (PTUs) because:
- Volume is predictable and consistently high β PTU cost is justified
- Reserved capacity guarantees throughput regardless of other tenants
- SLA-critical workload needs predictable latency
- She provisions 200 PTUs based on the Azure capacity calculator, with 20% headroom
CISO James Chen approves: dedicated capacity means no risk of throttling during regulatory deadlines.
Key terms flashcards
Knowledge check
NeuralSpark's chatbot handles 50 requests/minute during business hours but drops to 5 requests/minute overnight, with occasional spikes to 200 during launches. Kai wants to minimise cost while handling the spikes. Which deployment option should he choose?
Meridian Financial processes 50,000 loan documents daily with a 30-second latency SLA. Dr. Fatima needs guaranteed throughput that is not affected by other Azure tenants. Which deployment option is correct?
π¬ Video coming soon
Next up: Model Versioning and Production Strategies β managing model updates, rollbacks, and traffic splitting.