Model Versioning & Production Strategies
Foundation models get updated. Learn to version deployments, implement rollback strategies, and manage the transition from one model version to another in production.
Why versioning matters for GenAI
Versioning is like keeping the old menu while testing new recipes.
When a restaurant introduces a new dish, they don’t throw away the old menu. They test the new recipe with a few tables first. If customers love it, they add it to the main menu. If they hate it, the old dish is still there — no disruption.
Rollback is switching back to the old menu if the new recipe fails. You can only do this if you kept the old menu around.
GenAI models work the same way. GPT-4o gets updated, your prompts might behave differently. If you pinned your deployment to a specific version, you control when to upgrade — not Microsoft.
Versioning strategies
| Feature | How It Works | Risk Level | Best For |
|---|---|---|---|
| Version Pinning | Specify exact model version in deployment (e.g., gpt-4o-2024-11-20) | Low — no surprise changes | Production workloads where behaviour must be predictable |
| Auto-Update (Default) | Azure updates to latest stable version automatically | High — behaviour may change without warning | Non-critical workloads, prototyping only |
| Deployment Naming | Name deployments by version (gpt4o-v1120, gpt4o-v0513) | Low — clear which version is in use | Managing multiple versions side-by-side |
Exam tip: Always pin versions in production
If an exam question describes a production workload and asks about model versioning, the answer is almost always pin to a specific version.
- “Which approach ensures consistent behaviour?” → Version pinning
- “The application’s responses changed after a model update. How to prevent this?” → Pin the model version
- “How to test a new model version before production?” → Deploy the new version alongside the old one, split traffic
Auto-update is only acceptable for non-production, experimental workloads.
Pinning model versions in deployments
# Deploy GPT-4o pinned to a specific version
az cognitiveservices account deployment create \
--name aoai-genai-prod \
--resource-group rg-genai-prod \
--deployment-name gpt4o-prod-v1120 \
--model-name gpt-4o \
--model-version "2024-11-20" \
--model-format OpenAI \
--sku-capacity 50 \
--sku-name Standard
What’s happening:
- Line 5: Deployment name includes the version date — makes it obvious which version is running
- Line 7:
--model-version "2024-11-20"— pins to this exact version. Azure will NOT auto-update it - Line 10: Standard SKU (pay-per-token). Change to
ProvisionedManagedfor PTUs
Blue-green deployment for model updates
When a new model version is available, deploy it alongside the current one and shift traffic gradually:
# Step 1: Deploy the new model version alongside the current one
az cognitiveservices account deployment create \
--name aoai-genai-prod \
--resource-group rg-genai-prod \
--deployment-name gpt4o-prod-v0513 \
--model-name gpt-4o \
--model-version "2025-05-13" \
--model-format OpenAI \
--sku-capacity 50 \
--sku-name Standard
# Step 2: Route traffic between versions in your application
import random
from openai import AzureOpenAI
# Configuration for traffic splitting
DEPLOYMENTS = {
"gpt4o-prod-v1120": 90, # 90% to current version
"gpt4o-prod-v0513": 10, # 10% to new version (canary)
}
def get_deployment():
"""Route requests based on traffic weights."""
roll = random.randint(1, 100)
cumulative = 0
for deployment, weight in DEPLOYMENTS.items():
cumulative += weight
if roll <= cumulative:
return deployment
return list(DEPLOYMENTS.keys())[0]
client = AzureOpenAI(
azure_endpoint="https://aoai-genai-prod.openai.azure.com",
api_version="2024-10-21",
)
# Each request goes to the selected deployment
response = client.chat.completions.create(
model=get_deployment(),
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Summarise this loan document..."}
]
)
What’s happening:
- Lines 6-9: Defines traffic weights — 90% goes to the old version, 10% to the new one
- Lines 11-17:
get_deployment()randomly selects which deployment handles each request based on weights - Line 27: The selected deployment name is passed as the
modelparameter - To shift traffic: change the weights (e.g., 50/50, then 0/100) and redeploy your application code
Rollback procedures
When the new version causes issues, roll back immediately:
# Option 1: Application-level rollback — change traffic weights
# Update your deployment config to send 100% to the old version
# DEPLOYMENTS = "gpt4o-prod-v1120": 100, "gpt4o-prod-v0513": 0
# Option 2: Delete the problematic deployment entirely
az cognitiveservices account deployment delete \
--name aoai-genai-prod \
--resource-group rg-genai-prod \
--deployment-name gpt4o-prod-v0513
What’s happening:
- Option 1: Fastest rollback — just change the traffic weights in your application config to send 100% to the old version. No infrastructure changes needed
- Option 2: Clean up the failed deployment after shifting traffic away. This frees up capacity
A/B testing across model versions
Compare old vs new systematically:
| What to Compare | How to Measure | Decision Criteria |
|---|---|---|
| Response quality | Human evaluation scores, LLM-as-judge scores | New version scores equal or better |
| Latency | P50, P95, P99 response times | New version within 10% of old |
| Cost | Tokens per request (input + output) | New version not significantly more expensive |
| Safety | Content filter trigger rate, harmful response rate | New version same or lower trigger rate |
| Task accuracy | Ground truth comparison on test dataset | New version matches or beats accuracy |
Scenario: Zara manages GPT-4o to GPT-4o-mini migration at Atlas
Zara wants to cut costs at Atlas Consulting. GPT-4o-mini costs roughly 1/10th of GPT-4o. Marcus Webb, her lead, approves testing it on lower-complexity tasks.
Zara’s migration plan:
- Identify candidate tasks: Client email summarisation (lower complexity) vs contract analysis (high complexity)
- Deploy GPT-4o-mini as a new deployment:
gpt4omini-prod-v0718 - A/B test on email summarisation: 10% traffic to mini, 90% to GPT-4o. Compare quality scores from the evaluation pipeline
- Results after 1 week: Mini scores 4.2/5 (vs GPT-4o at 4.4/5) on summarisation quality — acceptable
- Progressive rollout: 50%, then 100% of email summarisation traffic to mini
- Contract analysis stays on GPT-4o: Quality drop was unacceptable (3.1/5 vs 4.6/5)
Result: 60% cost reduction on email summarisation, zero quality compromise on contract analysis. Both versions pinned and running in parallel.
Model deprecation planning
Azure OpenAI publishes a deprecation schedule. Plan ahead:
| Phase | What Happens | Action Required |
|---|---|---|
| Generally available | Model is fully supported | Deploy and use normally |
| Deprecation announced | Microsoft publishes retirement date (usually 6-12 months notice) | Start testing the replacement model |
| Legacy period | Model still works but no longer receives updates | Complete migration to replacement |
| Retired | Model is no longer available — API calls return errors | Must be off this version before this date |
Exam tip: Deprecation readiness
The exam may test deprecation scenarios:
- “GPT-4o version 2024-05-13 is being retired in 3 months. What should you do?” → Deploy the replacement version alongside, A/B test, migrate traffic progressively
- “A deployed model version was retired and the API is returning errors” → You should have pinned the version AND monitored deprecation announcements. Immediate action: deploy the latest supported version
Key practice: subscribe to Azure OpenAI model deprecation notifications and build version migration into your operational runbook.
Key terms flashcards
Knowledge check
Zara's client-facing summarisation tool at Atlas Consulting suddenly starts producing lower-quality outputs. The team discovers Azure auto-updated the GPT-4o deployment to a newer version overnight. How should Zara prevent this in future?
Dr. Fatima is migrating Meridian's document processing from GPT-4o version 2024-05-13 (being deprecated) to 2024-11-20. She processes 50,000 documents daily and cannot tolerate quality regressions. What is the safest migration approach?
🎬 Video coming soon
Next up: PromptOps — treating prompts as code with versioning, testing, and CI/CD.