Model Versioning & Production Strategies

Why versioning matters for GenAI

Simple explanation

Versioning is like keeping the old menu while testing new recipes.

When a restaurant introduces a new dish, they don’t throw away the old menu. They test the new recipe with a few tables first. If customers love it, they add it to the main menu. If they hate it, the old dish is still there — no disruption.

Rollback is switching back to the old menu if the new recipe fails. You can only do this if you kept the old menu around.

GenAI models work the same way. GPT-4o gets updated, your prompts might behave differently. If you pinned your deployment to a specific version, you control when to upgrade — not Microsoft.

Versioning strategies

Strategies for managing model versions
Feature	How It Works	Risk Level	Best For
Version Pinning	Specify exact model version in deployment (e.g., gpt-4o-2024-11-20)	Low — no surprise changes	Production workloads where behaviour must be predictable
Auto-Update (Default)	Azure updates to latest stable version automatically	High — behaviour may change without warning	Non-critical workloads, prototyping only
Deployment Naming	Name deployments by version (gpt4o-v1120, gpt4o-v0513)	Low — clear which version is in use	Managing multiple versions side-by-side

Exam tip: Always pin versions in production

If an exam question describes a production workload and asks about model versioning, the answer is almost always pin to a specific version.

“Which approach ensures consistent behaviour?” → Version pinning
“The application’s responses changed after a model update. How to prevent this?” → Pin the model version
“How to test a new model version before production?” → Deploy the new version alongside the old one, split traffic

Auto-update is only acceptable for non-production, experimental workloads.

Pinning model versions in deployments

# Deploy GPT-4o pinned to a specific version
az cognitiveservices account deployment create \
  --name aoai-genai-prod \
  --resource-group rg-genai-prod \
  --deployment-name gpt4o-prod-v1120 \
  --model-name gpt-4o \
  --model-version "2024-11-20" \
  --model-format OpenAI \
  --sku-capacity 50 \
  --sku-name Standard

What’s happening:

Line 5: Deployment name includes the version date — makes it obvious which version is running
Line 7: --model-version "2024-11-20" — pins to this exact version. Azure will NOT auto-update it
Line 10: Standard SKU (pay-per-token). Change to ProvisionedManaged for PTUs

Blue-green deployment for model updates

When a new model version is available, deploy it alongside the current one and shift traffic gradually:

# Step 1: Deploy the new model version alongside the current one
az cognitiveservices account deployment create \
  --name aoai-genai-prod \
  --resource-group rg-genai-prod \
  --deployment-name gpt4o-prod-v0513 \
  --model-name gpt-4o \
  --model-version "2025-05-13" \
  --model-format OpenAI \
  --sku-capacity 50 \
  --sku-name Standard

# Step 2: Route traffic between versions in your application
import random
from openai import AzureOpenAI

# Configuration for traffic splitting
DEPLOYMENTS = {
    "gpt4o-prod-v1120": 90,  # 90% to current version
    "gpt4o-prod-v0513": 10,  # 10% to new version (canary)
}

def get_deployment():
    """Route requests based on traffic weights."""
    roll = random.randint(1, 100)
    cumulative = 0
    for deployment, weight in DEPLOYMENTS.items():
        cumulative += weight
        if roll <= cumulative:
            return deployment
    return list(DEPLOYMENTS.keys())[0]

client = AzureOpenAI(
    azure_endpoint="https://aoai-genai-prod.openai.azure.com",
    api_version="2024-10-21",
)

# Each request goes to the selected deployment
response = client.chat.completions.create(
    model=get_deployment(),
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Summarise this loan document..."}
    ]
)

What’s happening:

Lines 6-9: Defines traffic weights — 90% goes to the old version, 10% to the new one
Lines 11-17: get_deployment() randomly selects which deployment handles each request based on weights
Line 27: The selected deployment name is passed as the model parameter
To shift traffic: change the weights (e.g., 50/50, then 0/100) and redeploy your application code

Rollback procedures

When the new version causes issues, roll back immediately:

# Option 1: Application-level rollback — change traffic weights
# Update your deployment config to send 100% to the old version
# DEPLOYMENTS = "gpt4o-prod-v1120": 100, "gpt4o-prod-v0513": 0

# Option 2: Delete the problematic deployment entirely
az cognitiveservices account deployment delete \
  --name aoai-genai-prod \
  --resource-group rg-genai-prod \
  --deployment-name gpt4o-prod-v0513

What’s happening:

Option 1: Fastest rollback — just change the traffic weights in your application config to send 100% to the old version. No infrastructure changes needed
Option 2: Clean up the failed deployment after shifting traffic away. This frees up capacity

A/B testing across model versions

Compare old vs new systematically:

What to Compare	How to Measure	Decision Criteria
Response quality	Human evaluation scores, LLM-as-judge scores	New version scores equal or better
Latency	P50, P95, P99 response times	New version within 10% of old
Cost	Tokens per request (input + output)	New version not significantly more expensive
Safety	Content filter trigger rate, harmful response rate	New version same or lower trigger rate
Task accuracy	Ground truth comparison on test dataset	New version matches or beats accuracy

Scenario: Zara manages GPT-4o to GPT-4o-mini migration at Atlas

Zara wants to cut costs at Atlas Consulting. GPT-4o-mini costs roughly 1/10th of GPT-4o. Marcus Webb, her lead, approves testing it on lower-complexity tasks.

Zara’s migration plan:

Identify candidate tasks: Client email summarisation (lower complexity) vs contract analysis (high complexity)
Deploy GPT-4o-mini as a new deployment: gpt4omini-prod-v0718
A/B test on email summarisation: 10% traffic to mini, 90% to GPT-4o. Compare quality scores from the evaluation pipeline
Results after 1 week: Mini scores 4.2/5 (vs GPT-4o at 4.4/5) on summarisation quality — acceptable
Progressive rollout: 50%, then 100% of email summarisation traffic to mini
Contract analysis stays on GPT-4o: Quality drop was unacceptable (3.1/5 vs 4.6/5)

Result: 60% cost reduction on email summarisation, zero quality compromise on contract analysis. Both versions pinned and running in parallel.

Model deprecation planning

Azure OpenAI publishes a deprecation schedule. Plan ahead:

Phase	What Happens	Action Required
Generally available	Model is fully supported	Deploy and use normally
Deprecation announced	Microsoft publishes retirement date (usually 6-12 months notice)	Start testing the replacement model
Legacy period	Model still works but no longer receives updates	Complete migration to replacement
Retired	Model is no longer available — API calls return errors	Must be off this version before this date

Exam tip: Deprecation readiness

The exam may test deprecation scenarios:

“GPT-4o version 2024-05-13 is being retired in 3 months. What should you do?” → Deploy the replacement version alongside, A/B test, migrate traffic progressively
“A deployed model version was retired and the API is returning errors” → You should have pinned the version AND monitored deprecation announcements. Immediate action: deploy the latest supported version

Key practice: subscribe to Azure OpenAI model deprecation notifications and build version migration into your operational runbook.

Key terms flashcards

Question

Why should you pin model versions in production?

Click or press Enter to reveal answer

Answer

Auto-updates can silently change model behaviour — responses, quality, safety filters, and costs may all shift. Pinning to a specific version (e.g., gpt-4o-2024-11-20) ensures predictable behaviour. You control when to upgrade by testing the new version first.

Click to flip back

Question

How does blue-green deployment work for foundation models?

Click or press Enter to reveal answer

Answer

Deploy the new model version as a separate deployment alongside the current one. Split traffic (e.g., 90/10) to test the new version with real requests. Monitor quality metrics. If good, shift more traffic. If bad, send 100% back to the old version (rollback).

Click to flip back

Question

What is the fastest way to roll back a model update?

Click or press Enter to reveal answer

Answer

Application-level rollback: change traffic weights to send 100% to the old deployment. No infrastructure changes needed — just update the routing configuration. The old deployment is still running and handling requests.

Click to flip back

Knowledge check

Knowledge Check

Zara's client-facing summarisation tool at Atlas Consulting suddenly starts producing lower-quality outputs. The team discovers Azure auto-updated the GPT-4o deployment to a newer version overnight. How should Zara prevent this in future?

Knowledge Check

Dr. Fatima is migrating Meridian's document processing from GPT-4o version 2024-05-13 (being deprecated) to 2024-11-20. She processes 50,000 documents daily and cannot tolerate quality regressions. What is the safest migration approach?

Next up: PromptOps — treating prompts as code with versioning, testing, and CI/CD.