🔒 Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided AI-300 Domain 5
Domain 5 — Module 3 of 3 100%
25 of 25 overall

AI-300 Study Guide

Domain 1: Design and Implement an MLOps Infrastructure

  • ML Workspace: Your AI Control Room Free
  • Data, Environments & Components
  • Compute Targets: Choosing the Right Engine
  • Infrastructure as Code: Provisioning at Scale
  • Git & CI/CD for ML Projects

Domain 2: Implement Machine Learning Model Lifecycle and Operations

  • MLflow: Track Every Experiment Free
  • AutoML & Hyperparameter Tuning
  • Training Pipelines: Automate Everything
  • Distributed Training: Scale to Big Data
  • Model Registration & Versioning
  • Model Approval & Responsible AI Gates
  • Deploying Models: Endpoints in Production
  • Drift, Monitoring & Retraining

Domain 3: Design and Implement a GenAIOps Infrastructure

  • Foundry: Hubs, Projects & Platform Setup Free
  • Network Security & IaC for Foundry
  • Deploying Foundation Models
  • Model Versioning & Production Strategies
  • PromptOps: Design, Compare, Version & Ship

Domain 4: Implement Generative AI Quality Assurance and Observability

  • Evaluation: Datasets, Metrics & Quality Gates Free
  • Safety Evaluations & Custom Metrics
  • Monitoring GenAI in Production
  • Cost Tracking, Logging & Debugging

Domain 5: Optimize Generative AI Systems and Model Performance

  • RAG Optimization: Better Retrieval, Better Answers Free
  • Embeddings & Hybrid Search
  • Fine-Tuning: Methods, Data & Production

AI-300 Study Guide

Domain 1: Design and Implement an MLOps Infrastructure

  • ML Workspace: Your AI Control Room Free
  • Data, Environments & Components
  • Compute Targets: Choosing the Right Engine
  • Infrastructure as Code: Provisioning at Scale
  • Git & CI/CD for ML Projects

Domain 2: Implement Machine Learning Model Lifecycle and Operations

  • MLflow: Track Every Experiment Free
  • AutoML & Hyperparameter Tuning
  • Training Pipelines: Automate Everything
  • Distributed Training: Scale to Big Data
  • Model Registration & Versioning
  • Model Approval & Responsible AI Gates
  • Deploying Models: Endpoints in Production
  • Drift, Monitoring & Retraining

Domain 3: Design and Implement a GenAIOps Infrastructure

  • Foundry: Hubs, Projects & Platform Setup Free
  • Network Security & IaC for Foundry
  • Deploying Foundation Models
  • Model Versioning & Production Strategies
  • PromptOps: Design, Compare, Version & Ship

Domain 4: Implement Generative AI Quality Assurance and Observability

  • Evaluation: Datasets, Metrics & Quality Gates Free
  • Safety Evaluations & Custom Metrics
  • Monitoring GenAI in Production
  • Cost Tracking, Logging & Debugging

Domain 5: Optimize Generative AI Systems and Model Performance

  • RAG Optimization: Better Retrieval, Better Answers Free
  • Embeddings & Hybrid Search
  • Fine-Tuning: Methods, Data & Production
Domain 5: Optimize Generative AI Systems and Model Performance Premium ⏱ ~16 min read

Fine-Tuning: Methods, Data & Production

When prompting isn't enough and RAG doesn't fit, fine-tune. Learn advanced fine-tuning methods, synthetic data generation, performance monitoring, and the critical decision: when to fine-tune vs RAG vs prompt engineer.

When to fine-tune (and when not to)

☕ Simple explanation

Fine-tuning is like teaching a chef your restaurant’s specific recipes.

You hire a brilliant chef (the base model). They know how to cook everything. But your restaurant has a unique menu with secret recipes, specific plating styles, and house rules about ingredients.

You have three options:

  • Prompt engineering = giving the chef detailed instructions each time: “Make the pasta, but use our house sauce recipe, plate it on the blue dish, and garnish with basil from the left fridge.” Works, but you repeat yourself every order.
  • RAG = giving the chef a recipe book to look up each time: “Check page 47 for the house sauce.” Works, but they have to read the book every order — slower and depends on finding the right page.
  • Fine-tuning = training the chef so they memorise your recipes: “After 2 weeks of practice, they just KNOW the house sauce.” No instructions needed, no book lookup — but training takes time and money, and if you change the recipe, you have to retrain.

Fine-tuning modifies a pre-trained model’s weights using your domain-specific training data. The model learns patterns, terminology, style, and knowledge from your examples, becoming specialised for your use case. In Azure OpenAI, fine-tuning uses supervised fine-tuning (SFT) — you provide input-output pairs, and the model learns to produce similar outputs.

Fine-tuning is the most powerful customisation technique but also the most expensive and maintenance-intensive. The AI-300 exam heavily tests the decision of when to use each approach.

The decision matrix: Fine-tuning vs RAG vs Prompt Engineering

This is a critical exam topic. Know when to use each approach:

Fine-tuning vs RAG vs Prompt Engineering
FeatureData FreshnessSetup CostPer-Request CostLatencyMaintenance
Prompt EngineeringReal-time (data in prompt)MinutesHigher (long prompts = more tokens)Moderate (prompt overhead)Low (just update prompts)
RAGNear real-time (index updates)DaysModerate (retrieval + generation)Higher (retrieval step adds latency)Medium (maintain index + docs)
Fine-TuningStale (training data snapshot)WeeksLowest (short prompts, no retrieval)Lowest (no retrieval step)High (retrain for updates)

When to use each

ApproachUse WhenExample
Prompt EngineeringTask is well-defined, few-shot examples suffice, data fits in context windowFormatting output as JSON, adjusting tone, simple classification
RAGKnowledge changes frequently, source docs are large, need citations/attributionCustomer support over product docs, legal research, internal knowledge base
Fine-TuningNeed specific style/tone at scale, domain terminology is specialised, want lower latency and cost per requestInsurance-specific language, medical report generation, brand-voice chatbot
RAG + Fine-TuningNeed both current knowledge AND specialised behaviourDomain-specific chatbot that also needs access to fresh data
💡 Exam tip: Fine-tuning is the LAST resort

The exam consistently favours this decision order:

  1. Try prompt engineering first — cheapest, fastest, easiest to iterate
  2. Try RAG second — if the model needs external knowledge or data changes frequently
  3. Fine-tune last — only when prompting and RAG aren’t sufficient

If a question asks “what should you try first to improve model output,” the answer is almost never fine-tuning. It’s expensive, time-consuming, and creates maintenance burden.

The exception: if the question specifies that prompt engineering has been tried and RAG doesn’t apply (e.g., the need is for a specific output style, not knowledge), then fine-tuning is correct.

Fine-tuning in Azure OpenAI

Preparing training data

Training data must be in JSONL format — one JSON object per line, each containing a conversation:

{"messages": [{"role": "system", "content": "You are an insurance claims assistant that uses precise industry terminology."}, {"role": "user", "content": "What does 'subrogation' mean in my claim?"}, {"role": "assistant", "content": "Subrogation is the process where your insurance company recovers costs from the at-fault party after paying your claim. For example, if another driver caused the accident, your insurer pays you first, then pursues the other driver's insurer for reimbursement. You don't need to take any action — it happens automatically."}]}
{"messages": [{"role": "system", "content": "You are an insurance claims assistant that uses precise industry terminology."}, {"role": "user", "content": "My claim was denied for 'material misrepresentation.' What does that mean?"}, {"role": "assistant", "content": "Material misrepresentation means information provided on your application or claim was inaccurate in a way that significantly affected the insurer's decision. This could include incorrect details about the incident, undisclosed pre-existing conditions, or inaccurate policy information. You have the right to appeal — I can guide you through the appeals process."}]}

What’s happening:

  • Each line is a complete conversation example with system prompt, user message, and ideal assistant response
  • The system message is consistent across examples (defines the model’s persona)
  • Assistant responses demonstrate the exact style, terminology, and behaviour you want

Data quality guidelines

GuidelineWhyMinimum
Diverse examplesCovers edge cases, not just common patterns50+ examples (100-500 recommended)
Consistent formatModel learns the pattern, not noiseSame system prompt across examples
High-quality responsesModel mimics your examples — garbage in, garbage outExpert-reviewed responses
Balanced categoriesPrevents model bias toward overrepresented topicsRoughly equal representation
Realistic queriesModel should learn from production-like inputsUse real user queries, anonymised
💡 Exam tip: Quality over quantity

The exam may test whether you know that fine-tuning data quality matters more than quantity:

  • 100 expert-reviewed, diverse examples often outperform 10,000 noisy, inconsistent examples
  • The minimum for Azure OpenAI fine-tuning is 10 examples, but 50-100 is the practical minimum for useful results
  • Always split your data: ~80% training, ~20% validation
  • Consistent system prompts across examples are critical — the model learns the pattern

Synthetic data generation

When you don’t have enough real training data, use a larger model to generate it:

from openai import AzureOpenAI

client = AzureOpenAI(
    api_key="your-key",
    api_version="2024-06-01",
    azure_endpoint="https://your-resource.openai.azure.com"
)

# Use GPT-4o to generate training data for fine-tuning a smaller model
def generate_training_example(topic):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": """Generate a realistic customer 
            support conversation about insurance claims. Include:
            - A natural customer question about the given topic
            - A professional response using correct insurance terminology
            - Empathetic tone, clear explanations, and next steps"""},
            {"role": "user", "content": f"Topic: {topic}"}
        ],
        temperature=0.8,  # Higher temperature for variety
    )
    return response.choices[0].message.content

# Generate examples for each topic
topics = [
    "claim denial reasons",
    "subrogation process",
    "deductible explanation",
    "coverage gap",
    "total loss valuation",
]

synthetic_examples = [generate_training_example(t) for t in topics]

What’s happening:

  • Lines 10-22: Use GPT-4o (the larger, more capable model) to generate training examples
  • Line 21: Higher temperature creates more variety in the generated examples
  • Lines 25-31: Generate examples across different topics for diversity
  • The generated examples will be used to fine-tune a smaller, cheaper model (like GPT-4o-mini)

The synthetic data workflow

  1. Define categories — list the topics/scenarios your model needs to handle
  2. Generate examples — use GPT-4o to create realistic conversations for each category
  3. Expert review — have domain experts validate and correct the synthetic examples
  4. Format as JSONL — convert to the required training format
  5. Fine-tune — train the smaller model on the reviewed synthetic data
Scenario: Zara creates synthetic data for Atlas's insurance bot

Atlas Consulting’s insurance client wants a chatbot that speaks like an experienced claims adjuster. Problem: they only have 30 real customer conversations — not enough to fine-tune.

Zara’s approach:

  1. Uses GPT-4o to generate 500 synthetic conversations across 25 insurance topics
  2. The insurance client’s claims team reviews and corrects 200 of them (quality over quantity)
  3. Combines 30 real conversations + 200 reviewed synthetic = 230 training examples
  4. Splits: 184 training, 46 validation
  5. Fine-tunes GPT-4o-mini with the combined dataset

Result: The fine-tuned GPT-4o-mini uses insurance terminology naturally, costs 15x less per request than GPT-4o, and runs 3x faster. Marcus Webb approves the cost savings.

Fine-tuning job configuration

from openai import AzureOpenAI

client = AzureOpenAI(
    api_key="your-key",
    api_version="2024-06-01",
    azure_endpoint="https://your-resource.openai.azure.com"
)

# Create a fine-tuning job
job = client.fine_tuning.jobs.create(
    model="gpt-4o-mini-2024-07-18",
    training_file=training_file.id,
    validation_file=validation_file.id,
    method={
        "type": "supervised",
        "supervised": {
            "hyperparameters": {
                "n_epochs": 3,
                "batch_size": "auto",
                "learning_rate_multiplier": "auto",
            }
        }
    },
    suffix="atlas-insurance-v1",
)

# Monitor training progress
print(f"Job ID: {job.id}")
print(f"Status: {job.status}")
# Statuses: created → pending → running → succeeded/failed

What’s happening:

  • Line 11: Base model to fine-tune — use the full model version string (GPT-4o-mini is common — cheap to fine-tune and deploy)
  • Lines 12-13: References to previously uploaded training/validation files (use client.files.create() to upload)
  • Lines 14-23: The method parameter specifies supervised fine-tuning with hyperparameters
    • n_epochs: how many times to iterate through the data (2-4 is typical)
    • batch_size: set to “auto” to let Azure optimize, or specify a number (4-16 typical)
    • learning_rate_multiplier: set to “auto” for default, or specify a float
  • Line 24: A suffix for your fine-tuned model name

Key hyperparameters

ParameterDefaultIncrease WhenDecrease When
n_epochsAuto (usually 3)Small dataset (under 100 examples)Large dataset (1000+ examples), signs of overfitting
batch_sizeAutoStable training, more dataLimited GPU memory
learning_rate_multiplier1.0Model underfitting (training loss still high)Model overfitting (validation loss increases)

Evaluating fine-tuned model performance

Always compare your fine-tuned model against the base model:

from azure.ai.evaluation import evaluate, RelevanceEvaluator, CoherenceEvaluator

# Evaluate base model
base_results = evaluate(
    data="eval_dataset_base.jsonl",
    evaluators={
        "relevance": RelevanceEvaluator(model_config=model_config),
        "coherence": CoherenceEvaluator(model_config=model_config),
    },
)

# Evaluate fine-tuned model (same test dataset, different model responses)
ft_results = evaluate(
    data="eval_dataset_finetuned.jsonl",
    evaluators={
        "relevance": RelevanceEvaluator(model_config=model_config),
        "coherence": CoherenceEvaluator(model_config=model_config),
    },
)

print(f"Base model relevance:      {base_results['metrics']['relevance']}")
print(f"Fine-tuned relevance:      {ft_results['metrics']['relevance']}")
print(f"Base model coherence:      {base_results['metrics']['coherence']}")
print(f"Fine-tuned coherence:      {ft_results['metrics']['coherence']}")

What’s happening:

  • Lines 4-10: Run the standard evaluation suite against the base model’s responses
  • Lines 13-19: Run the same evaluation against the fine-tuned model’s responses on the same test cases
  • Lines 21-24: Compare scores side by side — the fine-tuned model should score equal or higher

What to watch for

SignalMeaningAction
Fine-tuned scores higher than baseSuccess — fine-tuning improved the modelDeploy to production
Fine-tuned scores similar to baseFine-tuning didn’t help muchRevisit training data quality
Fine-tuned scores lower than baseOverfitting or bad training dataReduce epochs, improve data diversity
Great on training topics, bad on othersOverfitting — model forgot general knowledgeAdd diverse examples, reduce epochs

Dev to production deployment

Fine-tuned models follow the same deployment pipeline as base models, with extra considerations:

Deployment stages

  1. Development: fine-tune and evaluate against test data
  2. Staging: deploy to a non-production endpoint, run integration tests
  3. A/B testing: serve a percentage of production traffic to the fine-tuned model alongside the base model
  4. Production: full rollout if A/B results are positive
  5. Monitoring: continuous quality and safety evaluation

Monitoring fine-tuned models in production

Fine-tuned models can degrade over time as user queries drift from the training distribution:

What to MonitorWhyAlert When
Quality scores (sampled)Detect accuracy degradationScores drop below deployment threshold
Out-of-domain queriesFine-tuned models handle unfamiliar topics poorlyPercentage of low-confidence responses increases
LatencyFine-tuned models should be faster (shorter prompts)Latency exceeds base model performance
User feedbackDirect signal of model qualityNegative feedback rate increases
Scenario: Kai decides NOT to fine-tune

NeuralSpark’s customer support bot needs to handle 50 product categories. Priya (CTO) asks if fine-tuning would improve quality.

Kai evaluates:

  • Current approach: RAG over product documentation + system prompt with tone guidelines
  • Quality scores: Groundedness 4.3, Relevance 4.5, Coherence 4.6, Fluency 4.7
  • User satisfaction: 87% positive

Kai’s analysis:

FactorRAG (Current)Fine-Tuning
Product info freshnessReal-time (docs updated weekly)Stale (need retraining after doc updates)
Cost to maintainLow (update docs)High (retrain monthly, $200-500 per run)
Quality improvementAlready 4.3+ across metricsMaybe +0.2-0.3 improvement
Time to implementAlready done2-3 weeks of data prep + training

Decision: Don’t fine-tune. RAG already delivers strong quality with fresh data. Fine-tuning would add maintenance burden for marginal improvement. If product docs change weekly, the fine-tuned model would constantly be stale.

Kai documents the decision and moves on. Not every problem needs fine-tuning.

Scenario: Zara fine-tunes for Atlas's insurance domain

In contrast to Kai’s decision, Zara HAS a strong case for fine-tuning:

  • Insurance terminology is specialised (general models stumble on “subrogation,” “indemnification,” “pro rata cancellation”)
  • The client requires a specific professional tone that’s hard to maintain with just prompting
  • Responses must follow a strict format: empathy statement, explanation, next steps
  • Cost matters at scale: 50,000 conversations/month, and the long system prompt adds token cost

Fine-tuned GPT-4o-mini results:

MetricBase GPT-4o-miniFine-tunedImprovement
Relevance3.84.5+18%
Coherence4.04.7+18%
Tone compliance2.9 (custom)4.6 (custom)+59%
Cost per request$0.008$0.003-63%
Latency (P95)2.8s1.2s-57%

The cost reduction alone (from shorter prompts) pays for the fine-tuning job in 3 days.

Key terms flashcards

Question

What is the recommended order for customising GenAI?

Click or press Enter to reveal answer

Answer

1. Prompt engineering (cheapest, fastest). 2. RAG (when you need external/changing knowledge). 3. Fine-tuning (last resort — expensive, high maintenance). Try each approach in order before escalating.

Click to flip back

Question

What is synthetic data generation for fine-tuning?

Click or press Enter to reveal answer

Answer

Using a larger model (e.g., GPT-4o) to generate training examples for a smaller model (e.g., GPT-4o-mini). The synthetic examples must be expert-reviewed before use. Useful when you don't have enough real training data.

Click to flip back

Question

What format does fine-tuning training data use?

Click or press Enter to reveal answer

Answer

JSONL (JSON Lines) — one JSON object per line, each containing a messages array with system, user, and assistant roles. The assistant messages demonstrate the exact output style and content you want.

Click to flip back

Question

How do you detect overfitting in a fine-tuned model?

Click or press Enter to reveal answer

Answer

Training loss decreases but validation loss increases. The model performs well on training topics but poorly on new/unseen topics. Fix by: reducing epochs, adding more diverse examples, lowering learning rate.

Click to flip back

Question

Why does fine-tuning reduce per-request cost?

Click or press Enter to reveal answer

Answer

Fine-tuned models internalise the knowledge and style from training data, so you need shorter system prompts (no few-shot examples or detailed instructions). Fewer input tokens = lower cost per request. At scale, this offsets the one-time fine-tuning cost.

Click to flip back

Knowledge check

Knowledge Check

Dr. Fatima's financial chatbot needs to reference current interest rates that change weekly. Her team suggests fine-tuning the model on recent financial data. What's wrong with this approach?

Knowledge Check

Kai has 30 real customer conversations and needs at least 200 training examples for fine-tuning. Which approach should he use?

🎬 Video coming soon


Congratulations! You’ve completed all 25 modules of the AI-300 study guide. Go back and review any modules where your quiz scores were low, then take a practice exam to test your readiness.

← Previous

Embeddings & Hybrid Search

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.