Fine-Tuning: Methods, Data & Production
When prompting isn't enough and RAG doesn't fit, fine-tune. Learn advanced fine-tuning methods, synthetic data generation, performance monitoring, and the critical decision: when to fine-tune vs RAG vs prompt engineer.
When to fine-tune (and when not to)
Fine-tuning is like teaching a chef your restaurant’s specific recipes.
You hire a brilliant chef (the base model). They know how to cook everything. But your restaurant has a unique menu with secret recipes, specific plating styles, and house rules about ingredients.
You have three options:
- Prompt engineering = giving the chef detailed instructions each time: “Make the pasta, but use our house sauce recipe, plate it on the blue dish, and garnish with basil from the left fridge.” Works, but you repeat yourself every order.
- RAG = giving the chef a recipe book to look up each time: “Check page 47 for the house sauce.” Works, but they have to read the book every order — slower and depends on finding the right page.
- Fine-tuning = training the chef so they memorise your recipes: “After 2 weeks of practice, they just KNOW the house sauce.” No instructions needed, no book lookup — but training takes time and money, and if you change the recipe, you have to retrain.
The decision matrix: Fine-tuning vs RAG vs Prompt Engineering
This is a critical exam topic. Know when to use each approach:
| Feature | Data Freshness | Setup Cost | Per-Request Cost | Latency | Maintenance |
|---|---|---|---|---|---|
| Prompt Engineering | Real-time (data in prompt) | Minutes | Higher (long prompts = more tokens) | Moderate (prompt overhead) | Low (just update prompts) |
| RAG | Near real-time (index updates) | Days | Moderate (retrieval + generation) | Higher (retrieval step adds latency) | Medium (maintain index + docs) |
| Fine-Tuning | Stale (training data snapshot) | Weeks | Lowest (short prompts, no retrieval) | Lowest (no retrieval step) | High (retrain for updates) |
When to use each
| Approach | Use When | Example |
|---|---|---|
| Prompt Engineering | Task is well-defined, few-shot examples suffice, data fits in context window | Formatting output as JSON, adjusting tone, simple classification |
| RAG | Knowledge changes frequently, source docs are large, need citations/attribution | Customer support over product docs, legal research, internal knowledge base |
| Fine-Tuning | Need specific style/tone at scale, domain terminology is specialised, want lower latency and cost per request | Insurance-specific language, medical report generation, brand-voice chatbot |
| RAG + Fine-Tuning | Need both current knowledge AND specialised behaviour | Domain-specific chatbot that also needs access to fresh data |
Exam tip: Fine-tuning is the LAST resort
The exam consistently favours this decision order:
- Try prompt engineering first — cheapest, fastest, easiest to iterate
- Try RAG second — if the model needs external knowledge or data changes frequently
- Fine-tune last — only when prompting and RAG aren’t sufficient
If a question asks “what should you try first to improve model output,” the answer is almost never fine-tuning. It’s expensive, time-consuming, and creates maintenance burden.
The exception: if the question specifies that prompt engineering has been tried and RAG doesn’t apply (e.g., the need is for a specific output style, not knowledge), then fine-tuning is correct.
Fine-tuning in Azure OpenAI
Preparing training data
Training data must be in JSONL format — one JSON object per line, each containing a conversation:
{"messages": [{"role": "system", "content": "You are an insurance claims assistant that uses precise industry terminology."}, {"role": "user", "content": "What does 'subrogation' mean in my claim?"}, {"role": "assistant", "content": "Subrogation is the process where your insurance company recovers costs from the at-fault party after paying your claim. For example, if another driver caused the accident, your insurer pays you first, then pursues the other driver's insurer for reimbursement. You don't need to take any action — it happens automatically."}]}
{"messages": [{"role": "system", "content": "You are an insurance claims assistant that uses precise industry terminology."}, {"role": "user", "content": "My claim was denied for 'material misrepresentation.' What does that mean?"}, {"role": "assistant", "content": "Material misrepresentation means information provided on your application or claim was inaccurate in a way that significantly affected the insurer's decision. This could include incorrect details about the incident, undisclosed pre-existing conditions, or inaccurate policy information. You have the right to appeal — I can guide you through the appeals process."}]}
What’s happening:
- Each line is a complete conversation example with system prompt, user message, and ideal assistant response
- The system message is consistent across examples (defines the model’s persona)
- Assistant responses demonstrate the exact style, terminology, and behaviour you want
Data quality guidelines
| Guideline | Why | Minimum |
|---|---|---|
| Diverse examples | Covers edge cases, not just common patterns | 50+ examples (100-500 recommended) |
| Consistent format | Model learns the pattern, not noise | Same system prompt across examples |
| High-quality responses | Model mimics your examples — garbage in, garbage out | Expert-reviewed responses |
| Balanced categories | Prevents model bias toward overrepresented topics | Roughly equal representation |
| Realistic queries | Model should learn from production-like inputs | Use real user queries, anonymised |
Exam tip: Quality over quantity
The exam may test whether you know that fine-tuning data quality matters more than quantity:
- 100 expert-reviewed, diverse examples often outperform 10,000 noisy, inconsistent examples
- The minimum for Azure OpenAI fine-tuning is 10 examples, but 50-100 is the practical minimum for useful results
- Always split your data: ~80% training, ~20% validation
- Consistent system prompts across examples are critical — the model learns the pattern
Synthetic data generation
When you don’t have enough real training data, use a larger model to generate it:
from openai import AzureOpenAI
client = AzureOpenAI(
api_key="your-key",
api_version="2024-06-01",
azure_endpoint="https://your-resource.openai.azure.com"
)
# Use GPT-4o to generate training data for fine-tuning a smaller model
def generate_training_example(topic):
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": """Generate a realistic customer
support conversation about insurance claims. Include:
- A natural customer question about the given topic
- A professional response using correct insurance terminology
- Empathetic tone, clear explanations, and next steps"""},
{"role": "user", "content": f"Topic: {topic}"}
],
temperature=0.8, # Higher temperature for variety
)
return response.choices[0].message.content
# Generate examples for each topic
topics = [
"claim denial reasons",
"subrogation process",
"deductible explanation",
"coverage gap",
"total loss valuation",
]
synthetic_examples = [generate_training_example(t) for t in topics]
What’s happening:
- Lines 10-22: Use GPT-4o (the larger, more capable model) to generate training examples
- Line 21: Higher temperature creates more variety in the generated examples
- Lines 25-31: Generate examples across different topics for diversity
- The generated examples will be used to fine-tune a smaller, cheaper model (like GPT-4o-mini)
The synthetic data workflow
- Define categories — list the topics/scenarios your model needs to handle
- Generate examples — use GPT-4o to create realistic conversations for each category
- Expert review — have domain experts validate and correct the synthetic examples
- Format as JSONL — convert to the required training format
- Fine-tune — train the smaller model on the reviewed synthetic data
Scenario: Zara creates synthetic data for Atlas's insurance bot
Atlas Consulting’s insurance client wants a chatbot that speaks like an experienced claims adjuster. Problem: they only have 30 real customer conversations — not enough to fine-tune.
Zara’s approach:
- Uses GPT-4o to generate 500 synthetic conversations across 25 insurance topics
- The insurance client’s claims team reviews and corrects 200 of them (quality over quantity)
- Combines 30 real conversations + 200 reviewed synthetic = 230 training examples
- Splits: 184 training, 46 validation
- Fine-tunes GPT-4o-mini with the combined dataset
Result: The fine-tuned GPT-4o-mini uses insurance terminology naturally, costs 15x less per request than GPT-4o, and runs 3x faster. Marcus Webb approves the cost savings.
Fine-tuning job configuration
from openai import AzureOpenAI
client = AzureOpenAI(
api_key="your-key",
api_version="2024-06-01",
azure_endpoint="https://your-resource.openai.azure.com"
)
# Create a fine-tuning job
job = client.fine_tuning.jobs.create(
model="gpt-4o-mini-2024-07-18",
training_file=training_file.id,
validation_file=validation_file.id,
method={
"type": "supervised",
"supervised": {
"hyperparameters": {
"n_epochs": 3,
"batch_size": "auto",
"learning_rate_multiplier": "auto",
}
}
},
suffix="atlas-insurance-v1",
)
# Monitor training progress
print(f"Job ID: {job.id}")
print(f"Status: {job.status}")
# Statuses: created → pending → running → succeeded/failed
What’s happening:
- Line 11: Base model to fine-tune — use the full model version string (GPT-4o-mini is common — cheap to fine-tune and deploy)
- Lines 12-13: References to previously uploaded training/validation files (use
client.files.create()to upload) - Lines 14-23: The
methodparameter specifies supervised fine-tuning with hyperparametersn_epochs: how many times to iterate through the data (2-4 is typical)batch_size: set to “auto” to let Azure optimize, or specify a number (4-16 typical)learning_rate_multiplier: set to “auto” for default, or specify a float
- Line 24: A suffix for your fine-tuned model name
Key hyperparameters
| Parameter | Default | Increase When | Decrease When |
|---|---|---|---|
| n_epochs | Auto (usually 3) | Small dataset (under 100 examples) | Large dataset (1000+ examples), signs of overfitting |
| batch_size | Auto | Stable training, more data | Limited GPU memory |
| learning_rate_multiplier | 1.0 | Model underfitting (training loss still high) | Model overfitting (validation loss increases) |
Evaluating fine-tuned model performance
Always compare your fine-tuned model against the base model:
from azure.ai.evaluation import evaluate, RelevanceEvaluator, CoherenceEvaluator
# Evaluate base model
base_results = evaluate(
data="eval_dataset_base.jsonl",
evaluators={
"relevance": RelevanceEvaluator(model_config=model_config),
"coherence": CoherenceEvaluator(model_config=model_config),
},
)
# Evaluate fine-tuned model (same test dataset, different model responses)
ft_results = evaluate(
data="eval_dataset_finetuned.jsonl",
evaluators={
"relevance": RelevanceEvaluator(model_config=model_config),
"coherence": CoherenceEvaluator(model_config=model_config),
},
)
print(f"Base model relevance: {base_results['metrics']['relevance']}")
print(f"Fine-tuned relevance: {ft_results['metrics']['relevance']}")
print(f"Base model coherence: {base_results['metrics']['coherence']}")
print(f"Fine-tuned coherence: {ft_results['metrics']['coherence']}")
What’s happening:
- Lines 4-10: Run the standard evaluation suite against the base model’s responses
- Lines 13-19: Run the same evaluation against the fine-tuned model’s responses on the same test cases
- Lines 21-24: Compare scores side by side — the fine-tuned model should score equal or higher
What to watch for
| Signal | Meaning | Action |
|---|---|---|
| Fine-tuned scores higher than base | Success — fine-tuning improved the model | Deploy to production |
| Fine-tuned scores similar to base | Fine-tuning didn’t help much | Revisit training data quality |
| Fine-tuned scores lower than base | Overfitting or bad training data | Reduce epochs, improve data diversity |
| Great on training topics, bad on others | Overfitting — model forgot general knowledge | Add diverse examples, reduce epochs |
Dev to production deployment
Fine-tuned models follow the same deployment pipeline as base models, with extra considerations:
Deployment stages
- Development: fine-tune and evaluate against test data
- Staging: deploy to a non-production endpoint, run integration tests
- A/B testing: serve a percentage of production traffic to the fine-tuned model alongside the base model
- Production: full rollout if A/B results are positive
- Monitoring: continuous quality and safety evaluation
Monitoring fine-tuned models in production
Fine-tuned models can degrade over time as user queries drift from the training distribution:
| What to Monitor | Why | Alert When |
|---|---|---|
| Quality scores (sampled) | Detect accuracy degradation | Scores drop below deployment threshold |
| Out-of-domain queries | Fine-tuned models handle unfamiliar topics poorly | Percentage of low-confidence responses increases |
| Latency | Fine-tuned models should be faster (shorter prompts) | Latency exceeds base model performance |
| User feedback | Direct signal of model quality | Negative feedback rate increases |
Scenario: Kai decides NOT to fine-tune
NeuralSpark’s customer support bot needs to handle 50 product categories. Priya (CTO) asks if fine-tuning would improve quality.
Kai evaluates:
- Current approach: RAG over product documentation + system prompt with tone guidelines
- Quality scores: Groundedness 4.3, Relevance 4.5, Coherence 4.6, Fluency 4.7
- User satisfaction: 87% positive
Kai’s analysis:
| Factor | RAG (Current) | Fine-Tuning |
|---|---|---|
| Product info freshness | Real-time (docs updated weekly) | Stale (need retraining after doc updates) |
| Cost to maintain | Low (update docs) | High (retrain monthly, $200-500 per run) |
| Quality improvement | Already 4.3+ across metrics | Maybe +0.2-0.3 improvement |
| Time to implement | Already done | 2-3 weeks of data prep + training |
Decision: Don’t fine-tune. RAG already delivers strong quality with fresh data. Fine-tuning would add maintenance burden for marginal improvement. If product docs change weekly, the fine-tuned model would constantly be stale.
Kai documents the decision and moves on. Not every problem needs fine-tuning.
Scenario: Zara fine-tunes for Atlas's insurance domain
In contrast to Kai’s decision, Zara HAS a strong case for fine-tuning:
- Insurance terminology is specialised (general models stumble on “subrogation,” “indemnification,” “pro rata cancellation”)
- The client requires a specific professional tone that’s hard to maintain with just prompting
- Responses must follow a strict format: empathy statement, explanation, next steps
- Cost matters at scale: 50,000 conversations/month, and the long system prompt adds token cost
Fine-tuned GPT-4o-mini results:
| Metric | Base GPT-4o-mini | Fine-tuned | Improvement |
|---|---|---|---|
| Relevance | 3.8 | 4.5 | +18% |
| Coherence | 4.0 | 4.7 | +18% |
| Tone compliance | 2.9 (custom) | 4.6 (custom) | +59% |
| Cost per request | $0.008 | $0.003 | -63% |
| Latency (P95) | 2.8s | 1.2s | -57% |
The cost reduction alone (from shorter prompts) pays for the fine-tuning job in 3 days.
Key terms flashcards
Knowledge check
Dr. Fatima's financial chatbot needs to reference current interest rates that change weekly. Her team suggests fine-tuning the model on recent financial data. What's wrong with this approach?
Kai has 30 real customer conversations and needs at least 200 training examples for fine-tuning. Which approach should he use?
🎬 Video coming soon
Congratulations! You’ve completed all 25 modules of the AI-300 study guide. Go back and review any modules where your quiz scores were low, then take a practice exam to test your readiness.