Fine-Tuning: Methods, Data & Production

When to fine-tune (and when not to)

Simple explanation

Fine-tuning is like teaching a chef your restaurant’s specific recipes.

You hire a brilliant chef (the base model). They know how to cook everything. But your restaurant has a unique menu with secret recipes, specific plating styles, and house rules about ingredients.

You have three options:

Prompt engineering = giving the chef detailed instructions each time: “Make the pasta, but use our house sauce recipe, plate it on the blue dish, and garnish with basil from the left fridge.” Works, but you repeat yourself every order.
RAG = giving the chef a recipe book to look up each time: “Check page 47 for the house sauce.” Works, but they have to read the book every order — slower and depends on finding the right page.
Fine-tuning = training the chef so they memorise your recipes: “After 2 weeks of practice, they just KNOW the house sauce.” No instructions needed, no book lookup — but training takes time and money, and if you change the recipe, you have to retrain.

The decision matrix: Fine-tuning vs RAG vs Prompt Engineering

This is a critical exam topic. Know when to use each approach:

Fine-tuning vs RAG vs Prompt Engineering
Feature	Data Freshness	Setup Cost	Per-Request Cost	Latency	Maintenance
Prompt Engineering	Real-time (data in prompt)	Minutes	Higher (long prompts = more tokens)	Moderate (prompt overhead)	Low (just update prompts)
RAG	Near real-time (index updates)	Days	Moderate (retrieval + generation)	Higher (retrieval step adds latency)	Medium (maintain index + docs)
Fine-Tuning	Stale (training data snapshot)	Weeks	Lowest (short prompts, no retrieval)	Lowest (no retrieval step)	High (retrain for updates)

When to use each

Approach	Use When	Example
Prompt Engineering	Task is well-defined, few-shot examples suffice, data fits in context window	Formatting output as JSON, adjusting tone, simple classification
RAG	Knowledge changes frequently, source docs are large, need citations/attribution	Customer support over product docs, legal research, internal knowledge base
Fine-Tuning	Need specific style/tone at scale, domain terminology is specialised, want lower latency and cost per request	Insurance-specific language, medical report generation, brand-voice chatbot
RAG + Fine-Tuning	Need both current knowledge AND specialised behaviour	Domain-specific chatbot that also needs access to fresh data

Exam tip: Fine-tuning is the LAST resort

The exam consistently favours this decision order:

Try prompt engineering first — cheapest, fastest, easiest to iterate
Try RAG second — if the model needs external knowledge or data changes frequently
Fine-tune last — only when prompting and RAG aren’t sufficient

If a question asks “what should you try first to improve model output,” the answer is almost never fine-tuning. It’s expensive, time-consuming, and creates maintenance burden.

The exception: if the question specifies that prompt engineering has been tried and RAG doesn’t apply (e.g., the need is for a specific output style, not knowledge), then fine-tuning is correct.

Fine-tuning in Azure OpenAI

Preparing training data

Training data must be in JSONL format — one JSON object per line, each containing a conversation:

{"messages": [{"role": "system", "content": "You are an insurance claims assistant that uses precise industry terminology."}, {"role": "user", "content": "What does 'subrogation' mean in my claim?"}, {"role": "assistant", "content": "Subrogation is the process where your insurance company recovers costs from the at-fault party after paying your claim. For example, if another driver caused the accident, your insurer pays you first, then pursues the other driver's insurer for reimbursement. You don't need to take any action — it happens automatically."}]}
{"messages": [{"role": "system", "content": "You are an insurance claims assistant that uses precise industry terminology."}, {"role": "user", "content": "My claim was denied for 'material misrepresentation.' What does that mean?"}, {"role": "assistant", "content": "Material misrepresentation means information provided on your application or claim was inaccurate in a way that significantly affected the insurer's decision. This could include incorrect details about the incident, undisclosed pre-existing conditions, or inaccurate policy information. You have the right to appeal — I can guide you through the appeals process."}]}

What’s happening:

Each line is a complete conversation example with system prompt, user message, and ideal assistant response
The system message is consistent across examples (defines the model’s persona)
Assistant responses demonstrate the exact style, terminology, and behaviour you want

Data quality guidelines

Guideline	Why	Minimum
Diverse examples	Covers edge cases, not just common patterns	50+ examples (100-500 recommended)
Consistent format	Model learns the pattern, not noise	Same system prompt across examples
High-quality responses	Model mimics your examples — garbage in, garbage out	Expert-reviewed responses
Balanced categories	Prevents model bias toward overrepresented topics	Roughly equal representation
Realistic queries	Model should learn from production-like inputs	Use real user queries, anonymised

Exam tip: Quality over quantity

The exam may test whether you know that fine-tuning data quality matters more than quantity:

100 expert-reviewed, diverse examples often outperform 10,000 noisy, inconsistent examples
The minimum for Azure OpenAI fine-tuning is 10 examples, but 50-100 is the practical minimum for useful results
Always split your data: ~80% training, ~20% validation
Consistent system prompts across examples are critical — the model learns the pattern

Synthetic data generation

When you don’t have enough real training data, use a larger model to generate it:

from openai import AzureOpenAI

client = AzureOpenAI(
    api_key="your-key",
    api_version="2024-06-01",
    azure_endpoint="https://your-resource.openai.azure.com"
)

# Use GPT-4o to generate training data for fine-tuning a smaller model
def generate_training_example(topic):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": """Generate a realistic customer 
            support conversation about insurance claims. Include:
            - A natural customer question about the given topic
            - A professional response using correct insurance terminology
            - Empathetic tone, clear explanations, and next steps"""},
            {"role": "user", "content": f"Topic: {topic}"}
        ],
        temperature=0.8,  # Higher temperature for variety
    )
    return response.choices[0].message.content

# Generate examples for each topic
topics = [
    "claim denial reasons",
    "subrogation process",
    "deductible explanation",
    "coverage gap",
    "total loss valuation",
]

synthetic_examples = [generate_training_example(t) for t in topics]

What’s happening:

Lines 10-22: Use GPT-4o (the larger, more capable model) to generate training examples
Line 21: Higher temperature creates more variety in the generated examples
Lines 25-31: Generate examples across different topics for diversity
The generated examples will be used to fine-tune a smaller, cheaper model (like GPT-4o-mini)

The synthetic data workflow

Define categories — list the topics/scenarios your model needs to handle
Generate examples — use GPT-4o to create realistic conversations for each category
Expert review — have domain experts validate and correct the synthetic examples
Format as JSONL — convert to the required training format
Fine-tune — train the smaller model on the reviewed synthetic data

Scenario: Zara creates synthetic data for Atlas's insurance bot

Atlas Consulting’s insurance client wants a chatbot that speaks like an experienced claims adjuster. Problem: they only have 30 real customer conversations — not enough to fine-tune.

Zara’s approach:

Uses GPT-4o to generate 500 synthetic conversations across 25 insurance topics
The insurance client’s claims team reviews and corrects 200 of them (quality over quantity)
Combines 30 real conversations + 200 reviewed synthetic = 230 training examples
Splits: 184 training, 46 validation
Fine-tunes GPT-4o-mini with the combined dataset

Result: The fine-tuned GPT-4o-mini uses insurance terminology naturally, costs 15x less per request than GPT-4o, and runs 3x faster. Marcus Webb approves the cost savings.

Fine-tuning job configuration

from openai import AzureOpenAI

client = AzureOpenAI(
    api_key="your-key",
    api_version="2024-06-01",
    azure_endpoint="https://your-resource.openai.azure.com"
)

# Create a fine-tuning job
job = client.fine_tuning.jobs.create(
    model="gpt-4o-mini-2024-07-18",
    training_file=training_file.id,
    validation_file=validation_file.id,
    method={
        "type": "supervised",
        "supervised": {
            "hyperparameters": {
                "n_epochs": 3,
                "batch_size": "auto",
                "learning_rate_multiplier": "auto",
            }
        }
    },
    suffix="atlas-insurance-v1",
)

# Monitor training progress
print(f"Job ID: {job.id}")
print(f"Status: {job.status}")
# Statuses: created → pending → running → succeeded/failed

What’s happening:

Line 11: Base model to fine-tune — use the full model version string (GPT-4o-mini is common — cheap to fine-tune and deploy)
Lines 12-13: References to previously uploaded training/validation files (use client.files.create() to upload)
Lines 14-23: The method parameter specifies supervised fine-tuning with hyperparameters
- n_epochs: how many times to iterate through the data (2-4 is typical)
- batch_size: set to “auto” to let Azure optimize, or specify a number (4-16 typical)
- learning_rate_multiplier: set to “auto” for default, or specify a float
Line 24: A suffix for your fine-tuned model name

Key hyperparameters

Parameter	Default	Increase When	Decrease When
n_epochs	Auto (usually 3)	Small dataset (under 100 examples)	Large dataset (1000+ examples), signs of overfitting
batch_size	Auto	Stable training, more data	Limited GPU memory
learning_rate_multiplier	1.0	Model underfitting (training loss still high)	Model overfitting (validation loss increases)

Evaluating fine-tuned model performance

Always compare your fine-tuned model against the base model:

from azure.ai.evaluation import evaluate, RelevanceEvaluator, CoherenceEvaluator

# Evaluate base model
base_results = evaluate(
    data="eval_dataset_base.jsonl",
    evaluators={
        "relevance": RelevanceEvaluator(model_config=model_config),
        "coherence": CoherenceEvaluator(model_config=model_config),
    },
)

# Evaluate fine-tuned model (same test dataset, different model responses)
ft_results = evaluate(
    data="eval_dataset_finetuned.jsonl",
    evaluators={
        "relevance": RelevanceEvaluator(model_config=model_config),
        "coherence": CoherenceEvaluator(model_config=model_config),
    },
)

print(f"Base model relevance:      {base_results['metrics']['relevance']}")
print(f"Fine-tuned relevance:      {ft_results['metrics']['relevance']}")
print(f"Base model coherence:      {base_results['metrics']['coherence']}")
print(f"Fine-tuned coherence:      {ft_results['metrics']['coherence']}")

What’s happening:

Lines 4-10: Run the standard evaluation suite against the base model’s responses
Lines 13-19: Run the same evaluation against the fine-tuned model’s responses on the same test cases
Lines 21-24: Compare scores side by side — the fine-tuned model should score equal or higher

What to watch for

Signal	Meaning	Action
Fine-tuned scores higher than base	Success — fine-tuning improved the model	Deploy to production
Fine-tuned scores similar to base	Fine-tuning didn’t help much	Revisit training data quality
Fine-tuned scores lower than base	Overfitting or bad training data	Reduce epochs, improve data diversity
Great on training topics, bad on others	Overfitting — model forgot general knowledge	Add diverse examples, reduce epochs

Dev to production deployment

Fine-tuned models follow the same deployment pipeline as base models, with extra considerations:

Deployment stages

Development: fine-tune and evaluate against test data
Staging: deploy to a non-production endpoint, run integration tests
A/B testing: serve a percentage of production traffic to the fine-tuned model alongside the base model
Production: full rollout if A/B results are positive
Monitoring: continuous quality and safety evaluation

Monitoring fine-tuned models in production

Fine-tuned models can degrade over time as user queries drift from the training distribution:

What to Monitor	Why	Alert When
Quality scores (sampled)	Detect accuracy degradation	Scores drop below deployment threshold
Out-of-domain queries	Fine-tuned models handle unfamiliar topics poorly	Percentage of low-confidence responses increases
Latency	Fine-tuned models should be faster (shorter prompts)	Latency exceeds base model performance
User feedback	Direct signal of model quality	Negative feedback rate increases

Scenario: Kai decides NOT to fine-tune

NeuralSpark’s customer support bot needs to handle 50 product categories. Priya (CTO) asks if fine-tuning would improve quality.

Kai evaluates:

Current approach: RAG over product documentation + system prompt with tone guidelines
Quality scores: Groundedness 4.3, Relevance 4.5, Coherence 4.6, Fluency 4.7
User satisfaction: 87% positive

Kai’s analysis:

Factor	RAG (Current)	Fine-Tuning
Product info freshness	Real-time (docs updated weekly)	Stale (need retraining after doc updates)
Cost to maintain	Low (update docs)	High (retrain monthly, $200-500 per run)
Quality improvement	Already 4.3+ across metrics	Maybe +0.2-0.3 improvement
Time to implement	Already done	2-3 weeks of data prep + training

Decision: Don’t fine-tune. RAG already delivers strong quality with fresh data. Fine-tuning would add maintenance burden for marginal improvement. If product docs change weekly, the fine-tuned model would constantly be stale.

Kai documents the decision and moves on. Not every problem needs fine-tuning.

Scenario: Zara fine-tunes for Atlas's insurance domain

In contrast to Kai’s decision, Zara HAS a strong case for fine-tuning:

Insurance terminology is specialised (general models stumble on “subrogation,” “indemnification,” “pro rata cancellation”)
The client requires a specific professional tone that’s hard to maintain with just prompting
Responses must follow a strict format: empathy statement, explanation, next steps
Cost matters at scale: 50,000 conversations/month, and the long system prompt adds token cost

Fine-tuned GPT-4o-mini results:

Metric	Base GPT-4o-mini	Fine-tuned	Improvement
Relevance	3.8	4.5	+18%
Coherence	4.0	4.7	+18%
Tone compliance	2.9 (custom)	4.6 (custom)	+59%
Cost per request	$0.008	$0.003	-63%
Latency (P95)	2.8s	1.2s	-57%

The cost reduction alone (from shorter prompts) pays for the fine-tuning job in 3 days.