🔒 Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided AI-103 Domain 2
Domain 2 — Module 4 of 11 36%
12 of 27 overall

AI-103 Study Guide

Domain 1: Plan and Manage an Azure AI Solution

  • Choosing the Right AI Model Free
  • Foundry Services: Your AI Toolkit Free
  • Retrieval, Indexing & Agent Memory
  • Designing AI Infrastructure
  • Deploying Models & CI/CD
  • Quotas, Scaling & Cost
  • Monitoring & Security
  • Responsible AI: Filters, Auditing & Governance

Domain 2: Implement Generative AI and Agentic Solutions

  • Connecting Your App to Foundry Free
  • Building RAG Applications
  • Workflows & Reasoning Pipelines
  • Evaluating AI Models & Apps
  • Agent Fundamentals: Roles, Goals & Tools Free
  • Building Agents with Retrieval & Memory
  • Agent Tools & Knowledge Integration
  • Multi-Agent Orchestration & Safeguards
  • Agent Monitoring & Error Analysis
  • Prompt Engineering & Model Tuning
  • Observability & Production Operations

Domain 3: Implement Computer Vision Solutions

  • Image & Video Generation
  • Multimodal Visual Understanding
  • Responsible AI for Visual Content

Domain 4: Implement Text Analysis Solutions

  • Text Analysis with Language Models
  • Speech, Translation & Voice Agents

Domain 5: Implement Information Extraction Solutions

  • Ingestion, Indexing & Grounding Pipelines
  • Extracting Content with Content Understanding
  • Exam Prep: Putting It All Together

AI-103 Study Guide

Domain 1: Plan and Manage an Azure AI Solution

  • Choosing the Right AI Model Free
  • Foundry Services: Your AI Toolkit Free
  • Retrieval, Indexing & Agent Memory
  • Designing AI Infrastructure
  • Deploying Models & CI/CD
  • Quotas, Scaling & Cost
  • Monitoring & Security
  • Responsible AI: Filters, Auditing & Governance

Domain 2: Implement Generative AI and Agentic Solutions

  • Connecting Your App to Foundry Free
  • Building RAG Applications
  • Workflows & Reasoning Pipelines
  • Evaluating AI Models & Apps
  • Agent Fundamentals: Roles, Goals & Tools Free
  • Building Agents with Retrieval & Memory
  • Agent Tools & Knowledge Integration
  • Multi-Agent Orchestration & Safeguards
  • Agent Monitoring & Error Analysis
  • Prompt Engineering & Model Tuning
  • Observability & Production Operations

Domain 3: Implement Computer Vision Solutions

  • Image & Video Generation
  • Multimodal Visual Understanding
  • Responsible AI for Visual Content

Domain 4: Implement Text Analysis Solutions

  • Text Analysis with Language Models
  • Speech, Translation & Voice Agents

Domain 5: Implement Information Extraction Solutions

  • Ingestion, Indexing & Grounding Pipelines
  • Extracting Content with Content Understanding
  • Exam Prep: Putting It All Together
Domain 2: Implement Generative AI and Agentic Solutions Premium ⏱ ~12 min read

Evaluating AI Models & Apps

How do you know if your AI is actually good? Learn how to evaluate models for fabrications, relevance, quality, and safety — and build evaluation into your development workflow.

Why evaluation matters

☕ Simple explanation

Evaluation is like a quality inspection for your AI — you wouldn’t ship a product without testing it, and you shouldn’t deploy an AI model without evaluating it.

AI models can fabricate information (hallucinate), give irrelevant answers, produce unsafe content, or simply be bad at the task. Evaluation tells you which of these problems exist and how severe they are — before your users find out.

AI evaluation in Foundry is a systematic process of measuring model and application quality across multiple dimensions. The four key evaluation areas are:

  • Fabrications (hallucinations) — the model generates information not supported by source data
  • Relevance — the response actually addresses the user’s question
  • Quality — coherence, fluency, completeness, and usefulness
  • Safety — the response doesn’t contain harmful, biased, or inappropriate content

Evaluation dimensions

DimensionWhat It MeasuresEvaluatorScore Range
GroundednessIs the response based on provided context?GroundednessEvaluator1-5
RelevanceDoes it answer the actual question?RelevanceEvaluator1-5
CoherenceIs the response logical and well-structured?CoherenceEvaluator1-5
FluencyIs the language natural and readable?FluencyEvaluator1-5
SafetyIs the content free from harmful material?ContentSafetyEvaluatorPass/Fail
F1 ScoreDoes extraction output match expected fields?F1ScoreEvaluator0-1

Detecting fabrications

Fabrication (hallucination) detection is the most exam-relevant evaluation:

TypeWhat HappensExampleDetection Method
Factual fabricationModel invents facts”The policy was updated in March 2025” (it wasn’t)Groundedness evaluator against source docs
Citation fabricationModel invents references”According to Regulation 45.2.1” (doesn’t exist)Provenance checking against index
Confident fabricationModel states guesses as facts”This will definitely work because…”Calibration evaluation — certainty vs accuracy
💡 Exam tip: Groundedness vs relevance

These are different evaluation dimensions:

  • Groundedness = “Is the answer based on the retrieved data?” (factual accuracy)
  • Relevance = “Does the answer address what the user asked?” (topical accuracy)

A response can be grounded (all facts from source docs) but irrelevant (answering a different question than what was asked). Both matter.

Evaluation workflows

WhenMethodPurpose
DevelopmentManual evaluation with test datasetsIterate on prompts and RAG configuration
CI/CD pipelineAutomated evaluators on every PRGate deployments on quality thresholds
Pre-launchRed teaming with adversarial inputsFind safety gaps before users do
ProductionContinuous monitoring with samplingDetect drift and emerging quality issues

Building an evaluation dataset

ComponentWhat It ContainsExample
InputUser question”What is the refund policy for damaged goods?”
ContextRetrieved documents (for RAG)Refund policy document, Section 3.2
Expected outputGround truth answer”Damaged goods can be returned within 30 days for a full refund”
MetadataDifficulty, category, sourceDifficulty: medium, Category: refund, Source: policy_v3.pdf
ℹ️ Real-world example: MediaForge's evaluation pipeline

MediaForge evaluates their content generation model before every deployment:

  • Test dataset: 200 content briefs with expected outputs
  • Evaluators: Coherence (threshold: 4.0), Fluency (4.0), Relevance (4.0), Safety (must pass)
  • CI/CD gate: If any evaluator falls below threshold, deployment is blocked
  • Red teaming: Monthly adversarial testing — prompt injection, brand-unsafe content generation
  • Production monitoring: Sample 5% of responses daily, run all evaluators

Result: they caught a coherence regression when updating from GPT-4o to GPT-4.1 — the new model produced shorter responses that scored lower on completeness. They adjusted the prompt before deploying.

Key terms

Question

What is a fabrication (hallucination) in AI?

Click or press Enter to reveal answer

Answer

When the model generates information that is not supported by the provided source data or is factually incorrect. Includes inventing facts, citations, and confidently stating guesses as certainties.

Click to flip back

Question

What is the GroundednessEvaluator?

Click or press Enter to reveal answer

Answer

A Foundry evaluation tool that measures whether the model's response is factually based on the retrieved context documents. Scores 1-5, where 5 means fully grounded in provided data.

Click to flip back

Question

What is an evaluation dataset?

Click or press Enter to reveal answer

Answer

A curated set of test cases containing input questions, expected outputs, and optionally retrieved context. Used to systematically measure model quality across multiple dimensions before deployment.

Click to flip back

Question

What is red teaming for AI?

Click or press Enter to reveal answer

Answer

Adversarial testing where testers deliberately try to make the AI produce unsafe, incorrect, or unexpected outputs. Tests prompt injection, jailbreaks, edge cases, and bias. Run before launch and periodically in production.

Click to flip back

Knowledge check

Knowledge Check

NeuralMed's patient chatbot scores 4.5 on fluency and 4.2 on coherence, but only 2.1 on groundedness. What does this tell you?

Knowledge Check

Atlas Financial wants to ensure no AI model deployment happens without passing quality checks. Where should they add evaluation?

🎬 Video coming soon

← Previous

Workflows & Reasoning Pipelines

Next →

Agent Fundamentals: Roles, Goals & Tools

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.