🔒 Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided AI-901 Domain 1
Domain 1 — Module 3 of 11 27%
3 of 26 overall

AI-901 Study Guide

Domain 1: AI Concepts and Capabilities

  • What is AI? Your First 10 Minutes Free
  • Responsible AI: The Six Principles Free
  • How Generative AI Actually Works Free
  • Choosing the Right AI Model Free
  • Deploying AI Models: Options & Settings
  • AI Workloads at a Glance
  • Text Analysis: Keywords, Entities & Sentiment
  • Speech: Recognition & Synthesis
  • Computer Vision: Seeing the World
  • Image Generation: Creating with AI
  • Information Extraction: From Chaos to Structure

Domain 2: Implement AI Solutions Using Foundry

  • Prompting Fundamentals: System & User Prompts
  • Microsoft Foundry: Your AI Command Center Free
  • Building a Chat App with the Foundry SDK
  • Agents in Foundry: Create & Test
  • Building an Agent Client App
  • Building a Text Analysis App
  • Multimodal: Responding to Speech
  • Azure Speech in Foundry Tools
  • Visual Prompts: Images as Input
  • Generating Images with AI
  • Building a Vision App
  • Content Understanding: Documents & Forms
  • Multimodal Extraction: Images, Audio & Video
  • Building an Extraction App
  • Exam Prep: Putting It All Together

AI-901 Study Guide

Domain 1: AI Concepts and Capabilities

  • What is AI? Your First 10 Minutes Free
  • Responsible AI: The Six Principles Free
  • How Generative AI Actually Works Free
  • Choosing the Right AI Model Free
  • Deploying AI Models: Options & Settings
  • AI Workloads at a Glance
  • Text Analysis: Keywords, Entities & Sentiment
  • Speech: Recognition & Synthesis
  • Computer Vision: Seeing the World
  • Image Generation: Creating with AI
  • Information Extraction: From Chaos to Structure

Domain 2: Implement AI Solutions Using Foundry

  • Prompting Fundamentals: System & User Prompts
  • Microsoft Foundry: Your AI Command Center Free
  • Building a Chat App with the Foundry SDK
  • Agents in Foundry: Create & Test
  • Building an Agent Client App
  • Building a Text Analysis App
  • Multimodal: Responding to Speech
  • Azure Speech in Foundry Tools
  • Visual Prompts: Images as Input
  • Generating Images with AI
  • Building a Vision App
  • Content Understanding: Documents & Forms
  • Multimodal Extraction: Images, Audio & Video
  • Building an Extraction App
  • Exam Prep: Putting It All Together
Domain 1: AI Concepts and Capabilities Free ⏱ ~14 min read

How Generative AI Actually Works

Generative AI creates new content — text, images, code — but how? This module demystifies tokens, transformers, training, and inference without requiring a maths degree.

How does generative AI create new content?

☕ Simple explanation

Generative AI is like the world’s best autocomplete.

You know how your phone suggests the next word when you type a message? Generative AI does the same thing — but at a massive scale. It’s been trained on billions of documents, and it predicts the most likely next word, sentence, or paragraph based on what you’ve written.

It doesn’t “understand” language the way you do. It’s incredibly good at recognising patterns — so good that its outputs look like they were written by a person.

The same idea applies to images: image generation models start with random noise and progressively refine it into a clear image, guided by your text description.

Generative AI models are neural networks trained on massive datasets to generate new content. The dominant architecture for text generation is the Transformer, which uses a mechanism called self-attention to understand relationships between words in a sequence.

During training, the model processes billions of text examples and learns statistical patterns — word associations, grammar, factual relationships, and reasoning patterns. During inference (when you use the model), it generates output by predicting one token at a time, each prediction based on the full context of everything before it.

For image generation, models like GPT-image use diffusion processes — starting from random noise and gradually refining it into a coherent image guided by the text prompt.

The key concepts

Tokens: how AI reads text

AI models don’t read words — they read tokens. A token is a chunk of text, roughly 3-4 characters.

TextTokens
”Hello”1 token
”Microsoft Foundry”2 tokens
”MediSpark’s diagnostic AI”~5 tokens
A 1-page document~500-700 tokens

Why tokens matter:

  • Models have a token limit (context window) — how much text they can process at once
  • Pricing is based on tokens processed (input + output)
  • Longer prompts = more tokens = higher cost

Training vs inference

Training vs inference — two phases of an AI model's life
FeatureTrainingInference
WhenBefore the model is availableWhen you use the model
What happensModel learns patterns from massive datasetsModel generates responses based on your input
Who does itOpenAI, Microsoft, model providersYou — through the Foundry portal or SDK
CostEnormous (millions of dollars, weeks of compute)Per-token pricing (fractions of a cent)
AnalogyTeaching a student for yearsAsking the student a question

Large Language Models (LLMs)

The AI models you’ll use in this course are called Large Language Models (LLMs). They’re “large” because:

  • Billions of parameters — the internal numbers the model adjusts during training
  • Trained on massive datasets — books, websites, code, scientific papers
  • General-purpose — they can write, summarise, translate, code, and reason

Examples of LLMs:

Model FamilyProviderUsed In
GPT-4o, GPT-4OpenAIAzure OpenAI, Microsoft Foundry
Phi-4MicrosoftMicrosoft Foundry (smaller, efficient)
LlamaMetaAvailable in Foundry model catalog
MistralMistral AIAvailable in Foundry model catalog
ℹ️ What's a transformer?

The Transformer is the architecture behind modern LLMs. The key innovation is self-attention — the ability to look at every word in a sentence and understand how each word relates to every other word.

Before Transformers, AI models processed text one word at a time (left to right). Transformers can process the entire sentence at once, understanding context in both directions.

Example: In “The bank of the river was steep,” a Transformer understands that “bank” means riverbank (not a financial bank) because it looks at the surrounding words simultaneously.

You don’t need to understand the maths for the exam — just know that Transformers are what makes modern LLMs possible.

Grounding and hallucination

Grounding: keeping AI honest

Grounding means connecting the AI’s responses to real, verifiable information. Without grounding, AI models generate responses based purely on their training data — which may be outdated or wrong.

Ways to ground AI responses:

  • System prompts — tell the model what data sources to use
  • RAG (Retrieval-Augmented Generation) — feed relevant documents to the model alongside the user’s question
  • Foundry IQ — Microsoft’s built-in knowledge integration for enterprise data

Hallucination: when AI makes things up

Hallucination is when an AI model generates confident-sounding but factually incorrect information. It happens because the model is predicting likely text, not looking up facts.

Priya scenario: Priya asks an AI model: “What year was Microsoft Foundry launched?” The model might confidently say “2023” (wrong — it was rebranded from Azure AI Studio to Foundry more recently). This is a hallucination.

Reducing hallucinations:

  • Use grounding (RAG, system prompts)
  • Lower the temperature setting (less creative = more predictable)
  • Add instructions to say “I don’t know” when uncertain
💡 Exam tip: Grounding vs hallucination

The exam loves asking about these concepts:

  • Grounding = connecting AI to real data sources → reduces hallucination
  • Hallucination = AI generating false information confidently
  • RAG = a specific technique for grounding (feed documents to the model)
  • If a question asks “how to improve accuracy of AI responses” → the answer is usually grounding or RAG

Multimodal models

Modern AI models aren’t limited to text. Multimodal models can work with multiple types of input and output:

ModalityInput ExampleOutput Example
Text”Describe this image”Written description
ImageA photo of a productProduct classification
AudioA recorded meetingTranscription
VideoA surveillance clipActivity summary

GPT-4o (the “o” stands for “omni”) is a multimodal model — it can accept text, images, and audio as input and generate text and audio as output.

MediSpark scenario: MediSpark uses a multimodal model to analyse X-ray images. A doctor uploads the image and types “Identify any abnormalities.” The model processes both the image and the text instruction to generate a diagnostic summary.

🎬 Video walkthrough

🎬 Video coming soon

How Generative AI Works — AI-901 Module 3

How Generative AI Works — AI-901 Module 3

~14 min

Flashcards

Question

What is a token in AI?

Click or press Enter to reveal answer

Answer

A chunk of text (roughly 3-4 characters) that AI models use to process language. Models have token limits for how much text they can handle, and pricing is based on tokens processed.

Click to flip back

Question

What is the difference between training and inference?

Click or press Enter to reveal answer

Answer

Training is when the model learns patterns from massive datasets (done by the model provider). Inference is when you use the trained model to generate responses (done by you). Training costs millions; inference costs fractions of a cent per token.

Click to flip back

Question

What is hallucination in AI?

Click or press Enter to reveal answer

Answer

When an AI model generates confident-sounding but factually incorrect information. It happens because the model predicts likely text rather than looking up facts. Grounding and RAG help reduce hallucinations.

Click to flip back

Question

What is grounding in AI?

Click or press Enter to reveal answer

Answer

Connecting AI responses to real, verifiable information — such as documents, databases, or web content. Techniques include RAG (Retrieval-Augmented Generation), system prompts, and Foundry IQ.

Click to flip back

Question

What is a multimodal model?

Click or press Enter to reveal answer

Answer

An AI model that can process multiple types of input (text, images, audio, video) and generate multiple types of output. Example: GPT-4o accepts text and images as input.

Click to flip back

Knowledge Check

Knowledge Check

Priya is using a generative AI model to summarise a research paper. The model produces a summary that includes a statistic not found in the original paper. What is this called?

Knowledge Check

DataFlow Corp wants to reduce hallucinations in their customer support AI. Which approach is most effective?

Knowledge Check

Which of the following best describes what happens during AI model training?


Next up: Choosing the Right AI Model — not all models are equal. Learn when to use GPT-4o vs Phi-4 vs image generators.

← Previous

Responsible AI: The Six Principles

Next →

Choosing the Right AI Model

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.