How Generative AI Actually Works
Generative AI creates new content — text, images, code — but how? This module demystifies tokens, transformers, training, and inference without requiring a maths degree.
How does generative AI create new content?
Generative AI is like the world’s best autocomplete.
You know how your phone suggests the next word when you type a message? Generative AI does the same thing — but at a massive scale. It’s been trained on billions of documents, and it predicts the most likely next word, sentence, or paragraph based on what you’ve written.
It doesn’t “understand” language the way you do. It’s incredibly good at recognising patterns — so good that its outputs look like they were written by a person.
The same idea applies to images: image generation models start with random noise and progressively refine it into a clear image, guided by your text description.
The key concepts
Tokens: how AI reads text
AI models don’t read words — they read tokens. A token is a chunk of text, roughly 3-4 characters.
| Text | Tokens |
|---|---|
| ”Hello” | 1 token |
| ”Microsoft Foundry” | 2 tokens |
| ”MediSpark’s diagnostic AI” | ~5 tokens |
| A 1-page document | ~500-700 tokens |
Why tokens matter:
- Models have a token limit (context window) — how much text they can process at once
- Pricing is based on tokens processed (input + output)
- Longer prompts = more tokens = higher cost
Training vs inference
| Feature | Training | Inference |
|---|---|---|
| When | Before the model is available | When you use the model |
| What happens | Model learns patterns from massive datasets | Model generates responses based on your input |
| Who does it | OpenAI, Microsoft, model providers | You — through the Foundry portal or SDK |
| Cost | Enormous (millions of dollars, weeks of compute) | Per-token pricing (fractions of a cent) |
| Analogy | Teaching a student for years | Asking the student a question |
Large Language Models (LLMs)
The AI models you’ll use in this course are called Large Language Models (LLMs). They’re “large” because:
- Billions of parameters — the internal numbers the model adjusts during training
- Trained on massive datasets — books, websites, code, scientific papers
- General-purpose — they can write, summarise, translate, code, and reason
Examples of LLMs:
| Model Family | Provider | Used In |
|---|---|---|
| GPT-4o, GPT-4 | OpenAI | Azure OpenAI, Microsoft Foundry |
| Phi-4 | Microsoft | Microsoft Foundry (smaller, efficient) |
| Llama | Meta | Available in Foundry model catalog |
| Mistral | Mistral AI | Available in Foundry model catalog |
What's a transformer?
The Transformer is the architecture behind modern LLMs. The key innovation is self-attention — the ability to look at every word in a sentence and understand how each word relates to every other word.
Before Transformers, AI models processed text one word at a time (left to right). Transformers can process the entire sentence at once, understanding context in both directions.
Example: In “The bank of the river was steep,” a Transformer understands that “bank” means riverbank (not a financial bank) because it looks at the surrounding words simultaneously.
You don’t need to understand the maths for the exam — just know that Transformers are what makes modern LLMs possible.
Grounding and hallucination
Grounding: keeping AI honest
Grounding means connecting the AI’s responses to real, verifiable information. Without grounding, AI models generate responses based purely on their training data — which may be outdated or wrong.
Ways to ground AI responses:
- System prompts — tell the model what data sources to use
- RAG (Retrieval-Augmented Generation) — feed relevant documents to the model alongside the user’s question
- Foundry IQ — Microsoft’s built-in knowledge integration for enterprise data
Hallucination: when AI makes things up
Hallucination is when an AI model generates confident-sounding but factually incorrect information. It happens because the model is predicting likely text, not looking up facts.
Priya scenario: Priya asks an AI model: “What year was Microsoft Foundry launched?” The model might confidently say “2023” (wrong — it was rebranded from Azure AI Studio to Foundry more recently). This is a hallucination.
Reducing hallucinations:
- Use grounding (RAG, system prompts)
- Lower the temperature setting (less creative = more predictable)
- Add instructions to say “I don’t know” when uncertain
Exam tip: Grounding vs hallucination
The exam loves asking about these concepts:
- Grounding = connecting AI to real data sources → reduces hallucination
- Hallucination = AI generating false information confidently
- RAG = a specific technique for grounding (feed documents to the model)
- If a question asks “how to improve accuracy of AI responses” → the answer is usually grounding or RAG
Multimodal models
Modern AI models aren’t limited to text. Multimodal models can work with multiple types of input and output:
| Modality | Input Example | Output Example |
|---|---|---|
| Text | ”Describe this image” | Written description |
| Image | A photo of a product | Product classification |
| Audio | A recorded meeting | Transcription |
| Video | A surveillance clip | Activity summary |
GPT-4o (the “o” stands for “omni”) is a multimodal model — it can accept text, images, and audio as input and generate text and audio as output.
MediSpark scenario: MediSpark uses a multimodal model to analyse X-ray images. A doctor uploads the image and types “Identify any abnormalities.” The model processes both the image and the text instruction to generate a diagnostic summary.
🎬 Video walkthrough
🎬 Video coming soon
How Generative AI Works — AI-901 Module 3
How Generative AI Works — AI-901 Module 3
~14 minFlashcards
Knowledge Check
Priya is using a generative AI model to summarise a research paper. The model produces a summary that includes a statistic not found in the original paper. What is this called?
DataFlow Corp wants to reduce hallucinations in their customer support AI. Which approach is most effective?
Which of the following best describes what happens during AI model training?
Next up: Choosing the Right AI Model — not all models are equal. Learn when to use GPT-4o vs Phi-4 vs image generators.