How Generative AI Actually Works

How does generative AI create new content?

Simple explanation

Generative AI is like the world’s best autocomplete.

You know how your phone suggests the next word when you type a message? Generative AI does the same thing — but at a massive scale. It’s been trained on billions of documents, and it predicts the most likely next word, sentence, or paragraph based on what you’ve written.

It doesn’t “understand” language the way you do. It’s incredibly good at recognising patterns — so good that its outputs look like they were written by a person.

The same idea applies to images: image generation models start with random noise and progressively refine it into a clear image, guided by your text description.

The key concepts

Tokens: how AI reads text

AI models don’t read words — they read tokens. A token is a chunk of text, roughly 3-4 characters.

Text	Tokens
”Hello”	1 token
”Microsoft Foundry”	2 tokens
”MediSpark’s diagnostic AI”	~5 tokens
A 1-page document	~500-700 tokens

Why tokens matter:

Models have a token limit (context window) — how much text they can process at once
Pricing is based on tokens processed (input + output)
Longer prompts = more tokens = higher cost

Training vs inference

Training vs inference — two phases of an AI model's life
Feature	Training	Inference
When	Before the model is available	When you use the model
What happens	Model learns patterns from massive datasets	Model generates responses based on your input
Who does it	OpenAI, Microsoft, model providers	You — through the Foundry portal or SDK
Cost	Enormous (millions of dollars, weeks of compute)	Per-token pricing (fractions of a cent)
Analogy	Teaching a student for years	Asking the student a question

Large Language Models (LLMs)

The AI models you’ll use in this course are called Large Language Models (LLMs). They’re “large” because:

Billions of parameters — the internal numbers the model adjusts during training
Trained on massive datasets — books, websites, code, scientific papers
General-purpose — they can write, summarise, translate, code, and reason

Examples of LLMs:

Model Family	Provider	Used In
GPT-4o, GPT-4	OpenAI	Azure OpenAI, Microsoft Foundry
Phi-4	Microsoft	Microsoft Foundry (smaller, efficient)
Llama	Meta	Available in Foundry model catalog
Mistral	Mistral AI	Available in Foundry model catalog

What's a transformer?

The Transformer is the architecture behind modern LLMs. The key innovation is self-attention — the ability to look at every word in a sentence and understand how each word relates to every other word.

Before Transformers, AI models processed text one word at a time (left to right). Transformers can process the entire sentence at once, understanding context in both directions.

Example: In “The bank of the river was steep,” a Transformer understands that “bank” means riverbank (not a financial bank) because it looks at the surrounding words simultaneously.

You don’t need to understand the maths for the exam — just know that Transformers are what makes modern LLMs possible.

Grounding and hallucination

Grounding: keeping AI honest

Grounding means connecting the AI’s responses to real, verifiable information. Without grounding, AI models generate responses based purely on their training data — which may be outdated or wrong.

Ways to ground AI responses:

System prompts — tell the model what data sources to use
RAG (Retrieval-Augmented Generation) — feed relevant documents to the model alongside the user’s question
Foundry IQ — Microsoft’s built-in knowledge integration for enterprise data

Hallucination: when AI makes things up

Hallucination is when an AI model generates confident-sounding but factually incorrect information. It happens because the model is predicting likely text, not looking up facts.

Priya scenario: Priya asks an AI model: “What year was Microsoft Foundry launched?” The model might confidently say “2023” (wrong — it was rebranded from Azure AI Studio to Foundry more recently). This is a hallucination.

Reducing hallucinations:

Use grounding (RAG, system prompts)
Lower the temperature setting (less creative = more predictable)
Add instructions to say “I don’t know” when uncertain

Exam tip: Grounding vs hallucination

The exam loves asking about these concepts:

Grounding = connecting AI to real data sources → reduces hallucination
Hallucination = AI generating false information confidently
RAG = a specific technique for grounding (feed documents to the model)
If a question asks “how to improve accuracy of AI responses” → the answer is usually grounding or RAG

Multimodal models

Modern AI models aren’t limited to text. Multimodal models can work with multiple types of input and output:

Modality	Input Example	Output Example
Text	”Describe this image”	Written description
Image	A photo of a product	Product classification
Audio	A recorded meeting	Transcription
Video	A surveillance clip	Activity summary

GPT-4o (the “o” stands for “omni”) is a multimodal model — it can accept text, images, and audio as input and generate text and audio as output.

MediSpark scenario: MediSpark uses a multimodal model to analyse X-ray images. A doctor uploads the image and types “Identify any abnormalities.” The model processes both the image and the text instruction to generate a diagnostic summary.