πŸ”’ Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided DP-800 Domain 3
Domain 3 β€” Module 2 of 7 29%
23 of 28 overall

DP-800 Study Guide

Domain 1: Design and Develop Database Solutions

  • Table Design: Columns, Types, and Indexes Free
  • Constraints: Protecting Your Data Free
  • Specialised Tables and Graph Queries Free
  • JSON in SQL: Store, Query, and Index
  • Programmability: Views, Functions, Procedures, and Triggers
  • Advanced T-SQL: CTEs, Windows, and Correlated Queries
  • Pattern Power: Regular Expressions in T-SQL
  • Fuzzy String Matching: Finding Similar Text
  • AI-Assisted SQL with GitHub Copilot Free
  • MCP: Connecting AI to Your Database

Domain 2: Secure, Optimize, and Deploy Database Solutions

  • Encryption: Always Encrypted and Column-Level
  • Dynamic Data Masking and Row-Level Security
  • Permissions, Auditing, and Passwordless Access
  • Transaction Isolation and Concurrency
  • Query Performance: Plans, DMVs, and Query Store
  • SQL Database Projects: Build and Validate
  • CI/CD Pipelines for SQL Databases
  • Data API Builder: REST and GraphQL from SQL
  • Securing AI and API Endpoints
  • Change Detection: CES, CDC, and Change Tracking
  • Azure Integration and Monitoring

Domain 3: Implement AI Capabilities in Database Solutions

  • External AI Models: Choose, Create, Manage
  • Embeddings: Design, Chunk, and Generate
  • Choose Your Search Strategy
  • Vector Data: Types, Indexes, and Storage
  • Vector Search: Distance, ANN, and ENN
  • Hybrid Search and Reciprocal Rank Fusion
  • RAG with SQL: Prompt, Process, Respond

DP-800 Study Guide

Domain 1: Design and Develop Database Solutions

  • Table Design: Columns, Types, and Indexes Free
  • Constraints: Protecting Your Data Free
  • Specialised Tables and Graph Queries Free
  • JSON in SQL: Store, Query, and Index
  • Programmability: Views, Functions, Procedures, and Triggers
  • Advanced T-SQL: CTEs, Windows, and Correlated Queries
  • Pattern Power: Regular Expressions in T-SQL
  • Fuzzy String Matching: Finding Similar Text
  • AI-Assisted SQL with GitHub Copilot Free
  • MCP: Connecting AI to Your Database

Domain 2: Secure, Optimize, and Deploy Database Solutions

  • Encryption: Always Encrypted and Column-Level
  • Dynamic Data Masking and Row-Level Security
  • Permissions, Auditing, and Passwordless Access
  • Transaction Isolation and Concurrency
  • Query Performance: Plans, DMVs, and Query Store
  • SQL Database Projects: Build and Validate
  • CI/CD Pipelines for SQL Databases
  • Data API Builder: REST and GraphQL from SQL
  • Securing AI and API Endpoints
  • Change Detection: CES, CDC, and Change Tracking
  • Azure Integration and Monitoring

Domain 3: Implement AI Capabilities in Database Solutions

  • External AI Models: Choose, Create, Manage
  • Embeddings: Design, Chunk, and Generate
  • Choose Your Search Strategy
  • Vector Data: Types, Indexes, and Storage
  • Vector Search: Distance, ANN, and ENN
  • Hybrid Search and Reciprocal Rank Fusion
  • RAG with SQL: Prompt, Process, Respond
Domain 3: Implement AI Capabilities in Database Solutions Premium ⏱ ~14 min read

Embeddings: Design, Chunk, and Generate

Turn your data into vectors that AI can search and compare β€” choose which columns to embed, design chunking strategies, select maintenance methods, and generate embeddings at scale.

Turning text into searchable numbers

β˜• Simple explanation

Imagine every product in a store as a point on a map.

Similar products are close together: all the red shoes are in one area, all the running shoes in another, and red running shoes are where those areas overlap. That map is what embeddings create β€” except instead of physical location, each product is represented as a list of numbers (a vector) where similar items have similar numbers.

To create this map, you need to: (1) decide what information matters (product name? description? reviews?), (2) break long text into manageable pieces (chunks), and (3) send each piece to an AI model that converts it into a vector. This module covers all three steps.

Embeddings are dense vector representations of text (or other data) generated by AI models. They capture semantic meaning β€” similar content produces similar vectors. In SQL databases, embeddings enable vector search (finding semantically similar records) and are the foundation of RAG (Retrieval-Augmented Generation).

The DP-800 exam covers the full embedding pipeline: choosing which columns to embed, designing chunk strategies for long text, generating embeddings using external models, and maintaining embeddings as source data changes.

Which columns to embed

Not every column should be embedded. Choose columns that contain semantic meaning users would search for:

Column TypeEmbed?Why
Product descriptionYesRich semantic content users search with natural language
Customer review textYesContains opinions and experiences users want to find
Technical documentationYesLong-form content with searchable concepts
Product nameMaybeShort text β€” useful if combined with description
Price, quantityNoNumeric β€” use standard SQL comparison operators
Status codes, IDsNoCategorical β€” use exact match, not semantic search
TimestampsNoTemporal β€” use date range queries

Combine related columns for richer embeddings:

-- Concatenate meaningful columns before embedding
SELECT
    ProductId,
    Name + ' | ' + Category + ' | ' + Description AS EmbeddingInput
FROM Products;

Chunking: breaking long text into pieces

Embedding models have token limits (typically 512-8,192 tokens). Long documents must be split into chunks.

Chunking strategies

StrategyHow It WorksProsCons
Fixed-sizeSplit every N characters/tokensSimple, predictableMay split mid-sentence
Sentence-basedSplit on sentence boundariesPreserves meaningVariable chunk sizes
Paragraph-basedSplit on paragraph breaksNatural boundariesParagraphs vary in length
SemanticSplit where topics changeBest meaning preservationMost complex to implement
OverlappingEach chunk includes some text from adjacent chunksPreserves cross-boundary contextMore chunks to store and search

SQL Server 2025: AI_GENERATE_CHUNKS

-- Built-in chunking function
SELECT chunk_id, chunk_text
FROM AI_GENERATE_CHUNKS(
    @text,
    'text',       -- chunk type
    512           -- max chunk size (tokens)
);
πŸ’‘ Exam tip: Chunk size affects search quality

Smaller chunks (256-512 tokens) give more precise search results but lose broader context. Larger chunks (1024-2048 tokens) preserve context but may dilute the relevance signal. The sweet spot depends on your content:

  • FAQ entries, short descriptions: no chunking needed (already small)
  • Product documentation: 512-token chunks with overlap
  • Legal documents: paragraph-based chunking to preserve clause boundaries
  • Conversations: message-level chunking (each message = one chunk)

The exam may present a scenario where search results are too broad β€” the fix is usually smaller chunks.

Generating embeddings

Using AI_GENERATE_EMBEDDINGS (SQL Server 2025)

-- Generate embeddings using a registered model
SELECT
    ProductId,
    AI_GENERATE_EMBEDDINGS(EmbeddingModel, Description) AS DescriptionVector
FROM Products;

-- Store embeddings in a vector column
ALTER TABLE Products ADD DescriptionEmbedding VECTOR(1536);

UPDATE Products
SET DescriptionEmbedding = AI_GENERATE_EMBEDDINGS(EmbeddingModel, Description);

Maintaining embeddings as data changes

When source data changes, embeddings become stale. Choose a maintenance method based on your data change patterns:

Embedding maintenance methods β€” choose based on volume, latency, and architecture
MethodHow It Detects ChangesLatencyBest For
Table triggersAFTER INSERT/UPDATE trigger regenerates embeddingImmediate (synchronous)Low-volume tables where latency matters
Change TrackingLightweight change flags on rowsNear-real-time (polling)Medium-volume, polling-based sync
Azure Functions SQL triggerServerless function triggered by row changesSeconds to minutesServerless architecture, event-driven
CDC (Change Data Capture)Full change history in change tablesNear-real-timeWhen you need the full change history, not just current state
CES (Change Event Streaming)Streams changes to Event HubsReal-time streamingHigh-volume, event-driven architectures
Azure Logic AppsWorkflow triggered by schedule or eventMinutes to hoursLow-code workflow, scheduled batch updates
Microsoft FoundryAI pipeline orchestrationConfigurableComplex AI pipelines with multiple models
πŸ’‘ Scenario: Leo's embedding pipeline at SearchWave

SearchWave’s product catalog has 500,000 products. About 1,000 products change daily (new listings, updated descriptions). Leo designs the embedding maintenance strategy:

  • Initial load: batch process all 500K products overnight using a SQL Agent job
  • Ongoing updates: Azure Functions with SQL trigger binding β€” when a product’s description changes, the function automatically regenerates its embedding
  • Why not triggers? Calling an external AI model synchronously inside a trigger would block the INSERT/UPDATE transaction. Azure Functions processes the embedding asynchronously.
Question

Why should you not embed numeric columns like Price or Quantity?

Click or press Enter to reveal answer

Answer

Embeddings capture semantic meaning in text. Numeric values like prices do not have semantic meaning β€” 29.99 and 30.00 are semantically identical but would produce different vectors. Use standard SQL comparison operators (=, >, <, BETWEEN) for numeric filtering.

Click to flip back

Question

What is the purpose of overlapping chunks?

Click or press Enter to reveal answer

Answer

Overlapping chunks include some text from adjacent chunks at the boundaries. This preserves context that would be lost if a concept spans two chunks. For example, with 50-token overlap, the last 50 tokens of chunk 1 are also the first 50 tokens of chunk 2.

Click to flip back

Question

Why should you avoid using triggers for embedding maintenance on high-volume tables?

Click or press Enter to reveal answer

Answer

Triggers run synchronously inside the transaction. Calling an external AI model from a trigger blocks the INSERT/UPDATE until the model responds β€” adding seconds of latency to every write. For high-volume tables, use async methods (Azure Functions, CES, CDC) instead.

Click to flip back

Knowledge Check

Priya at Vault Bank has a customer support database with 10 million tickets. Each ticket has a subject line and a 2,000-word description. She wants to enable semantic search. How should she prepare the data for embeddings?

🎬 Video coming soon

Next up: Choose Your Search Strategy β€” compare full-text, vector, and hybrid search to pick the right approach for your data.

← Previous

External AI Models: Choose, Create, Manage

Next β†’

Choose Your Search Strategy

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.