Designing AI Infrastructure

Planning your AI infrastructure

Simple explanation

Building AI infrastructure is like setting up a restaurant kitchen before opening night.

You need to decide: Where will the kitchen be? (region) How many ovens do you need? (compute) Should the kitchen be open to walk-ins or reservation-only? (networking) Can you share equipment between branches? (resource topology)

Get these decisions wrong and you’ll spend more, go slower, or fail compliance audits. Get them right and everything else flows smoothly.

Region selection

Not all Azure regions offer the same AI services. Your region choice affects:

Factor	Impact	Example
Model availability	Not all models are in all regions	GPT-4o may be available in East US but not Australia East
Data residency	Regulated industries require data to stay in specific geographies	EU healthcare data must stay in EU regions
Latency	Closer regions = faster responses	An app serving users in Asia should use an Asia-Pacific region
Capacity	Popular regions may have longer queue times	East US 2 may have shorter wait times than East US

Exam tip: Region + model availability

The exam may present a scenario where the correct answer depends on model availability in a specific region. Key rule: always check model availability before choosing a region. A region that meets data residency requirements but doesn’t offer your required model is not a valid choice.

Deployment options

Serverless vs provisioned deployment
Feature	Serverless (Pay-per-token)	Provisioned Throughput
How it works	Pay only for tokens consumed	Reserve fixed compute capacity (TPM)
Cost model	Variable — scales with usage	Fixed — predictable monthly cost
Best for	Development, variable workloads, prototyping	Production with predictable, high-volume traffic
Latency	May queue during peak times	Guaranteed capacity, consistent latency
Rate limits	Shared pool, may be throttled	Dedicated capacity, higher limits
Setup	Deploy model, start calling	Reserve capacity, then deploy

Other deployment patterns

Pattern	When to Use
Managed compute	Default for most scenarios — Foundry manages the infrastructure
Connected compute (self-hosted)	When you need models on your own VMs or Kubernetes
Edge deployment	SLMs on IoT devices or local servers (Phi-4-mini on ONNX)
Global deployment	Route requests across regions for availability and latency

Resource topology

A typical AI solution connects multiple Azure resources:

Resource	Role	Connects To
Foundry Project	Central workspace for AI development	All other resources
Azure AI Search	Retrieval and grounding index	Foundry Project (data connection)
Azure Storage	Raw document storage, training data	Search (indexer source), Foundry
Azure Key Vault	Secrets and API keys	All services via managed identity
Azure Container Apps	Host custom agent code and orchestrators	Foundry Project (via SDK)
Azure Monitor / App Insights	Observability and tracing	All services

Real-world example: Kai's infrastructure design

Kai is designing the infrastructure for the logistics platform’s AI features:

Region: East US 2 (GPT-4o available, closest to main user base)
Foundry Project: One project per environment (dev, staging, prod)
Model deployment: Serverless for dev (low cost), provisioned for prod (predictable latency)
Search: Azure AI Search Standard tier (handles 10,000 shipping documents)
Storage: Azure Blob Storage for raw shipment documents
Networking: Private endpoints for prod, public for dev
Identity: Managed identity everywhere — no API keys in code

Key terms

Question

What is serverless model deployment?

Click or press Enter to reveal answer

Answer

A pay-per-token deployment where you only pay for tokens consumed. No reserved capacity. Best for development and variable workloads. May experience throttling during peak times.

Click to flip back

Question

What is provisioned throughput?

Click or press Enter to reveal answer

Answer

Reserved compute capacity measured in Provisioned Throughput Units (PTU). Each PTU delivers a model-specific amount of Tokens Per Minute (TPM). Provides consistent latency and guaranteed capacity. Best for production workloads with predictable traffic.

Click to flip back

Question

What is a Foundry Project in the new architecture?

Click or press Enter to reveal answer

Answer

A standalone Azure resource that serves as the workspace for AI development. Contains model deployments, agent definitions, data connections, and evaluations. No parent hub required in the new Foundry architecture.

Click to flip back

Question

Why does region selection matter for AI solutions?

Click or press Enter to reveal answer

Answer

Regions differ in model availability, data residency compliance, latency to users, and available capacity. Always verify your required models are available in your chosen region before designing infrastructure.

Click to flip back

Knowledge check

Knowledge Check

Atlas Financial is deploying a compliance review agent that processes 100,000 loan applications per month with strict SLA requirements. The workload is predictable and steady. Which deployment option should they choose?

Knowledge Check

NeuralMed must keep all patient data within the European Union due to GDPR requirements. They need GPT-4o for their diagnostic assistant. What should they verify FIRST when choosing an Azure region?