πŸ”’ Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided AI-103 Domain 1
Domain 1 β€” Module 6 of 8 75%
6 of 27 overall

AI-103 Study Guide

Domain 1: Plan and Manage an Azure AI Solution

  • Choosing the Right AI Model Free
  • Foundry Services: Your AI Toolkit Free
  • Retrieval, Indexing & Agent Memory
  • Designing AI Infrastructure
  • Deploying Models & CI/CD
  • Quotas, Scaling & Cost
  • Monitoring & Security
  • Responsible AI: Filters, Auditing & Governance

Domain 2: Implement Generative AI and Agentic Solutions

  • Connecting Your App to Foundry Free
  • Building RAG Applications
  • Workflows & Reasoning Pipelines
  • Evaluating AI Models & Apps
  • Agent Fundamentals: Roles, Goals & Tools Free
  • Building Agents with Retrieval & Memory
  • Agent Tools & Knowledge Integration
  • Multi-Agent Orchestration & Safeguards
  • Agent Monitoring & Error Analysis
  • Prompt Engineering & Model Tuning
  • Observability & Production Operations

Domain 3: Implement Computer Vision Solutions

  • Image & Video Generation
  • Multimodal Visual Understanding
  • Responsible AI for Visual Content

Domain 4: Implement Text Analysis Solutions

  • Text Analysis with Language Models
  • Speech, Translation & Voice Agents

Domain 5: Implement Information Extraction Solutions

  • Ingestion, Indexing & Grounding Pipelines
  • Extracting Content with Content Understanding
  • Exam Prep: Putting It All Together

AI-103 Study Guide

Domain 1: Plan and Manage an Azure AI Solution

  • Choosing the Right AI Model Free
  • Foundry Services: Your AI Toolkit Free
  • Retrieval, Indexing & Agent Memory
  • Designing AI Infrastructure
  • Deploying Models & CI/CD
  • Quotas, Scaling & Cost
  • Monitoring & Security
  • Responsible AI: Filters, Auditing & Governance

Domain 2: Implement Generative AI and Agentic Solutions

  • Connecting Your App to Foundry Free
  • Building RAG Applications
  • Workflows & Reasoning Pipelines
  • Evaluating AI Models & Apps
  • Agent Fundamentals: Roles, Goals & Tools Free
  • Building Agents with Retrieval & Memory
  • Agent Tools & Knowledge Integration
  • Multi-Agent Orchestration & Safeguards
  • Agent Monitoring & Error Analysis
  • Prompt Engineering & Model Tuning
  • Observability & Production Operations

Domain 3: Implement Computer Vision Solutions

  • Image & Video Generation
  • Multimodal Visual Understanding
  • Responsible AI for Visual Content

Domain 4: Implement Text Analysis Solutions

  • Text Analysis with Language Models
  • Speech, Translation & Voice Agents

Domain 5: Implement Information Extraction Solutions

  • Ingestion, Indexing & Grounding Pipelines
  • Extracting Content with Content Understanding
  • Exam Prep: Putting It All Together
Domain 1: Plan and Manage an Azure AI Solution Premium ⏱ ~12 min read

Quotas, Scaling & Cost

AI workloads can get expensive fast. Learn how to manage quotas, rate limits, scaling, and cost footprints β€” plus how to monitor model performance and detect drift before users notice.

Managing AI costs and limits

β˜• Simple explanation

AI models are like electricity β€” powerful but you pay for every unit you use, and there’s a limit to how much you can draw at once.

Quotas set how much capacity you’re allowed. Rate limits cap how fast you can use it. Scaling adjusts capacity up or down as demand changes. And cost management stops you from getting a surprise bill at the end of the month.

The exam tests whether you can keep AI workloads running smoothly without burning through budget or hitting walls.

Managing AI workloads requires balancing four interconnected concerns:

  • Quotas β€” subscription-level limits on total capacity (TPM across all deployments)
  • Rate limits β€” per-deployment limits on requests per minute (RPM) and tokens per minute (TPM)
  • Scaling β€” adjusting provisioned capacity to match demand patterns
  • Cost optimisation β€” choosing the right deployment type, model, and configuration to minimise spend

These interact: increasing rate limits requires quota. Scaling up provisioned throughput increases cost. The exam tests trade-off decisions.

Quotas and rate limits

ConceptScopeWhat It LimitsHow to Increase
Subscription quotaEntire Azure subscriptionTotal TPM available for a model in a regionRequest increase via Azure portal
Deployment rate limitSingle model deploymentRPM and TPM for that specific deploymentAdjust within subscription quota
Provisioned capacityReserved deploymentFixed compute capacity (PTU) guaranteeing a model-specific TPM ratePurchase more PTU
πŸ’‘ Exam tip: Quota vs rate limit

The exam distinguishes between these:

  • Quota = your budget ceiling (subscription level). Example: 300K TPM for GPT-4o in East US.
  • Rate limit = how fast one deployment can spend. Example: Deployment β€œprod-chat” limited to 80K TPM.
  • Multiple deployments share the quota. If quota is 300K and you have 3 deployments, their combined rate limits can’t exceed 300K.

Cost management strategies

StrategyHow It Saves MoneyBest For
Right-size the modelUse SLMs for simple tasks instead of LLMsHigh-volume, low-complexity workloads
Prompt cachingReuse cached prefills for repeated system promptsApps with long, stable system prompts
Batch processingProcess requests in bulk at lower priorityNon-real-time workloads (report generation, analysis)
Token budgetingSet max_tokens to prevent runaway responsesAll deployments
Model RouterAuto-route to cheapest capable modelVariable complexity workloads

Monitoring model performance

Beyond cost, you need to monitor whether your models are performing well:

MetricWhat to WatchRed Flag
GroundednessAre responses based on retrieved data?Responses contain information not in the source documents
RelevanceDo responses answer the actual question?Users rephrase and retry frequently
Safety eventsAre safety filters triggering?Spike in blocked requests or user complaints
DriftHas model behaviour changed over time?Quality scores declining without code changes
LatencyResponse time per requestP95 latency exceeding SLA thresholds
ℹ️ Real-world example: Atlas Financial's cost controls

Atlas Financial processes 100,000 compliance reviews monthly. Their cost strategy:

  • Provisioned throughput for the compliance agent (predictable cost, guaranteed capacity)
  • Serverless for the internal FAQ chatbot (low, variable usage)
  • Phi-4-mini for email classification (50,000 emails/day β€” SLM saves 80% vs GPT-4o)
  • Batch API for monthly regulatory report generation (not time-sensitive)
  • Token budget of 2,000 tokens max on all customer-facing responses

Result: 60% cost reduction compared to running everything on GPT-4o serverless.

Key terms

Question

What is TPM (Tokens Per Minute)?

Click or press Enter to reveal answer

Answer

The rate limit unit for AI model deployments. It caps how many tokens (input + output) a deployment can process per minute. Both subscription quotas and deployment rate limits are measured in TPM.

Click to flip back

Question

What is model drift?

Click or press Enter to reveal answer

Answer

When a model's behaviour changes over time without any code changes on your side. Can happen due to model updates, data distribution shifts, or changes in user query patterns. Detected through ongoing evaluation metrics.

Click to flip back

Question

What is prompt caching?

Click or press Enter to reveal answer

Answer

A cost-saving feature where repeated system prompts are cached and reused, reducing token costs. Most effective when your application uses long, stable system prompts that don't change between requests.

Click to flip back

Question

What is groundedness in AI evaluation?

Click or press Enter to reveal answer

Answer

A metric measuring whether the model's response is based on the retrieved source data. Low groundedness means the model is generating information not supported by the provided context β€” a form of hallucination.

Click to flip back

Knowledge check

Knowledge Check

NeuralMed's patient chatbot is hitting rate limit errors during peak hours (9-11 AM) but usage is low overnight. Their subscription quota has available capacity. What should they do?

Knowledge Check

MediaForge notices their content generation agent's quality scores have declined over the past 2 weeks, but no code changes were deployed. What is the most likely cause?

🎬 Video coming soon

← Previous

Deploying Models & CI/CD

Next β†’

Monitoring & Security

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.