Quotas, Scaling & Cost

Managing AI costs and limits

Simple explanation

AI models are like electricity — powerful but you pay for every unit you use, and there’s a limit to how much you can draw at once.

Quotas set how much capacity you’re allowed. Rate limits cap how fast you can use it. Scaling adjusts capacity up or down as demand changes. And cost management stops you from getting a surprise bill at the end of the month.

The exam tests whether you can keep AI workloads running smoothly without burning through budget or hitting walls.

Quotas and rate limits

Concept	Scope	What It Limits	How to Increase
Subscription quota	Entire Azure subscription	Total TPM available for a model in a region	Request increase via Azure portal
Deployment rate limit	Single model deployment	RPM and TPM for that specific deployment	Adjust within subscription quota
Provisioned capacity	Reserved deployment	Fixed compute capacity (PTU) guaranteeing a model-specific TPM rate	Purchase more PTU

Exam tip: Quota vs rate limit

The exam distinguishes between these:

Quota = your budget ceiling (subscription level). Example: 300K TPM for GPT-4o in East US.
Rate limit = how fast one deployment can spend. Example: Deployment “prod-chat” limited to 80K TPM.
Multiple deployments share the quota. If quota is 300K and you have 3 deployments, their combined rate limits can’t exceed 300K.

Cost management strategies

Strategy	How It Saves Money	Best For
Right-size the model	Use SLMs for simple tasks instead of LLMs	High-volume, low-complexity workloads
Prompt caching	Reuse cached prefills for repeated system prompts	Apps with long, stable system prompts
Batch processing	Process requests in bulk at lower priority	Non-real-time workloads (report generation, analysis)
Token budgeting	Set max_tokens to prevent runaway responses	All deployments
Model Router	Auto-route to cheapest capable model	Variable complexity workloads

Monitoring model performance

Beyond cost, you need to monitor whether your models are performing well:

Metric	What to Watch	Red Flag
Groundedness	Are responses based on retrieved data?	Responses contain information not in the source documents
Relevance	Do responses answer the actual question?	Users rephrase and retry frequently
Safety events	Are safety filters triggering?	Spike in blocked requests or user complaints
Drift	Has model behaviour changed over time?	Quality scores declining without code changes
Latency	Response time per request	P95 latency exceeding SLA thresholds

Real-world example: Atlas Financial's cost controls

Atlas Financial processes 100,000 compliance reviews monthly. Their cost strategy:

Provisioned throughput for the compliance agent (predictable cost, guaranteed capacity)
Serverless for the internal FAQ chatbot (low, variable usage)
Phi-4-mini for email classification (50,000 emails/day — SLM saves 80% vs GPT-4o)
Batch API for monthly regulatory report generation (not time-sensitive)
Token budget of 2,000 tokens max on all customer-facing responses

Result: 60% cost reduction compared to running everything on GPT-4o serverless.

Key terms

Question

What is TPM (Tokens Per Minute)?

Click or press Enter to reveal answer

Answer

The rate limit unit for AI model deployments. It caps how many tokens (input + output) a deployment can process per minute. Both subscription quotas and deployment rate limits are measured in TPM.

Click to flip back

Question

What is model drift?

Click or press Enter to reveal answer

Answer

When a model's behaviour changes over time without any code changes on your side. Can happen due to model updates, data distribution shifts, or changes in user query patterns. Detected through ongoing evaluation metrics.

Click to flip back

Question

What is prompt caching?

Click or press Enter to reveal answer

Answer

A cost-saving feature where repeated system prompts are cached and reused, reducing token costs. Most effective when your application uses long, stable system prompts that don't change between requests.

Click to flip back

Question

What is groundedness in AI evaluation?

Click or press Enter to reveal answer

Answer

A metric measuring whether the model's response is based on the retrieved source data. Low groundedness means the model is generating information not supported by the provided context — a form of hallucination.

Click to flip back

Knowledge check

Knowledge Check

NeuralMed's patient chatbot is hitting rate limit errors during peak hours (9-11 AM) but usage is low overnight. Their subscription quota has available capacity. What should they do?

Knowledge Check

MediaForge notices their content generation agent's quality scores have declined over the past 2 weeks, but no code changes were deployed. What is the most likely cause?