Quotas, Scaling & Cost
AI workloads can get expensive fast. Learn how to manage quotas, rate limits, scaling, and cost footprints β plus how to monitor model performance and detect drift before users notice.
Managing AI costs and limits
AI models are like electricity β powerful but you pay for every unit you use, and thereβs a limit to how much you can draw at once.
Quotas set how much capacity youβre allowed. Rate limits cap how fast you can use it. Scaling adjusts capacity up or down as demand changes. And cost management stops you from getting a surprise bill at the end of the month.
The exam tests whether you can keep AI workloads running smoothly without burning through budget or hitting walls.
Quotas and rate limits
| Concept | Scope | What It Limits | How to Increase |
|---|---|---|---|
| Subscription quota | Entire Azure subscription | Total TPM available for a model in a region | Request increase via Azure portal |
| Deployment rate limit | Single model deployment | RPM and TPM for that specific deployment | Adjust within subscription quota |
| Provisioned capacity | Reserved deployment | Fixed compute capacity (PTU) guaranteeing a model-specific TPM rate | Purchase more PTU |
Exam tip: Quota vs rate limit
The exam distinguishes between these:
- Quota = your budget ceiling (subscription level). Example: 300K TPM for GPT-4o in East US.
- Rate limit = how fast one deployment can spend. Example: Deployment βprod-chatβ limited to 80K TPM.
- Multiple deployments share the quota. If quota is 300K and you have 3 deployments, their combined rate limits canβt exceed 300K.
Cost management strategies
| Strategy | How It Saves Money | Best For |
|---|---|---|
| Right-size the model | Use SLMs for simple tasks instead of LLMs | High-volume, low-complexity workloads |
| Prompt caching | Reuse cached prefills for repeated system prompts | Apps with long, stable system prompts |
| Batch processing | Process requests in bulk at lower priority | Non-real-time workloads (report generation, analysis) |
| Token budgeting | Set max_tokens to prevent runaway responses | All deployments |
| Model Router | Auto-route to cheapest capable model | Variable complexity workloads |
Monitoring model performance
Beyond cost, you need to monitor whether your models are performing well:
| Metric | What to Watch | Red Flag |
|---|---|---|
| Groundedness | Are responses based on retrieved data? | Responses contain information not in the source documents |
| Relevance | Do responses answer the actual question? | Users rephrase and retry frequently |
| Safety events | Are safety filters triggering? | Spike in blocked requests or user complaints |
| Drift | Has model behaviour changed over time? | Quality scores declining without code changes |
| Latency | Response time per request | P95 latency exceeding SLA thresholds |
Real-world example: Atlas Financial's cost controls
Atlas Financial processes 100,000 compliance reviews monthly. Their cost strategy:
- Provisioned throughput for the compliance agent (predictable cost, guaranteed capacity)
- Serverless for the internal FAQ chatbot (low, variable usage)
- Phi-4-mini for email classification (50,000 emails/day β SLM saves 80% vs GPT-4o)
- Batch API for monthly regulatory report generation (not time-sensitive)
- Token budget of 2,000 tokens max on all customer-facing responses
Result: 60% cost reduction compared to running everything on GPT-4o serverless.
Key terms
Knowledge check
NeuralMed's patient chatbot is hitting rate limit errors during peak hours (9-11 AM) but usage is low overnight. Their subscription quota has available capacity. What should they do?
MediaForge notices their content generation agent's quality scores have declined over the past 2 weeks, but no code changes were deployed. What is the most likely cause?
π¬ Video coming soon