Small Language Models & Model Selection
Not every AI task needs GPT-4. Learn when small language models (SLMs) are the right choice, how model routers intelligently select the best model for each request, and how to design a model selection strategy for enterprise AI solutions.
Why bigger isn’t always better
Imagine you need someone to sort your mail. You wouldn’t hire a brain surgeon — you’d hire someone who’s fast, efficient, and cheap for that specific task.
Small language models (SLMs) are the efficient mail sorters of the AI world. They’re trained for specific tasks — classifying text, extracting data, answering domain-specific questions — and they do those tasks faster and cheaper than massive general-purpose models like GPT-4.
A model router is like a smart receptionist who looks at each incoming request and decides: “This is a simple question — send it to the small model. This needs deep reasoning — send it to the big model.” It optimises cost without sacrificing quality.
When to use small language models
SLMs shine in specific scenarios. The exam tests whether you can identify when an SLM is the right architectural choice.
| Feature | Best For | Example SLM | Why Not an LLM? |
|---|---|---|---|
| Edge/on-device inference | Manufacturing sensor log analysis, retail POS text recommendations | Phi-3-mini, Phi-3.5-mini | LLMs require cloud connectivity and have higher latency |
| High-volume, simple tasks | Email classification, sentiment analysis, intent detection | Phi-3-small, fine-tuned Phi models | Cost of running GPT-4 on millions of simple classifications is prohibitive |
| Domain-specific reasoning | Legal document analysis, medical coding, financial report parsing | Fine-tuned Phi or custom-trained models | After fine-tuning, SLMs match LLM quality on narrow domains at lower cost |
| Low-latency requirements | Real-time customer service routing, chatbot intent detection | Phi-3-mini, ONNX-optimised models | LLMs take 2-5 seconds; SLMs respond in milliseconds |
| Data sovereignty | Government or regulated industries where data cannot leave the premises | Self-hosted Phi models on Azure or on-premises | Cloud LLM APIs may not meet data residency requirements |
Exam tip: SLM decision signals
Look for these keywords in exam scenarios:
- “Edge,” “on-premises,” “limited connectivity” — SLM (can run locally)
- “Millions of requests,” “high volume,” “cost-sensitive” — SLM (cheaper per inference)
- “Under 1 second response time,” “real-time” — SLM (lower latency)
- “Data cannot leave the environment” — SLM (self-hosted)
- “Complex reasoning,” “multi-step analysis,” “creative generation” — LLM (SLMs lack depth for these)
- “General-purpose assistant across many topics” — LLM (SLMs are narrow specialists)
Model router: intelligent model selection
A model router in Microsoft Foundry is a deployed model that analyses each incoming prompt and routes it to the most suitable underlying LLM. You deploy it like any other model — one endpoint, one deployment — but behind the scenes it selects from multiple models.
How model router works:
- You deploy a model router from the Foundry model catalogue
- You send requests to a single endpoint (just like calling GPT-4)
- The router analyses each prompt — complexity, task type, reasoning requirements
- It routes to the best model based on your selected routing mode
- The response includes a
modelfield revealing which underlying model was selected
Routing modes:
| Mode | Behaviour | Best For |
|---|---|---|
| Balanced (default) | Considers all models within a small quality range (1-2% of best) and picks the most cost-effective | Most enterprise workloads |
| Quality | Always picks the highest-quality model regardless of cost | Legal review, medical summaries, complex reasoning |
| Cost | Considers a larger quality band (5-6% of best) and picks the cheapest | High-volume classification, simple Q&A, content tagging |
Scenario: Kai implements model routing for Apex Industries
Kai designs a model routing strategy for Apex’s AI platform:
Agent 1 — Customer FAQ bot: Handles thousands of simple product questions daily. Routing mode: Cost — most questions are straightforward; smaller models handle them fine.
Agent 2 — Quality inspection analyser: Reviews complex inspection reports and identifies potential compliance issues. Routing mode: Quality — accuracy is critical; regulatory compliance can’t tolerate errors.
Agent 3 — General supply chain assistant: A mix of simple lookups and complex analysis. Routing mode: Balanced — the router automatically sends simple queries to cheap models and complex ones to powerful models.
Cost impact: By using model routing instead of sending everything to GPT-4, Kai estimates a 40% reduction in inference costs with minimal quality degradation.
Deep dive: model router architecture details
Key architectural facts for the exam:
- Single deployment: You deploy model router once. Don’t deploy the underlying models separately (except Claude models, which need their own deployment)
- Content filters: Applied at the router level — one filter covers all underlying models
- Rate limits: Applied at the router level — one quota for all traffic
- Model subset: You can restrict which underlying models the router uses (useful if you need specific context window sizes or want to exclude certain models)
- Auto-update: Router versions can auto-update, which changes the underlying model set
- Automatic failover: If a routed model has issues, the router transparently redirects to the next best model
- Monitoring: Use Azure Monitor to track which underlying models are being selected and at what cost
Designing a model selection strategy
As an architect, you need a strategy that covers the full spectrum of AI tasks:
| Task Complexity | Recommended Approach | Cost |
|---|---|---|
| Simple classification, intent detection | SLM (Phi-3-mini) or model router in Cost mode | Very low |
| Standard Q&A, summarisation, content generation | Model router in Balanced mode | Low to medium |
| Complex reasoning, multi-step analysis | Model router in Quality mode or direct LLM (GPT-4) | Medium to high |
| Domain-specific with strict accuracy | Fine-tuned SLM or fine-tuned LLM with RAG | Variable (training cost upfront, low inference) |
| Edge/offline scenarios | Deployed SLM on edge device (ONNX runtime) | One-time deployment cost |
Exam tip: model selection hierarchy
The exam rewards architects who follow this cost-optimisation hierarchy:
- Can a model router handle it? — Use model router first (it automatically optimises)
- Is it a narrow, high-volume task? — Consider an SLM
- Does it need domain expertise? — Fine-tune an SLM on your data
- Does it need deep reasoning across broad topics? — Use a direct LLM
- Does it need to run offline? — Deploy an SLM to edge
The wrong answer is almost always “use GPT-4 for everything.” The right answer considers cost, latency, and accuracy requirements for each specific task.
Flashcards
Knowledge check
Kai's manufacturing client needs an AI system that classifies equipment maintenance logs from text sensor outputs on the production floor. The factory has intermittent internet connectivity, and the classification must happen in under 500 milliseconds. Which approach should Kai recommend?
Adrienne's financial services company processes 2 million customer emails per month for intent classification (complaint, inquiry, request, compliment). The classification is straightforward — most emails clearly fall into one category. Which model strategy minimises cost while maintaining accuracy?
Which of the following is NOT a benefit of using a model router compared to deploying a single large language model?
🎬 Video coming soon
Next up: ROI, TCO & Business Case Analysis — building the financial case for AI investments, understanding total cost of ownership, and proving value to leadership.