Small Language Models & Model Selection

Why bigger isn’t always better

Simple explanation

Imagine you need someone to sort your mail. You wouldn’t hire a brain surgeon — you’d hire someone who’s fast, efficient, and cheap for that specific task.

Small language models (SLMs) are the efficient mail sorters of the AI world. They’re trained for specific tasks — classifying text, extracting data, answering domain-specific questions — and they do those tasks faster and cheaper than massive general-purpose models like GPT-4.

A model router is like a smart receptionist who looks at each incoming request and decides: “This is a simple question — send it to the small model. This needs deep reasoning — send it to the big model.” It optimises cost without sacrificing quality.

When to use small language models

SLMs shine in specific scenarios. The exam tests whether you can identify when an SLM is the right architectural choice.

Use cases for small language models in business solutions
Feature	Best For	Example SLM	Why Not an LLM?
Edge/on-device inference	Manufacturing sensor log analysis, retail POS text recommendations	Phi-3-mini, Phi-3.5-mini	LLMs require cloud connectivity and have higher latency
High-volume, simple tasks	Email classification, sentiment analysis, intent detection	Phi-3-small, fine-tuned Phi models	Cost of running GPT-4 on millions of simple classifications is prohibitive
Domain-specific reasoning	Legal document analysis, medical coding, financial report parsing	Fine-tuned Phi or custom-trained models	After fine-tuning, SLMs match LLM quality on narrow domains at lower cost
Low-latency requirements	Real-time customer service routing, chatbot intent detection	Phi-3-mini, ONNX-optimised models	LLMs take 2-5 seconds; SLMs respond in milliseconds
Data sovereignty	Government or regulated industries where data cannot leave the premises	Self-hosted Phi models on Azure or on-premises	Cloud LLM APIs may not meet data residency requirements

Exam tip: SLM decision signals

Look for these keywords in exam scenarios:

“Edge,” “on-premises,” “limited connectivity” — SLM (can run locally)
“Millions of requests,” “high volume,” “cost-sensitive” — SLM (cheaper per inference)
“Under 1 second response time,” “real-time” — SLM (lower latency)
“Data cannot leave the environment” — SLM (self-hosted)
“Complex reasoning,” “multi-step analysis,” “creative generation” — LLM (SLMs lack depth for these)
“General-purpose assistant across many topics” — LLM (SLMs are narrow specialists)

Model router: intelligent model selection

A model router in Microsoft Foundry is a deployed model that analyses each incoming prompt and routes it to the most suitable underlying LLM. You deploy it like any other model — one endpoint, one deployment — but behind the scenes it selects from multiple models.

How model router works:

You deploy a model router from the Foundry model catalogue
You send requests to a single endpoint (just like calling GPT-4)
The router analyses each prompt — complexity, task type, reasoning requirements
It routes to the best model based on your selected routing mode
The response includes a model field revealing which underlying model was selected

Routing modes:

Mode	Behaviour	Best For
Balanced (default)	Considers all models within a small quality range (1-2% of best) and picks the most cost-effective	Most enterprise workloads
Quality	Always picks the highest-quality model regardless of cost	Legal review, medical summaries, complex reasoning
Cost	Considers a larger quality band (5-6% of best) and picks the cheapest	High-volume classification, simple Q&A, content tagging

Scenario: Kai implements model routing for Apex Industries

Kai designs a model routing strategy for Apex’s AI platform:

Agent 1 — Customer FAQ bot: Handles thousands of simple product questions daily. Routing mode: Cost — most questions are straightforward; smaller models handle them fine.

Agent 2 — Quality inspection analyser: Reviews complex inspection reports and identifies potential compliance issues. Routing mode: Quality — accuracy is critical; regulatory compliance can’t tolerate errors.

Agent 3 — General supply chain assistant: A mix of simple lookups and complex analysis. Routing mode: Balanced — the router automatically sends simple queries to cheap models and complex ones to powerful models.

Cost impact: By using model routing instead of sending everything to GPT-4, Kai estimates a 40% reduction in inference costs with minimal quality degradation.

Deep dive: model router architecture details

Key architectural facts for the exam:

Single deployment: You deploy model router once. Don’t deploy the underlying models separately (except Claude models, which need their own deployment)
Content filters: Applied at the router level — one filter covers all underlying models
Rate limits: Applied at the router level — one quota for all traffic
Model subset: You can restrict which underlying models the router uses (useful if you need specific context window sizes or want to exclude certain models)
Auto-update: Router versions can auto-update, which changes the underlying model set
Automatic failover: If a routed model has issues, the router transparently redirects to the next best model
Monitoring: Use Azure Monitor to track which underlying models are being selected and at what cost

Designing a model selection strategy

As an architect, you need a strategy that covers the full spectrum of AI tasks:

Task Complexity	Recommended Approach	Cost
Simple classification, intent detection	SLM (Phi-3-mini) or model router in Cost mode	Very low
Standard Q&A, summarisation, content generation	Model router in Balanced mode	Low to medium
Complex reasoning, multi-step analysis	Model router in Quality mode or direct LLM (GPT-4)	Medium to high
Domain-specific with strict accuracy	Fine-tuned SLM or fine-tuned LLM with RAG	Variable (training cost upfront, low inference)
Edge/offline scenarios	Deployed SLM on edge device (ONNX runtime)	One-time deployment cost

Exam tip: model selection hierarchy

The exam rewards architects who follow this cost-optimisation hierarchy:

Can a model router handle it? — Use model router first (it automatically optimises)
Is it a narrow, high-volume task? — Consider an SLM
Does it need domain expertise? — Fine-tune an SLM on your data
Does it need deep reasoning across broad topics? — Use a direct LLM
Does it need to run offline? — Deploy an SLM to edge

The wrong answer is almost always “use GPT-4 for everything.” The right answer considers cost, latency, and accuracy requirements for each specific task.

Flashcards

Question

What is a model router in Microsoft Foundry?

Click or press Enter to reveal answer

Answer

A deployable AI model that analyses incoming prompts and routes them to the most suitable underlying LLM in real time. It optimises cost while maintaining quality, and is deployed as a single endpoint that selects from multiple models behind the scenes.

Click to flip back

Question

Name the three routing modes for model router.

Click or press Enter to reveal answer

Answer

Balanced (default — optimises cost within a small quality range), Quality (always picks the best model regardless of cost), and Cost (picks the cheapest model within a larger quality band).

Click to flip back

Question

When should you use a small language model instead of a large language model?

Click or press Enter to reveal answer

Answer

For edge/on-device inference, high-volume simple tasks, domain-specific reasoning after fine-tuning, low-latency requirements, and data sovereignty scenarios where data cannot leave the environment.

Click to flip back

Question

What happens when a model in a model router deployment experiences issues?

Click or press Enter to reveal answer

Answer

Automatic failover — the router transparently redirects the request to the next most appropriate model. This is built-in and requires no additional configuration.

Click to flip back

Knowledge check

Knowledge Check

Kai's manufacturing client needs an AI system that classifies equipment maintenance logs from text sensor outputs on the production floor. The factory has intermittent internet connectivity, and the classification must happen in under 500 milliseconds. Which approach should Kai recommend?

Knowledge Check

Adrienne's financial services company processes 2 million customer emails per month for intent classification (complaint, inquiry, request, compliment). The classification is straightforward — most emails clearly fall into one category. Which model strategy minimises cost while maintaining accuracy?

Knowledge Check

Which of the following is NOT a benefit of using a model router compared to deploying a single large language model?

Next up: ROI, TCO & Business Case Analysis — building the financial case for AI investments, understanding total cost of ownership, and proving value to leadership.