Embeddings & Hybrid Search
Vector search alone isn't enough. Learn to select embedding models, implement hybrid search combining semantic and keyword retrieval, and optimize for domain-specific accuracy.
How embeddings power search
Embeddings translate words into coordinates on a map.
Imagine a giant map where every word and sentence has a location. “Dog” and “puppy” are neighbours. “Dog” and “refrigerator” are on opposite sides of the map. “Bank” (money) and “bank” (river) are in completely different neighbourhoods.
When you search, the system finds your query’s location on the map and returns whatever is closest. “What’s the refund policy?” lands near documents about returns, refunds, and exchanges — even if those documents don’t use the exact word “refund.”
That’s the magic: embeddings understand meaning, not just matching words.
Embedding model selection
Azure OpenAI offers several embedding models with different trade-offs:
| Feature | Dimensions | Max Input | Relative Quality | Cost |
|---|---|---|---|---|
| text-embedding-3-small | 1536 | 8191 tokens | Good — general purpose | Lowest |
| text-embedding-3-large | 3072 | 8191 tokens | Best — highest accuracy | ~6x more than small |
| text-embedding-ada-002 | 1536 | 8191 tokens | Legacy — still works | Between small and large |
Choosing the right model
| Use Case | Recommended Model | Why |
|---|---|---|
| General-purpose chatbot | text-embedding-3-small | Good quality, lowest cost, fast |
| High-accuracy domain search | text-embedding-3-large | Best quality, worth the cost for critical apps |
| Budget-constrained, high volume | text-embedding-3-small (reduced dimensions) | Can reduce to 256 dims for cost savings |
| Scientific/medical domain | Domain-specific model or fine-tuned | General models miss specialised terminology |
Dimensionality trade-off
The text-embedding-3 models support dimension reduction — you can request fewer dimensions to save storage and speed up search:
from openai import AzureOpenAI
client = AzureOpenAI(
api_key="your-key",
api_version="2024-06-01",
azure_endpoint="https://your-resource.openai.azure.com"
)
# Full dimensions (1536)
response_full = client.embeddings.create(
model="text-embedding-3-small",
input="What is the refund policy?",
)
# response_full.data[0].embedding → 1536-dim vector
# Reduced dimensions (256) — smaller, faster, slightly less accurate
response_small = client.embeddings.create(
model="text-embedding-3-small",
input="What is the refund policy?",
dimensions=256,
)
# response_small.data[0].embedding → 256-dim vector
What’s happening:
- Lines 10-13: Standard embedding call — returns a 1536-dimensional vector
- Lines 17-21: Same model with
dimensions=256— returns a 256-dimensional vector - Fewer dimensions = smaller index, faster search, but slightly lower accuracy
- For most applications, 512-1024 dimensions provide a good balance
Exam tip: Dimensionality affects quality AND cost
The exam tests the trade-off:
- Higher dimensions = better quality, larger index, slower search, more storage cost
- Lower dimensions = slightly lower quality, smaller index, faster search, less storage
- text-embedding-3-small at 256 dimensions can be sufficient for many use cases while being significantly cheaper to store and search
- You CANNOT increase dimensions beyond the model’s maximum (1536 for small, 3072 for large)
If a question asks how to reduce search latency or storage cost without changing models, the answer is reduce embedding dimensions.
Vector search vs keyword search
Neither vector search nor keyword search is universally better — they excel at different things:
| Feature | Strengths | Weaknesses | Best For |
|---|---|---|---|
| Vector Search | Understands meaning and synonyms; finds semantically similar results even with different words | Misses exact terms (product codes, IDs, names); can match unrelated content with surface-level similarity | Natural language questions, conceptual queries |
| Keyword Search (BM25) | Exact term matching; great for codes, names, specific phrases; fast and well-understood | Misses synonyms and paraphrases; 'car' won't match 'automobile' | Specific lookups, product codes, legal citations |
Example of where each fails:
| Query | Vector Search | Keyword Search |
|---|---|---|
| ”How do I return an item?” | Finds documents about refund policy, returns process, exchanges (correct) | Only finds docs containing “return” and “item” (misses docs about “refund process”) |
| “Policy ABC-2024-Q3” | Might return any policy document (wrong) | Finds the exact policy by ID (correct) |
Hybrid search: the best of both worlds
Hybrid search combines vector (semantic) and keyword (BM25) retrieval, then merges the results. This almost always outperforms either approach alone.
How hybrid search works
- Vector search: embed the query, find the top N semantically similar chunks
- Keyword search (BM25): run the same query as a text search, find the top N keyword matches
- Merge results: combine both result sets using a fusion algorithm
- Return top K: the merged, re-ordered results become the context for the LLM
Reciprocal Rank Fusion (RRF)
RRF is the most common algorithm for merging hybrid search results. It scores each document based on its rank in each result set:
RRF score = sum of 1 / (k + rank) for each result set
Where k is a constant (typically 60). Documents that appear high in BOTH result sets get the best combined score.
Example:
| Document | Vector Rank | Keyword Rank | RRF Score |
|---|---|---|---|
| Doc A | 1 | 5 | 1/61 + 1/65 = 0.0318 |
| Doc B | 3 | 1 | 1/63 + 1/61 = 0.0323 |
| Doc C | 2 | 8 | 1/62 + 1/68 = 0.0308 |
Doc B wins — it ranked well in both searches. Doc A was the best vector match but only 5th in keywords. Doc B was the best keyword match AND 3rd in vector search, giving it the highest combined relevance.
Configuring hybrid search in Azure AI Search
from azure.search.documents import SearchClient
search_client = SearchClient(
endpoint="https://your-search.search.windows.net",
index_name="documents-index",
credential=credential,
)
# Hybrid search: vector + keyword combined
results = search_client.search(
search_text="What is the refund policy?", # Keyword search
vector_queries=[
{
"kind": "vector",
"vector": query_embedding, # Vector search
"k_nearest_neighbors": 10,
"fields": "content_vector",
}
],
query_type="semantic", # Enable semantic ranking
semantic_configuration_name="semantic-config",
top=5,
)
for result in results:
print(f"Score: {result['@search.score']:.4f} | {result['title']}")
What’s happening:
- Line 11:
search_texttriggers keyword (BM25) search - Lines 12-19:
vector_queriestriggers vector search using the query embedding - Lines 20-21:
query_type="semantic"enables an additional semantic re-ranking layer - Azure AI Search automatically fuses the results using RRF
- Line 22:
top=5returns the best 5 combined results
Scenario: Dr. Luca uses domain-specific embeddings for genomics
Dr. Luca Bianchi at GenomeVault is building a RAG system over 50,000 genomics research papers. The general-purpose text-embedding-3-large model misses critical matches:
Problem: Searching for “BRCA1 mutation pathogenicity” returns papers about general cancer genetics but misses papers that use the term “BRCA1 variant of uncertain significance (VUS)” — semantically related but using different terminology.
Root cause: General embeddings don’t understand that “pathogenicity” and “variant of uncertain significance” are closely related in genomics. In general English, these phrases have no connection.
Solution: Dr. Luca evaluates two approaches:
| Approach | Relevance Score | Latency | Cost |
|---|---|---|---|
| General embeddings (text-embedding-3-large) | 3.4 | 120ms | Baseline |
| Hybrid search (general embeddings + BM25) | 4.1 | 150ms | +25% |
Prof. Sarah Lin approves the hybrid approach as the best cost-quality balance. The keyword search catches exact gene names (BRCA1, TP53) that vector search sometimes misses, while vector search catches semantic relationships that keywords miss.
For future work, Dr. Luca may explore training a custom embedding model with sentence-transformers on GenomeVault’s paper corpus — but hybrid search provides immediate improvement without that investment.
Domain-specific embedding optimization
When general embeddings don’t understand your domain’s terminology, you have several options — but fine-tuning Azure OpenAI embedding models directly is NOT one of them.
Azure OpenAI embedding models (text-embedding-3-small, text-embedding-3-large) are pre-trained and NOT fine-tunable. To optimize for your domain, you can: (1) use a larger embedding model for better accuracy, (2) adjust the dimensions parameter to trade accuracy for cost, (3) train a custom embedding model outside Azure OpenAI using frameworks like sentence-transformers.
When to optimize embeddings
| Condition | Approach | Alternative |
|---|---|---|
| General domain, standard vocabulary | No optimization needed | Use text-embedding-3-small/large |
| Specialised terminology (medical, legal, scientific) | Hybrid search first | Combine vector + keyword to catch domain terms |
| Critical accuracy requirements AND specialised vocab | Custom model (sentence-transformers) | Train outside Azure OpenAI on domain pairs |
| Limited data (under 1000 examples) | Hybrid search + prompt engineering | Most cost-effective approach |
Exam tip: Hybrid search almost always outperforms pure approaches
Key exam takeaway: hybrid search (vector + keyword) almost always outperforms either approach alone. This is well-established in information retrieval research.
If a question asks how to improve retrieval quality, and hybrid search is an option, it’s very likely the correct answer. The only exception is if the question specifically asks about reducing latency or simplifying architecture — in those cases, pure vector search is simpler.
Also remember: embedding dimensionality affects both quality AND cost. Higher dimensions = better quality but more storage and slower search.
Key terms flashcards
Knowledge check
Dr. Luca's search for 'TP53 loss of function' using pure vector search returns papers about general protein function loss but misses papers about 'TP53 tumour suppressor inactivation.' Adding keyword search fixes this. Why?
Zara needs to reduce the storage cost of Atlas's vector index by 60% without changing the embedding model. What should she do?
🎬 Video coming soon
Next up: Fine-Tuning: Methods, Data & Production — the last resort when prompting and RAG aren’t enough.