Embeddings & Hybrid Search

How embeddings power search

Simple explanation

Embeddings translate words into coordinates on a map.

Imagine a giant map where every word and sentence has a location. “Dog” and “puppy” are neighbours. “Dog” and “refrigerator” are on opposite sides of the map. “Bank” (money) and “bank” (river) are in completely different neighbourhoods.

When you search, the system finds your query’s location on the map and returns whatever is closest. “What’s the refund policy?” lands near documents about returns, refunds, and exchanges — even if those documents don’t use the exact word “refund.”

That’s the magic: embeddings understand meaning, not just matching words.

Embedding model selection

Azure OpenAI offers several embedding models with different trade-offs:

Azure OpenAI embedding models
Feature	Dimensions	Max Input	Relative Quality	Cost
text-embedding-3-small	1536	8191 tokens	Good — general purpose	Lowest
text-embedding-3-large	3072	8191 tokens	Best — highest accuracy	~6x more than small
text-embedding-ada-002	1536	8191 tokens	Legacy — still works	Between small and large

Choosing the right model

Use Case	Recommended Model	Why
General-purpose chatbot	text-embedding-3-small	Good quality, lowest cost, fast
High-accuracy domain search	text-embedding-3-large	Best quality, worth the cost for critical apps
Budget-constrained, high volume	text-embedding-3-small (reduced dimensions)	Can reduce to 256 dims for cost savings
Scientific/medical domain	Domain-specific model or fine-tuned	General models miss specialised terminology

Dimensionality trade-off

The text-embedding-3 models support dimension reduction — you can request fewer dimensions to save storage and speed up search:

from openai import AzureOpenAI

client = AzureOpenAI(
    api_key="your-key",
    api_version="2024-06-01",
    azure_endpoint="https://your-resource.openai.azure.com"
)

# Full dimensions (1536)
response_full = client.embeddings.create(
    model="text-embedding-3-small",
    input="What is the refund policy?",
)
# response_full.data[0].embedding → 1536-dim vector

# Reduced dimensions (256) — smaller, faster, slightly less accurate
response_small = client.embeddings.create(
    model="text-embedding-3-small",
    input="What is the refund policy?",
    dimensions=256,
)
# response_small.data[0].embedding → 256-dim vector

What’s happening:

Lines 10-13: Standard embedding call — returns a 1536-dimensional vector
Lines 17-21: Same model with dimensions=256 — returns a 256-dimensional vector
Fewer dimensions = smaller index, faster search, but slightly lower accuracy
For most applications, 512-1024 dimensions provide a good balance

Exam tip: Dimensionality affects quality AND cost

The exam tests the trade-off:

Higher dimensions = better quality, larger index, slower search, more storage cost
Lower dimensions = slightly lower quality, smaller index, faster search, less storage
text-embedding-3-small at 256 dimensions can be sufficient for many use cases while being significantly cheaper to store and search
You CANNOT increase dimensions beyond the model’s maximum (1536 for small, 3072 for large)

If a question asks how to reduce search latency or storage cost without changing models, the answer is reduce embedding dimensions.

Vector search vs keyword search

Neither vector search nor keyword search is universally better — they excel at different things:

Vector search vs keyword search
Feature	Strengths	Weaknesses	Best For
Vector Search	Understands meaning and synonyms; finds semantically similar results even with different words	Misses exact terms (product codes, IDs, names); can match unrelated content with surface-level similarity	Natural language questions, conceptual queries
Keyword Search (BM25)	Exact term matching; great for codes, names, specific phrases; fast and well-understood	Misses synonyms and paraphrases; 'car' won't match 'automobile'	Specific lookups, product codes, legal citations

Example of where each fails:

Query	Vector Search	Keyword Search
”How do I return an item?”	Finds documents about refund policy, returns process, exchanges (correct)	Only finds docs containing “return” and “item” (misses docs about “refund process”)
“Policy ABC-2024-Q3”	Might return any policy document (wrong)	Finds the exact policy by ID (correct)

Hybrid search: the best of both worlds

Hybrid search combines vector (semantic) and keyword (BM25) retrieval, then merges the results. This almost always outperforms either approach alone.

How hybrid search works

Vector search: embed the query, find the top N semantically similar chunks
Keyword search (BM25): run the same query as a text search, find the top N keyword matches
Merge results: combine both result sets using a fusion algorithm
Return top K: the merged, re-ordered results become the context for the LLM

Reciprocal Rank Fusion (RRF)

RRF is the most common algorithm for merging hybrid search results. It scores each document based on its rank in each result set:

RRF score = sum of 1 / (k + rank) for each result set

Where k is a constant (typically 60). Documents that appear high in BOTH result sets get the best combined score.

Example:

Document	Vector Rank	Keyword Rank	RRF Score
Doc A	1	5	1/61 + 1/65 = 0.0318
Doc B	3	1	1/63 + 1/61 = 0.0323
Doc C	2	8	1/62 + 1/68 = 0.0308

Doc B wins — it ranked well in both searches. Doc A was the best vector match but only 5th in keywords. Doc B was the best keyword match AND 3rd in vector search, giving it the highest combined relevance.

Configuring hybrid search in Azure AI Search

from azure.search.documents import SearchClient

search_client = SearchClient(
    endpoint="https://your-search.search.windows.net",
    index_name="documents-index",
    credential=credential,
)

# Hybrid search: vector + keyword combined
results = search_client.search(
    search_text="What is the refund policy?",   # Keyword search
    vector_queries=[
        {
            "kind": "vector",
            "vector": query_embedding,            # Vector search
            "k_nearest_neighbors": 10,
            "fields": "content_vector",
        }
    ],
    query_type="semantic",                        # Enable semantic ranking
    semantic_configuration_name="semantic-config",
    top=5,
)

for result in results:
    print(f"Score: {result['@search.score']:.4f} | {result['title']}")

What’s happening:

Line 11: search_text triggers keyword (BM25) search
Lines 12-19: vector_queries triggers vector search using the query embedding
Lines 20-21: query_type="semantic" enables an additional semantic re-ranking layer
Azure AI Search automatically fuses the results using RRF
Line 22: top=5 returns the best 5 combined results

Scenario: Dr. Luca uses domain-specific embeddings for genomics

Dr. Luca Bianchi at GenomeVault is building a RAG system over 50,000 genomics research papers. The general-purpose text-embedding-3-large model misses critical matches:

Problem: Searching for “BRCA1 mutation pathogenicity” returns papers about general cancer genetics but misses papers that use the term “BRCA1 variant of uncertain significance (VUS)” — semantically related but using different terminology.

Root cause: General embeddings don’t understand that “pathogenicity” and “variant of uncertain significance” are closely related in genomics. In general English, these phrases have no connection.

Solution: Dr. Luca evaluates two approaches:

Approach	Relevance Score	Latency	Cost
General embeddings (text-embedding-3-large)	3.4	120ms	Baseline
Hybrid search (general embeddings + BM25)	4.1	150ms	+25%

Prof. Sarah Lin approves the hybrid approach as the best cost-quality balance. The keyword search catches exact gene names (BRCA1, TP53) that vector search sometimes misses, while vector search catches semantic relationships that keywords miss.

For future work, Dr. Luca may explore training a custom embedding model with sentence-transformers on GenomeVault’s paper corpus — but hybrid search provides immediate improvement without that investment.

Domain-specific embedding optimization

When general embeddings don’t understand your domain’s terminology, you have several options — but fine-tuning Azure OpenAI embedding models directly is NOT one of them.

Azure OpenAI embedding models (text-embedding-3-small, text-embedding-3-large) are pre-trained and NOT fine-tunable. To optimize for your domain, you can: (1) use a larger embedding model for better accuracy, (2) adjust the dimensions parameter to trade accuracy for cost, (3) train a custom embedding model outside Azure OpenAI using frameworks like sentence-transformers.

When to optimize embeddings

Condition	Approach	Alternative
General domain, standard vocabulary	No optimization needed	Use text-embedding-3-small/large
Specialised terminology (medical, legal, scientific)	Hybrid search first	Combine vector + keyword to catch domain terms
Critical accuracy requirements AND specialised vocab	Custom model (sentence-transformers)	Train outside Azure OpenAI on domain pairs
Limited data (under 1000 examples)	Hybrid search + prompt engineering	Most cost-effective approach

Exam tip: Hybrid search almost always outperforms pure approaches

Key exam takeaway: hybrid search (vector + keyword) almost always outperforms either approach alone. This is well-established in information retrieval research.

If a question asks how to improve retrieval quality, and hybrid search is an option, it’s very likely the correct answer. The only exception is if the question specifically asks about reducing latency or simplifying architecture — in those cases, pure vector search is simpler.

Also remember: embedding dimensionality affects both quality AND cost. Higher dimensions = better quality but more storage and slower search.

Key terms flashcards

Question

What is an embedding?

Click or press Enter to reveal answer

Answer

A dense vector representation of text in a high-dimensional space (256-3072 dimensions). Texts with similar meaning have vectors that are close together (measured by cosine similarity). Powers the retrieval step in RAG.

Click to flip back

Question

What is hybrid search?

Click or press Enter to reveal answer

Answer

Combining vector search (semantic, understands meaning) with keyword search (BM25, exact matching) and merging results using Reciprocal Rank Fusion (RRF). Almost always outperforms either approach alone.

Click to flip back

Question

What is Reciprocal Rank Fusion (RRF)?

Click or press Enter to reveal answer

Answer

An algorithm for merging results from multiple search methods. Scores each document as sum of 1/(k+rank) across result sets. Documents that rank well in BOTH vector and keyword search get the highest combined score.

Click to flip back

Question

When should you optimize embedding models for your domain?

Click or press Enter to reveal answer

Answer

Azure OpenAI embeddings are NOT fine-tunable. Instead: (1) try hybrid search first (vector + keyword), (2) use a larger model (3-large), (3) adjust dimensions for cost/quality trade-off, (4) for critical domains, train a custom model with sentence-transformers outside Azure OpenAI.

Click to flip back

Knowledge check

Knowledge Check

Dr. Luca's search for 'TP53 loss of function' using pure vector search returns papers about general protein function loss but misses papers about 'TP53 tumour suppressor inactivation.' Adding keyword search fixes this. Why?

Knowledge Check

Zara needs to reduce the storage cost of Atlas's vector index by 60% without changing the embedding model. What should she do?

Next up: Fine-Tuning: Methods, Data & Production — the last resort when prompting and RAG aren’t enough.