Vector Data: Types, Indexes, and Storage

Storing vectors in SQL

Simple explanation

Think of GPS coordinates for every piece of data.

Regular data has an address (a primary key). Vector data has GPS coordinates in a high-dimensional space — except instead of latitude/longitude (2 dimensions), embeddings have 256 to 3,072 dimensions. SQL Server 2025 adds a native VECTOR data type to store these coordinates efficiently, and vector indexes to find nearby points quickly.

The VECTOR data type

-- Declare a vector column with specific dimensions
CREATE TABLE Products (
    ProductId INT NOT NULL PRIMARY KEY,
    Name NVARCHAR(200) NOT NULL,
    DescriptionEmbedding VECTOR(1536) NULL  -- 1536-dimension vector
);

-- Insert a vector value (JSON array format)
INSERT INTO Products (ProductId, Name, DescriptionEmbedding)
VALUES (1, 'Wireless Mouse', '[0.0123, -0.0456, 0.0789, ...]');

Choosing vector dimensions

Embedding Model	Dimensions	Storage Per Vector	Quality
text-embedding-3-small	1536	~6 KB	Good for most use cases
text-embedding-3-large	3072	~12 KB	Higher quality, more storage
text-embedding-ada-002	1536	~6 KB	Legacy, still widely used
Custom/smaller models	256-768	1-3 KB	Faster search, lower quality

Note: Standard VECTOR supports max 1998 dimensions with float32. For 3072 dimensions (e.g., text-embedding-3-large), use the PARAMETERS option to request reduced dimensions from the model, or use float16 preview support.

Vector functions

VECTOR_NORMALIZE

Normalises a vector to unit length (magnitude = 1). Essential before using cosine distance.

-- Normalise vectors for consistent distance calculations
UPDATE Products
SET DescriptionEmbedding = VECTOR_NORMALIZE(DescriptionEmbedding, 'norm2');

VECTORPROPERTY

Inspects vector metadata:

-- Get the number of dimensions
SELECT VECTORPROPERTY(DescriptionEmbedding, 'Dimensions') AS Dims
FROM Products WHERE ProductId = 1;
-- Returns: 1536

VECTOR_DISTANCE

Calculates the distance between two vectors:

-- Find the 5 most similar products to a query vector
SELECT TOP 5 ProductId, Name,
    VECTOR_DISTANCE('cosine', DescriptionEmbedding, @queryVector) AS Distance
FROM Products
ORDER BY Distance ASC;  -- Lower distance = more similar

Distance metrics

Distance metrics — cosine is the default for text embeddings
Metric	Range	Interpretation	Best For
Cosine	0 to 2	0 = identical direction, 2 = opposite	Text embeddings (most common)
Euclidean (L2)	0 to infinity	0 = same point, larger = farther	When magnitude matters (image features)
Dot product	Depends on normalisation	Smaller = more similar (negative dot product in SQL Server)	Normalised vectors, high-performance search

Vector indexes

Without an index, vector search requires scanning every row (brute force). Vector indexes enable fast approximate nearest neighbour (ANN) search.

-- Create a vector index
CREATE VECTOR INDEX IX_Products_Embedding
ON Products (DescriptionEmbedding)
WITH (
    METRIC = 'cosine',
    TYPE = 'DISKANN'
);

DiskANN index

SQL Server uses DiskANN (Disk-based Approximate Nearest Neighbours) — Microsoft Research’s algorithm optimised for SSD-based storage. Key properties:

Handles billions of vectors
Works with data larger than memory
Configurable accuracy vs speed trade-off

Exam tip: Vector index sizing

Vector data is large. Plan storage carefully:

1 million rows x 1536 dimensions x 4 bytes = ~6 GB just for vectors
Add the DiskANN index overhead (~20-30% additional)
Total: ~8 GB for 1M vectors with 1536 dimensions

If the exam asks about storage planning for vector data, remember to account for both the vector column AND the index overhead. Smaller dimensions (256-768) significantly reduce storage requirements.

Question

What is the VECTOR data type in SQL Server 2025?

Click or press Enter to reveal answer

Answer

A native data type that stores fixed-dimension floating-point arrays (embeddings). Declared as VECTOR(N) where N is the number of dimensions (e.g., VECTOR(1536)). Values are stored in an optimised binary format and inserted as JSON arrays.

Click to flip back

Question

Why should you normalise vectors before searching?

Click or press Enter to reveal answer

Answer

VECTOR_NORMALIZE converts vectors to unit length (magnitude = 1). This ensures cosine distance works correctly — without normalisation, vectors with different magnitudes may produce misleading distance values even if they point in the same direction.

Click to flip back

Question

What distance metric should you use for text embeddings?

Click or press Enter to reveal answer

Answer

Cosine distance is the standard for text embeddings. It measures the angle between vectors (direction), ignoring magnitude. A cosine distance of 0 means identical direction (most similar), 2 means opposite. Most embedding models are optimised for cosine similarity.

Click to flip back

Question

What is DiskANN?

Click or press Enter to reveal answer

Answer

DiskANN (Disk-based Approximate Nearest Neighbours) is Microsoft Research's vector index algorithm used in SQL Server. It is optimised for SSD storage, handles datasets larger than memory, and provides configurable accuracy vs speed trade-offs. It is the default vector index type.

Click to flip back

Knowledge Check

Priya at Vault Bank is designing a vector search system for 5 million customer support tickets. Each ticket has a 1536-dimension embedding. She needs sub-second search latency. What should she create?

Next up: Vector Search: Distance, ANN, and ENN — implement vector search with approximate and exact nearest neighbours.