Fuzzy String Matching: Finding Similar Text

When exact matching is not enough

Simple explanation

Imagine searching for “Jon Smith” in a customer database.

An exact search finds nothing — but the customer exists as “John Smith.” A fuzzy match says: “those two strings are 90% similar — probably the same person.” That is fuzzy string matching: measuring how close two strings are, even when they are not identical.

SQL Server 2025 adds built-in functions that calculate the “distance” between strings — how many edits (insertions, deletions, swaps) it takes to turn one into the other. Fewer edits = more similar.

The fuzzy matching functions

Four fuzzy functions — two algorithms, each with distance and similarity variants
Function	Algorithm	Returns	Best For
EDIT_DISTANCE	Damerau-Levenshtein	Integer (number of edits)	Measuring exact number of changes needed
EDIT_DISTANCE_SIMILARITY	Damerau-Levenshtein	0-100 (percentage)	Comparing similarity as a normalised score
JARO_WINKLER_DISTANCE	Jaro-Winkler	Float 0.0-1.0 (distance)	Matching names and short strings where prefix matters
JARO_WINKLER_SIMILARITY	Jaro-Winkler	0-100 (percentage)	Similarity score favouring strings that match from the start

EDIT_DISTANCE: count the changes

EDIT_DISTANCE calculates the minimum number of insertions, deletions, substitutions, and transpositions needed to transform one string into another.

-- How many edits to go from "Colour" to "Color"?
SELECT EDIT_DISTANCE('Colour', 'Color') AS Distance;
-- Returns: 1 (delete the 'u')

-- Find customers whose name is within 2 edits of a search term
SELECT FullName, Email
FROM Customers
WHERE EDIT_DISTANCE(FullName, 'Jon Smith') <= 2;
-- Matches: "John Smith" (1 edit), "Jon Smyth" (2 edits)

EDIT_DISTANCE_SIMILARITY: normalised score

SELECT EDIT_DISTANCE_SIMILARITY('Colour', 'Color') AS Similarity;
-- Returns: 83  (83% similar)

-- Find products with names at least 80% similar to the search
SELECT Name, EDIT_DISTANCE_SIMILARITY(Name, @SearchTerm) AS Score
FROM Products
WHERE EDIT_DISTANCE_SIMILARITY(Name, @SearchTerm) >= 80
ORDER BY Score DESC;

The formula: (1 - (edit_distance / GREATEST(LEN(string1), LEN(string2)))) * 100

JARO_WINKLER: prefix-weighted matching

Jaro-Winkler gives extra weight to strings that match from the beginning — making it ideal for matching names where the first few characters are usually correct.

-- Compare "Martha" and "Marhta" (transposition)
SELECT JARO_WINKLER_DISTANCE('Martha', 'Marhta') AS Distance;
-- Returns: ~0.039 (very close — low distance = high similarity)

SELECT JARO_WINKLER_SIMILARITY('Martha', 'Marhta') AS Similarity;
-- Returns: 96 (96% similar)

When to use which algorithm?

Scenario	Best Algorithm	Why
Name matching (people, companies)	Jaro-Winkler	Prefix weighting helps — “John” vs “Jon” scores higher than “ohnJ” vs “Jon”
General typo detection	Edit Distance	Counts all edits equally — good for any position
Address matching	Edit Distance	Addresses have errors anywhere, not just at the end
Short strings (2-5 chars)	Jaro-Winkler	More sensitive to small differences
Long strings (50+ chars)	Edit Distance	More meaningful absolute counts
Deduplication	EDIT_DISTANCE_SIMILARITY	Normalised score makes threshold setting easier across varying string lengths

Exam tip: Preview features require configuration

Fuzzy matching functions are currently in preview. To use them:

ALTER DATABASE SCOPED CONFIGURATION SET PREVIEW_FEATURES = ON;

The exam may test whether you know this prerequisite. If a question mentions fuzzy functions failing, check whether PREVIEW_FEATURES is enabled.

Scenario: Priya's customer deduplication at Vault Bank

Vault Bank has 2 million customer records from three merged systems. Many are duplicates with slight name variations: “Robert Chen” / “Rob Chen” / “Robert C Chen.” Priya builds a deduplication query:

SELECT a.CustomerId, a.FullName, b.CustomerId, b.FullName,
       EDIT_DISTANCE_SIMILARITY(a.FullName, b.FullName) AS NameScore,
       JARO_WINKLER_SIMILARITY(a.Email, b.Email) AS EmailScore
FROM Customers a
INNER JOIN Customers b ON a.CustomerId < b.CustomerId
WHERE EDIT_DISTANCE_SIMILARITY(a.FullName, b.FullName) >= 85
  AND JARO_WINKLER_SIMILARITY(a.Email, b.Email) >= 90;

She uses both algorithms together — Edit Distance for name similarity and Jaro-Winkler for email prefix matching — to find likely duplicates with high confidence.

Question

What does EDIT_DISTANCE measure?

Click or press Enter to reveal answer

Answer

The minimum number of insertions, deletions, substitutions, and transpositions needed to transform one string into another (Damerau-Levenshtein algorithm). A distance of 0 means identical strings. Lower distance = more similar.

Click to flip back

Question

Why is Jaro-Winkler better than Edit Distance for name matching?

Click or press Enter to reveal answer

Answer

Jaro-Winkler gives extra weight to strings that match from the beginning (prefix weighting). Since people's names usually have the correct first few characters (even with typos), this makes it more accurate for name matching. 'John' vs 'Jon' scores higher with Jaro-Winkler than with Edit Distance.

Click to flip back

Question

What prerequisite is needed to use fuzzy matching functions in SQL Server 2025?

Click or press Enter to reveal answer

Answer

Fuzzy matching functions are in preview and require: ALTER DATABASE SCOPED CONFIGURATION SET PREVIEW_FEATURES = ON; Without this, the functions will not be available.

Click to flip back

Knowledge Check

Ingrid at Nordic Shield Insurance needs to match incoming claim names against the policyholder database. Names often have minor typos but the first few characters are usually correct. Which function is most appropriate?

Next up: AI-Assisted SQL with GitHub Copilot — use AI tools to write, debug, and optimise T-SQL code.