Fuzzy String Matching: Finding Similar Text
Use EDIT_DISTANCE and JARO_WINKLER functions to find strings that are close but not identical — catch typos, match misspelled names, and deduplicate messy data.
When exact matching is not enough
Imagine searching for “Jon Smith” in a customer database.
An exact search finds nothing — but the customer exists as “John Smith.” A fuzzy match says: “those two strings are 90% similar — probably the same person.” That is fuzzy string matching: measuring how close two strings are, even when they are not identical.
SQL Server 2025 adds built-in functions that calculate the “distance” between strings — how many edits (insertions, deletions, swaps) it takes to turn one into the other. Fewer edits = more similar.
The fuzzy matching functions
| Function | Algorithm | Returns | Best For |
|---|---|---|---|
| EDIT_DISTANCE | Damerau-Levenshtein | Integer (number of edits) | Measuring exact number of changes needed |
| EDIT_DISTANCE_SIMILARITY | Damerau-Levenshtein | 0-100 (percentage) | Comparing similarity as a normalised score |
| JARO_WINKLER_DISTANCE | Jaro-Winkler | Float 0.0-1.0 (distance) | Matching names and short strings where prefix matters |
| JARO_WINKLER_SIMILARITY | Jaro-Winkler | 0-100 (percentage) | Similarity score favouring strings that match from the start |
EDIT_DISTANCE: count the changes
EDIT_DISTANCE calculates the minimum number of insertions, deletions, substitutions, and transpositions needed to transform one string into another.
-- How many edits to go from "Colour" to "Color"?
SELECT EDIT_DISTANCE('Colour', 'Color') AS Distance;
-- Returns: 1 (delete the 'u')
-- Find customers whose name is within 2 edits of a search term
SELECT FullName, Email
FROM Customers
WHERE EDIT_DISTANCE(FullName, 'Jon Smith') <= 2;
-- Matches: "John Smith" (1 edit), "Jon Smyth" (2 edits)
EDIT_DISTANCE_SIMILARITY: normalised score
SELECT EDIT_DISTANCE_SIMILARITY('Colour', 'Color') AS Similarity;
-- Returns: 83 (83% similar)
-- Find products with names at least 80% similar to the search
SELECT Name, EDIT_DISTANCE_SIMILARITY(Name, @SearchTerm) AS Score
FROM Products
WHERE EDIT_DISTANCE_SIMILARITY(Name, @SearchTerm) >= 80
ORDER BY Score DESC;
The formula: (1 - (edit_distance / GREATEST(LEN(string1), LEN(string2)))) * 100
JARO_WINKLER: prefix-weighted matching
Jaro-Winkler gives extra weight to strings that match from the beginning — making it ideal for matching names where the first few characters are usually correct.
-- Compare "Martha" and "Marhta" (transposition)
SELECT JARO_WINKLER_DISTANCE('Martha', 'Marhta') AS Distance;
-- Returns: ~0.039 (very close — low distance = high similarity)
SELECT JARO_WINKLER_SIMILARITY('Martha', 'Marhta') AS Similarity;
-- Returns: 96 (96% similar)
When to use which algorithm?
| Scenario | Best Algorithm | Why |
|---|---|---|
| Name matching (people, companies) | Jaro-Winkler | Prefix weighting helps — “John” vs “Jon” scores higher than “ohnJ” vs “Jon” |
| General typo detection | Edit Distance | Counts all edits equally — good for any position |
| Address matching | Edit Distance | Addresses have errors anywhere, not just at the end |
| Short strings (2-5 chars) | Jaro-Winkler | More sensitive to small differences |
| Long strings (50+ chars) | Edit Distance | More meaningful absolute counts |
| Deduplication | EDIT_DISTANCE_SIMILARITY | Normalised score makes threshold setting easier across varying string lengths |
Exam tip: Preview features require configuration
Fuzzy matching functions are currently in preview. To use them:
ALTER DATABASE SCOPED CONFIGURATION SET PREVIEW_FEATURES = ON;The exam may test whether you know this prerequisite. If a question mentions fuzzy functions failing, check whether PREVIEW_FEATURES is enabled.
Scenario: Priya's customer deduplication at Vault Bank
Vault Bank has 2 million customer records from three merged systems. Many are duplicates with slight name variations: “Robert Chen” / “Rob Chen” / “Robert C Chen.” Priya builds a deduplication query:
SELECT a.CustomerId, a.FullName, b.CustomerId, b.FullName,
EDIT_DISTANCE_SIMILARITY(a.FullName, b.FullName) AS NameScore,
JARO_WINKLER_SIMILARITY(a.Email, b.Email) AS EmailScore
FROM Customers a
INNER JOIN Customers b ON a.CustomerId < b.CustomerId
WHERE EDIT_DISTANCE_SIMILARITY(a.FullName, b.FullName) >= 85
AND JARO_WINKLER_SIMILARITY(a.Email, b.Email) >= 90;She uses both algorithms together — Edit Distance for name similarity and Jaro-Winkler for email prefix matching — to find likely duplicates with high confidence.
Ingrid at Nordic Shield Insurance needs to match incoming claim names against the policyholder database. Names often have minor typos but the first few characters are usually correct. Which function is most appropriate?
🎬 Video coming soon
Next up: AI-Assisted SQL with GitHub Copilot — use AI tools to write, debug, and optimise T-SQL code.