Image on Internet
Introduction
In the world of search engines and data retrieval, achieving high accuracy and relevance in the results is a constant challenge. One of the techniques used to improve search results is Fuzzy Search.
This blog post will delve into the concept of fuzzy search, its implementation using the Levenshtein Distance, and how to test its effectiveness.
Understanding Fuzzy Search
Fuzzy search is a powerful tool for finding approximate matches in a dataset.
It’s particularly useful when dealing with human language, where small typos or variations in spelling can make exact matching impractical.
Understanding the Levenshtein Distance
The Levenshtein distance, named after Vladimir Levenshtein, is a measure of the difference between two strings,
It’s defined as the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into the other.
Fuzzy Search Utilizing Levenshtein Distance
Fuzzy search becomes more powerful and flexible when combined with the Levenshtein distance. When performing a fuzzy search using the Levenshtein distance, we can specify a threshold of how “fuzzy” our search can be.
For instance, with a Levenshtein distance of 1, we can find all words in our dataset that can be obtained by changing the original word with at most one single-character edit.
Testing Fuzzy Search with a Sample
Sample 01: Fuzzy search for a dataset Composed of Single Words
Imagine we have a list of words: “brush,” “blush,” “crush,” “brash,” “bush,” “rush,” and “brunch.”
Now, let’s say we’re looking for the word “brush” using a fuzzy search, which means we’re okay with small changes in spelling.
We set a rule that allows only one change in the word we’re searching for, like replacing a letter, adding a letter, or removing a letter. This rule is called the Levenshtein distance of 1.
When we search for “brush” the fuzzy search algorithm finds similar words from our list that are within one change away. Here’s what it finds:
- “brush” – Exactly the same, no changes needed.
- “blush” – Replace ‘r’ with ‘l’ in “brush.”
- “crush” – Replace ‘b’ with ‘c’ in “brush.”
- “brash” – Replace ‘u’ with ‘a’ in “brush.”
- “bush” – Delete ‘r’ from “brush.”
- “rush” – Delete ‘b’ from “brush.”
But, it doesn’t include “brunch” because it requires more than one change to turn “brush” into “brunch.” We would need to change ‘s’ to ‘n’ and add a ‘c’ after ‘n,’ which is two changes, exceeding the allowed one change in the Levenshtein distance of 1.
So, “brunch” isn’t in the search results because it needs more changes than the fuzzy search allows (the Levenshtein distance of 1).
Sample 02: Fuzzy search for a dataset Composed of Two Words
Imagine we have a list of cool subjects we can learn about like [“data science”, “machine learning”, “artificial intelligence”, “natural language processing”].
Search keyword: “mahcine learnign”, with Levenshtein distance of 2.
To figure out if these two phrases are similar, we break them down into smaller parts, or tokens, like breaking a cookie into pieces. So, “machine learning” becomes [“machine”, “learning”].
For transforming “mahcine learnign” into “machine learning”, Here’s what it finds:
- Delete “h” from “mahcine” to get “macine”.
- Insert “h” after “c” in “macine” to get “machine”.
- Delete “g” from “learnign” to get “learnin”.
- Insert “g” after “n” in “learnin” to get “learning”.
So, it takes 4 operations (2 deletions and 2 insertions) to transform “mahcine learnign” into “machine learning”. This exceeds the maximum of 2 changes (Levenshtein distance of 2) we mentioned earlier. Therefore, “mahcine learnign” would not be a match for “machine learning” in a fuzzy search allowing a maximum of 2 changes.
Conclusion
Fuzzy search, especially when combined with techniques like the Levenshtein distance, can significantly enhance the user experience by returning more relevant search results and accommodating minor discrepancies in the search query. Like any algorithm, it’s important to continually test and refine the fuzzy search to ensure it meets the needs of the users.
Remember, the effectiveness of fuzzy search can depend on the specific use case and the nature of the data being searched. It’s always important to consider the context when deciding whether to use fuzzy search or another search method.
References
Fuzzy search – Azure AI Search | Microsoft Learn
What is a Fuzzy Search? | AddSearch
Understanding How Fuzzy Search Works: A Comprehensive Guide – Expertrec