NashTech Blog

Exploring Fuzzy Search and Levenshtein Distance in Data Retrieval

Table of Contents
Fuzzy Search

Image on Internet

Introduction

In the world of search engines and data retrieval, achieving high accuracy and relevance in the results is a constant challenge. One of the techniques used to improve search results is Fuzzy Search.

This blog post will delve into the concept of fuzzy search, its implementation using the Levenshtein Distance, and how to test its effectiveness.

Understanding Fuzzy Search

Fuzzy search is a powerful tool for finding approximate matches in a dataset.
It’s particularly useful when dealing with human language, where small typos or variations in spelling can make exact matching impractical.

Understanding the Levenshtein Distance

The Levenshtein distance, named after Vladimir Levenshtein, is a measure of the difference between two strings,
It’s defined as the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into the other.

Fuzzy Search Utilizing Levenshtein Distance

Fuzzy search becomes more powerful and flexible when combined with the Levenshtein distance. When performing a fuzzy search using the Levenshtein distance, we can specify a threshold of how “fuzzy” our search can be.

For instance, with a Levenshtein distance of 1, we can find all words in our dataset that can be obtained by changing the original word with at most one single-character edit.

Testing Fuzzy Search with a Sample

Sample 01: Fuzzy search for a dataset  Composed of Single Words

Imagine we have a list of words: “brush,” “blush,” “crush,” “brash,” “bush,” “rush,” and “brunch.”

Now, let’s say we’re looking for the word “brush” using a fuzzy search, which means we’re okay with small changes in spelling.

We set a rule that allows only one change in the word we’re searching for, like replacing a letter, adding a letter, or removing a letter. This rule is called the Levenshtein distance of 1.

When we search for “brush” the fuzzy search algorithm finds similar words from our list that are within one change away. Here’s what it finds:

  • “brush” – Exactly the same, no changes needed.
  • “blush” – Replace ‘r’ with ‘l’ in “brush.”
  • “crush” – Replace ‘b’ with ‘c’ in “brush.”
  • “brash” – Replace ‘u’ with ‘a’ in “brush.”
  • “bush” – Delete ‘r’ from “brush.”
  • “rush” – Delete ‘b’ from “brush.”

But, it doesn’t include “brunch” because it requires more than one change to turn “brush” into “brunch.” We would need to change ‘s’ to ‘n’ and add a ‘c’ after ‘n,’ which is two changes, exceeding the allowed one change in the Levenshtein distance of 1.

So, “brunch” isn’t in the search results because it needs more changes than the fuzzy search allows (the Levenshtein distance of 1).

Sample 02: Fuzzy search for a dataset  Composed of Two Words

Imagine we have a list of cool subjects we can learn about like [“data science”, “machine learning”, “artificial intelligence”, “natural language processing”].

Search keyword: “mahcine learnign”, with Levenshtein distance of 2.

To figure out if these two phrases are similar, we break them down into smaller parts, or tokens, like breaking a cookie into pieces. So, “machine learning” becomes [“machine”, “learning”].

For transforming “mahcine learnign” into “machine learning”, Here’s what it finds:

  • Delete “h” from “mahcine” to get “macine”.
  • Insert “h” after “c” in “macine” to get “machine”.
  • Delete “g” from “learnign” to get “learnin”.
  • Insert “g” after “n” in “learnin” to get “learning”.

So, it takes 4 operations (2 deletions and 2 insertions) to transform “mahcine learnign” into “machine learning”. This exceeds the maximum of 2 changes (Levenshtein distance of 2) we mentioned earlier. Therefore, “mahcine learnign” would not be a match for “machine learning” in a fuzzy search allowing a maximum of 2 changes.

Conclusion

Fuzzy search, especially when combined with techniques like the Levenshtein distance, can significantly enhance the user experience by returning more relevant search results and accommodating minor discrepancies in the search query. Like any algorithm, it’s important to continually test and refine the fuzzy search to ensure it meets the needs of the users.

Remember, the effectiveness of fuzzy search can depend on the specific use case and the nature of the data being searched. It’s always important to consider the context when deciding whether to use fuzzy search or another search method.

References

Fuzzy search – Azure AI Search | Microsoft Learn

What is a Fuzzy Search? | AddSearch

Understanding How Fuzzy Search Works: A Comprehensive Guide – Expertrec

Picture of hoannguyen1

hoannguyen1

I am a Senior Test Engineer with over 9 years of experience in the software testing industry. I am responsible for managing the testing process within a software development project to ensure the quality standards and expected performance.

Leave a Comment

Your email address will not be published. Required fields are marked *

Suggested Article

Scroll to Top