In the actual search, we sometimes type the wrong word, resulting in the search is not available. In Elasticsearch, you can use the fuzziness property to perform fuzzy queries to get errors.

The match query has the fuziness property. It can be set to “0”, “1”, “2” or “Auto”. “Auto” is the recommended option, which defines the distance based on the length of the query term. In practice, when we use auto, the funziness value is automatically set to 2 if the string length is greater than 5, and 0 if the string length is less than 2.

Fuzzy query

Return documents containing words similar to the search term, with Levenshtein edit distance measurement.

The edit distance is the number of one-character changes required to convert one term into another. These changes can include:

  • Change characters (box→fox)
  • Delete character (black→lack)
  • Insert character (sic→sick)
  • Transpose two adjacent characters (act→cat)

To find similar words, a fuzzy query creates a set of all possible variations or extensions of the search term within a specified editing distance. The query then returns a full match for each extension.

example

We start by entering the following document into the FuzzyIndex index:

PUT fuzzyindex/_doc/1
{
  "content": "I like blue sky"
}
Copy the code

If at this point, we perform the following search:

GET fuzzyindex/_search
{
  "query": {
    "match": {
      "content": "ski"
    }
  }
}
Copy the code

So there are no search results, because “I like blue sky” does not have the word ski after the participle.

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}
Copy the code

At this point, if we use the following search:

GET fuzzyindex/_search
{
  "query": {
    "match": {
      "content": {
        "query": "ski",
        "fuzziness": "1"
      }
    }
  }
}
Copy the code

The result is:

{
  "took" : 18,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.19178805,
    "hits" : [
      {
        "_index" : "fuzzyindex",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.19178805,
        "_source" : {
          "content" : "I like blue sky"
        }
      }
    ]
  }
}
Copy the code

Apparently, we found what we were looking for. This is because sky and ski are only one letter different in time.

Again, if we select “Auto” let’s see:

GET fuzzyindex/_search
{
  "query": {
    "match": {
      "content": {
        "query": "ski",
        "fuzziness": "auto"
      }
    }
  }
}
Copy the code

It shows the same result as the one above. You can also make a match.

If we match as follows:

GET fuzzyindex/_search
{
  "query": {
    "match": {
      "content": {
        "query": "bxxe",
        "fuzziness": "auto"
      }
    }
  }
}
Copy the code

Then it doesn’t match any of the results, but if we do the following search:

GET fuzzyindex/_search
{
  "query": {
    "match": {
      "content": {
        "query": "bxxe",
        "fuzziness": "2"
      }
    }
  }
}
Copy the code

We can also use the following format:

GET /_search
{
    "query": {
        "fuzzy": {
            "content": {
                "value": "bxxe",
                "fuzziness": "2"
            }
        }
    }
}
Copy the code

Then it can display the results of the search because we can tolerate two editing errors.

Let’s do another experiment:

GET fuzzyindex/_search
{
  "query": {
    "match": {
      "content": {
        "query": "bluo ski",
        "fuzziness": 1
      }
    }
  }
}
Copy the code

The results shown above are:

{
  "took" : 17,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.40754962,
    "hits" : [
      {
        "_index" : "fuzzyindex",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.40754962,
        "_source" : {
          "content" : "I like blue sky"
        }
      }
    ]
  }
}
Copy the code

In the search above for “bluo ski”, the term has two errors. We wondered if it was beyond our definition of “funziness”: 1. It’s not. Fuziness is 1, which means it is specific to each word, not the total number of errors.

In Elasticsearch, there is a separate Fuzzy search, but this is useful for only one term. Its function is similar to the above:

GET fuzzyindex/_search
{
  "query": {
    "fuzzy": {
      "content": {
        "value": "ski",
        "fuzziness": 1
      }
    }
  }
}
Copy the code

The above search returns:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.19178805,
    "hits" : [
      {
        "_index" : "fuzzyindex",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.19178805,
        "_source" : {
          "content" : "I like blue sky"
        }
      }
    ]
  }
}
Copy the code

 

conclusion

Fuzziness is a simple solution to spelling errors, but has a high CPU overhead and very low accuracy.

 

Reference:

【 1 】 www.elastic.co/guide/en/el…