How does Elasticsearch implement case-insensitive query/aggregation?

1. Actual combat problems

There have been several case sensitivity issues in the community recently:

Question 1: How can ES queries and aggregations be case insensitive?

Question 2: How to implement case insensitive fuzzy query in ES7.6? Mainly how to set up word segmentation and mapping to achieve this effect.

I also tried to set the setting and mapping fields, but I was worried about errors.

Since many students ask similar questions, it is necessary for us to sort out a complete train of thought and scheme.

This may be the mission of Mingyi World public account.

This is not a complicated issue, so this article will be brief and to the point!

2. Problem disassembly

2.1 Debunking 1: If the default word segmentation, can it be case-sensitive?

Yes, the default tokenizer is the Standard tokenizer, which is case insensitive.

Principle part of official documentation:

The following two figures illustrate the Token filters core component of a standard word segmentation: Lower Case Token Filter.

What does that mean? Uppercase English characters are converted to lowercase.

2.2 Disassembly 2: Demo verification

DELETE test_003
PUT test_003
{
  "mappings": {
    "properties": {
      "title": {"type":"text"."analyzer": "standard"
      },
      "keyword": {"type":"keyword"
      }
    }
  }
}

POST test_003/_bulk
{"index": {"_id":1}}
{ "city": "New York"}
{"index": {"_id":2}}
{ "city": "new York"}
{"index": {"_id":3}}
{ "city": "New york"}
{"index": {"_id":4}}
{ "city": "NEW YORK"}
{"index": {"_id":5}}
{ "city": "Seattle"}


POST test_003/_analyze
{
  "text": "New york"."analyzer": "standard"
}

POST test_003/_search
{
  "query": {
    "match_phrase": {"city":"new york"}}}Copy the code

The match_PHRASE retrieval returns a clear result: all data with _id = 1,2,3, and 4 are recalled.

The preliminary conclusion here is that the standard default word segmentation can be case-sensitive.

But what about aggregation?

GET test_003/_search
{
  "size": 0."aggs": {
    "cities": {
      "terms": {
        "field": "city.keyword"}}}}Copy the code

The result is as follows:

"aggregations" : {
    "cities" : {
      "doc_count_error_upper_bound" : 0."sum_other_doc_count" : 0."buckets": [{"key" : "NEW YORK"."doc_count" : 1
        },
        {
          "key" : "New York"."doc_count" : 1
        },
        {
          "key" : "New york"."doc_count" : 1
        },
        {
          "key" : "Seattle"."doc_count" : 1
        },
        {
          "key" : "new York"."doc_count" : 1}}}]Copy the code

Here is the core:

The Mapping setting is multi-fields.
The aggregate takes the keyword type and does not involve the standard.

Now that the keyword is mentioned, let’s take a closer look:

POST test_003/_search
{
  "query": {
    "term": {"city.keyword":"new york"}}}Copy the code

After precise matching, the recall result is null.

How do you explain that? The keyword type is exact matching, which means that the keyword type alone cannot be case-sensitive.

Further summary:

Does our combinatorial multi-field approach above address the case sensitivity of retrieval and aggregation?

If you can’t handle multi-field, what else can you do? Don’t worry, let’s take our time……

Consider this: You need to work on the Mapping phase.

Core principle: Change everything to lowercase, set Mapping — Set filter: Lowercase filter.

Normalizer is a knowledge point we didn’t mention in previous articles. I hope you can read it and master it.

3. Solutions

First give the implementation, then talk about the principle.

PUT caseinsensitive
{
  "settings": {
    "analysis": {
      "normalizer": {
        "lowercase_normalizer": {
          "type": "custom"."char_filter": []."filter": [
            "lowercase"}}}},"mappings": {
    "properties": {
      "city": {
        "type": "keyword"."normalizer": "lowercase_normalizer"
      }
    }
  }
}
  
  
POST caseinsensitive/_bulk
{"index": {"_id":1}}
{ "city": "New York"}
{"index": {"_id":2}}
{ "city": "new York"}
{"index": {"_id":3}}
{ "city": "New york"}
{"index": {"_id":4}}
{ "city": "NEW YORK"}
{"index": {"_id":5}}
{ "city": "Seattle"}


GET caseinsensitive/_search
{
  "query": {
    "bool": {
      "filter": {
        "term": {
          "city": "NEW YORK"
        }
      }
    }
  }
}
Copy the code

The result is: _id = 1,2,3,4.

Notice that we used terms to retrieve it.

GET caseinsensitive/_search
{
  "size": 0."aggs": {
    "cities": {
      "terms": {
        "field": "city"}}}}Copy the code

The return result is:

"aggregations" : {
    "cities" : {
      "doc_count_error_upper_bound" : 0."sum_other_doc_count" : 0."buckets": [{"key" : "new york"."doc_count" : 4
        },
        {
          "key" : "seattle"."doc_count" : 1}}}]Copy the code

The above four different cases in New York are all aggregated together, which is our expected result.

4. Interpretation of the principle of the solution

The core of the core is that we use Normalizer.

This concept has not been mentioned in our previous participle articles, so we should popularize it here.

Here’s the official reading:

The normalizer property of keyword fields is similar to analyzer except that it guarantees that the analysis chain produces a single token.
The normalizer is applied prior to indexing the keyword, as well as at search-time when the keyword field is searched via a query parser such as the match query or via a term-level query such as the term query.

The core points are as follows:

First, Normalizer is a keyword attribute. It is similar to the function of an Analyzer tokenizer, except that a single Term generated by the keyword can be further processed.
Second, Normalizers are used before keyword data is indexed, and can also be used during match or term searches.

The further processing mentioned above is reflected in our solution: we can do lowercase lowercase conversions.

Since both the write phase and the retrieval phase: Normalizer are in effect, we achieve the case insensitive results we want.

5, summary

If the official document is familiar, our example is actually the official document: Normalizer example.

Filter in the middle is set in lower case. Of course, other Settings can also be set, which needs to be flexibly used in business scenarios.

Feel free to comment on different implementations of similar problems.

Hit Elasticsearch with you!

Add elastic6 (only a few slots left) and work with BAT to improve Elastic!

How does Elasticsearch implement case-insensitive query/aggregation?

1. Actual combat problems

2. Problem disassembly

2.1 Debunking 1: If the default word segmentation, can it be case-sensitive?

2.2 Disassembly 2: Demo verification

3. Solutions

4. Interpretation of the principle of the solution

5, summary

Related Posts

Benchmark Kubernetes persistent volume using FIO: Read/write (IOPS), bandwidth (MB/s), and latency

Build your own GO Project from scratch based on GIN

Deep understanding of Java interfaces and abstract classes