When you perform a search in Elasticsearch, the results are sorted so that the documents related to your query rank highly. However, results that are relevant to one application can be considered less relevant to another application. Because Elasticsearch is super flexible, you can fine-tune it to provide the most relevant search results for your specific use case. A relatively straightforward way to adjust the results is to provide its additional conditional query in the query sent to Elasticsearch.

In this blog post, I’ll run through some examples to show you how you can easily use the Bool Query feature as well as match Queries and Match phrases queries to improve search relevance. Before you start, you can check out my previous article “Elastic: A Beginner’s Guide” to start your own Elasticsearch cluster.

Create the sample document in Elasticsearch

To demonstrate the concepts in this blog, we’ll start by marshalling a few documents into Elasticsearch. These documents will be queried throughout the blog to demonstrate various concepts. Our demo document can be written to Elasticsearch as follows:

POST _bulk
{ "index" : { "_index" : "demo_idx", "_id": 1} }
{"content":"Distributed nature, simple REST APIs, speed, and scalability"}
{ "index" : { "_index" : "demo_idx", "_id": 2} }
{"content":"Distributed nature, simple APIs, speed, and scalability"}
{ "index" : { "_index" : "demo_idx", "_id": 3} }
{"content":"Known for its simple REST APIs, distributed nature, speed, and scalability, Elasticsearch is the central component of the Elastic Stack, a set of open source tools for data ingestion, enrichment, storage, analysis, and visualization."}
Copy the code

We typed the above command in Kibana’s Dev Tools. This will generate the documentation we need. We try to use a small amount of documentation so that we can see the nature of the search more easily. Now we have some data to work with. After completing this tutorial, you’ll be able to apply these same techniques to larger data sets, but for now, we’ll keep it simple.

How do YOU rank documents in Elasticsearch

For the rest of this blog, it’s helpful to have a basic understanding of how Elasticsearch calculates the score used to sort documents returned by a query.

Before scoring documents, Elasticsearch first reduces the collection of candidate documents by applying a Boolean test that only includes documents that match the query. A score is then calculated for each document in the set, which determines how the documents are ranked. The score represents the relevance of a given document to a particular query. The default scoring algorithm used by Elasticsearch is BM25. There are three main factors that determine a document’s score:

  • Term Frequency (TF) – The more times the search term appears in a field in the document we are searching for, the more relevant the document is.
  • Reverse Document Frequency (IDF) – The more documents that contain the search term in the field we are searching for, the lower the importance of that term.
  • Field length – Documents containing search terms in very short fields (that is, only a few words) are more relevant than documents containing search terms in longer fields (that is, many words).

If you want to learn more about ranking documents, see my previous article “Elasticsearch: Distributed Scoring”.

 

A basic match Query

Basic matching queries are typically used to perform full-text searches. By default, matching queries with multiple terms will use the OR operator, which will return documents that match any term in the query. Even though some of the matching documents may be only marginally related, this can result in many documents being matched. A search for the Content field in the document we just indexed would look something like this:

GET demo_idx/_search
{
  "query": {
    "match": {
      "content": {
        "query": "simple rest apis distributed nature"
      }
    }
  }
}
Copy the code

The above query will be interpreted as: Simple OR rest OR apis OR distributed OR nature. When we execute the above query, the following results are returned:

{ "took" : 0, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : {" total ": {" value" : 3, "base" : "eq"}, "max_score" : 1.2689934, "hits" : [{" _index ":" demo_idx ", "_type" : "_doc", "_id" : "1", "_score" : 1.2689934, "_source" : {" content ": "Distributed nature, simple REST APIs, speed, and scalability" } }, { "_index" : "demo_idx", "_type" : "_doc", "_id" : "2", "_score" : 0.6970792, "_source" : {" content ": "Distributed nature, simple APIs, speed, and scalability" } }, { "_index" : "demo_idx", "_type" : "_doc", "_id" : "3", "_score" : 0.69611007, "_source" : {" content ": "Known for its simple REST APIs, distributed nature, speed, and scalability, Elasticsearch is the central component of the Elastic Stack, a set of open source tools for data ingestion, enrichment, storage, analysis, and visualization." } } ] } }Copy the code

From the search results above, we can see that all three documents have been searched.

In many cases, the above sort may be just what is needed. In other cases, other adjustments may be required. The different levels of acceptability will depend on the specific requirements of a given application.

  • The first hit was great – it contained all the words we searched for, although not in the order we typed them.
  • The second hit is a good choice, but notice that it lacks the word “rest” and is in a different order than our search.
  • Finally, for some use cases, the third match can be considered a good match because it contains all the words we searched in the exact order in which we typed them.

The third match is ranked no higher than the first two for the following reasons:

  1. Matching queries using the OR operator do not consider the position of the word. Therefore, even though the third match (_id: 3) contains the search text, and it contains the order in which all the words were searched, this does not affect the score.
  2. The third match contains a longer content field than the other matches. Thus, the field length part of the scoring algorithm, which favors shorter fields, results in lower scores. In this example, the score of the third match (_id: 3) due to its long content field drops more than the score of the second match (_id: 2) due to its lack of the word “rest”.

Let’s see what happens if we use the AND operator in a matching query.

 

Use the AND operator in the match query

You can make the search more specific by using the AND operator in matching queries. This will only return documents containing all the search terms. The AND operator returns fewer documents for a given query than matching queries that use the OR operator. This means that the result set may be missing some documents that users might consider relevant. The AND search for the Content field in our index looks like this:

GET demo_idx/_search
{
  "query": {
    "match": {
      "content": {
        "query": "simple rest apis distributed nature",
        "operator": "and"
      }
    }
  }
}
Copy the code

The above query will be interpreted as: Simple AND rest AND apis AND distributed AND Nature. When we execute the above query, the following results are returned:

{ "took" : 0, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : {" total ": {" value" : 2, the "base" : "eq"}, "max_score" : 1.2689934, "hits" : [{" _index ":" demo_idx ", "_type" : "_doc", "_id" : "1", "_score" : 1.2689934, "_source" : {" content ": "Distributed nature, simple REST APIs, speed, and scalability" } }, { "_index" : "demo_idx", "_type" : "_doc", "_id" : "3", "_score" : 0.69611007, "_source" : {" content ": "Known for its simple REST APIs, distributed nature, speed, and scalability, Elasticsearch is the central component of the Elastic Stack, a set of open source tools for data ingestion, enrichment, storage, analysis, and visualization." } } ] } }Copy the code

This query returns only two matches and excludes the second document we extracted (_id: 2). This is because the second document does not contain the word “rest” in its content field, which is required to satisfy the AND condition. Now, we get more accurate results, but we remove matches that might be relevant.

The second match (_id: 3) can be considered more relevant than the first match (_id: 1) because it contains the search terms in the exact order in which they were entered. However, like the OR operator, the AND operator does not consider the position of the item. In addition, because the second matched text field is longer than the first matched text field, the field length part of the scoring algorithm, which favors shorter fields, results in a lower score.

Let’s see what happens if we use a matching phrase query.

 

match_phrase query

More accurate results can be obtained by using a matching phrase query, which returns only documents that exactly match the phrase the user is searching for. This is more stringent than matching queries using the AND operator, AND therefore returns fewer documents than either of the above two queries. A matching phrase query for a document content field would look like the following:

GET demo_idx/_search
{
  "query": {
    "match_phrase": {
      "content": {
        "query": "simple rest apis distributed nature"
      }
    }
  }
}
Copy the code

The query above will match the document containing the phrase: “Simple Rest apis Distributed Nature”. In other words, the above query will only return documents that contain all the words in the same search order. Executing the above query returns the following results.

{ "took" : 8, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : {" total ": {" value" : 1, the "base" : "eq"}, "max_score" : 0.6961101, "hits" : [{" _index ":" demo_idx ", "_type" : "_doc", "_id" : "3", "_score" : 0.6961101, "_source" : {" content ": "Known for its simple REST APIs, distributed nature, speed, and scalability, Elasticsearch is the central component of the Elastic Stack, a set of open source tools for data ingestion, enrichment, storage, analysis, and visualization." } } ] } }Copy the code

Note that this query returns only one match. Now we have a very specific result that matches exactly what the user is searching for, but at the expense of not returning other documents that might be relevant.

None of the solutions above may give us the results we need. The rest of this blog focuses on how to get more relevant search results by combining all of the above queries into one query.

 

Combine OR, AND, AND match_phrase query

We might want to accurately match documents that rank high in the search results, but we might also want to look at documents that are less relevant in the results. Here we show how to use the should clause to combine OR, AND, AND match phrase queries in Boolean Query to help us meet this requirement. The should clause in the Boolean query takes a better matching approach, so the score for each clause contributes to the final _score for each document.

Previous searches can be combined into a single should clause, as follows:

GET demo_idx/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "content": {
              "query": "simple rest apis distributed nature"
            }
          }
        },
        {
          "match": {
            "content": {
              "query": "simple rest apis distributed nature",
              "operator": "and"
            }
          }
        },
        {
          "match_phrase": {
            "content": {
              "query": "simple rest apis distributed nature"
            }
          }
        }
      ]
    }
  }
}
Copy the code

The above query evaluates each should clause and increases the score for each matching clause. Any documents that match the match Query (by definition) will also match AND AND OR match the query. Similarly, any document that matches AND (by definition) will also match the OR query. As a result, we can expect that documents that match the phrase_match we search for will now be higher than documents that match the phrase. However, the above query will return the following results, which may not be exactly what we expected:

{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : {" total ": {" value" : 3, "base" : "eq"}, "max_score" : 2.5379868, "hits" : [{" _index ":" demo_idx ", "_type" : "_doc", "_id" : "1", "_score" : 2.5379868, "_source" : {" content ": "Distributed nature, simple REST APIs, speed, and scalability" } }, { "_index" : "demo_idx", "_type" : "_doc", "_id" : "3", "_score" : 2.0883303, "_source" : {" content ": "Known for its simple REST APIs, distributed nature, speed, and scalability, Elasticsearch is the central component of the Elastic Stack, a set of open source tools for data ingestion, enrichment, storage, analysis, and visualization." } }, { "_index" : "demo_idx", "_type" : "_doc", "_id" : "2", "_score" : 0.6970792, "_source" : {"content" : "scalability: Scalability"}}]}Copy the code

That’s pretty good, but we probably don’t think it’s perfect. We got hits for all the relevant documents, but the order of hits was not exactly what we expected. We might expect the second match (_id: 3) to rank first. After all, the second match matches exactly the phrase we are searching for (AND therefore matches all should clauses), while the first match (_id: 1) only matches AND AND OR clauses. Why is the second match (_id: 3) not ranked first?

The documents are sorted in this order because the content field of the second match (_id: 3) is longer than the other matches, so the score assigned to the document by each should clause (OR, AND AND match phrases) is reduced proportionally due to the field length component of the scoring algorithm. In this case, the score increase due to the successful matching phrase clause is not enough to offset the decrease in field length in the score.

If we really want to ensure that full matches are shown before other matches, we can enhance individual clauses as described in the next section.

 

Enhanced individual clause

Boost functionality can be added to a single clause to make it more important. In our case, we want to enhance the matching (match_PHRASE) phrase clause to ensure that documents that exactly match the phrase we are searching are returned first. This can be done with the following query:

GET demo_idx/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "content": {
              "query": "simple rest apis distributed nature"
            }
          }
        },
        {
          "match": {
            "content": {
              "query": "simple rest apis distributed nature",
              "operator": "and"
            }
          }
        },
        {
          "match_phrase": {
            "content": {
              "query": "simple rest apis distributed nature",
              "boost": 2
            }
          }
        }
      ]
    }
  }
}
Copy the code

After executing the above query, we get something like this:

{ "took" : 0, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : {" total ": {" value" : 3, "base" : "eq"}, "max_score" : 2.7844405, "hits" : [{" _index ":" demo_idx ", "_type" : "_doc", "_id" : "3", "_score" : 2.7844405, "_source" : {" content ": "Known for its simple REST APIs, distributed nature, speed, and scalability, Elasticsearch is the central component of the Elastic Stack, a set of open source tools for data ingestion, enrichment, storage, analysis, and visualization." } }, { "_index" : "demo_idx", "_type" : "_doc", "_id" : "1", "_score" : 2.5379868, "_source" : {"content" : "scalability, scalability, simple REST APIs"}}, {"_index" : "Demo_idx _type", "" :" _doc ", "_id" : "2", "_score" : 0.6970792, "_source" : {" content ": "Distributed nature, simple APIs, speed, and scalability" } } ] } }Copy the code

We have now received the results in the desired order. The document that contains the exact phrase we are searching for is the first match. In addition, we received other less relevant documents, whose results are shown in the drop-down list.

 

Using the Search template

The query above keeps getting bigger and bigger. By using search templates, you can simplify the management of large or complex queries. The search template for the above query is as follows:

POST _scripts/demo_search_template { "script": { "lang": "mustache", "source": { "query": { "bool": { "should": [ { "match": { "content": { "query": "{{query_string}}" } } }, { "match": { "content": { "query": "{{query_string}}", "operator": "and" } } }, { "match_phrase": { "content": { "query": "{{query_string}}", "boost": 2}}}]}}}}}Copy the code

The above search template can be executed with the following calls:

GET _search/template
{
    "id": "demo_search_template", 
    "params": {
        "query_string": "simple rest apis distributed nature"
    }
}
Copy the code

It will return exactly the same results as we received earlier.

 

See details on score calculations

Elasticsearch provides an interpreted API and an interpreted query parameter to learn how to calculate the score. For example, we could use our base match (OR) query execution instructions as follows:

GET demo_idx/_search
{
  "explain": true,
  "query": {
    "match": {
      "content": {
        "query": "simple rest apis distributed nature"
      }
    }
  }
}
Copy the code

This returns a large, detailed response showing the various components of the score calculated for each matching document. However, an analysis of the response is beyond the scope of this blog post.

 

Other relevant adjustment resources

In order to assess the quality of search results more rigorously, ranking evaluation apis can be helpful. In addition, as described in the “Easier Relevance Tuning” section of the Elasticsearch 7.0 blog, more custom relevance scores can be implemented.

 

Example project

A demonstration of the concepts presented in this blog can be found in the ES Local Indexer project. This is a simple Python-based desktop search application that indexes HTML documents into Elasticsearch and provides a browser-based interface to search and page extract documents. Of particular interest to the project is the search body, which demonstrates many of the concepts discussed in this blog, as well as complex Boolean queries that search across multiple fields.