Elasticsearch master (9)

Boost based fine-grained search term weight control

Requirements:

  • If a post contains Java hadoop and a post contains Java ElasticSearch, the title of the post will be searched first. Posts containing Hadoop will be searched before ElasticSearch

Search term weight boost

  • boost

    • The weight of a search term can be increased, so that when the document matching this search term and the document matching another search term is calculated, the document matching the search term with higher weight will be higher, so of course, the relevance score will be returned preferentially
    • By default, the search criteria all have the same weight, which is 1
 GET /forum/article/_search 
 {
   "query": {
     "bool": {
       "must": [{"match": {
             "title": "blog"}}]."should": [{"match": {
             "title": {
               "query": "java"}}}, {"match": {
             "title": {
               "query": "hadoop"}}}, {"match": {
             "title": {
               "query": "elasticsearch"}}}, {"match": {
             "title": {
               "query": "spark"."boost": 5}}}} {"took": 2."timed_out": false,
   "_shards": {
     "total": 5."successful": 5."failed": 0
   },
   "hits": {
     "total": 5."max_score": 1.7260925."hits": [{"_index": "waws"."_type": "article"."_id": "5"."_score": 1.7260925."_source": {
           "articleID": "DHJK-B-1395-#Ky5"."userID": 3."hidden": false,
           "postDate": "2017-03-01"."tag": [
             "elasticsearch"]."tag_cnt": 1."view_cnt": 10."title": "this is spark blog"}}, {"_index": "waws"."_type": "article"."_id": "4"."_score": 1.4930474."_source": {
           "articleID": "QQPX-R-3956-#aD8"."userID": 2."hidden": true,
           "postDate": "2017-01-02"."tag": [
             "java"."elasticsearch"]."tag_cnt": 2."view_cnt": 80."title": "this is java, elasticsearch, hadoop blog"}}, {"_index": "waws"."_type": "article"."_id": "1"."_score": 0.80226827."_source": {
           "articleID": "XHDK-A-1293-#fJ3"."userID": 1."hidden": false,
           "postDate": "2017-01-01"."tag": [
             "java"."hadoop"]."tag_cnt": 2."view_cnt": 30."title": "this is java and elasticsearch blog"}}, {"_index": "waws"."_type": "article"."_id": "3"."_score": 0.5753642."_source": {
           "articleID": "JODL-X-1937-#pV7"."userID": 2."hidden": false,
           "postDate": "2017-01-01"."tag": [
             "hadoop"]."tag_cnt": 1."view_cnt": 100."title": "this is elasticsearch blog"}}, {"_index": "waws"."_type": "article"."_id": "2"."_score": 0.3971361."_source": {
           "articleID": "KDKE-B-9947-#kL5"."userID": 1."hidden": false,
           "postDate": "2017-01-02"."tag": [
             "java"]."tag_cnt": 1."view_cnt": 50."title": "this is java blog"}}]}}Copy the code

Elasticsearch master (10)

Relevance Score in shard scenarios

Root cause: Multiple shards are computed locally

1. Great disclosure of Relevance Score in multi-shard scenario

If your index has more than one shard, the search results may be inaccurate

  • Personal Understanding:

    • Our index may contain many shards, so our one shard is only part of the complete document. When we use TF/IDF to calculate the correlation score, the shard is only part of the document. By default, IDF is calculated locally in shard local. Let’s say we have two shards, P0 and P1. The word “Java” is high in P0, but low in P1. This results in a very high correlation in P0, because there is too much Java in the same shard, which leads to a lower IDF ranking. The sort of data we end up with is not the sort of data we want

2. How to solve the problem?

  • In the production environment, a large amount of data needs to be distributed evenly

    • If there is a large amount of data, in fact, in general, under the background of probability, ES routes data evenly among multiple shards, and the load is balanced according to the _id
    • For example, if you have 10 Documents and titles that contain Java, and you have 5 shards, then in a probabilistic context, if you’re load-balancing, you should actually have 2 doc’s per shard, and title that contains Java
    • ifEvenly distributed dataIn fact, there is no just said that the problem
  • In the test environment, set the primary shard of the index to one, number_of_SHards =1, and index Settings

    • If there is only one shard, then of course, all the documents are in that shard, so there is no problem
  • In the test environment, search with search_type= dfS_query_then_FETCH will fetch the local IDF to calculate the global IDF

    • When calculating a doc relatedness score, the local IDF of all shard pairs will be calculated, and the global IDF score will be calculated locally. The doc of all shard pairs will be used as the context to calculate the global IDF score, which also ensures accuracy. However, in the production environment, this parameter is not recommended because of poor performance.