Elasticsearch master (11)

Based on DIS_max to realize best Fields strategy for multi-field search

1. Add content field to post data

 POST /waws/article/_bulk
 { "update": { "_id": "1"}} {"doc" : {"content" : "i like to write best elasticsearch article"}}
 { "update": { "_id": "2"}} {"doc" : {"content" : "i think java is the best programming language"}}
 { "update": { "_id": "3"}} {"doc" : {"content" : "i am only an elasticsearch beginner"}}
 { "update": { "_id": "4"}} {"doc" : {"content" : "elasticsearch and hadoop are all very good solution, i am a beginner"}}
 { "update": { "_id": "5"}} {"doc" : {"content" : "spark is best big data solution based on scala ,an programming language similar to java"}}
Copy the code

2. Search for posts with Java or solution in title or content

So this is multi-field search, multi-field search

 GET /waws/article/_search
 {
     "query": {
         "bool": {
             "should": [{"match": { "title": "java solution" }},
                 { "match": { "content":  "java solution"}}]}}} {"took": 1."timed_out": false,
   "_shards": {
     "total": 5."successful": 5."failed": 0
   },
   "hits": {
     "total": 4."max_score": 0.8849759."hits": [{"_index": "waws"."_type": "article"."_id": "2"."_score": 0.8849759."_source": {
           "articleID": "KDKE-B-9947-#kL5"."userID": 1."hidden": false,
           "postDate": "2017-01-02"."tag": [
             "java"]."tag_cnt": 1."view_cnt": 50."title": "this is java blog"."content": "i think java is the best programming language"}}, {"_index": "waws"."_type": "article"."_id": "4"."_score": 0.7120095."_source": {
           "articleID": "QQPX-R-3956-#aD8"."userID": 2."hidden": true,
           "postDate": "2017-01-02"."tag": [
             "java"."elasticsearch"]."tag_cnt": 2."view_cnt": 80."title": "this is java, elasticsearch, hadoop blog"."content": "elasticsearch and hadoop are all very good solution, i am a beginner"}}, {"_index": "waws"."_type": "article"."_id": "5"."_score": 0.56008905."_source": {
           "articleID": "DHJK-B-1395-#Ky5"."userID": 3."hidden": false,
           "postDate": "2017-03-01"."tag": [
             "elasticsearch"]."tag_cnt": 1."view_cnt": 10."title": "this is spark blog"."content": "spark is best big data solution based on scala ,an programming language similar to java"}}, {"_index": "waws"."_type": "article"."_id": "1"."_score": 0.26742277."_source": {
           "articleID": "XHDK-A-1293-#fJ3"."userID": 1."hidden": false,
           "postDate": "2017-01-01"."tag": [
             "java"."hadoop"]."tag_cnt": 2."view_cnt": 30."title": "this is java and elasticsearch blog"."content": "i like to write best elasticsearch article"}}]}}Copy the code

3. Result analysis

  • Doc5 was expected, doc2 was given, and DOC4 came first

    • Calculate relevance score of each document: the score of each query multiplied by the number of matched Queries, divided by the total number of queries
  • So let’s do the doc4 score

    • {“match”: {“title”: “Java solution”}}, for doc4, there is a score
    • {“match”: {“content”: “Java solution”}}, for doc4, also has a score
    • So the two scores add up, for example, 1.1 + 1.2 = 2.3
    • Matched Query number = 2
    • Total number of queries = 2
    • 2.3 * 2/2 = 2.3
  • Let’s do the doc5 score

    • {“match”: {“title”: “Java solution”}} for doc5, there is no score
    • {“match”: {“content”: “Java solution”}}, for doc5, there is a score
    • So, only one query has a score, such as 2.3
    • Matched Query number = 1
    • Total number of queries = 2
    • 2.3 * 1/2 = 1.15

Doc5 = 1.15 < DOC4 = 2.3

4. Best Fields Strategy, DIS_max

  • The Best fields strategy, that is, the results of the search should be ranked first with as many keywords as possible in a particular field. Instead of having as many fields as possible match a few keywords and get ahead

  • The dis_max syntax is used to select the highest number of queries

{“match”: {“title”: “Java solution”}}, for doc4, there is a score, 1.1 {“match”: {“content”: “Java solution”}}, for doc4, also has a score, 1.2 take the maximum score, 1.2

{“match”: {“title”: “Java solution”}}, for doc5, is not score {“match”: {“content”: “Java solution”}}, for doc5, has a score of 2.3, the maximum score of 2.3

Then doc4 score = 1.2 < DOC5 score = 2.3, so DOC5 can be ranked in the front of the place, in line with our needs

 GET /forum/article/_search
 {
     "query": {
         "dis_max": {
             "queries": [{"match": { "title": "java solution" }},
                 { "match": { "content":  "java solution"}}]}}} {"took": 24."timed_out": false,
   "_shards": {
     "total": 5."successful": 5."failed": 0
   },
   "hits": {
     "total": 4."max_score": 0.68640786."hits": [{"_index": "waws"."_type": "article"."_id": "2"."_score": 0.68640786."_source": {
           "articleID": "KDKE-B-9947-#kL5"."userID": 1."hidden": false,
           "postDate": "2017-01-02"."tag": [
             "java"]."tag_cnt": 1."view_cnt": 50."title": "this is java blog"."content": "i think java is the best programming language"}}, {"_index": "waws"."_type": "article"."_id": "5"."_score": 0.56008905."_source": {
           "articleID": "DHJK-B-1395-#Ky5"."userID": 3."hidden": false,
           "postDate": "2017-03-01"."tag": [
             "elasticsearch"]."tag_cnt": 1."view_cnt": 10."title": "this is spark blog"."content": "spark is best big data solution based on scala ,an programming language similar to java"}}, {"_index": "waws"."_type": "article"."_id": "4"."_score": 0.5565415."_source": {
           "articleID": "QQPX-R-3956-#aD8"."userID": 2."hidden": true,
           "postDate": "2017-01-02"."tag": [
             "java"."elasticsearch"]."tag_cnt": 2."view_cnt": 80."title": "this is java, elasticsearch, hadoop blog"."content": "elasticsearch and hadoop are all very good solution, i am a beginner"}}, {"_index": "waws"."_type": "article"."_id": "1"."_score": 0.26742277."_source": {
           "articleID": "XHDK-A-1293-#fJ3"."userID": 1."hidden": false,
           "postDate": "2017-01-01"."tag": [
             "java"."hadoop"]."tag_cnt": 2."view_cnt": 30."title": "this is java and elasticsearch blog"."content": "i like to write best elasticsearch article"}}]}}Copy the code

Elasticsearch master (12)

Optimization of dis_max search effect based on tie_breaker parameter

1, search for posts that contain Java Beginner in title or content

 GET /waws/article/_search
 {
     "query": {
         "dis_max": {
             "queries": [{"match": { "title": "java beginner" }},
                 { "match": { "body":  "java beginner"}}]}}}Copy the code

Some scenes are not easy to reproduce because, as such, you need to try to construct different text and then construct some search out to achieve the effect you want

One possible scenario would look like this:

(1) A post, doc1, title contains Java, and content does not contain any keywords of Java beginner

(2) A post, DOC2, content contains beginner, but the title does not contain any keywords

(3) IN a post, Java is included in the title and content is included in the beginner

(4) In the final search, doc1 and DOC2 may be in front of DOC3, instead of doc3 being in the front as we expected

Dis_max, which simply takes the highest query score.

2. Dis_max takes only the maximum score of one query and does not consider the scores of other queries

3. Use tie_breaker to take other query scores into account

The meaning of the tie_breaker parameter is that the scores of other queries are multiplied by tie_breaker, and then the scores of the query with the highest score are combined to calculate the scores of other queries in addition to the highest score

  • The value of tie_breaker, between 0 and 1, is a decimal, ok
 GET /waws/article/_search
 {
     "query": {
         "dis_max": {
             "queries": [{"match": { "title": "java beginner" }},
                 { "match": { "body":  "java beginner"}}]."tie_breaker": 0.3}}} {"took": 2."timed_out": false,
   "_shards": {
     "total": 5."successful": 5."failed": 0
   },
   "hits": {
     "total": 3."max_score": 0.26742277."hits": [{"_index": "waws"."_type": "article"."_id": "1"."_score": 0.26742277."_source": {
           "articleID": "XHDK-A-1293-#fJ3"."userID": 1."hidden": false,
           "postDate": "2017-01-01"."tag": [
             "java"."hadoop"]."tag_cnt": 2."view_cnt": 30."title": "this is java and elasticsearch blog"."content": "i like to write best elasticsearch article"}}, {"_index": "waws"."_type": "article"."_id": "2"."_score": 0.19856805."_source": {
           "articleID": "KDKE-B-9947-#kL5"."userID": 1."hidden": false,
           "postDate": "2017-01-02"."tag": [
             "java"]."tag_cnt": 1."view_cnt": 50."title": "this is java blog"."content": "i think java is the best programming language"}}, {"_index": "waws"."_type": "article"."_id": "4"."_score": 0.155468."_source": {
           "articleID": "QQPX-R-3956-#aD8"."userID": 2."hidden": true,
           "postDate": "2017-01-02"."tag": [
             "java"."elasticsearch"]."tag_cnt": 2."view_cnt": 80."title": "this is java, elasticsearch, hadoop blog"."content": "elasticsearch and hadoop are all very good solution, i am a beginner"}}]}}Copy the code

Conclusion:

  • Dis_max: Only the most relevant TF/IDF value is considered as score, and the selected query is ranked first and most matched
  • Tie_breaker: the data match represented by the query with the most relevant TF/IDF value + the other TF/IDF values matched * the tie_breaker ratio, which is ultimately the score value, more comprehensive