Elasticsearch master (19)

Deep search technology _ Approximate matching based on SLOP parameters, principle analysis and related experiments

slop

 GET /waws/article/_search
 {
     "query": {
         "match_phrase": {
             "title": {
                 "query": "java spark"."slop":  1}}}}Copy the code
  • The meaning of the slop

    • Terms in a query string, search text, have to go through a number of moves to match a document, and that number of moves is sloP

For example, a Query String can be moved several times to match a document and then set sloP

hello world, java is very good, spark is also very good.

Java Spark, match Phrase, cannot be found

If we specify sloP, then we allow Java Spark to move to try to match doc

java is very good spark
java spark
java spark
java spark
java spark

The sloP here is 3, because of the phrase Java Spark, spark moves three times to match a doc

Slop means more than simply saying that a query String terms moves several times to match a doc. A query String terms can be moved up to a number of times to try to match a doc

Slop is set to 3, so it’s OK

 GET /waws/article/_search
 {
     "query": {
         "match_phrase": {
             "content": {
                 "query": "java spark"."slop":  3}}}} {"took": 19."timed_out": false,
   "_shards": {
     "total": 5."successful": 5."failed": 0
   },
   "hits": {
     "total": 1."max_score": 0.5753642."hits": [{"_index": "waws"."_type": "article"."_id": "5"."_score": 0.5753642."_source": {
           "articleID": "DHJK-B-1395-#Ky5"."userID": 3."hidden": false,
           "postDate": "2017-03-01"."tag": [
             "elasticsearch"]."tag_cnt": 1."view_cnt": 10."title": "this is spark blog"."content": "spark is best big data solution based on scala ,an programming language similar to java spark"."sub_title": "haha, hello world"."author_first_name": "Tonny"."author_last_name": "Peter Smith"."new_author_last_name": "Peter Smith"."new_author_first_name": "Tonny"}}]}}Copy the code

I can match that doc, and that doc will be returned as the result

If sloP is set to 2, spark can only be moved a maximum of two times. In this case, spark does not match doc. The DOC will not be returned as the result of experiments to verify sloP

 GET /waws/article/_search
 {
   "query": {
     "match_phrase": {
       "content": {
         "query": "spark data"."slop": 3}}}} {"took": 1."timed_out": false,
   "_shards": {
     "total": 5."successful": 5."failed": 0
   },
   "hits": {
     "total": 1."max_score": 0.21824157."hits": [{"_index": "waws"."_type": "article"."_id": "5"."_score": 0.21824157."_source": {
           "articleID": "DHJK-B-1395-#Ky5"."userID": 3."hidden": false,
           "postDate": "2017-03-01"."tag": [
             "elasticsearch"]."tag_cnt": 1."view_cnt": 10."title": "this is spark blog"."content": "spark is best big data solution based on scala ,an programming language similar to java spark"."sub_title": "haha, hello world"."author_first_name": "Tonny"."author_last_name": "Peter Smith"."new_author_last_name": "Peter Smith"."new_author_first_name": "Tonny"}}]}}Copy the code

spark is best big data solution

spark data

              data

                     data

                            data

 GET /waws/article/_search
 {
   "query": {
     "match_phrase": {
       "content": {
         "query": "data spark"."slop": 5}}}} {"took": 1."timed_out": false,
   "_shards": {
     "total": 5."successful": 5."failed": 0
   },
   "hits": {
     "total": 1."max_score": 0.154366."hits": [{"_index": "waws"."_type": "article"."_id": "5"."_score": 0.154366."_source": {
           "articleID": "DHJK-B-1395-#Ky5"."userID": 3."hidden": false,
           "postDate": "2017-03-01"."tag": [
             "elasticsearch"]."tag_cnt": 1."view_cnt": 10."title": "this is spark blog"."content": "spark is best big data solution based on scala ,an programming language similar to java spark"."sub_title": "haha, hello world"."author_first_name": "Tonny"."author_last_name": "Peter Smith"."new_author_last_name": "Peter Smith"."new_author_first_name": "Tonny"}}]}}Copy the code
GET /waws/article/_search
{
  "query": {
    "match_phrase": {
      "content": {
        "query": "data spark".# Examples of reverse positions
        "slop": 5}}}}Copy the code
  • Step of change
steps spark is best big data
0 data spark
1 data/spark
2 spark data
3 spark data
4 spark data
5 spark data
  • In SLOP search, the closer the keywords are, the higher the relevance score will be
 GET /waws/article/_search
 {
   "query": {
     "match_phrase": {
       "content": {
         "query": "java best"."slop": 15}}}} {"took": 3."timed_out": false,
  "_shards": {
    "total": 5."successful": 5."failed": 0
  },
  "hits": {
    "total": 2."max_score": 0.65380025."hits": [{"_index": "waws"."_type": "article"."_id": "2"."_score": 0.65380025."_source": {
          "articleID": "KDKE-B-9947-#kL5"."userID": 1."hidden": false,
          "postDate": "2017-01-02"."tag": [
            "java"]."tag_cnt": 1."view_cnt": 50."title": "this is java blog"."content": "i think java is the best programming language"."sub_title": "learned a lot of course"."author_first_name": "Smith"."author_last_name": "Williams"."new_author_last_name": "Williams"."new_author_first_name": "Smith"}}, {"_index": "waws"."_type": "article"."_id": "5"."_score": 0.07111243."_source": {
          "articleID": "DHJK-B-1395-#Ky5"."userID": 3."hidden": false,
          "postDate": "2017-03-01"."tag": [
            "elasticsearch"]."tag_cnt": 1."view_cnt": 10."title": "this is spark blog"."content": "spark is best big data solution based on scala ,an programming language similar to java spark"."sub_title": "haha, hello world"."author_first_name": "Tonny"."author_last_name": "Peter Smith"."new_author_last_name": "Peter Smith"."new_author_first_name": "Tonny"}}]}}Copy the code

In fact, a phrase match with SLOP is a proximity match

  • Java Spark, phrase, doc, phrase match
  • Java Spark can have a certain distance, but the closer it is, the closer it is, the closer it is to search. It can be seen by proximity Match

Elasticsearch master (20)

Deep search techniques – Use match and approximate match to balance recall and accuracy

The recall rate

  • For example, if you search for a Java Spark and there are 100 doc’s in total, how many doc’s are returned as results

precision

  • For example, if you search for a Java Spark, can you make the doc that contains Java Spark, or a doc that is close to Spark, as high as possible? Precision

Searching directly with the match_PHRASE results in a match that requires all terms to be present in the DOC field and within sloP limits

Match phrase and proximity match requires that doc must contain all terms before it can be returned as the result. If a doc might just have a term not included, it cannot be returned as a result

  • Java Spark –> Hello world Java –> cannot be returned
  • Java Spark –> Hello world, Java Spark –> can return

When it comes to approximate matching, the recall rate is low and the accuracy is too high

Requirements:

However, sometimes what we hope to match is part in several terms, which can be taken as the result, so as to improve the recall rate. At the same time, we also hope to use the function of Match_PHRASE to improve the score according to the distance, so that the closer several terms are, the higher the score will be, and the priority will be returned

Java spark is returned if it contains Java, spark is returned if it contains Java, spark is returned if it contains Java and Spark. In terms of precision, both Java and Spark are included, and doc, which is closer to Java and Spark, comes first

You can use the bool combination match query and match_phrase query to achieve the above effect

  • The title case is not very clear
 GET /waws/article/_search
 {
   "query": {
     "bool": {
       "must": {
         "match": { 
           "title": {
             "query":"java spark"}}},"should": {
         "match_phrase": {
           "title": {
             "query": "java spark"."slop":  50}}}}}} {"took": 5."timed_out": false,
   "_shards": {
     "total": 5."successful": 5."failed": 0
   },
   "hits": {
     "total": 4."max_score": 0.2876821."hits": [{"_index": "waws"."_type": "article"."_id": "5"."_score": 0.2876821."_source": {
           "articleID": "DHJK-B-1395-#Ky5"."userID": 3."hidden": false,
           "postDate": "2017-03-01"."tag": [
             "elasticsearch"]."tag_cnt": 1."view_cnt": 10."title": "this is spark blog"."content": "spark is best big data solution based on scala ,an programming language similar to java spark"."sub_title": "haha, hello world"."author_first_name": "Tonny"."author_last_name": "Peter Smith"."new_author_last_name": "Peter Smith"."new_author_first_name": "Tonny"}}, {"_index": "waws"."_type": "article"."_id": "1"."_score": 0.26742277."_source": {
           "articleID": "XHDK-A-1293-#fJ3"."userID": 1."hidden": false,
           "postDate": "2017-01-01"."tag": [
             "java"."hadoop"]."tag_cnt": 2."view_cnt": 30."title": "this is java and elasticsearch blog"."content": "i like to write best elasticsearch article"."sub_title": "learning more courses"."author_first_name": "Peter"."author_last_name": "Smith"."new_author_last_name": "Smith"."new_author_first_name": "Peter"}}, {"_index": "waws"."_type": "article"."_id": "2"."_score": 0.19856805."_source": {
           "articleID": "KDKE-B-9947-#kL5"."userID": 1."hidden": false,
           "postDate": "2017-01-02"."tag": [
             "java"]."tag_cnt": 1."view_cnt": 50."title": "this is java blog"."content": "i think java is the best programming language"."sub_title": "learned a lot of course"."author_first_name": "Smith"."author_last_name": "Williams"."new_author_last_name": "Williams"."new_author_first_name": "Smith"}}, {"_index": "waws"."_type": "article"."_id": "4"."_score": 0.155468."_source": {
           "articleID": "QQPX-R-3956-#aD8"."userID": 2."hidden": true,
           "postDate": "2017-01-02"."tag": [
             "java"."elasticsearch"]."tag_cnt": 2."view_cnt": 80."title": "this is java, elasticsearch, hadoop blog"."content": "elasticsearch and hadoop are all very good solution, i am a beginner"."sub_title": "both of them are good"."author_first_name": "Robbin"."author_last_name": "Li"."new_author_last_name": "Li"."new_author_first_name": "Robbin"}}]}}Copy the code

The content of the sample

  • In the first case below, where the accuracy rate is high but the recall rate is low, Java alone can rank very high
  • In the second case below, the amount of data we got was relatively small and the accuracy was low. The combination of Java and Spark ranked very high
 GET /waws/article/_search 
 {
   "query": {
     "bool": {
       "must": [{"match": {
             "content": "java spark"}}]}}} {"took": 1."timed_out": false,
   "_shards": {
     "total": 5."successful": 5."failed": 0
   },
   "hits": {
     "total": 2."max_score": 0.68640786."hits": [{"_index": "waws"."_type": "article"."_id": "2"."_score": 0.68640786."_source": {
           "articleID": "KDKE-B-9947-#kL5"."userID": 1."hidden": false,
           "postDate": "2017-01-02"."tag": [
             "java"]."tag_cnt": 1."view_cnt": 50."title": "this is java blog"."content": "i think java is the best programming language"."sub_title": "learned a lot of course"."author_first_name": "Smith"."author_last_name": "Williams"."new_author_last_name": "Williams"."new_author_first_name": "Smith"}}, {"_index": "waws"."_type": "article"."_id": "5"."_score": 0.68324494."_source": {
           "articleID": "DHJK-B-1395-#Ky5"."userID": 3."hidden": false,
           "postDate": "2017-03-01"."tag": [
             "elasticsearch"]."tag_cnt": 1."view_cnt": 10."title": "this is spark blog"."content": "spark is best big data solution based on scala ,an programming language similar to java spark"."sub_title": "haha, hello world"."author_first_name": "Tonny"."author_last_name": "Peter Smith"."new_author_last_name": "Peter Smith"."new_author_first_name": "Tonny"}}]}}Copy the code
  • The second case
 GET /waws/article/_search 
 {
   "query": {
     "bool": {
       "must": [{"match": {
             "content": "java spark"}}]."should": [{"match_phrase": {
             "content": {
               "query": "java spark"."slop": 50}}}} {"took": 2."timed_out": false,
   "_shards": {
     "total": 5."successful": 5."failed": 0
   },
   "hits": {
     "total": 2."max_score": 1.258609."hits": [{"_index": "waws"."_type": "article"."_id": "5"."_score": 1.258609."_source": {
           "articleID": "DHJK-B-1395-#Ky5"."userID": 3."hidden": false,
           "postDate": "2017-03-01"."tag": [
             "elasticsearch"]."tag_cnt": 1."view_cnt": 10."title": "this is spark blog"."content": "spark is best big data solution based on scala ,an programming language similar to java spark"."sub_title": "haha, hello world"."author_first_name": "Tonny"."author_last_name": "Peter Smith"."new_author_last_name": "Peter Smith"."new_author_first_name": "Tonny"}}, {"_index": "waws"."_type": "article"."_id": "2"."_score": 0.68640786."_source": {
           "articleID": "KDKE-B-9947-#kL5"."userID": 1."hidden": false,
           "postDate": "2017-01-02"."tag": [
             "java"]."tag_cnt": 1."view_cnt": 50."title": "this is java blog"."content": "i think java is the best programming language"."sub_title": "learned a lot of course"."author_first_name": "Smith"."author_last_name": "Williams"."new_author_last_name": "Williams"."new_author_first_name": "Smith"}}]}}Copy the code