Elasticsearch master (21)

Using rescoring mechanism to optimize the performance of approximate match search

It is different from Phrase match(Proximity match)

  • match

    • As long as a term is simply matched, it can be understood that the doc corresponding to the term is returned as the result, and the index is scanned in reverse order, and it is ok when it is scanned
  • phrase match

    • First, scan the doc list of all terms. Find the doc list containing all terms; Then, the position of each term is calculated for each doc to see whether it conforms to the specified range. Slop requires complex calculations to determine if sloP can be moved to match a doc

The match query performance is much higher than that of the Phrase match and proximity match (with SLOP). Because both calculate the distance of position.

The match Query performance is 10 times higher than that of the Phrase match and 20 times higher than that of the Proximity match.

However, don’t worry too much, because the performance of ES is generally in milliseconds, the performance of match Query is in milliseconds or tens of milliseconds, and the performance of Phrase match and Proximity match is in tens to hundreds of milliseconds. Therefore, it is acceptable.

Optimize the performance of proximity match

Generally, it is to reduce the number of documents required for proximity match search.

  • The main idea is

    • withmatch queryFilter out the data you need,Secondly, proximity match was used to improve doc scores according to term distance. In addition, proximity match only worked on the top n doc scores of each SHard to adjust their scoresThe process is calledrescoring, re-score. Generally, users can only see the data of the first few pages through paging query. Therefore, it is unnecessary to conduct proximity match operation for all results.

As we just said, match + Proximity match can simultaneously achieve recall rate and accuracy (important)

  • By default, a match may match 1000 doc. In a proximity match, it is necessary to run a calculation on each doc to determine whether the SLOP can move and match, and then contribute its own score. In many cases, a match may generate 1000 doc. In fact, users are mostly paging queries, so they may only look at the first few pages, for example, a page of 10, may look at a maximum of 5 pages, or 50
  • Proximity match Only requires sloP movement of the first 50 doc to match and contribute its own score. It does not need to calculate and contribute scores for all 1000 doc

Rescore: Re-grade

Match: 1000 doc, each doc has a score. Proximity match. Rescore was conducted for the first 50 doc, and re-scoring was sufficient. The more recent the first 50 DOC and term examples, the higher the rank

 GET /waws/article/_search 
 {
   "query": {
     "match": {
       "content": "java spark"}},"rescore": {
     "window_size": 50."query": {
       "rescore_query": {
         "match_phrase": {
           "content": {
             "query": "java spark"."slop": 50}}}}}} {"took": 56."timed_out": false,
   "_shards": {
     "total": 5."successful": 5."failed": 0
   },
   "hits": {
     "total": 2."max_score": 1.258609."hits": [{"_index": "waws"."_type": "article"."_id": "5"."_score": 1.258609."_source": {
           "articleID": "DHJK-B-1395-#Ky5"."userID": 3."hidden": false,
           "postDate": "2017-03-01"."tag": [
             "elasticsearch"]."tag_cnt": 1."view_cnt": 10."title": "this is spark blog"."content": "spark is best big data solution based on scala ,an programming language similar to java spark"."sub_title": "haha, hello world"."author_first_name": "Tonny"."author_last_name": "Peter Smith"."new_author_last_name": "Peter Smith"."new_author_first_name": "Tonny"}}, {"_index": "waws"."_type": "article"."_id": "2"."_score": 0.68640786."_source": {
           "articleID": "KDKE-B-9947-#kL5"."userID": 1."hidden": false,
           "postDate": "2017-01-02"."tag": [
             "java"]."tag_cnt": 1."view_cnt": 50."title": "this is java blog"."content": "i think java is the best programming language"."sub_title": "learned a lot of course"."author_first_name": "Smith"."author_last_name": "Williams"."new_author_last_name": "Williams"."new_author_first_name": "Smith"}}]}}Copy the code

Elasticsearch master (22)

Deep exploration search technology _ actual combatThe prefix search,Wildcard search,Regular searchEtc.

The prefix search

  • C3D0-KD345
  • C3K5-DFG65
  • C4I8-UI365

C3 –> search for both of the above –> search by the prefix of the string

 PUT my_index
 {
   "mappings": {
     "my_type": {
       "properties": {
         "title": {
           "type": "keyword"
         }
       }
     }
   }
 }
Copy the code
  • Add test data
 PUT /waws_index/waws_type/1
 {
   "title":"C3D0-KD345"
 }
 ​
 PUT /waws_index/waws_type/2
 {
   "title":"C3K5-DFG65"
 }
 ​
 PUT /waws_index/waws_type/3
 {
   "title":"C4I8-UI365"
 }
Copy the code
  • Get data by prefix
 GET /waws_index/waws_type/_search
 {
   "query": {
     "prefix": {
       "title": {
         "value": "C3"}}}} {"took": 51."timed_out": false,
   "_shards": {
     "total": 5."successful": 5."failed": 0
   },
   "hits": {
     "total": 2."max_score": 1."hits": [{"_index": "waws_index"."_type": "waws_type"."_id": "2"."_score": 1."_source": {
           "title": "C3K5-DFG65"}}, {"_index": "waws_index"."_type": "waws_type"."_id": "1"."_score": 1."_source": {
           "title": "C3D0-KD345"}}]}}Copy the code

The principle of prefix search

Prefix Query does not calculate relevance score. The only difference with prefix filter is that filters cache bitsets

Scan the entire inverted index, for example

  • The shorter the prefix, the more doc to process and the worse the performance. Use long prefixes whenever possible
Prefix search, how does it work? Why is the performance poor?
  • match
  • C3-D0-KD345
  • C3-K5-DFG65
  • C4-I8-UI365

Full-text retrieval (each string needs to be split)

  • c3 doc1,doc2
  • d0
  • kd345
  • k5
  • dfg65
  • c4
  • i8
  • ui365

C3 –> scan inverted index –> once the scan reaches C3, it can stop, because there are only 2 doc with C3, it has been found –> there is no need to continue searching for other terms

Match performance tends to be high

  • Regardless of the word
  • C3-D0-KD345
  • C3-K5-DFG65
  • C4-I8-UI365

C3-k5-dfg65, and there may be many other strings with a prefix c3 –> You can’t stop scanning for a term with a prefix that matches c3. The search must continue until the entire inverted index has been scanned

Because in actual scenes, there may be some scenes that can not be solved by full-text retrieval

  • C3D0-KD345
  • C3K5-DFG65
  • C4I8-UI365

C3 –> match –> scan the entire inverted index, can you find it

C3 –> prefix only

Prefix performance is poor

Wildcard search

  • Similar to prefix search, it’s much more powerful
  • C3D0-KD345
  • C3K5-DFG65
  • C4I8-UI365

5 characters. -D Any character is 5

5? -*5: Wildcard to express the more complex semantics of fuzzy search

 GET /waws_index/waws_type/_search
 {
   "query": {
     "wildcard": {
       "title": {
         "value": "C? K*5"}}}} {"took": 8."timed_out": false,
   "_shards": {
     "total": 5."successful": 5."failed": 0
   },
   "hits": {
     "total": 1."max_score": 1."hits": [{"_index": "waws_index"."_type": "waws_type"."_id": "2"."_score": 1."_source": {
           "title": "C3K5-DFG65"}}]}}Copy the code
  • ? : Any character
  • * : contains 0 or any number of characters

Same poor performance, must scan the entire inverted index, ok

Regular search

 GET /waws_index/waws_type/_search 
 {
   "query": {
     "regexp": {
       "title": "C[0-9].+"}}} {"took": 11."timed_out": false,
   "_shards": {
     "total": 5."successful": 5."failed": 0
   },
   "hits": {
     "total": 3."max_score": 1."hits": [{"_index": "waws_index"."_type": "waws_type"."_id": "2"."_score": 1."_source": {
           "title": "C3K5-DFG65"}}, {"_index": "waws_index"."_type": "waws_type"."_id": "1"."_score": 1."_source": {
           "title": "C3D0-KD345"}}, {"_index": "waws_index"."_type": "waws_type"."_id": "3"."_score": 1."_source": {
           "title": "C4I8-UI365"}}]}}Copy the code
  • C[0-9].+

    • [0-9] : indicates a number in a specified range
    • [A-z] : letters in a specified range
    • . : one character
    • + : The preceding regular expression can occur once or more

Wildcard and Regexp, same as prefix, both scan the entire index and perform poorly

This is to introduce you to some advanced search syntax. In practice, don’t use it if you can. The performance is too poor.