Elasticsearch master (3)

In-depth analysis of the implementation principle of structured search _filter (Bitset mechanism and Caching mechanism)

(1) Search the search string in the inverted index to obtain the Document list

The date for example

word doc1 doc2 doc3
2017-01-01 * *
2017-02-02 * *
2017-03-03 * * *

The filter: 2017-02-02

2017-02-02 document list is doc2, DOC3

(2) Construct a bitset for each result found in the inverted index, [0, 0, 0, 1, 0, 1]

It is very important

Using the doc list, construct a bitset, which is a binary array with each element 0 or 1, to identify whether a doc matches a filter condition. If a doc matches, it is 1; if a doc does not match, it is 0

[0, 1, 1)

Doc1: doc2 and DO3: do not match the filter

Using simple data structures to implement complex functions can save memory space and improve performance

(3) Traverse the bitset corresponding to each filtering condition, search from the sparsest first, and find the document that meets all the conditions

  • As will be explained later, in a single search request, multiple filter criteria can be issued at one time. Each filter criterion corresponds to a bitset, and the corresponding bitset of each filter criterion is traversed, starting from the sparsest one

[0, 0, 0, 1, 0, 0] : sparse [0, 1, 0, 1, 0, 1]

By iterating through a sparse bitset, you can filter out as much data as possible

Iterate through all bitsets to find doc matching all filter criteria

Request: filter, postDate=2017-01-01, userID=1

postDate: [0, 0, 1, 1, 0, 0] userID: [0, 1, 0, 1, 0, 1]

After iterating through the two bitsets, the doc that matches all the criteria is doc4

You can return the document to the client as a result

(4) Caching bitset: Tracks queries. If the filtering conditions exceed a certain number of times in the last 256 queries, the bitset is cached. For small segments (<1000, or <3%), bitsets are not cached.

  • For example, postDate=2017-01-01, [0, 0, 1, 1, 0, 0], can be cached in memory, so that the next time if this condition comes, there is no need to rescan the inverted index, repeatedly generated bitset, can greatly improve performance.
  • If a filter of the latest 256 filters exceeds a certain number of times, the bitset corresponding to this filter will be automatically cached

Segment (first quarter), filter for the results obtained from small segments, the number of segment records is less than 1000, or the segment size is less than 3% of the total index size

  • The reason for not caching when the segment is small

    • The amount of segment data is very small, so even scanning is fast. The segment is automatically merged in the background, and the smaller segment will be merged into the larger segment soon
    • Bitset for a small segment [0, 0, 1, 0]
  • The advantage of filter over Query is that it can caching, but it is not known what caching is. In fact, it is not the complete doc List data result returned by a filter. Instead, filter bitsets are cached. No need to scan the inverted index next time.

In most cases, the filter is executed before the query, so as to filter as much data as possible

  • Query: Doc’s Relevance score of search criteria is calculated and sorted according to this score
  • Filter: Simply filters the desired data, does not calculate relevance score, and does not sort

(6) If the Document is added or modified, the cached bitset is automatically updated

PostDate =2017-01-01, [0, 0, 1, 0]

  • Document, ID =5, postDate=2017-01-01, is automatically updated to the bitset of filter postDate=2017-01-01. PostDate =2017-01-01 bitset, [0, 0, 1, 0, 1]
  • Document, id=1, postDate=2016-12-30, postDate= 2017-01-01, postDate= 2017-01-01

(7) The cached bitset corresponding to the same filter criterion will be used directly in the future

Elasticsearch master (4)

Structured search _ In the case of actual practice based on bool combination of multiple filter conditions to search data

Example 1

  • Search for posts with 2017-01-01 or xhdK-A-1293 -#fJ3 and definitely not 2017-01-02
 select * from forum.article where (post_date='2017-01-01' or article_id='XHDK-A-1293-#fJ3') and post_date! ='2017-01-02'
Copy the code

The syntax for using ES is as follows

 GET /waws/article/_search
 {
   "query": {
     "constant_score": {
       "filter": {
         "bool": {
           "should": [{"term": {
                 "articleID.keyword": "XHDK-A-1293-#fJ3"}}, {"term": {
                 "post_date": "2017-01-01"}}]."must_not": {
             "term": {
               "post_date": "2017-01-02"}}}}}}} {"took": 3."timed_out": false,
   "_shards": {
     "total": 5."successful": 5."failed": 0
   },
   "hits": {
     "total": 1."max_score": 1."hits": [{"_index": "waws"."_type": "article"."_id": "1"."_score": 1."_source": {
           "articleID": "XHDK-A-1293-#fJ3"."userID": 1."hidden": false,
           "postDate": "2017-01-01"}}]}}Copy the code

Must, should, must_not, filter: The value must be matched. Any of the values can be matched

Example 2

  • Search for posts xhdK-A-1293 -#fJ3 or jodl-X-1937 -#pV7 with A post date of 2017-01-01
 select * from forum.article where article_id='XHDK-A-1293-#fJ3' or (article_id='JODL-X-1937-#pV7' and post_date='2017-01-01')
Copy the code

The syntax for using ES is as follows

 GET /waws/article/_search 
 {
   "query": {
     "constant_score": {
       "filter": {
         "bool": {
           "should": [{"term": {
                 "articleID.keyword": "XHDK-A-1293-#fJ3"}}, {"bool": {
                 "must": [{"term": {"articleID.keyword": "JODL-X-1937-#pV7"}}, {"term": {
                       "postDate": "2017-01-01"}}]}}]}}}}} {"took": 2."timed_out": false,
   "_shards": {
     "total": 5."successful": 5."failed": 0
   },
   "hits": {
     "total": 2."max_score": 1."hits": [{"_index": "waws"."_type": "article"."_id": "1"."_score": 1."_source": {
           "articleID": "XHDK-A-1293-#fJ3"."userID": 1."hidden": false,
           "postDate": "2017-01-01"}}, {"_index": "waws"."_type": "article"."_id": "3"."_score": 1."_source": {
           "articleID": "JODL-X-1937-#pV7"."userID": 2."hidden": false,
           "postDate": "2017-01-01"}}]}}Copy the code

Comb the knowledge points learned

  • Bool: Must, must_NOT, should
  • Bool can be nested
  • When you learn the search syntax well, you can basically implement some of the common SQL syntax corresponding to the function