In the establishment of index, on the one hand, the establishment of inverted index for search;
On the one hand, a forward index, or doc value, is created for sorting, aggregation, filtering, and so on.
Doc values are stored on disk. If the memory is sufficient, the OS automatically caches it to the memory.
doc1: {"field_1": "value1"."field_2": "value1"}
doc2: {"field_1": "value2"."field_2": "value2"}
Copy the code
-
The search request is sent to a coordinate node and a priority queue is constructed. The length of the priority queue is based on paging from and size. The default value is 10
-
Coordinate Node forwards the request to all shards, each shard searches locally, and builds a local Priority queue
-
Each SHard returns its priority queue to coordinate Node and constructs a global Priority queue
-
This process is called the Query phase
Document to coordinate Node also needs sorting, paging, aggregation and other operations, so part of the data may not be needed. To reduce the overhead of network transport, only doc ids are included in Priority Quere
After the Priority queue is constructed by coordinate Node, the MGET request is sent to all shards to obtain the corresponding document
After each shard returns the Document to coordinate Node, the coordinate Node returns the combined document result to the client client
Suppose there are five shards, and each shard returns a priority queue that is sorted by a field. Now we need to get the first four documents after sorting.
First each shard returns the first four internally sorted documents, for example
3 2 1 6
7 5 3 2
3 2 1 0
September 4 2 1
4 3 2 1
Then the DOC ID is returned to the coordinate Node for processing
Coordinate Node will get the document at the top of each priority queue and sort the priority queue according to field, and get
September 4 2 1
7 5 3 2
3 2 1 6
4 3 2 1
3 2 1 0
So you can eliminate about half of the data, and the top four results can only exist in that half
September 4 2 1
7 5 3
June 3
4
At this point, use mGET to get the data, and then sort it to get the results
The two documents sort and the field value is the same
- In different
shard
Up, maybe in a different order - Each request calls a different one
replica shard
This may cause you to see different search results
Solution: Use Preference
He can decide which shards are used to perform search operations
By setting preference to a string, such as user_id, for each user search using the same Repliace shard, bouncing results do not occur.
The default document route is the _id route. If you need to determine which shard document belongs to based on other fields, you can set routing=field
A batch search mode is supported in ES to solve performance problems caused by massive data query.
With Scroll search, ES will save a current snapshot and then only provide data search based on that old snapshot.
Data changes during this period are transparent to users.
The difference between Scroll and paging is that Scroll searches within a service to obtain a large amount of data, while paging only obtains data for the current page.
Scroll also has distributed sorting problems, so it is recommended to use _doc to sort for higher performance.
If data is lost in OS cache
-
OS Disk synchronizes all segment files from the last commit point to disk.
-
Translog stores records of all data changes since the last flush.
After the data is restarted, the changes in the Translog file are played back, and the segment is flushed to the OS cache again until the next commit point occurs.