ElasticSearch core

doc value

In the establishment of index, on the one hand, the establishment of inverted index for search;

On the one hand, a forward index, or doc value, is created for sorting, aggregation, filtering, and so on.

Doc values are stored on disk. If the memory is sufficient, the OS automatically caches it to the memory.

doc1: {"field_1": "value1"."field_2": "value1"}
doc2: {"field_1": "value2"."field_2": "value2"}
Copy the code

query phase

The search request is sent to a coordinate node and a priority queue is constructed. The length of the priority queue is based on paging from and size. The default value is 10
Coordinate Node forwards the request to all shards, each shard searches locally, and builds a local Priority queue
Each SHard returns its priority queue to coordinate Node and constructs a global Priority queue
This process is called the Query phase

fetch phase

Document to coordinate Node also needs sorting, paging, aggregation and other operations, so part of the data may not be needed. To reduce the overhead of network transport, only doc ids are included in Priority Quere

After the Priority queue is constructed by coordinate Node, the MGET request is sent to all shards to obtain the corresponding document

After each shard returns the Document to coordinate Node, the coordinate Node returns the combined document result to the client client

Suppose there are five shards, and each shard returns a priority queue that is sorted by a field. Now we need to get the first four documents after sorting.

First each shard returns the first four internally sorted documents, for example

3 2 1 6

7 5 3 2

3 2 1 0

September 4 2 1

4 3 2 1

Then the DOC ID is returned to the coordinate Node for processing

Coordinate Node will get the document at the top of each priority queue and sort the priority queue according to field, and get

September 4 2 1

7 5 3 2

3 2 1 6

4 3 2 1

3 2 1 0

So you can eliminate about half of the data, and the top four results can only exist in that half

September 4 2 1

7 5 3

June 3

4

At this point, use mGET to get the data, and then sort it to get the results

bouncing results

The two documents sort and the field value is the same

In differentshardUp, maybe in a different order
Each request calls a different onereplica shardThis may cause you to see different search results

Solution: Use Preference

He can decide which shards are used to perform search operations

By setting preference to a string, such as user_id, for each user search using the same Repliace shard, bouncing results do not occur.

routing

The default document route is the _id route. If you need to determine which shard document belongs to based on other fields, you can set routing=field

Scroll search

A batch search mode is supported in ES to solve performance problems caused by massive data query.

With Scroll search, ES will save a current snapshot and then only provide data search based on that old snapshot.

Data changes during this period are transparent to users.

The difference between Scroll and paging is that Scroll searches within a service to obtain a large amount of data, while paging only obtains data for the current page.

Scroll also has distributed sorting problems, so it is recommended to use _doc to sort for higher performance.

Process for restoring data after an outage

If data is lost in OS cache

OS Disk synchronizes all segment files from the last commit point to disk.
Translog stores records of all data changes since the last flush.

After the data is restarted, the changes in the Translog file are played back, and the segment is flushed to the OS cache again until the next commit point occurs.

Related Posts

Simple factory, factory method, abstract factory pattern

MongoDB Helper

Redis Square Wheels with Python – Part 0 – Understanding communication protocols