ES Reads and writes the underlying process

“This is the 22nd day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

Data writing process

The client selects a node to send the request to. This node iscoordinating node(Coordinate node).
coordinating nodeOn the documentrouting, forward the ff request to the corresponding node (with primary shard).

The actualnodeOn theprimary shardProcess the request and then synchronize the data toreplica node.
coordinating nodeIf it is found that the primary node and all replica nodes are repaired, the replica node returns the response result to the client.

Hash the doc ID to determine which shard the document was assigned to when it was written, and query from that shard.

The client sends a request to either onenodeTo becomecoordinate node.
coordinage noderightdoc idHash routing is used to forward the request to the corresponding noderound-robinRandom polling algorithmIn theprimary shardOne replica is selected randomly from all replicas to balance read request load.

The full text retrieval

The client sends a request to acoordinate node.
The coordinating node forwards the search request toallCorresponds to the shard ofprimary shardorreplica shard.

Query Phrase: Each shard returns its own search resultsdoc id) returned to the coordination node, the coordination node for input merging, sorting, paging and other operations, output final results.
Fetch Phrase: The fetch phrase is then used by the coordination node according to thedoc idGo to the nodesPull the actualThe actualdocumentData, returned to the client.

First write to the memory buffer, the data is not searched in the buffer; Data is also written to a Translog log file.
If the buffer is almost full, or a certain amount of time, the data will be stored in memory bufferrefreshTo a new onesegment file, but the data is not directly enteredsegment fileDisk files are entered firstos cache. The process isrefresh 。

Every 1 second, ES writes data to the bufferThe new segment fileOne is produced every secondNew disk file segment filethesegment fileStores the data written to the buffer in the last 1 second.
If there is no data in the segment file, the segment file will not refresh. By default, if there is data in the segment file, the segment file will refresh once every second.

In the operating system, disk files actually have a thing calledos cacheThe operating system cache, which means data is entered before it is written to a disk fileos cache, first into a memory cache at the operating system level. As long asbufferThe data in is flushed by the refresh operationos cache, the data can be searched.
Why is it esQuasi real-time?NRTFull name,near real-time. The default is refresh every second, so ES is quasi-real-time because written data is not seen until one second later. You can go through esrestful apiorjava api ，manualTo perform a refresh operation, you manually flush data from the bufferos cacheMake the data instantly searchable. As long as the data is enteredos cacheIn translog, the buffer will be cleared, since there is no need to keep the buffer, the data has been persisted to disk in translog.

Repeat the above steps as new data continues to enter buffer and translogbufferData is written one after anothersegment fileEvery timerefreshWhen the buffer is empty, translog is reserved. As this process progresses, translog becomes larger and larger. When the translog reaches a certain length, it firescommitOperation.
The first step in the commit operation is to add the existing data in the bufferrefresh 到 os cacheEmpty the buffer. Then, put acommit pointWrite to the disk file that identifies thiscommit pointCorresponding to allsegment fileAt the same time will forciblyos cacheAll of the current data in thefsyncGo to the disk file. The lastemptyThe translog log file exists, restart a Translog, and the commit operation is complete.

The commit operation is calledflush. This command is automatically executed every 30 minutes by defaultflush, but also if translog is too largeflush. Flush corresponds to the commit process. You can manually flush fsync data from the OS cache to disk through the ES API.
What is a Translog log file for? Before you commit, data is either stored in buffer or OS cache. Both buffers and OS cache are memory, and once the machine dies, all data in memory is lost. Therefore, the operation of the data needs to be written to a special log filetranslogEs will automatically read the data in the Translog log file and restore it to the buffer and OS cache when the machine is restarted.

Translog is written to the OS cache first and is flushed to disk every 5 seconds by default, so by default, 5 seconds of data may be stored in the OS cache in the buffer or translog file. If the machine hangs at this point, it willThe loss ofFive seconds of data. But this way the performance is better, the maximum loss of 5 seconds of data. You can also set translog so that each write operation must be directfsyncTo disk, but the performance is much worse.
If the interviewer does not ask you about the data loss of ES, you can show the interviewer here by saying that the first es is quasi-real-time, and the data can be searched 1 second after writing. You might lose your data. 5 seconds of data is stored in the buffer, Translog OS cache, and segment file OS cache, but not in the disk. If the system goes down, 5 seconds of data will be lost.

To summarize, the data is written to the buffer, and then refreshed to the OS cache every 1s. When the OS cache is refreshed, the data can be searched. Write data to a Translog file every 5 seconds (so that if the machine is down and there is no data in memory, at most 5 seconds of data will be lost). If the translog is large enough, or every 30 minutes by default, the commit operation will be triggered. Flush all buffer data into segment file.

If it is a delete operation, one will be generated at commit time.delA file that identifies a doc asdeletedState, then the search is based on.delThe file knows if the doc has been deleted.
In the case of an update operation, the original doc is identified asdeletedState, and then write a new piece of data.

Each time the buffer is refreshed, one buffer is generatedsegment file, so the default is one per secondsegment fileCome down like thissegment fileMore and more merge operations are performed periodically. Each time you merge, there are more than one mergesegment fileI’m going to merge it into one, and I’m going to label itdeletedDoc toPhysically deleteAnd then the newsegment fileWrite to disk. I’m going to write one herecommit point, identify all newsegment fileAnd then open itsegment fileFor search use, while removing the oldsegment file 。