ES writes the data process
- The client selects a node to send the request. This node is the Coordinating Node.
- Coordinating Nodes route the Document and forward the request to the corresponding node (with a primary shard).
- The primary shard on the actual node processes the request and then synchronizes the data to the replica node.
- Coordinating node will return the response result to the client if it finds that the primary node and all replica nodes have been confirmed.
The underlying principles of writing data
First write to the memory buffer, in the buffer when the data is not searchable; The data is also written to the Translog log file.
If the segment buffer is nearly full, or after a certain period of time, it will refresh the memory buffer data into a new segment file, but instead of entering the segment file directly, the data will enter the OS cache first. This process is called Refresh.
Every 1 second, ES writes a segment file from the buffer to a new segment file. Every second, ES creates a new segment file from the disk. This segment file stores all data written to the buffer in the last 1 second.
However, if there is no data in the buffer at this time, then of course REFRESH will not be performed. If there is data in the buffer, by default REFRESH will be performed once every second and a new segment file will be flushed.
In fact, the disk files in the operating system have a thing called the OS cache, which means that data written to the disk file will first enter the OS cache, first enter the operating system level memory cache. As soon as the data in the buffer is flushed into the OS cache by the REFRESH operation, the data can be searched.
Why is it called ES quasi-real-time? NRT stands for Near Real-Time. The default is to refresh every second, so ES is quasi-real-time, since the data written is not seen until a second later. A REFRESH operation can be performed manually through the RESTful API of ES or the Java API, which is to manually brush the data in the Buffer into the OS Cache so that the data can be searched immediately. As soon as the data is entered into the OS cache, the buffer will be cleared. Since there is no need to keep the buffer, the data will already be persisted to disk in the translog.
Repeat the above steps. New data will be added to Buffer and Translog continuously, and buffer data will be written into new segment files one after another. After each Refresh, Buffer will be cleared and Translog will remain. As this process progresses, the translog grows larger and larger. When the translog reaches a certain length, the COMMIT operation is triggered.
The first step of the COMMIT operation is to refresh the existing data in the buffer into the OS cache to clear the buffer. Then, write a commit point to a disk file that identifies all segment files for this commit point, forcing all current data in the OS cache to be fsynced into the disk file. Finally, clean the existing Translog log file, restart a Translog, and the COMMIT operation is complete.
This commit operation is called flush. The default is to automatically flush every 30 minutes, but if the translog is too large, a flush will also be triggered. You can manually flush data from the OS cache to disk by using the ES API to perform the complete commit operation.
What is the purpose of the Translog log file? Data is stored in either the buffer or the OS cache until you commit it. Both the buffer and the OS cache are memory. Once the machine dies, the data in memory is lost. When the machine goes down and restarts again, ES will automatically read the data from the Translog log file and restore it to the memory buffer and OS cache.
Translog is written to the OS cache first, and by default it is flushed to disk every 5 seconds, so by default, there may be 5 seconds of data sitting in the OS cache of the Buffer or Translog file, and if the machine fails at this time, You’ll lose 5 seconds of data. But this performance is better, the maximum loss of 5 seconds of data. You can also set the translog so that each write must be fsync directly to disk, but performance will be much worse.
If it’s an interview, and you’re actually here, if the interviewer doesn’t ask you about ES data loss, you can brag to the interviewer here. You can say: Actually, ES number one is quasi-real-time, data can be searched after 1 second written; You might lose your data. 5 seconds of data in the buffer, translog OS cache, or segment file OS cache, not on disk, will result in 5 seconds of data loss if the machine goes down.
To summarize:
Data is written to the memory buffer, and then REFRESH to the OS Cache every 1s so that the data can be searched (hence the 1s delay between the time it is written to the OS Cache). Write the data to the translog file every 5 minutes (so that if the machine goes down and the memory is completely lost, up to 5 minutes of data will be lost). Flush segment file (segment file);
After the data is written to the segment file, an inverted index is created.
ES reads the data process
If the doc id is used to query the query, it will hash the doc id to determine which shard the doc id was assigned to.
Write requests are written to the primary shard and synchronized to all replica shards. Read requests can be read from the Primary Shard or Replica Shard, which uses random polling algorithm.
- The client sends a request to any node, called a coordinate node.
- The coordinate node hashes the doc ID and forwards the request to the corresponding node. At this time, round-robin random polling algorithm is used to randomly select one of the primary shard and all the replicas, so as to load balance the read request.
- The node that receives the request returns the document to the coordinate node.
- Coordinate node returns the document to the client.
Delete/update the underlying principle of data
In the case of a delete, a.del file is generated at COMMIT that identifies a deleted DOC, so that a search can be made based on the.del file to see if the doc was deleted.
In the case of an update operation, the original DOC is marked as deleted and a new entry is written.
Each time Buffer REFRESH, a segment file will be generated. By default, a segment file will be generated every 1 second. In this case, there will be more and more segment files. Deleted segment files are deleted from the database and deleted segment files are deleted from the database. The delete segment is deleted from the database and is deleted from the database. Identify all new segment files, and then open segment files for search while deleting old segment files.
Inverted index
In a search engine, each document has a corresponding document ID, and the document content is represented as a collection of keywords. For example, Document 1 extracts 20 keywords through word segmentation, and each keyword records the number and location of its occurrence in the document.
An inverted index, then, is a mapping of keywords to document IDs, with each keyword corresponding to a set of files in which the keyword appears.
Take an example.
The following documents are available:
After the word segmentation of the document, the following inverted index is obtained.
There are two important details to note about inverted indexes:
All word entries in the inverted index correspond to one or more documents; Terms in an inverted index are arranged in ascending lexicographical order
The above is just a simple chestnut, not in strict dictionary ascending order.
This article is mainly used to record the working process and principle of ES’s reading, writing, modifying and deleting. I do not elaborate the basic concepts and do not know the basic concepts of ES, so I can spend half an hour to get familiar with it and write a small demo to run.