segment file: The basic unit of Elasticsearch storage is the shard. The ES Index can be divided into multiple shards, and each shard is actually a Lucence Index. And each Lucence Index consists of multiple segments. Each Segment is actually a set of inverted indexes. Each time a new Document is created, it will be assigned to a new Segment without modifying the original Segment. In addition, each document deletion operation will only mark the document in the Segment as deleted state, but will not be physically deleted immediately. Therefore, the index of ES can be understood as an abstract concept. Es generates a segment file every second. When there are too many segment files, ES automatically merges the segment file and physically deletes the deleted file.
Commit: Every index change should be flushed immediately for data security, so commit means merging segments and writing them to disk. Ensure that memory data is not lost. Scrubbing is a heavy IO operation, so for the sake of machine performance and near-real time search, scrubbing is not as timely.
commit point: Every commit point maintains a.del file (es delete data is not a physical delete). When es deletes data from a.del file, it declares that a document has been deleted. Delete files from the segment that have been deleted. When a query request is made, the segment that has been deleted can be detected, but when the query result is returned, the deleted files will be filtered from the.del file maintained by the Commit Point.
Translog: Translog provides a persistent record of all operations that have not yet been flushed to disk. When Elasticsearch starts, it uses the last commit point from disk to recover known segments, and replays all translog changes that have occurred since the last commit. To prevent data loss caused by The breakdown of ElasticSearch and ensure reliable storage, ES writes each operation to the Translog at the same time. The new document being indexed means that the document is written to buffer first and the operation is written to a Translog file. Each shard corresponds to a translog file; Translog will be asynchronously executed every 5 seconds or fsync operation will be executed after each request is completed to flush translog from cache to disk. This operation is time-consuming. If data consistency requirements are not high, it is recommended to change the index to Async. If the node is down, 5 seconds of data will be lost.
Refresh: The lightweight process of writing and opening a new segment. When receiving a data request, ES stores the data in memory first. By default, the data is written to a segment in the Filesystem cache every second, and the buffer is emptied. At this point the index becomes searchable, a process called refresh;
Fsync: Fsync is a Unix system call function that stores data from the memory buffer to the file system. Flush all segments from filesystem cache to disk.
Flush: By default, every 30 minutes or 512mb of data is written to a new segment, the buffer is emptied, and a commit point is written to disk. In addition, data in filesystem cache is flushed to disk through fsync and translog files are flushed.
A new segment to disk requires a fsync to ensure that the segment is physically written to disk so that data is not lost in the event of a power outage. But fsync is expensive; Performing this every time you index a document can cause significant performance problems. What we need is a lighter way to make a document searchable, which means that fsync is removed from the whole process. Between the ES and the disk is filesystem Cache. As described earlier, documents in the memory buffer are written to a new segment. However, the new segment is written to the file system cache first, which is cheap, and flushed to disk later, which is expensive. But once the file is already in filesystem Cache, it can be opened and read like any other file. Lucene allows new segments to be written and opened, making the documents they contain visible to the search without a full commit. This approach is much less costly than a single commit and can be performed frequently without affecting performance. The low-level search in ES is based on REFRESH. The default value of REFRESH is 1s, so the search is not real-time, but near real-time.
Refer to the following websites:
Blog.csdn.net/lsgqjh/arti…
www.jianshu.com/p/15837be98…
Blog.csdn.net/wx152815940…
Blog.csdn.net/u013129944/…
Developer.51cto.com/art/202009/…