So far, I have been working with a lot of middleware, such as “Elasticsearch” “Redis” “HDFS” “Kafka” “HBase” and so on.

As you can see, their persistence mechanisms are not too different. Today I want to summarize, on the one hand, to review these components, on the other hand, to summarize the persistence “routine” for those of you who have not been introduced to these middleware, and then go to learn a lot easier.

persistence

Let’s take a quick look at the persistence mechanisms of each middleware/component and wrap it up.

Why persist? The reason is simple: data needs to be stored and you don’t want to lose it if something goes wrong

Elasticsearch

Elasticsearch is a full text search engine that is very good at fuzzy search.

Elasticsearch writes data to the memory cache and then to the Translog cache, flushing the Translog buffer to disk every 5s

! [](https://upload-images.jianshu.io/upload_images/24533109-bc9adce251d554cb? imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

Kafka

Kafka is known to be a high-throughput message queue, but how does it persist?

Kafka relies heavily on the filesystem for storing and caching messages.

Yes, Kafka uses a file system for storage.

Kafka’s persistence relies on pagecache and sequential writes.

! [](https://upload-images.jianshu.io/upload_images/24533109-77b208b9edb4afa5? imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

HDFS

HDFS is a distributed file system that can store massive amounts of data. How does HDFS write data?

The HDFS Client needs to access NameNode to obtain, add, delete, or modify the metadata of the file. NameNode is dedicated to maintaining the metadata.

Therefore, to write data in HDFS, NameNode needs to ask which Datanodes these segmented blocks write data to.

To improve the efficiency of NameNode, the NameNode actually operates on memory when writing data, and then writes the sequence of data changes to the EditLog file

! [](https://upload-images.jianshu.io/upload_images/24533109-65dde8a0e5092501? imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

Redis

Redis is memory based, and if you do not manage to save data to hard disk, once Redis restarts (exit/failure), all data in memory will be lost.

We definitely don’t want to lose all data in Redis due to some failure (causing all requests to go to MySQL). Even if there is a failure, we also want to recover the original data in Redis, so Redis has RDB and AOF persistence mechanism.

RDB: Based on snapshots, all data at one point in time is saved to a single RDB file.

AOF(append-only-file) : when the Redis server executes a write command, it saves the command to the AOF file.

The implementation of AOF persistence can be divided into three steps:

Command appending: The command writes to the aof_buf buffer

File writing: Call flushAppendOnlyFile to consider whether to write the aOF_buf buffer to the AOF file

File synchronization: Considers whether the data in the memory buffer is actually written to the disk

! [](https://upload-images.jianshu.io/upload_images/24533109-32e90b59cc03da1a? imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

HBase

HBase is a database that stores massive data.

When HBase writes data, it writes data to the MemStore first. When the MemStore exceeds a certain threshold, data in the memory is written to hard disks to form a StoreFile. The StoreFile is stored in HFile format. HFile is the storage format of KeyValue data in HBase.

When we write data, we write it to memory first. In case the machine goes down, the data in memory dies before it is flushed to disk. We also write a HLog when we write Mem Store

! [](https://upload-images.jianshu.io/upload_images/24533109-e38894f249182de3? imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

MySQL

How does the InnoDB engine that we use the most store MySQL? It has a redo log to support persistence.

MySQL introduces a redo log, and when memory runs out, it writes a redo log, and that redo log records what changes were made on a particular page.

! [](https://upload-images.jianshu.io/upload_images/24533109-9353cda7359d0db8? imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

conclusion

After seeing the above common middleware/component persistence mode, I don’t need to say more, right? The ideas are almost the same, they just have different names for their logs.

Write buffer first, then write Log sequentially IO. This prevents data loss due to the downtime when the buffer data is not flushed to the disk.

In the case of Log files, middleware does not allow them to keep ballooning, resulting in large size:

In Redis, the AOF file is overwritten

In HDFS, the Editlog generates fsimage periodically

In Elasticsearch, Translog triggers the commit operation based on the threshold

.