Log-structured merge-tree, LSM for short.

Traditional RDBMSS such as Mysql and PostgresQL have been used as page-Orented storage engines based on B-tree. Modern computer’s biggest processing bottleneck in the disk read and write, data storage can not bypass the disk read and write, except pure memory database, but because of the instability of memory storage, we generally only memory storage as a cache system.

In order to improve the write performance of the database system, we find that the sequential write performance of disk is much better than the random write performance, and even better than the random write performance of memory. So LSM data structures are introduced at the expense of read performance and write magnification in many database systems that favor write performance.

Design a database engine

We designed a database engine from scratch. The data model is very simple. Let’s choose the simplest key-value structure, where a piece of data has only one Key and one Value. The only operations are get and PUT, as follows:

get(key);

put(key, value);
Copy the code

Starting with the simplest, one data.db file per database, we append each record to the end of the file like we would log it.

key1,value1
key2,value2
key3,value3
key10,value10
key8,value8
Copy the code

So we’re about 80% done, and then we need to finish reading. If you want to query key2 data, we can only start from the beginning of the file to read key2 data:

for (row in rows) {
   if (row.key == "key1") {
      returnrow; }}Copy the code

There you have it, a simple database.

What? Finished? Yes, it’s done, and no one can deny that it has fulfilled the most basic function of a database system, even though it would be hacked to death.

** This traversal is very performance intensive. ** So how to improve read performance? Create a memory Index “Index”, the simplest way to maintain a Map in memory, store the file content offset corresponding to each key. Thus reading a record requires only one memory operation plus one disk operation.

Why did b-tree come into being? What is the disadvantage of the index of the Map structure? The Map index addresses the performance problem of random single reads, but not Rang queries, such as those that need to query data with a key between key1 and key200. Hence the B-tree, which is an ordered structure tree that can easily be Rang.

B-tree indexes all data in memory. When the data grows infinitely, it cannot store such a large index file in memory.

Let’s look at the implementation of LSM.

The LSM framework

SSTable: The LSM disk file, called SSTable(Sorted String Table). Fortunately, LSM stores files on disk, and the data is also stored by Key, which can solve the above mentioned problem of not being able to index all the data into memory. If the disk file is also ordered, the in-memory Index can take the form of “Sparse Index”, which can record one Index for each segment and logically divide the data into multiple blocks. The Sparse Index only needs to record the offset of each block, and each piece of data can be realized by traversing the block. The index size will be greatly reduced.

Memtable: The memory structure of LSM is called Memtable. Memtable is an ordered structure, can also use a tree structure, you can use a skip table. When LSM writes data, it only needs to write Memtable in memory. When Memtable reaches a certain amount, it will asynchronously flush to disk, which is the SSTable above.

Immutable Memtable: When data is flushed from Memtable to SSTable, LSM copies an IMmutable Memtable in memory to avoid performance problems caused by read/write locks. As the name implies, this data structure cannot be changed. New data is written only to new Memtable. Immutable Memtable is a data structure that can be read by disk flushing threads. Requests to query data can also access this data structure. In this way, data stored in memory does not need to access disks, which facilitates data query efficiency.

For more information on WAL, see my previous post, “What exactly is WAL you Often Hear about?” In LSM, before data is flushed to disk, LSM writes data to the WAL and then to the SstAbles to prevent data loss due to exceptions. When the system restarts, LSM writes data to the WAL and then to the SstAbles. LSM clears expired WAL logs to prevent excessive WAL logs.

Write the LSM

LSM writing consists of four processes:

  1. Write to…
  2. Write memtable
  3. When memTable reaches the threshold, imutable memtable is copied
  4. Asynchronously flush to disk

LSM delete

To ensure sequential disk writes, LSM does not delete the data directly, but writes a DELETE flag to indicate that the data is deleted. The data is deleted only when Compact is used.

Read the LSM

LSM reads data from memTable, IMUTable, and Sstable until it reads data or returns no data after reading all levels of data structure. So when the data does not exist, you need to read each layer of file in turn. LSM can first determine whether a data exists by introducing a Bloom filter to avoid invalid file scanning.

LSM merger

LSM’s merge strategy is an important part of LSM that we will cover separately in the next article.

LSM architecture is widely used, for example, Bigtable, HBase, LevelDB, SQLite4, Tarantool, RocksDB, WiredTiger, Apache Cassandra, and InfluxDB. Good article, we will go into detail on LSM implementation in Leveldb or Cassandra.

Recommendation:

Apache Druid is an Apache Druid storage system. The Apache Druid storage system is an Apache Druid storage system. The Apache Druid storage system is an Apache Druid storage system