Abstract: Hash indexes have two obvious limitations: (1) When the number of keys is large, maintaining Hash indexes puts a lot of stress on memory; (2) Interval query is inefficient. How do you optimize these two limitations? This brings us to the main character of this article, the LSM tree.

We implement a key-value database with high write performance through the data structure of append-only log. The append-only log provides high write performance due to the sequential write of disks. This probably violates what we think of disks, because writing to disks is always slow. It’s not. It’s more accurate to say that writing to random disks is slow, because it can be addressed multiple times before writing. If only sequential disk write, the performance is very high, as shown in the following ACM report, sequential disk write even better than random memory write performance!

For example, Kafka is a high-performance message queue that takes extreme advantage of sequential write performance on disk. If producer and consumer rates are similar, messages can be delivered even at the Page Cache level of the operating system. So stop thinking that writing to disk is slow!

Append-only log improves data write performance significantly, but with it comes very low data read performance. In view of this, we use Hash index to optimize, and the optimization effect is also very significant. However, Hash indexes have two significant limitations :(1) maintaining Hash indexes when the number of keys is large puts a lot of stress on memory; (2) Interval query is inefficient.

How do you optimize these two limitations?

This brings us to the main character of this article, the LSM tree.

The Log-structured Merge Tree (LSM) is not a data structure, but a storage model, consisting of MemTable, Immutable MemTable, and SSTable. It also takes advantage of the append-only log to greatly improve write performance. At the same time, because of the ordered storage of keys, it has good read performance and overcomes the two limitations of Hash indexes mentioned above. Below, this article takes a step-by-step look at how LSM trees do this.

SSTable

In the simplest database example, since the data is stored unordered, reading the value of a key requires traversing the entire data file with O(n) algorithm complexity. To improve read performance, we have to maintain Hash indexes of all keys in memory.

What if records were sorted by key when storing data?

In the case of ordered storage of keys, the query efficiency is very good even without Hash index, because we can use Binary Search to quickly find the location of the key. The algorithm complexity is O(logn). LSM trees use key order to organize data stores, which is called SstAbles.

The Sorted String Table (SSTable) is the most basic storage structure of the LSM tree. It is stored on disk and Sorted by key. The advantage of keeping the key in order is that you can quickly find a key value in O(logn) time, which is a great improvement over pure append-only log. However, if all the data is stored in an SSTable, the query efficiency will be reduced due to a large amount of data. As a result, LSM trees typically store data across multiple SstAbles, and record the maximum and minimum keys of each SSTable so that you can quickly locate which SstAbles a key is stored on.

Example of searching for SSTable data

LevelDB provides a data store for key: Value data, and a data management area for indexes and other data. LevelDB provides a data store for key: Value data and a data management area for index data.

So how do you ensure that sstAbles are ordered?

The most common storage structure to maintain data order in disk is the B/B+ tree. If the SSTable uses a similar storage structure, the first problem is that every write is accompanied by random writes to the disk, which affects the data write performance. This obviously violates the original purpose of the LSM tree. In order to balance the order and write performance of SSTable, MemTable is adopted in LSM tree.

MemTable

Compared with maintaining an ordered data structure on disk, it is much simpler to implement ordered data in memory, and has high read and write performance. Common ordered data structures include red-black tree, AVL tree, and hop table, etc. No matter what order you insert, the read data is always in order. MemTable is an ordered data structure maintained by LSM in memory. Let’s take a look at how LSM can use MemTable to achieve both ordered rows and write performance of SstAbles:

1, when the write request comes, the first record written to Memtable, so that you can ensure that the record in memory in order, and have a high write performance.

2. When the size of Memtable reaches a certain threshold (usually several Mb), turn Memtable to Immutable Memtable (a Memtable that is read-only and not written), and create a Memtable to receive write requests.

Similar to the segment file mechanism in “From Hash Index to LSM tree (1)”, only current segment file receives write requests at one time.

3. Write Immutable MemTable data to SSTable periodically by background task. Because Immutable MemTable is ordered, it ensures that data in SSTable is written sequentially. Order and write performance are achieved.

MemTable->SSTable; MemTable->SSTable; MemTable->SSTable;

Memtable has both ordered and write performance

Memtable is implemented using a hop table (LevelDB and HBase use this method). Compared with AVL and red-black trees, a hop table can avoid frequent tree node adjustment operations when inserting data, so it is more efficient and easier to implement.

LsmKvDb

The LSM tree is used as a database for the storage engine, and sstAbles are usually managed hierarchically to facilitate queries and subsequent Compact operations. In this paper, LsmKvDb will also be implemented by hierarchical architecture of SSTable.

LsmKvDb Storage architecture

We begin by abstracting levels, each consisting of multiple SstAbles:

The code for LsmKvDb is as follows: MemTable is written to LsmKvDb when data is written. When the value reaches the threshold, it is transformed to Immutable MemTable. Immutable MemTable has the same data structure as MemTable, the only difference is that MemTable is read-only and not written, while MemTable is both read and written.

Compaction

We explain this mechanism in this article from Hash Index to LSM Tree (Part 1). The goal of this process is to wipe out records that have been overwritten or deleted and prevent data storage files from growing ever larger. This mechanism also applies to LSM trees. As data is continuously added, updated, and deleted, there are inevitably overlapping keys or deleted keys among some SstAbles. By combining sstAbles into one Compaction, you can save disk space.

In the previous article, compact operations on segment files relied primarily on Hash indexes. Since the index covers all keys, it is easy to determine whether the key has been processed by the Hash index of the new segment file. However, for the SSTable, there is no Hash index covering all keys. So how can compact be efficient?

Thanks to the orderality of sstAbles, we can apply the merge sort algorithm to compact!

There are three types of Compaction for LSM trees: minor Compact, Major Compact, and Full Compact.

Minor Compact

Minor Compact refers to transferring data from Immutable MemTable directly to Level0’s SSTable.

minor compact

In Level0, keys may overlap among Sstables because data from Immutable memtables is directly dumped to sstables.

Major Compact

Major Compact refers to merging the Sstables in Level N into Level N +1.

Level0 -> Level1

As follows:

Select the oldest SSTable sst0 from Level0 and find all sst0 sst0 keys that overlap with sST0 keys in Level0… N.

2, In Level1, select all sST0… N An SSTable with overlapping keys SST ‘0… M.

3, to sst0… N and SST ‘0… M merges SST ‘ ‘0… K and stored in Level1.

4, delete sst0… N and SST ‘0… M.

major compact Level0 -> Level1

Unlike Level0, there is no overlap of keys between the sstables at Level1 and Level2, so Level1 -> Level2 merging is slightly easier.

Level1 -> Level2

As follows:

1. Select the oldest SSTable sst0 in Level1.

Level2 select all sstAbles SST ‘0 that have key overlaps with SST0… M.

Sst0 and SST ‘0… M merges SST “0… K and stored in Level2.

Sst0 and SST ‘0; M.

major compact Level1 -> Level2

Full Compact

Full Compact refers to compact operation on all sstAbles at Level0, Level1, and Level2, also using the multi-way merge sort algorithm.

full compact

Full Compact usually takes a lot of time, so it is usually carried out at low traffic times.

Optimize the LSM tree

Introduce blocks to the SSTable

So far, to search for a key in an SSTable, we first judge whether the key belongs to the range of the SSTable according to the min key and Max key. If so, we use the binary search method to search the SSTable. Binary lookup works in LsmKvDb because it is a simple SSTable implementation — data is stored as string and separated by \n. In practice, in order to utilize disk space as much as possible, data in SSTable is usually stored in bytecode form and does not separate each record with \ N. In this case, the implementation of binary search is more complicated and efficient.

A common optimization approach is to organize records into blocks of a certain size in an SSTable and compress them on a block basis. To quickly find the block to which a key belongs, you need to maintain the offset (a sparse Hash index) in the SSTable for the start key of each block in memory.

Sstables stored by block

The steps to find the key are as follows:

1. Locate the block to which the key belongs based on the index.

2. Load the block into memory and decompress it.

3. Binary search for data in memory.

When designing block sizes, the principle of disk space locality should be taken advantage of so that the system can load the entire block into memory with only one disk I/O.

Introduce Bloom Filter for SSTable

In fact, if the target key belongs to the key range of an SSTable, the key may not exist in the SSTable. But so far, LsmKvDb has performed lookup operations whenever the target key is in the range of an SSTable. As the number of SstAbles in the system increases, the query efficiency inevitably decreases.

A common optimization approach is to introduce Bloom Filter for SSTable.

A Bloom Filter is an in-memory data structure that tells you something definitely doesn’t exist or could exist. It consists of an extremely long array of bits and a series of Hash functions. When an element is added to the collection, that element is computed and mapped to a set of values by a series of Hash functions, all of which are treated to an offset of 1. If you need to determine whether an element exists in the set, you only need to determine the Hash value of the element in the array. If 0 exists, the element does not exist. If they’re all 1’s, then there might be, and in that case there might be misjudgment.

Bloom Filter

With the Bloom Filter, we can quickly determine whether the target key does not exist in the SSTable, which improves read performance.

Google’s Guava library provides an implementation of BloomFilter and allows you to set misjudgment rates on demand.

conclusion

In this paper, from Hash Index to LSM Tree (1), we mainly introduce the basic principle of LSM tree, and implement a simple LSM tree-based key-value database LsmKvDb on the basis of the original append-only log. The LSM tree consists of MemTable, Immutable MemTable, and SSTable. MemTable and Immutable MemTable are maintained in memory, while SSTable is stored in disks. The orderality of SstAbles enables LSM trees to have good read performance without Hash index and support interval query. Memable allows LSM trees to have high write performance.

This series of articles, we start from a simple key-value database, step by step through the Hash index, LSM tree, Bloom filter and other technical means to optimize it, from which also in-depth analysis of the implementation principle of each technical point. But the database index technology is far more than these, such as the most commonly used **B/B+** tree, is also very worthy of in-depth study, later have the opportunity to further analysis ~

Click to follow, the first time to learn about Huawei cloud fresh technology ~