Original text: yetanotherdevblog.com/lsm/

A log-structured merge-tree LSM tree is a data structure typically used when processing a large number of write tasks. Optimize the write path by sequential writes. The LSM tree is the core data structure behind many databases, including BigTable, Cassandra, Scylla, and RocksDB.

Sort string list

The LSM tree is saved to disk using a Sorted string Table (Sorted Strings Table for short SSTable) format. SSTables, as the name suggests, is a format for storing key-value pairs, in which the keys are arranged in order. An SSTable consists of a number of ordered files called Segments. Once written to disk, these data segments are immutable. A simplified example is as follows:

We can see that the key-value pairs in each segment are sorted by keys. Let’s take a look at what a fragment is and how it is generated.

Write the data

Recall that LSM trees only perform sequential writes. We might wonder, how do we write data in an ordered format when values are written in any order? This can be solved by using an in-memory tree structure. It is often referred to as a memtable, but the underlying data structure is usually some sort tree, such as a red-black tree. When data is written, it is added to the red-black tree.

Our writes will be stored in the red-black tree until the tree reaches a predefined size. Once the red-black tree has enough entries, it is flushed to disk as a segment on disk in an orderly order. This allows us to write the segment files sequentially, even if the order of insertion is not fixed.

Read the data

So how do you find the corresponding values in an SSTable? One simple approach is to scan the data segment to find the desired key. Start with the latest segment and scan to the oldest segment until we find the key we want. This means that we will retrieve the most recently written key first. A simple optimization is to keep sparse indexes in memory.

We can use this index to quickly find the offset of the value before and after the target key. Now, you only need to scan a small portion of each segment file against these boundaries. Let’s consider a scenario where we want to look for the key Dollar in the segment above. We can perform a binary search on the sparse index and find dollar between Dog and downgrade. Now, just scan from offsets 17208 to 19504 to find the value (or determine if the value is missing).

This is a nice improvement, but what about looking for records that don’t exist? You still need to loop through all the segment files, and you can’t find the target key. Bloom filters can help us solve this problem. A Bloom filter is a space-saving data structure that can tell us whether a particular key exists in the data. We can add data to the filter as it is written and check it at the start of reading to efficiently respond to requests for non-existent data.

The compression

Over time, the system accumulates more and more segment files as it runs. These segment files need to be cleaned and maintained to prevent the number of segment files from getting out of control. This process is called compression. Compression is a background process that merges successive old segments into new ones.

As you can see in the example above, segment 1 and segment 2 both hold the value of the dog key. Newer segments contain the most recent written values, so the values in segment 2 are carried over to segment 4. Once the compression process writes a new segment to the input segment, the old segment file is deleted.

Delete the data

We’ve talked about reading and writing data, but how do you delete data? How do you remove data from an SSTable when segment files are considered immutable? Deletion actually follows exactly the same path as writing data. Each time a delete request is received, a unique flag named tombstone (the key marked for deletion) is written to the key.

The example above shows that the key dog at some point in the past had a value of 52, but now it has a delete flag. This means that if we receive a request with the key dog, a response will be returned indicating that the key does not exist. This means that deletion actually takes up disk space, which may come as a surprise to many developers. Eventually, the delete flag is compressed so that the value no longer exists on disk.

conclusion

We now understand how the LSM tree storage engine works:

  1. Written data is stored in memory trees (also known as memory tables). Any supported data structures (Bloom filters and sparse indexes) are also updated as necessary.
  2. When the tree becomes large enough, it is flushed to disk and the keys are pressed in order.
  3. When reading data, the Bloom filter is checked first. If the filter indicates that the value does not exist, return that the client does not have this key. If the filter indicates that the value exists, we iterate over the segment file in the new-to-old order.
  4. For each segment file, we check the sparse index and scan the offset of the target key until the key is found. The value is returned as soon as it is found in the segment file.