• Understanding LSM Trees: What Powers write-heavy Databases
  • By Braden Groom
  • The Nuggets translation Project
  • Permanent link to this article: github.com/xitu/gold-m…
  • Translator: samyu2000
  • Proofreader: Chzh9311, Eminlin

Understand the LSM tree: A structure suitable for frequently written databases

An LSM tree is a data structure that is typically used to handle a large number of data writes. Data write paths are optimized by sequential writes. LSM trees are the core data structures behind many database systems such as BigTable, Cassandra, Scylla, and RocksDB.

SSTables

The LSM tree is persisted on disk using the Sorted Strings Table (SSTable) format. As the name implies, an SSTable is a format for storing key-value pairs in which keys are sorted. An SSTable is made up of several sorted files called segments. These segments are immutable once written to disk. Let’s look at a simple example:

As you can see, key-value pairs in each segment are sorted by key. We’ll discuss exactly what a segment is and how it is created in the next section.

Data update

To recap, LSM trees can only handle sequential writes. You may not know how to write data sequentially if the values are unordered. This problem can be solved using an in-memory tree structure. It is often referred to as an in-memory table, and is essentially a sorted tree, similar to a red-black tree. When data is updated, it is stored in this red-black tree.

The data we write is stored in the red-black tree until the size of the tree reaches some preset value. When the red-black tree has enough data elements, it is moved to disk as an ordered fragment. Thus, we can update the fragment as a single sequential write, even if the inserted data is unordered.

Data is read

So how do we find a value in an SSTable? Scanning all segments one by one is obviously naive. To do so, we need to start with the latest segment and work backwards until we find the target value. This means we need to find recently written data much faster. A simple optimization is to maintain a sparse index in memory.

Using such an index, we can quickly get the offset of the value before and after the desired key. Now you only need to scan the small portion of each segment that meets the boundary conditions. For example, we need to look for a key named Dollar in the segment shown above. We can perform a binary search in the sparse index and find dollar between Dog and downgrade. At this point, we only need to scan the data between the offset of 17208 and 19504 to find the required value.

This optimization is fine, but what about looking for records that don’t exist? If we follow the above approach, we still need to traverse all the segment files to get the result that the target does not exist. In this case, a Bloom filter is needed. Bloom filter is a data structure with high spatial efficiency. It is used to detect the existence of a value element in data. While writing data, we can add records to the Bloom filter; At the start of the read, the Bloom filter checks to efficiently process requests for non-existent data.

Data compression

As the system runs, more and more segment files will be accumulated. To prevent the number of segment files from growing out of control, clean up and maintain them. The compression process is responsible for this. It is a background process that continuously combines old segments with new ones.

In the example shown above, you can see that in segment 1 and segment 2, the dog key has the corresponding value. The new segment contains the most recently written value, so the value in segment 2 is the value passed in to segment 4. When the compression process writes the added data to a new segment, the old segment file is deleted.

To delete data

We’ve talked about reading and updating data, but what about deleting data? Since the segment file is immutable, how do I remove it from the SSTable? In fact, deleting is the same process as writing. Whenever a delete request is received, the key to be removed is marked with a uniquely identified tombstone.

As the example above shows, the key named dog, which used to correspond to 52, is now labeled as tombstone. This means that if we receive a request for data with key dog, we should get a response that the data does not exist. This means that delete requests initially take up a lot of disk space, which may come as a surprise to many developers. Eventually, however, the tombstone-labeled data is compressed, so the associated values are lost forever.

conclusion

We now know how a storage engine for a basic LSM tree works:

  1. The written data is stored in an in-memory tree structure (also known as an in-memory table). Any supported data structures (Bloom filters and sparse indexes) are updated as necessary.
  2. When the tree structure becomes too large, it is migrated to disk as an ordered fragment.
  3. When reading the data, we first check the Bloom filter. If the Bloom filter cannot find the corresponding value, it tells the client that the corresponding key does not exist. If bloom’s filter finds the corresponding value, we start iterating through the segment file from new to old.
  4. For each segment file, we need to check the sparse index and scan offsets where we expect to find the desired key until we find the target key. Once found, the corresponding value can be returned.

If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.


The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.