The target

Take full advantage of the performance of modern storage SSDS and significantly reduce read and write magnification of LSMTree to improve its performance while providing the same API.

background

On traditional disks, the performance of sequential I/O is about 100 times that of random I/O. Based on this, LSMTree implements random read/write of massive KV as random memory read/write + sequential flush + periodic merge (COMPACT) to improve read/write performance. This is especially applicable to scenarios where more data is written than read and the latest data is most frequently accessed.

Author: Greenwood Birdwww.qtmuniao.com/2020/03/19/…Please indicate the source

Read and write magnification. For write amplification, since LSMTree has many layers, merging sort is needed continuously to compact in order to speed up the reading speed, resulting in each KV being read and written multiple times. For read magnification, multiple layers of queries are required to find the specified key in the vertical direction, and binary queries are required in the horizontal direction because there are multiple key ranges in the same layer. Of course, to speed up lookups, each layer can be equipped with a Bloom filter to quickly skip if there is no key to look for at that layer. As the amount of data increases, read and write will become more extensive.

SSDS are becoming cheaper and more widely used, and their parallel random read performance is very good, not much different from sequential read performance. Of course, random writes should be avoided as they are not as uniformly fast as random reads and reduce SSD life.

The core design

The core design of WiscKey is as follows:

  1. Keys and values are stored separately, keys are still stored in lSM-tree, and values are stored in additional log files (vlogs).
  2. For unordered value data, SSD parallel random reads are used to speed up the reading speed.
  3. Use unique crash consistency and garbage collection strategies to efficiently manage Value log files.
  4. WAL is removed without affecting consistency and improves write performance on small data traffic.

Design details

The key value separately

The Key is still stored in the LSM-tree structure. Since the space occupied by the Key is usually much smaller than that of the Value, the lSM-tree in WiscKey has a small number of lSM-tree layers and does not have much read/write amplification. Save Value to an additional log file

, called vLog. Of course, some meta information about the Value, such as the position of the Value in the vLog, is stored in the LSM-tree along with the Key, but the space occupied by the Value is very small.

Read. Although the Key and Value need to be read separately (i.e., a read needs to be broken down into an in-memory (high probability) search in LSM-tree and a random search on SSD), the time required for both is no longer than LevelDB because both are faster than LevelDB.

Write. Value is appended to vLog to obtain its offset vlog-offset in vLog. Then write the Key and < vlog-offset, value-size> to the LSM-tree. An append, a write to memory, all very fast.

Delete it. In the asynchronous deletion policy, only the keys in the LSM-tree are deleted, and the values in the vLog are reclaimed by the garbage collection process periodically.

Despite these advantages, the separation of Key values brings with it many challenges, such as Range Query, garbage collection, and consistency issues.

Challenge 1: Range query

Range Query (kV-pair sequential traversal of start and end keys) is an important feature of modern KV storage. LevelDB Key/value pairs are stored in Key order, so a range query can be performed by traversing related memtables and SstAbles in sequence. But the Value of WiscKey is unordered, so a lot of random queries are required. However, as shown in Figure 3, we can use multi-thread parallel random query to fill SSD bandwidth and greatly improve the query speed.

Specifically, when performing range query, first load the required keys in sequence from LSM-tree, and then prefetch them into Buffer using multithreaded random read of SDD. Then, the sequential combination of read keys and values in Buffer can be returned to the user. For high performance.

Challenge 2: Garbage collection

LevelDB uses the compact mechanism for delayed garbage collection, and WiscKey uses the same mechanism for Key collection. For Value, however, because it exists in the vLog, additional garbage collection mechanisms need to be considered.

The simplest way to do this is to scan the LSM-tree structure of WiscKey to obtain the set of keys in use. Then scan the vLog to reclaim all values that are not referenced by the Key set. Obviously, however, this is a heavy (and time-consuming) operation that might require stopping the external service for consistency, similar to stop-the-world in the early JVM GC.

We clearly need something lighter. The basic idea of WiscKey is to treat all Value data in the vLog as a stripe, keep all data in use in the middle of the stripe, and use two Pointers to mark the beginning and end of the valid data area in the middle. The head can only append, and the tail can be garbage collected. So how do we maintain this valid intermediate data region?

When garbage collection is needed, a Block of data is read from the tail (a batch of data entries, each containing four fields

, one at a time to reduce IO) into memory. For each data entry, if it is in use, it is appended to the vLog stripe header; Otherwise, it is discarded; Then move the tail pointer to skip the data.
,>

The tail pointer is a key variable that needs to be persisted in case of outages. WiscKey is used to reuse the LSM-tree structure of the stored Key. A special Key (< ‘tail’, tail-vlog-offset >) is used to store it together with the Keys. The header pointer is the end of the vLog file and does not need to be saved. In addition, WiscKey’s garbage collection timing can be flexibly configured according to the situation, such as periodic collection, collection at a certain threshold, collection when the system is idle, and so on.

Challenge 3: Breaking consistency

Lsm-tree usually guarantees atomicity of KV insertions and sequence of recoveries in the event of system downtime. WiscKey can provide the same consistency, but the implementation mechanism is slightly more complicated (or at least atomically difficult) because key values are stored separately.

For atomicity of data insertion, consider the following case. After the system is recovered, when a user queries a Key,

  1. If it cannot be found in the LSM-tree, the system assumes that it does not exist. Even though the Value may have been appended to the vLog, it will be reclaimed later.
  2. If it can be found in the LSM-tree, look at the corresponding data entry in the vLog<ksize, vsize, key, value>And check whether the entry exists, whether the location is in the middle legal segment, and whether the Key can match. If not, delete the Key and tell the user that it does not exist. To prevent incomplete data entries from being hung after data is written only halfway, you can add checksums to data entries.

By following this process, we can guarantee the atomicity of KV writes: for the user, KV either exists or does not exist.

For sequential insertion of data, because modern file systems (such as ext4, BTRFS, XFS) guarantee sequential appending, that is, if the data entry D1, D2, D3… Dx, Dx+1, … If the Dx is not appended to the vLog during a system outage, subsequent data entries are not appended to the system. This ensures that the data is inserted sequentially.

As discussed below, WiscKey uses a write Buffer to improve the efficiency of appending small values. Therefore, some data entries may be lost during downtime. For this reason, WiscKey provides the synchronous write switch to make WiscKey abandon the Buffer and forcibly flush values to the vLog file before writing corresponding keys to the LSM-tree.

Optimization 1: vLog caching

For intensive and small-size write traffic, if the user calls the write source language every time he calls PUT (K, V) to append a data entry to vLog, such frequent I/O will result in poor performance and insufficient SSD bandwidth utilization, as shown in the following figure:

Therefore, WiscKey uses a Buffer to cache written values, which are actually appended to the vLog only when the user requests or reaches a set size threshold. You also need to make some changes in the query. Each query is searched first in buffer and then in vLog. The trade-off is that, as mentioned earlier, this portion of the buffer will be lost in the event of a system crash.

Optimization 2: Eliminate WAL

WAL (Write Ahead Log) is a common data recovery mechanism in database systems. In traditional LSM-Tree, data is directly written to the memory. Therefore, a log is recorded before each operation for downtime recovery. During downtime recovery, logs are read one by one to restore data structures in memory. However, each write request increases disk IO, which reduces system write performance.

Because the data entries in vLog record all inserted keys sequentially, vLog can be reused as WAL for LSM-tree in WiscKey. As an optimization, the unpersisted Key point < ‘ ‘head’ ‘, head-vlog-offset > can also be saved in the LSM-tree. During each outage recovery, the lSM-tree can be retrieved from this point first, and then the keys in the vLog after this point are read one by one to restore the LSM-tree.


Welcome to pay attention to the public number: distributed drip. Get more distributed systems articles