Author: Hua Ting, senior technical expert of OceanBase Database of Ant Financial

This article introduces compaction as a critical operation when operating a storage system based on the LSM architecture.

Ten years ago, Google published a paper on BigTable, which promoted the popularity of LSM based system architecture. User data is written to WAL first and then to Memtable. When certain conditions are met, Memtable is frozen and dumped to form a data file SSTable. As more data is written, the number of dumps increases, and more sstables are dumped. However, too many Sstables cause IO queries, so the background tries to merge these Sstables over and over again in a process called Compaction. There are two types of Compaction: Minor Compaction and Major Compaction.

Minor Compaction occurs when one or more small, contiguous dump sstables and zero or more Frozen memtables are consolidated into a single, larger SSTable. A Minor Compaction results in fewer sstables and a larger SSTable.

Major Compaction consolidates all dumped Sstables with one or more Frozen memtables into one SSTable. This process compacts all deleted data. When a new Compaction occurs, it takes a long time to create a new Compaction process that consumes new system resources and affects operations.

Compaction amplification factor

A Compaction typically involves three factors of amplification, and a Compaction must decide between them.

Write amplification

Write Amplification. Assume that 10MB of data is written every second and 30MB/s of disk is written. Then Write Amplification is 3. Write operations can be either immediate or delayed. For example, when a redo log is written immediately, it generates a dirty page and LSM Compaction from a traditional B-Tree database. Use direct IO to write at least 512 bytes. If the log is 100 bytes, write 512 bytes to the disk.

Read the amplification

Read Amplification: Corresponds to the number of times a simple Query needs to Read a hard disk. For example, if a simple query reads 5 pages and 5 IO times occur, the read magnification is 5. If all the non-leaf nodes of the B-tree are cached in memory and the point read-amp is 1, the Leaf Block can be obtained by a single disk read. Short range read-amp indicates that 1~2, 1~2 disk reads can obtain the required Leaf Block.

Space amplification

Space Amplification: If I need to store 10MB of data, but the actual hard disk occupies 30MB, then Space Amplification is 3. There are a number of factors that can cause Compaction, including the need for temporary storage during Compaction, space fragmentation, a small percentage of valid data in a Block, and late deletion of new data.

In recent years, there have been a lot of studies on LSM writing and amplification in many papers. Writing and amplification itself is a very old problem. In a computer system, if the processing units of two adjacent layers are inconsistent or the application has special requirements for consistency, the problem of writing and amplification is likely to occur. Examples include CPU Cache and memory cells, file system blocks and disk sectors, database blocks and file system blocks, database redo/undo, file system Journal, etc. Below is the comparison of magnification factors of several popular KV tested in the PebblesDB paper:

RocksDB magnification factor up to 42 times, LevelDB also up to 27 times. Write amplification mean more reading and writing, can lead to fluctuations in performance of the system, such as for SSD will be accelerated life decay, in terms of cost and more power, the more optimal algorithm is proposed to avoid the write amplification can use cheaper SSD, write amplification also affect the bandwidth of the database system continue to write, if the disk bandwidth is 500 m/s, write amplification of 10, The maximum bandwidth for continuous write to the database is 50 MB /s, so it is important to solve the problem of write amplification.

Analysis of HDFS Under HBase: Some of the tests provided by A Facebook Messages Case Study concluded that in the Facebook Messages business system, the business read/write ratio was 99:1, but eventually reflected on disk, the read/write ratio was 36:64.

The essence of the amplification problem is how strongly a system needs “global order at all times.” At any time, is any write can not lead to the system disorder; The so-called global, that is, the system between any units to maintain order. B-tree series is a typical representative of global order at any time, while Fractal Tree breaks the global constraint, allows local disorder, and improves the random writing ability. The LSM series goes a step further by breaking the anytime rule and allowing sorting to occur from a background Compaction. In the LSM system which relies on background arrangement to maintain order, the stronger the order requirement of the system, the more serious the write amplification.

It is difficult to optimize the write, read, and spatial scaling at the same time, such as the internal fragmentation of SSD. The more the fragmentation, the more serious the space and read scaling. In order to improve the space and read scaling, the effective data in the page is copied to the new page (copy-out), which will bring write scaling. If the effective data rate in the page is lower, then the larger the space size, the lower the copy-out write size, and conversely, the smaller the space size, the greater the copy-out write size.

Compaction’s effects and side effects

As data writes increase, Minor freezes are triggered, and data dumps increase, a query may require more AND more IO operations, and the read latency increases. Minor Compaction stabilizes the number of dumped files, keeps IO Seek times stable, and delays stable. Minor Compaction, however, overwrites files, resulting in heavy bandwidth and short I/O pressure. Minor Compaction is a combination of short IO consumption and bandwidth consumption and low latency for subsequent inquiries.

In addition to the short read acceleration, Compaction can have a significant write impact, in exchange for low latency in subsequent inquiries. When a number of write operations occur, a Minor Compaction generates a dump SSTable. Multiple Minor freezes trigger a Major Freeze, causing a Major Compaction to occur. When Memtable memory is insufficient, user data writing is restricted. If Compaction consumes large amounts of I/O and bandwidth, it can cause read performance to deteriorate dramatically. To avoid this, the speed of write requests is limited when Memtable memory is tight.

Minor Compaction releases Memtable memory, clearing unnecessary multiversion data, and ensuring that the number of sstables dumped is stable, reducing read latency. Major Compaction clears deleted and expired data, reducing storage space and read latency. Compaction creates stable read latency, but it costs a lot of read latency burrs and some write congestion.

A compaction target

Compactions are designed to strike a balance between a Compaction that generates a new Compaction and a new Compaction that doesn’t cause severe IO stress. However, there is no single design strategy that works for all application scenarios or all data sets. Compaction strategy needs to be determined depending on business scenarios and data set characteristics. This Compaction strategy needs to be tailored to the business data. The goal is to reduce scaling. This Compaction strategy balances the following factors:

Properly control read amplification: Avoid the increase in read latency caused by Minor Freeze, and avoid the request for too many Sstables.

Control write magnifying: avoid compacting the same data over and over again;

Sizing up the storage space. When Compaction Compaction occurs, unnecessary data of multiple versions, deleted data, and expired data cannot fill storage space for a long time. When Compaction Compaction occurs, it releases storage space that is not needed.

Control Compaction resources: Use fewer resources during peak operations and more resources during peak operations.

Control Compaction smoothness. Response time and throughput are predictable when Compaction runs.

Compaction algorithm evolution

Memtable(incremental data) + L1(baseline data)

This is an ideal scenario where memory is large, disk space is small, and service data is small. Every change you make to a Memtable is consolidated with a daily Major Compaction.


Each table contains an incremental Memtable and a baseline SSTable. This daily Major Compaction algorithm creates a point read-AMP of 1, a short range read-AMP of 1, and a space Compaction of 1. Sizeof (database)/sizeof(memtable) + 1. This Compaction algorithm provides optimal read size and space size. Write size is acceptable when Memtable memory is large, disk space is small, and service data is small. But as Compaction operations continue to grow, and storage space grows rapidly, write size becomes extremely large. Every day Major Compaction gets slower and slower.

Memtable(incremental data)+L0(incremental dump data)+L1(baseline data)

With the development of the business, the data updated daily becomes larger and larger, and the Memtable can no longer store the data modified throughout the day. When the system memory is tight, Minor Freeze is triggered, and then SSTable is generated, as shown in the following figure:



Point read-amp = >=1 disk read and multiple BloomFilter reads, short range read-amp = L0+L1 SSTable Total number of disk reads, Sizeof (database)/sizeof(redo log, L0); Stand-alone memory volume has barely increased in recent years, but the amount of baseline data and daily incremental data has grown dramatically, making it extremely difficult to write a daily Major Compaction algorithm. When the number of sstables in L0 reaches a certain number of dumps, we create a Major Freeze and then launch a Major Compaction. This strategy results in multiple Major Compaction operations per day. As the baseline volume of data grows, the writes grow and Major Compaction slows down.

Select Compaction based on the scenario

How to choose between read, write and space magnification in different scenarios? It can be difficult to implement a single Compaction strategy across a system, but deploying a new Compaction strategy to a different table can help to ensure that this Compaction Compaction is implemented properly.

Increments rowKey INSERT/Update/DELETE, regardless of read/write ratio

Memtable + L1 mode is adopted for optimal write and space amplification, without read amplification. Delete range is marked in Memtable if there are many delete ranges.

Write only: A large number of random ROWkey INSERTS, Updates, and deletes

Optimized write Compaction. Memtable + L0 + L1 generates a SSTable per dump. If L0 has the same size as L1, write a Major Compaction when a Memtable + L0 + L1 generates a Major Compaction.

Write mostly, a large number of random rowkey insert/update/delete

The Memtable + L0 + L1 mode is adopted. L0 layer accumulates incremental SSTable data of multiple Major freezes. L0 layer contains multiple Sstables, and the number and size of L0 layer SSTable are controlled according to the optimized write amplification policy. Use Stripe Compaction and progressive consolidation strategies to reduce write sizing and temporary sizing of Major Compaction.

Read only, or write very little

Optimal write magnification, read magnification and space magnification, adopt Memtable + L1 mode, Memtable has no data or only a little data.

Read mostly, a few random rowkey inserts/updates/deletes

Optimize read amplification, adopt Memtable + L0 + L1 mode, each time Frozen Memtable is dumped and merged with the previous L0 layer SSTable to form a new L0 layer SSTable. If the delete range is large, Write a delete range in the Memtable and L0 SSTable. If L0 and L1 create a new Major Compaction, it will create a small Compaction.

Read mostly, lots of random rowkey insert/update/delete

Optimize write amplification and read amplification. Memtable + L0 + L1 mode is adopted. L0 layer accumulates incremental SSTable data of multiple Major Freeze. If there is a large number of delete ranges, write a delete range in the Memtable and L0 SSTable, divide L0 and L1 into multiple sub-ranges, and create a Major Compaction when every sub-range compacts. Each Major Freeze merges parts of the range data from L0 and L1, similar to a progressive merge. If there are too many deletes, the storage space cannot be enlarged properly, and the storage space of deleted data cannot be released.