OceanBase database uses LSM Tree structure as the storage engine of the database. The data is divided into baseline data (SSTable) and incremental data (MemTable). The baseline data is stored on disk and loaded into the cache of the database when it needs to be read. The delta data is cached in memory when data is continuously inserted (or modified). When the delta data reaches a certain threshold, the delta data is flushed to disk. When the delta data reaches a certain threshold, the delta data on disk is merged with the baseline data.

As for the LSM Tree structure, if the MemTable of multiple levels is stored, it will bring a large space storage problem. OceanBase simplifies the LSM Tree structure, reserving only the C0 layer and C1 layer (refer to the content in the previous period), that is to say, This process is called a dump. After a certain number of operations are performed, the MemTable on the disk is merged with the baseline data. Here is a detailed explanation of dump and merge:

Dump: Since data modification in memory will continue, there will be more and more Memtables in memory. In order to release memory space, OceanBase will define a threshold for MemTable memory usage. When this threshold is reached, You need to merge and sort some of the oldest MemTable information and save it to disk to form C0 level data. This process is called dump. OceanBase calls this process Minor Freeze.

Merge: The merge operation (Major Freeze) merges dynamic and static data, which generates the new C1 layer data, and can be time consuming. When the incremental data generated by the dump accumulates to a certain extent, the Major Freeze is used to merge the large version. In order to ensure the consistency of data, it is necessary to suspend transactions on the data being merged during the merge process, which will affect performance. OceanBase has refined the merge operation into incremental merge, rotational merge and full merge.

The following table describes the difference between a dump and a merge:

OceanBase also combines the characteristics of traditional relational database, and also has the concept of database block. In OceanBase, the unit for allocating space to data files is called Marco block. If you are familiar with Oracle, you can simply think that a Marco block corresponds to an extent in Oracle. Each macro block is divided into several small blocks of 16K size, which is the smallest unit of database IO per time (equivalent to the block of traditional database). Various data in the database are stored in the small block. Since the size of the macroblock is 2M and OceanBase uses the LSM Tree structure to store data, the data is sorted by the primary key of the table. So OceanBase’s macroblocks can be split, and adjacent macroblocks can be merged if data is deleted

The data in SSTable is baseline data, which is static in most cases. Therefore, OceanBase analyzes the data during merging by default and encodes the data according to the data distribution of each column. Currently, the encoding modes supported are as follows: Dictionary encoding, RLE encoding, constant encoding, difference encoding, prefix encoding, inter-column encoding, etc. After the data is encoded, the data can be compressed by a general compression algorithm, which can achieve a good data compression ratio, and at the same time have no impact on the read performance, and make the write performance better during the merge. The following image shows the basic process of encoding column rate_id using a dictionary.