takeaway

The author of this article is Qu Shan, senior technical expert of Alibaba OLTP database team. As the person in charge of developing high-performance and low-cost storage Engine X-Engine, qu Shan’s excellent relational database storage Engine should have what capabilities?

The body of the

The database kernel is divided into two layers: SQL & Storage. The SQL Layer is responsible for the SQL statement you input through a series of steps (parse/resolve/rewrite/optimize… The execution plan is usually in the form of a tree, in which the leaf node (operator) part of the tree is usually responsible for single-table data operations, and these operators are to be executed at the storage layer.

Therefore, a database storage engine, the main work is simply to access data, but the premise is to ensure that the database ACID (atomicity/consistency/isolation/durability) semantics. Storage engine external interface is relatively simple, mainly is writing data/modify/query, transaction processing (start transaction/commit/rollback…). , modify schema objects/data dictionaries (optional), data statistics, and some peripheral o&M or data import and export functions.

It doesn’t seem difficult to implement a storage engine in terms of functionality alone, and there are many key-value stores that have been transformed into database storage engines, simply by adding a transaction mechanism. However, as a database chassis, a mature storage engine must consider efficiency, and how to achieve efficient (performance/cost maximization) data access becomes a major consideration in the design of the trade-offs. This can be discussed in terms of the main components of the storage engine:

Data organization

The organization of data in memory and disk largely determines the efficiency of data access. Different application scenarios have different choices. Typical examples are:

  1. Data store by row (NSM), which is transaction-friendly because transaction data is always written in full lines, is mostly used in OLTP scenarios.
  2. Storage by Column (DSM), which physically stores the same column values in tuples together so that only the required columns need to be read, saving a lot of I/O during large-scale data scans. In addition, column storage is better compressed and suitable for OLAP scenarios, but transaction processing is not so convenient, requiring row to column. Therefore, most AP databases are not very efficient in transaction processing, and some even support only bulk import.
  3. Mixed storage (FSM) uses mixed layout of columns and columns. Data is first grouped by row (Segment, SubPage), DSM organization is used in groups, such as PAX and RCFile, and Column Group is first grouped by Column Group, the specified columns in groups are organized by NSM, such as Peloton’s tiles. This format tries to combine the advantages of NSM and DSM to handle mixed load (HTAP), but it also inherits the disadvantages of both.

So when you do a storage engine, you have to face the problem of choosing a storage format from the very beginning. Even if the class is selected, there are countless details to consider in each format, including how fields of each data type are encoded (Encoding), how null/not NULL is stored in rowstore, whether column indexes are needed to speed up project operation, whether column values need to be rearranged, how column stores compress data, And so on, there is a balance between storage space and access speed.

In order to cope with complex application scenarios, modern databases often use more than one storage format. For example, Oracle has an in-memory Column Store that converts rows of pages into columns of pages In memory to speed up complex queries.

Once you have selected the storage format for the data, you must also select how the data will be aggregated on disk and in memory. Take row storage as an example. Most storage engines use fixed-size pages to store consecutive rows. Of course, there are heap tables (random) and index organized tables (in index order). The more popular LSM-like storage engine uses pages of variable size (called DataBlock) and only supports primary key index order aggregation. The main difference between these two methods is that the former is designed to be updatable, leaving space in each page, while the latter is read-only, storing data tightly without padding, making it easy to compress. The difference between the two is actually caused by the larger difference in transaction processing mechanism, which will be discussed later.

For in-memory databases, the way data is organized is quite different because there is no need to exchange data between in-memory and persistent stores. Instead of using page form, in-memory stores index structures (such as B+Tree) directly index records (tuples). Eliminating indirect references to the page layer reduces CPU cache misses.

Cache management

The granularity of the cache is generally page, and the key is the cache replacement algorithm. At present, LRU,LFU,ARC are widely used. And various variants of the algorithm are used in the database. There is also the question of more efficient memory management, where fixed-length pages have an advantage over variable length pages.

Of course, we should also consider the influence of various query patterns on cache. If there are many single rows, it is more efficient to choose a cache with finer granularity (such as row), but the elimination strategy is more complex. Many new studies have tried to introduce machine learning methods to optimize the cache elimination algorithm. And efficient cache management.

Transaction processing

The core of the storage engine ensures ACID in the database. To ensure D, we all do the same thing: Write Ahead Log (WAL) for recovery. The key is how to efficiently implement ACI, also known as the multi-version Concurrency control (MVCC) mechanism.

The complete implementation of MVCC is complicated and will not be elaborated in detail. The key here lies in how to deal with data race in the process of concurrent execution, including writing and reading conflicts. To be efficient, read-only transactions cannot be blocked by read/write transactions. This requires that our writes not update the current data directly, but have the ability to maintain multiple versions of data. Current storage engines manage multiple versions of data in one of two ways:

  1. The written data is updated in situ, and the updated old version is written to the undo chain. The writing cost is high and transaction processing is complex, but the data of the old version is highly efficient.
  2. Writing data does not directly update the original data, but appends it to the new version, which has a small writing cost. However, reading, especially scanning, requires a lot of reading layers. The more serious problem is that data recycling of the old version requires compact, which costs a lot.

The former algorithm, called ARIES, is used more than most mainstream database storage engines, while the latter structure, called LSM-tree, is also used by many new storage engines and is gaining more and more attention.

Catalog

Unlike KV Store, databases have strict schemas, so most of the records in storage engines are structured. Many KV Stores, when used as database storage engines, do a layer of conversion in the middle. Convert tuples processed by the upper layer to binary key-values ina specific encoding mode, write vstore to KVStore, and after reading the tuples to the upper layer, interpret them into TuPLES format based on the schema for the upper layer to process.

This approach certainly works, but many optimizations cannot be implemented: a. Data iterations must be entire rows, and even if only one column is required, serialization/deserialization overhead is unavoidable. B. Project and filter cannot be transferred to the storage layer for processing. C. Without column information, it is impossible to do column encoding and compression. D. Schema change can only be used to forcibly reconstruct data. So to be truly efficient, more and more storage engines are choosing to be fully Schema-aware, storing microstructures.

conclusion

Above are discussed, also is only a matter of several big stand-alone database storage engines, and modern database puts forward higher demands on the storage engine, scalable, highly available has become standard, and now want to consider is how to give your storage engine combined with the ability of distributed, which involve the consistency to ensure high availability, automatic extension, A number of more complex issues, such as distributed transactions, are well beyond the scope of this article and need to be addressed in another chapter.

Finally, the x-DB distributed database developed by Ali is introduced. The storage Engine of the database uses x-Engine developed by ali. X-engine uses a hierarchical storage architecture for data, as the goal is to provide high concurrent transaction capability and low cost for massive data storage.

We divide the data into multiple layers according to different data access frequency (hot and cold), design the corresponding storage structure according to the data access characteristics of each layer, and write appropriate storage devices. X-engine uses LSM-Tree as an architectural foundation for tiered storage and has been redesigned on top of it.

To put it simply, hot data layer and data update use memory storage and utilize the technology of massive in-memory database (lock-free index structure/ Append only) to improve the performance of transaction processing. We design a set of transaction pipelined processing mechanism to parallel several stages of transaction processing. Greatly improved throughput. Cold (warm) data with low access frequency is phased out or merged into persistent storage hierarchies, combined with the current rich storage device hierarchies (NVM/SSD/HDD) for storage.

We have more influence on the performance of the compaction process done a lot of optimization, data storage is mainly split granularity, using the data update the characteristics of the hot spots are concentrated, as far as possible in the process of merging reuse data, the fine control of the shape of the LSM, reduce I/O and computation cost, and at the same time, greatly reduce the space amplification in the merging process. Fine-grained access control and caching mechanisms are used to optimize read performance. Optimization, of course, is endless, and thanks to the rich application scenarios, we have gained a lot of engineering experience.

X-engine is now more than just a stand-alone database storage Engine. Combined with our X-PaxOS (Distributed Strong Consistent High Availability Framework), GMS(Distributed Management Services), and X-TRX (Distributed Transaction Processing Framework), it has evolved into a distributed database storage system.