Author | Ma Haoxiang, Volcano Engine System development engineer
Hello everyone, I am Ma Haoxiang, system development engineer of Bytedance Volcano Engine. Currently, I mainly focus on the development of distributed storage and distributed database. I know a little about computing and storage. Today, I will introduce the practical application of our distributed database in douyin Spring Festival Gala.
Introduction to distributed database architecture
I believe those of you who are interested in databases will be familiar with this diagram. This graph shows DB Engines’ database ranking, or more accurately, a relational database ranking. MySQL and PG are both Top5 relational databases in the April 2021 list. This means that if we want to make a database product, we’ll probably never get around the MySQL and PG ecosystem. So if we want to make a database product, we should not think of a complete set of its own, or the compatibility of MySQL and PG ecology should be a high priority.
At this time, some students may ask, since the development of open source MySQL and open source PG is so good, their ecology is very perfect, users are very many, ranking is also very high, why we still need to develop distributed database? The answer to this question is obvious: the original architecture didn’t meet the needs of our internal applications well, so we looked for a second way.
The diagram above shows the architecture of existing or mainstream large database systems, which is divided into three layers:
- The top layer is the app. Toutiao, Douyin, Watermelon video are all apps.
- The middle layer is the database middleware layer.
- The bottom layer is the database layer and the standalone storage below the database.
This architecture is supposed to be a mainstream, large back-end database architecture, but what’s wrong with this architecture?
The first is the use of database middleware in this architecture. The middleware itself has some limitations and is not very user friendly. For example, it may require the user to sense some Sharding keys in the process of use. If the user does not specify sharding keys, the read and write may be magnified and the performance is poor and the user is not so friendly.
The second is that you run into capacity limitations on your local disk. In the traditional architecture, the stand-alone database runs on a single node, which is naturally limited by the local disk capacity. If a node has more than a dozen disks, the total capacity will be limited by the total capacity of the dozen disks. Some students may say, we can do a cluster architecture, master and slave replication, or we can divide the database into tables and so on. That brings us back to the first issue, where there are some limitations to using middleware support.
The third point is that traditional stand-alone databases may have cross-room problems in deployment and use, and we may need to achieve tradeoff between RPO and performance.
Given the problems with traditional large-scale database system architectures, it is natural to look for another solution. Is a distributed database the answer? It does seem that we are going further down this road.
Introduction to distributed database architecture
There are two main types of distributed database architecture:
- Shared-nothing architecture: Some of the first products to use the shared-nothing architecture were called MPP databases. If users choose to use a database with an MPP architecture, they are likely to be more concerned with overall system throughput and not particularly sensitive to query latency. MPP databases are mainly used for reporting or analysis applications, and may often use column storage. However, column storage or row storage is not absolute, it is only a summary of existing product features.
- Shared-storage architecture: Currently, some mainstream products based on shared-storage architecture are used to process real-time online transactions. With database products of this architecture, users may be more concerned about the processing latency of online transactions, which may be millisecond or even microsecond requirements. This product mainly connects to online transaction applications. In this scenario, row storage may often be used instead of column storage because there is no requirement for analysis and reporting classes.
Again, it is difficult to compare the two architectures, and users need to choose the database architecture based on the business architecture. Taking a closer look at shared-storage, here is a brief distributed database architecture diagram for shared-storage architecture.
As you can see, our system is divided into three levels:
- At the top is the agent layer;
- In the middle is the computing layer;
- At the lowest level is the distributed storage layer.
It can be seen that all nodes between layer 3 are interconnected through high-speed network, and there is no direct network interaction between computing nodes at layer 3. The lowest distributed storage layer is a shared storage pool that uses a variety of different media for final data storage. Such a database system has the following characteristics:
- Strong flexibility: Because it is a database product that separates computing and Storage based on the shared-storage architecture, the computing layer and Storage layer have very low coupling when the capacity needs to be expanded or shrunk, and can be expanded or shrunk independently, which is very flexible.
- Good compatibility: DB Instance is 100% compatible with MySQL and PostgreSQL kernels.
- High availability: Multiple copies of data are implemented in distributed storage pools at the storage layer and can be deployed across multiple rooms to improve system availability.
- High performance: The system can be deployed in cluster mode. In cluster mode, the performance of a cluster is much higher than that of a single machine.
- Low cost: The capacity of compute nodes and udSns can be expanded independently without affecting each other. Therefore, the cost is relatively low because the capacity of compute nodes and UDsns does not affect each other. Therefore, the capacity of compute nodes and UDsns does not need to be expanded at the same time. At the same time we did a lot of high compression ratio technical solutions in the storage layer, which will be described in more detail later.
- Large capacity: Supports TB or even PB data tables with large capacity.
Data computing engine parsing
Having looked at the overall architecture overview, let’s take a look at the computing engine. The database computing engine is used to process the computing logic and transaction logic. Some core modules include:
- Access layer
- Query Engine
- Buffer Pool
- Logging subsystem
- Transaction subsystem
- Lock subsystem
Suffice it to say, it is difficult to build a relational database with full ACID properties without any of these modules. With the key submodules in mind, let’s look at the data model for the computing layer. To a user or a back-end application developer, a database might be a collection of users, databases, and data tables. But for database developers, the essence of a database is a complex combination of an in-memory data model and a disk data model. Let’s look at what the data models are.
In-memory data model: First there must be a PAGE /block based LRU cache; There is also a tree structure based on page organization to organize data, indexes, and so on; There is also a global log buffer, or perhaps a Thread local log buffer for brushing down logs.
In terms of disk structures, there are certainly redo logs, table Spaces, and temporary tables. It is these memory and disk structures that together form the data model of a computing engine.
The life cycle of an SQL
After knowing how to organize the data, I think we are more curious about a question is, as a user, when writing a SQL to the database system, the database system is how to process this SQL statement, query the results in the table and return to the user. Here’s a quick look at the full life cycle of an SQL.
Suppose the user sends an SQL query to select some data from two tables and then add some constraints, such as filter in where, etc. So when this SQL is entered into the database system, we will:
- The SQL raw string is divided into multiple valid tokens. In this case, it could be SELECT, T1, WHERE, etc., which are all valid tokens.
- These tokens are organized into an abstract syntax tree, known as an AST, according to certain syntax rules. Once organized into an abstract syntax tree, you walk through the tree.
- Based on this tree structure and some syntax rules, it can be organized into a query plan (currently we call it a logical plan). We then optimize the logical plan to improve its query performance. Finally, we generate a physical plan based on a logical plan, which describes how we actually work with storage, which data to pull, and which specific operations to perform.
- Next comes the execution engine (currently the dominant one is the Volcano model), which executes the generated physical plan. In the process of execution, it interacts with the storage layer to obtain data, and then executes the calculation logic in each operator. Finally, the calculated results are returned to users in batches, and users can get the query results.
This is the full life cycle of an SQL.
Computing engine kernel optimization
I believe that you have a clear understanding of how the data is organized and how SQL is executed, and I will show you what optimization we have done at the kernel level.
- The first is to do a very deep optimization of the logging subsystem, even transformation. We did away with some of the native brush mechanics, built an efficient Append only model with the new hardware, and enriched the redo log types and semantics to support the entire system.
- The Extent Data Cache is implemented. It is implemented based on shared memory. If a database process is interrupted unexpectedly and no hot Data is stored in the memory after the database is restarted, the Extent Data Cache can save the hot Data before the database process is interrupted to avoid cold startup.
- DDL optimization. Combined with the capabilities provided by the distributed storage layer, we made a more in-depth DDL optimization to improve the efficiency and performance of DDL.
- Operator push down. Combined with the capabilities of the distributed storage layer, we push some simple operators down to the storage layer and use the free CPU of the storage nodes to do calculations, thereby improving the performance of complex queries.
- Implement a lock-free ReadView. It can solve the global lock bottleneck of multi-transaction concurrency and improve the ability of multi-transaction concurrency.
- Merge redo log and Binlog. The goal is to eliminate some of the problems associated with the native Binlog and Redolog XA mechanisms, while also improving write performance when combined into a single stream.
Distributed storage system parsing
After looking at the computing engine, let’s look at the second core subsystem, the distributed storage layer. After the explanation of the computing layer, you are certainly familiar with data types. Again, the two most critical types of database data are redo logs and pages. As long as you have these two, the database will work. So our entire storage layer is actually built around redo logs and page stores. At the storage layer we have two problems to solve: the first problem is about Page, in a distributed storage system, how to store the table to the storage layer? Standalone traditional databases use standalone storage, and you can see the file system on a standalone database. A stand-alone database can store data simply by writing it to the local disk through the interface provided on the stand-alone file system. In distributed architecture, a storage is a remote distributed storage pool. In this case, how to save the table that was written to the single file system to the remote storage pool in the distributed system? The answer is simply to construct a set of distributed mapping rules.
As can be seen from the figure above, there are many tables in the computing layer. Each table is actually a table space composed of pages. What we need to do is map the Page of the computing layer to the Segment of the storage layer. This mapping rule can be based on hash or round-robin as shown in the figure, or any custom rule, as long as it is addressed correctly and addresses are unique.
After you map the Page to the Segment, you can make multiple copies of the Segment to multiple physical nodes in the actual storage pool.
What are the advantages of this model?
- First of all, high availability and reliability, multiple copies can be stored across the machine room.
- Second, it can provide better computing performance, because we can perform parallel computation on multiple copies. As a simple example, suppose we had to scan all the pages from beginning to end. The simplest approach might be to start from scratch and scan sequentially and linearly, but this would be inefficient. Based on our data model, scan can be delivered to multiple segments at the same time. Because Page is originally separated based on Segment, we can scan on multiple segments at the same time. Multiple segments may be copied to different physical nodes. Therefore, computing resources on multiple physical nodes can be fully utilized to perform computing operations, and scan is fast.
The second problem diverges around log. Now that we know how pages are stored, how did pages come about? In fact, it’s very simple. One of the very important ideas that we have been following throughout the process of building this distributed database is: log is the Database. The data that we end up landing on is Page, and that Page is going to come from log.
A computing engine generates many redo logs during transaction processing. These redo logs are submitted to the redo log buffer of the computing engine and parsed into a Page-based log buffer. The same rules are used to split page-based log buffers into segment-based log buffers. Write the segment-based log buffer to the remote storage pool. The remote storage pool organizes the log linked list based on the Segment. Each Page only needs to consume the log modified for itself. New versions can be generated constantly, and then Page Read to service different version requirements. So that’s the whole process from log to Page.
At this point, I believe you will have another more curious question, that is, how can we control the cost of storing so much data such as logs and pages in multi-copy storage? This is also a very real problem. We used the following key techniques:
- The data storage layer is Tiering, which can use a variety of different storage media. For example, persistent memory is used to store the hottest data, high-performance SSDS are used to store temperature data, and HDDS are used to store archived cold data. Using different media at different prices for storage can solve or mitigate the cost problem from the point of view of physical hardware.
- The stand-alone storage engine runs on the storage node. In the stand-alone storage engine, we have implemented an efficient compression algorithm to compress data without losing too much performance. This is to mitigate the cost problem from the software level.
- Smart replica policy: Our storage system uses multiple replicas. However, in some scenarios, complete multiple replicas are not required. For example, EC and lazy Replica policies can be used to reduce costs.
Combined with these cost reduction techniques, even if we do some data redundancy, we can still control the cost of storage tier.
The practical application of distributed database in Douyin Spring Festival Gala
Now that you have a certain understanding of our distributed database system, let’s reveal the black technology behind the Event to support the massive traffic surge.
Business supported by distributed database in Tik Tok Spring Festival Gala
The flow of the red envelope activity in the Whole Tik Tok Spring Festival Gala is very large, and distributed database supports a large number of online and offline businesses, including device push, novels, wallets, etc., among which the most representative business is device push. The magnitude of this business is very large, the number of users reaches one billion, the peak read QPS can reach 600W+, the peak write QPS can reach 360W+, and the data storage is 20+ TB. Under the circumstance of such high flow and large stock, we conducted the following joint investigation with the business:
- Query Pattern analysis: In general, a mature business does not have naive errors such as no index or poor index quality, so the first thing we do is to directly analyze Query Pattern to see where we can improve the performance of the transaction. During the analysis, it was found that the business characteristics were mainly Update and the transaction size might not be appropriate. So we did some transaction splitting, reduced IO size, removed global lock and update performance optimizations to help transactions run faster.
- Computing/storage performance mapping: First set a minimum specification for virtual machines and physical machines, obtain a benchmark result based on this minimum specification, and then slowly increase specifications to see whether the computing and storage performance can be relatively linear expansion. After five or six rounds of touchdown testing, the QPS requirements were finally met while meeting the business SLA.
- Performance pressure test: the business side requires 600W+ read QPS and 360W+ write QPS. Our performance pressure test should be higher than this value to ensure no accidents on the day of activity. So on the basis of the bottom test, we also carried out multiple rounds of performance pressure measurement. In the performance pressure test, we found some incorrect cache policies or bad effects, so we optimized the cache of Page IO of the storage layer to improve the read and write performance.
- Fault drill: After the performance test is complete, the most important things are fault drill and disaster recovery plan. On the day of the Spring Festival Gala, the business is not available, rather than poor performance. Therefore, we have made many preplans and test cases in the aspect of failure drill, including:
- Test case for network failure: For example, iptables is used to simulate the scenario of total network packet loss, and some storage nodes are randomly killed to see if their failure will cause a chain reaction.
- Disk failure drill: Inject some single-node disk failures and kill some storage nodes, metadata nodes, and single-node disks to simulate storage disk failures.
After the exercise of network failure and storage failure, we made some emergency disaster recovery plans and tools according to the results. Fortunately, none of the plans were used on the day of the Spring Festival Gala, which successfully supported the whole business activities on the evening of the Spring Festival Gala.
Conclusion outlook
The above is the introduction of volcano Engine distributed database and its application in Douyin Spring Festival Gala. Finally, I would like to discuss our team’s thoughts on the future evolution of distributed database.
- Take advantage of new hardware acceleration. New hardware includes not only RDMA networks, Persistant Memory, but also computable storage hardware.
- Multi-mode computing engines: In addition to relational databases, we may have other popular data models or databases, such as document databases, graph databases, and so on. Imagine implementing an All in One computing engine in the future that directly executes transactions across multiple computing engines through a single SQL or other query statement.
- Continuous optimization of the relational database kernel, even retrofit or start from scratch:
- First of all, the current B+ trees are all single-point B+ trees inside a single storage engine. In the future, we may try to do distributed B+ trees and some corresponding distributed algorithms.
- In addition to pessimistic transactions, you can explore optimistic transactions and see if you can mix them.
- Now in the lock above the article although relatively little, but it can do a lot of things in fact, for example, you can try more diverse lock scheduling algorithm, but also can introduce predicate lock to enrich the lock system.
- Finally, AI technology has become a hot trend, but the combination of AI technology and database is still relatively rare and niche. We can imagine that in the future, whether we can do AI for DB or DB for AI, use AI for automatic parameter setting and automatic index quality diagnosis, and even bring AI to the storage layer to realize online format diagnosis and row storage format conversion.
Q&A
Q: In which process do you determine whether data is stored as hot data or cold data?
A: As mentioned earlier, our architecture is separated from computing and storage. The computing layer and storage layer are decouple from each other and are not aware of each other. Therefore, it is necessary to judge whether the data is hot or cold in the storage layer. So what is the rule of judgment? First, we can make a very simple judgment based on the frequency of visits; Secondly, it can also be judged based on THE IO pressure. Sometimes it may be frequently accessed, but the IO pressure of each accessed data is not that large, or it may be classified as warm/cold data.
Q: How do you support both MySQL and PG?
A: Right now we only support MySQL and PG, but we can support any database based on log and Page mechanism in theory. The principle behind this is that we have made a unified generalization abstraction in the storage layer, based on the idea of log is the Database, and made a lot of common interfaces from log to Page. Based on these interfaces, as long as the redo log, Page data of some computing engine database, can be connected to this set of unified generalization interface, connected storage layer can naturally connect to the whole system, and then compatible with more database products and computing engine, become a database family.