Chuanhui Yang, alias Rizhao, is currently CTO and founding member of OceanBase team. This article is based on the content of “Integrated Design of Next Generation Distributed Database” broadcast on April 28.

Hi, my name is Yang Chuanhui. Last month, I discussed the three generations of distributed database evolution: the first generation of NoSQL systems, the second generation of extensible SQL processing, to the next generation of transparently extensible enterprise databases. Compared with the second generation, the next generation of enterprise-level distributed database uses an integrated design, into the classic database design concept of refinement, the pursuit of ultimate performance. This live broadcast is a continuation of the last one, introducing the integrated design of the next generation enterprise distributed database.

It is much easier to make a stable system efficient than to make an efficient system stable

Before I begin, I will first intersperse my thoughts on the system design concept. Every system needs to be designed with architecture, stability, and performance in mind. What is the relationship between the three? A classic rule is that “it is much easier to make a stable system efficient than to make an efficient system stable.”

The hardest part is stabilizing the system from zero to one. With a stable system, it’s often easier to tune performance gradually until you hit the performance ceiling of the system architecture. Therefore, before designing the system architecture, we should first consider the goals and performance ceiling of the system, then stabilize the system based on the correct architecture, and finally optimize the performance. If the goal of the system is to become a world-class enterprise-level distributed database, then the pursuit of extreme performance must be taken as a mandatory option at the beginning of the architecture design, which will fundamentally change the design of enterprise-level distributed database.

A review of three generations of distributed databases:

To address scalability and concurrency performance issues, the distributed database underwent three iterations:

The first generation is distributed storage system, also known as NoSQL, which was popular before 2013. The basic idea is to sacrifice SQL, transaction, consistency and enterprise-level functions, and only support simple KV operation so as to achieve scalability.

The second generation of distributed database, represented by Google Spanner system, supports extensible SQL. On the basis of the first generation of NoSQL system, SQL and distributed transactions are introduced to ensure strong consistency, but they do not pay much attention to SQL compatibility and cost performance, and the stand-alone performance is often poor.

The third generation is the transparent expansion of the enterprise database, which is what I call “the next generation of enterprise distributed database”. OceanBase is represented by OceanBase. Distributed architecture is transparent to services, supports comprehensive enterprise-level functions compatible with MySQL and Oracle, supports HTAP mixed load, adopts C language to achieve high stand-alone performance, and the system architecture has a high performance ceiling, which can be based on the architecture to pursue the ultimate performance.

Enterprise distributed database in addition to do a good job of distribution, but also adhere to the original intention

Classical database contains several core engines such as storage, transaction and SQL, as well as enterprise-level features such as database functions and performance, cost, security and reliability based on these core engines. On the basis of classical database, enterprise-level distributed database introduces the distribution and realizes the high availability and extensibility. Although distributed databases have generated many innovations on top of native distributed architectures, they still have a long way to go in terms of functionality, performance, and refinement.

Enterprise distributed database in addition to do a good job of distribution, but also to adhere to the original heart of the database: improve functional compatibility and single-machine performance. Key technologies of classical databases have stood the test of time, such as SQL standards, transaction models, and enterprise-level distributed databases need to learn SQL compatibility from classical databases. Some of the hottest technologies in the distributed world today, such as storage computing separation and HTAP, also came from classical databases, which tend to be much more refined.

Storage computing separation

Most people today have heard of storage computing separation, but it varies greatly from system to system. There are three storage computing separation solutions in the industry:

The first is based on the essence of the middleware database and table, back-end database represents storage, middleware represents computing, this scheme is the real separation of storage and computing? Obviously not, I also call this scheme “pseudo-storage computing separation”;

Distributed KV represents storage, AND SQL and transaction represent computation. Storage and computation adopt loose coupling design. This scheme can solve the problem of scalable SQL processing, but each operation involves remote access between computation and storage. Transaction related metadata is stored in distributed KV table, which increases the extra overhead of transaction processing and sacrifices the performance.

In the third design, SQL, transaction and KV represent computation, and distributed file system dependent on KV layer data block represents storage. This solution is derived from the IOE architecture of classical databases, where SQL, transaction and KV layers represent computation and the underlying layer relies on EMC’s SAN storage. Amazon Aurora borroves the storage-computing separation architecture of the classic database and has thrived in the cloud native environment to become the ultimate de facto standard.

In order to pursue the ultimate performance, I believe that enterprise distributed databases should learn from classical databases the native storage computing separation technology, and I recommend the third solution.

HTAP

HTAP is also a hot concept in the distributed database space. There are two ways to do this:

The first is physical isolation between the primary and secondary databases. The primary database does OLTP and the standby database does OLAP. Synchronization between the primary and secondary databases is performed through redo logs. The second method is to implement mixed OLTP and OLAP load in the same engine to distinguish the resource groups where OLTP and OLAP requests reside and logically isolate resource groups.

The first scheme is relatively simple to implement, but has low cost performance due to more data redundancy. The second scheme is relatively complex, but it adopts integrated design and has higher cost performance. The second solution comes from classic databases, such as Oracle and SQL Server.

In my opinion, enterprise distributed databases should learn the HTAP technology and adopt the second approach.

The extreme performance

The performance bottleneck of modern databases is usually CPU, not disk IO. To optimize the CPU, classic databases tend to use localization:

Minimize cross-server operations and achieve server localization;

Try to complete all SQL operations in the same thread to achieve CPU core localization;

Try to put the data accessed together to hit the CPU level 2 cache to achieve CPU cache localization.

In the pursuit of refinement at the code level, the main code is implemented in C language, and a small number of critical path codes are even implemented in assembly language to optimize the CPU instruction number of each SQL operator. Manually manage memory to make full use of the characteristics of the database memory life cycle to improve memory utilization and avoid the extra performance overhead of automatic memory garbage collection at the language level.

Enterprise-class distributed databases should also learn from classic databases in terms of refined design to achieve extreme performance. Database optimization comes down to the nitty-gritty. How to understand the ceiling of system design? For example, if a distributed database is not designed with extreme performance in mind, then, no matter what you do, you can only produce 28nm technology for the benchchip; Only with extreme performance in mind at the beginning of the design can it be possible to make 7nm, 5nm, or even 3nm technologies for benchmark chips. Of course, the technical architecture complexity of 7nm and 5nm will be much higher than that of 28nm. It will be relatively slow in the first few years, but the technology dividend will be gradually released after the architecture is stable.

Integration design of enterprise distributed database

Enterprise-level distributed database is an MPP architecture that supports dynamic expansion and lossless disaster recovery.

In THE SQL layer, the enterprise database, such as Oracle HTAP design, in the same engine support OLTP and OLAP mixed load; In the transaction layer, distributed transaction and multi-version concurrency control with automatic fault tolerance are implemented to ensure strong consistency with classical centralized database object. At the storage layer, Paxos log synchronization at the partition level enables high availability and automatic scaling, and supports flexible storage architectures, including local storage and storage computing separation.

When the enterprise-level distributed database has both scalability, and has good functional compatibility and cost-effective, and in TPC-C, TPC-H and other database benchmark and core application scenarios have made a breakthrough, is the true next generation of distributed database.

Paxos for enterprise-class distributed databases

Enterprise-class distributed database supports partition-level Paxos log synchronization, implements high availability and automatic fault tolerance through Paxos election protocol, with RPO=0 and RTO<30 seconds. Partition-level log synchronization is scalable and supports dynamic addition of servers to improve system processing capability without human intervention.

In 2014, OceanBase introduced the Paxos protocol into the database and achieved RPO=0 in the financial core system for the first time. It was precisely by relying on this technology that OceanBase became the final choice of Alipay. Based on the Paxos protocol, OceanBase continues to develop flexible DISASTER recovery deployment schemes such as same-city three-room, two-place three-center, three-place five-center, and so on.

The classical centralized database usually adopts the master/slave synchronization scheme, and there are two synchronization modes: the first is strong synchronization. Each transaction operation needs to be strongly synchronized to the standby machine before answering the user. In this way, the server failure does not lose data, but the service must be stopped, and availability cannot be guaranteed. The other is asynchronous. Each transaction operation can answer the user only after the host succeeds. In this way, high availability can be achieved, but the data between the master and the standby database is inconsistent, and the standby database will lose data after switching to the master database.

As the two most important characteristics of database, strong consistency and high availability can not have both fish and fish in the master and slave synchronous mode. Paxos is a majority based distributed voting protocol in which each transaction needs to be successfully written to more than half of the servers before a user can be answered. As the saying goes, “two heads are better than one”, assuming that there are three servers, each transaction operation requirements in at least two servers success, no matter any one server fails, the system has one contains all the data server can work normally, which do not lose data, The RPO is 0 and the RTO is less than 30 seconds. All of the protocols that guarantee RPO=0 are Paxos or variations of Paxos, Raft being one of them.

Raft is a simplification of the Paxos protocol that adds a restriction to the Paxos protocol that requires sequential voting, which means sequential synchronization of database redo logs. Raft protocol has the advantages of simple implementation, but each partition can only be synchronized sequentially. The performance of concurrent synchronization is poor, and the probability of transaction delay is high in weak network environment. Paxos protocol supports out-of-order synchronization. Although the implementation is more complex, the concurrent synchronization performance is good and the probability of transaction delay is low in weak network environment. The Raft class system invented a concept called Multi-raft, where each shard runs one Raft protocol. This concept is not equivalent to the classic multi-PaxOS, which refers to concurrency within a shard, and multi-raft, which refers to concurrency between shards, which is still a performance bottleneck if there is a lot of writing to a single shard.

Both Paxos and Raft are reasonable approaches. I personally recommend using the native Paxos protocol with higher concurrency performance. This is what top Internet companies are doing with massively distributed storage systems, including Google Spanner, Microsoft Azure, Amazon DynamoDB, OceanBase, etc.

Classical databases usually use server-level static log flow. There is only a single log flow between the primary and standby databases. This method generally does not support dynamic expansion. The key to the scalability of enterprise-level distributed database is to use partition-level dynamic log flow, so as to achieve partition-level load balancing naturally for dynamic expansion. Storage computing separation can learn from Oracle automatic storage management technology, also known as ASM technology, to manage local disks and remote distributed storage through unified interfaces. Each data block can be written to the local disk or remote distributed storage. Since data blocks can often be cached to SQL compute nodes through buffer pools, most scenarios can be localized and perform well. This is also the solution used in Oracle RAC systems.

Many of you who are here today are DBAs, and many of the DBAs I’ve worked with in the past have a built-in understanding that strong synchronization means significant performance degradation, such as when Oracle DataGuard turned on strong synchronization.

Enterprise-class distributed database is designed for strong synchronization of Paxos. Asynchronous Commit and Group Commit technologies are used to avoid the impact of strong synchronization on system performance. After the transaction submits the log writing task, the worker thread is released immediately. The freed worker thread can process other concurrent transactions. After the log strong synchronization succeeds, the worker thread can respond to the user through the asynchronous task callback. In addition, the logs of multiple concurrent transactions can be merged into a batch and written to the same buffer and strongly synchronized to the standby machine. OceanBase’s practical experience shows that through various optimization methods, the loss of system concurrency performance caused by strong synchronization can be reduced to within 5%~10%.

Enterprise-level distributed databases support distributed transactions with automatic fault tolerance and multi-version concurrency control. The difficulty of distributed database lies in fault recovery and concurrency control in distributed scenario, where a lot of code logic is used for exception handling. Many students have a question, distributed database is not a stand-alone database plus Sharding, why is it so difficult? The root cause is fault tolerance and extreme performance, which makes synchronous operation become asynchronous operation. Error handling needs to be added to each operation. Moreover, these problems are coupled together and need to be considered at the architecture level.

Many scenarios will encounter the requirements of distributed transactions, and there are many different solutions. The approach of middleware XA is to rely on the underlying database. As long as the underlying database is continuously available, it can easily handle exceptions without the blocking problems caused by classic coordinator failures. NoSQL systems eschewed consistent and distributed transactions and only supported single-row or single-machine transactions.

For enterprise-level distributed database, we must face the problem directly and adopt Paxos plus two-phase commit protocol to realize automatic fault-tolerant distributed transaction:

Paxos protocol implements automatic fault tolerance and two-phase commit protocol ensures atomicity of distributed transactions. Many sub-database and sub-table schemes also support distributed transactions, but often do not support fault tolerance, requiring human handling of suspended distributed transactions when the server fails. Such solutions may pass PoC tests, but cannot be scaled to production systems.

The two-phase commit protocol divides the transaction commit process into prepare and COMMIT phases, which is an additional phase compared to the classical database transaction commit. To avoid the transaction delay added by the two-phase commit protocol, one-phase optimization techniques can be used. Unlike the classic implementation of two-phase commit, the coordinator does not need to persist the state, but rather is jointly negotiated by the participants during failure recovery. Based on this implementation scheme, users can be answered after the second round of commit rather than after the second round of commit. In the end, there is only one log write delay and one RPC synchronization delay, and logging and RPC synchronization can be pursued in parallel. Of course, this approach increases the complexity of troubleshooting.

Multi-version concurrency control comes from classical databases. Concurrency control of early databases is often implemented based on locks. The four isolation levels of database transactions, such as REPEATable Read, are tailored to the locking mechanism. However, the lock mechanism leads to mutual exclusion of read and write, and the system concurrency performance is poor. Today, except for IBM DB2, which still uses fine-grained locks for concurrency control, other major databases, including Oracle, SQL Server, and OceanBase, all adopt multi-version concurrency control with the isolation level of Snapshot isolation. The advantage of this scheme is that the write operation does not affect the read and is more suitable for distributed databases with increasing concurrency. In addition to business factors, the selection of several core technologies is also critical, such as choosing multi-version concurrency control over fine-grained locking, integrated Oracle cluster technology, and integrated mixed load technology.

There are two classic implementations of multi-version concurrency control: Read View, used in MySQL databases, and Read Version, used in commercial databases such as Oracle.

Among them, the transaction of read View scheme does not commit the version number, and every time it reads, it needs to obtain the list of all ongoing transactions, which has poor scalability. In the Read Version scheme, each transaction has a unique commit version number (equivalent to an SCN in an Oracle database), and only the latest transaction version number is required for each read.

Enterprise-level distributed database adopts Read Version scheme, in which the changes generated during the execution of each transaction are stored in the system as uncommitted data. When reading, historical data in the system is selected according to the snapshot version, and the data being modified is not seen, ensuring the atomicity of transactions. As shown in the figure above, account A is on machine 1, account B is on machine 2, account A transfers 50 to account B, and the snapshot reads either the data before the transfer or the data after the transfer.

To get a globally consistent read version, there are two implementations: the classic global version number generator and TrueTime, which is used in the Google Spanner system. The benefit of TrueTime is that it supports global scalability across geographies, but the disadvantage is that it adds an additional 7 to 10 milliseconds of latency per transaction. Enterprise-level distributed databases should adopt global version number generators to reduce transaction latency and application adaptation costs. Of course, the global version number generator is essentially a single point that needs to address the issue of high availability automatic fault tolerance, as well as the performance issues of transaction version number generation.

HTAP for enterprise databases

Using the HTAP technology of the classical database, the enterprise distributed database realizes the same engine to process the mixed load of OLTP and OLAP simultaneously. HTAP is not entirely new. Classic databases, such as Oracle, have always handled mixed loads of enterprise customers with a high degree of refinement. Whether it is mixed load, or storage computing separation, enterprise distributed database should learn from the classic database experience, stick to the essence of database, stand on the shoulders of giants to do innovation.

Classical databases support HTAP mixed load with mixed column storage, which divides a table into row groups and stores columns within each row group, finding a balance between row storage and column storage.

Each row group is a compression unit that supports intelligent indexing and efficient compression, such as storing index metadata such as count and sum within a row group. In addition, both Oracle and SQL Server support In Memory column storage, where OLTP queries access row Memory tables on disk and store an additional column table In Memory exclusively for OLAP queries. HTAP is often referred to as supporting both OLTP and real-time OLAP services, so you need to avoid extremes.

Because HTAP uses mixed column and column storage, performance for purely offline OLAP services may not be as good as more dedicated OLAP and big data systems. The value of HTAP lies in its simplicity and versatility. For most medium-sized customers, the amount of data is not particularly large and only one system is required. Of course, for super-large Internet enterprises, one system is certainly not enough, and multiple systems need to be deployed. We should not only constantly explore a more general HTAP scheme, but also avoid one size fit all. The final scheme is likely to be one size fit a Bunch.

Another technical challenge for HTAP is resource isolation.

The lazy practice is physical isolation for multiple copies, OLTP queries for the primary copy and OLAP queries for the secondary copy. The other is the resource group scheme used in classical databases. Whether it is Oracle or SQL Server, it supports defining different resource groups and limiting the physical resources that can be used by each resource group. Isolation of resource groups can be achieved either through SQLVM in a database such as SQL Server or through the cgroup mechanism underlying the operating system.

Database services usually have three types of users: OLTP users for online transactions, Batch users for Batch processing, and DSS users for reports. Each type of user can use CPU, I/O, and network resources based on resource groups to prevent impact on OLTP services. Oracle’s latest version of multi-tenant isolation uses a Cgroup mechanism, and OceanBase has adopted a similar mechanism with good results.

Compared with MySQL and other open source databases, the greatest value of classical commercial databases lies in their ability to handle complex queries and mixed loads. Enterprise-level distributed databases should learn SQL optimization and SQL execution techniques from classical databases.

Classical databases adopt cost-based optimization and support cost-based rewriting. Many optimizers of distributed databases rely on the open source Calcite framework. Due to the limitation of general framework, it is difficult to achieve cost-based rewriting, and the support ability of complex queries is limited. The Oracle optimizer is divided into two phases: The first stage is serial optimization, and the second stage is parallel optimization. The distributed and parallel scenarios are relatively less considered. Enterprise-level distributed databases can innovate on the basis of Oracle optimizer, enhance the distributed and parallel capabilities, support one-stage optimization, and consider both stand-alone and distributed costs in the cost model.

Classical database SQL execution ability is also very good, whether IT is TPC-H, TPC-DS benchmark, or real business scenarios, single-core execution performance is very good. Enterprise-level distributed database should learn SQL execution techniques of classical database, including compilation execution, vector execution, push model, SIMD processing, strongly typed system based on structured data and so on.

Enterprise distributed database = classic database + distributed

Finally make a summary: enterprise distributed database in the classical database on the basis of the introduction of distributed, high availability and scalability.

Enterprise distributed database in addition to do a good job of distribution, but also to adhere to the original heart of the database, in the functional compatibility, cost-effective efforts. Today, some of the popular technologies in the field of distributed database, including storage computing separation, HTAP, are originally from the classical database, and the classical database after more than 50 years of practice, through the application of technology innovation to make more sophisticated.

I think enterprise-level distributed database is the big trend of the future, and whether we can finally walk to that step depends on whether we can stand on the shoulders of the giants of classical database, dig the core technology and concept of classical database, and make innovation on the basis of digestion and understanding. Enterprise distributed database has made a lot of innovations in distributed architecture, including partition level Paxos log synchronization to achieve high availability and scalability, support fault tolerant distributed transactions and multi-version concurrency control, but the understanding of the core technology of classical database still needs to be strengthened. For example, in-depth understanding of HTAP and storage computing separation technology of enterprise databases such as Oracle and SQL Server.

I also take this opportunity to appeal to practitioners in the field of distributed database to learn more from the classical database integration design concept, make the next generation of distributed database, stick to the original idea of database.

The above is my sharing. This sharing is a continuation of the next Generation Distributed Database Design Thinking. OceanBase team will continue to strengthen technology sharing in the future. Thank you for your attention. I look forward to discussing the architecture design of distributed database with you and your suggestions. Thank you very much!

Attached “Next generation Distributed Database Integration Design” live audience QA

Why did OceanBase implement HTAP based on the same engine?

A: Classic databases, such as Oracle or SQL Server, actually support HTAP mixed load under one engine. This approach is more refined and cost-effective, and classic databases have solved the problem of TP and AP isolation through technologies such as CGroup. OceanBase technology can learn more from the development of more than 50 years of classical relational database.

What is the standard SQL support in the latest release of OceanBase?

A: In practice, most of MySQL services can be smoothly migrated, and basic Oracle functions are supported in Oracle mode. It can be smoothly migrated from Oracle to OceanBase with few changes.

OceanBase as a TP database, how about AP computing capability?

A: AP mainly adopts technologies such as column and column mixed storage, compilation execution, vector engine, cost-based query rewriting and optimization, and OceanBase has good scalability. OceanBase is the industry leader in real-time analysis in AP field. Of course, for offline big data processing, Spark and other big data solutions may be more suitable.

How much data can OceanBase be used as AP?

A: It depends on the business needs. There is no limit from 100G to PB level. AP is a capability provided by OceanBase. With a small amount of data, it can be set to a smaller parallel setting and use less computing resources. Given the scale of data, more computing resources would be better. There is no “quantitative” limit in nature. The capability of AP is embodied in the ability to fully mobilize the hardware resources of the machine and generate efficient parallel plans.

What is the relationship between OceanBase’s performance and the number of machines?

A: You can see the TPC-C report of OceanBase. The TPC-C scenario with more than 1500 machines is basically linear expansion.

For more questions and answers, you can visit the “Developers” section of OceanBase’s official website and interact directly with OceanBase CTO Chuan-hui Yang in the “QUESTION and Answer” section.

Click the link www.oceanbase.com/community/a… One-click access to the Q&A area.