Yan Ran, one of the initial members of OceanBase, is currently responsible for transaction engine, high availability architecture, load balancing, performance optimization and other work.
The introduction
As we all know, transaction feature is an important function in database, which is the key to ensure data consistency. In the recent Open source technology live broadcast of OceanBase, Yan Ran, the r&d director of OceanBase distributed database transaction, took the essence of database transaction as the theme and deeply shared the solution of OceanBase on distributed transaction in the past and present seasons. This paper will describe database transactions from the following three aspects: 1, the history of transactions 2, challenges of transactions 3, distributed transactions
The body of the
The preexistence of business
Almost no one has an App called a “database” installed on their phone or computer, but almost every App uses a lower-level system called a “database” inside to store data. Everyone goes about their daily lives booking tickets, shopping and paying bills, but they may not realize that these operations are supported by a “database” system. So what exactly is a “database”? What does “database” have to do with “transactions”?
There have been many important events in the evolution of computer systems, but since the 1960s there has been a major revolution in computer systems that has reshaped the entire world digitally. The revolution was the mass use of the “disk”. Before disks, computers used “magnetic tapes” to store data. The tapes that computers use, the tapes that you use to listen to music, or the tapes that you use to watch movies, are essentially the same thing. It’s an age-revealing metaphor, since both of these things are now only seen in museums. The biggest problem with tape is that it is not easy to find things, and rewinding is an extremely patient task. This decided that the data processed by computers at that time were transcribed to tape by later transcribing, and computers were only used for statistics and reporting.
The original disk was a big thing. You can’t tell the size from the picture below, but it was actually the size of a microwave oven. This first hard drive, maybe not a hard drive, can only store 3.75MB of data. As technology continues to iterate, hard drives get smaller and bigger. To the back again, there was an upgraded version of flash media, hard disk is small to pervasive. But regardless of size, the random-access nature of these storage devices is the same.
The most important feature of the disk is “random access”, that is, what the user is looking for can be easily located. The rapid processing power of computers, combined with the ability of disks to quickly locate and modify data, opened up a new world of computer systems. Since then, computers began to participate in all aspects of human society. Sabre, the airline’s first ticketing system, for example, uses computers and disks to store ticket information and the buyer information directly for each flight that is up for sale. The information is stored on hard drives, which are accessed and modified in real time by ticket sellers, directly replacing the paper way of recording ticket information. After that, banking, communications, transportation and other industries used disks to store their business information and computers to respond to various “real-time processing” needs.
For the rapid expansion of data processing requirements, specifically for data storage and access to the “database management system” has emerged, and has been widely accepted as an independent system. The reason why a software system can stand on its own and be accepted by the market is its reusability, that is, in many business scenarios, using the software is more convenient and efficient than not using it, and the world will be better for it.
Database management system helps users to solve the universality of the main problems:
- Data storage and management
- The data access
- Data changes
- High availability
First of all, data storage and management solves the problem of how to manage data. It is essentially an abstraction of storage devices, so that users do not need to care about how data is placed on the hard disk, but only care about the model of data. From the early days to the present, data models emerge in an endless stream. Before the relational model, there are network models and hierarchical models, and after the relational model, there are object models, document models, graph models and so on.
Secondly, the data access to solve how to do query and analysis in the data, on the data model, the database system will provide abstract interface for users to use easily, such as SQL statement is on the relational model has super expression ability of data access interface.
Then, data change is to solve the problem of data modification, and the core of data modification lies in how to ensure the correctness and consistency of data, data correction is the most users do not want to see the situation. The “transaction” function is the core of the “data change” function. Transaction functionality provides users with 4 well-known features: Atomicity, Consistency, Isolatation, Durability (ACID). When a consumer commits multiple data changes to the DATABASE management system as a single transaction, the database management system ensures that the data changes made by the consumer have ACID properties.
Finally, high availability is a requirement that has been placed on systems in recent years. Before, systems tended to emphasize reliability, but now systems are more about availability. The reason is that as the Internet gradually deeps into people’s life, the fast forward of online projects, the impact of service unavailability on people’s life becomes more and more serious, so people’s requirements for the system also change from reliability to availability. This is a qualitative change where availability equals reliability plus continuity of service. High availability is also closely related to the transactional nature, as discussed later in this section.
Let’s go back to transaction processing. Database management systems are intended for use by other software developers to store, query, modify business data and express business logic. Now that you have a relational model, you have SQL statements for developers to use. As a developer writing a program, to make some changes to the data stored on the hard disk, so can not directly change, why also need “transaction” function? The difficulty lies in the computer system is not very good operation, imagine in the ticketing system ticketing process, if we modify the reservation seat information, but has not received money, the computer is broken, when the computer is restored, want to check out the wrong information and repair, that can not be simple. Transaction functionality, on the other hand, can address this concern.
Transaction challenges
So, what problems do transactions solve to make the world a better place for users?
Transaction functionality essentially does two important things, namely “failover” and “concurrency control.” The former guarantees atomicity and persistence of data modifications in the event of a computer system failure. The latter ensures that operations remain isolated from operations while data within the system is in concurrent operations. The common purpose of these two is to ensure that the database management system to operate on the data, the data itself to maintain integrity and consistency. For example, when Zhang SAN transfers to Li Si, we should modify zhang SAN’s account and Li Si’s account at the same time. This transfer operation will not cause more money or less money out of thin air because of failure, nor will it cause Zhang SAN’s money to become negative because of the transfer operation of Zhang SAN and Wang Wu at the same time.
Are “failover” and “concurrency control” challenging? Yes, very high.
First let’s take a look at “failover.” What could possibly go wrong with the computer system? Will the hard disk where the data is stored be damaged? Could the motherboard suddenly burn out and shut down the system? Will the machine lose power? Does the network cable go on and off? Switching machines may be shut down, or even the whole machine room may be without power during the summer peak, but the diesel generators in reserve can only last for a few hours? All of these situations are real.
The “fail-over” function of a transaction is to prepare a manual for the computer to deal with these exceptions and ensure atomicity and persistence through the process of exception handling. The most common failover solution currently used by OceanBase is the use of logging.
A log is a “running book” that records operations. In order for many data changes to succeed at once, the hard drive that records the changes cannot allow multiple writes to complete at once. The method of the transaction function is to convert all the changes to be made in a transaction into a record in the log system. After all the changes are recorded, an end mark “done” is recorded. Whether the whole transaction is completed atomically depends on whether the “done” mark is finally completed. Once the final marking is complete, it means that all changes in the transaction have been recorded. Even if the actual data change was not completed due to a failure, when the system is restored, the log records all the things that the transaction needs to do and then redo.
The secret to ensuring that just-completed transactions are not lost lies in this log. The database records this log on more than one hard drive, ensuring that if one of the hard drives fails unexpectedly, there are logs stored on other hard drives that can be used to recover changes to the transaction. To improve the security capability, for example, the logs are not lost when the machine is damaged, the logs are not lost when the equipment room is faulty, and the logs are not lost when multiple equipment rooms in the city are faulty. Therefore, the logs need to be stored in more places. In OceanBase’s three-place and five-center architecture, logs are stored in five data centers in three cities. Thanks to the support of Paxos protocol, transactions can be successfully submitted if any two cities are successfully stored. In addition, transaction data will not be lost if all the computer rooms in any city fail.
As an inside tip, there is currently no database for disaster recovery of earth failure. If the earth is lost, the data will be lost. It seems OceanBase has some work to do.
The high availability of databases is also closely related to the multiple copy processing mechanism described earlier. Produced by the latest log by synchronous Paxos agreement to other copies, according to the deployment of different specifications, may be the same city in the other room, also may be another city in the room, then log and the corresponding data in multiple rooms or more cities have backup, when there is a machine or computer failure, according to the Paxos agreement, Another machine room with a good copy will be activated to continue providing database services. This is also the cornerstone of the high availability of OceanBase databases.
Again, let’s talk about concurrency control. If the database system only handles one transaction for one user at a time, there is no need for “concurrency control.” It is safest to do one transaction after another. But this is too inefficient, put on THE CPU, hard disk, network these resources can not be used. As a kind of system software, database management system’s internal mission is to bring the hardware capability of computer system into full play as much as possible, so that the same hardware can better meet the needs of actual business. “Efficiency” is the inherent mission of system software, and “efficiency” is also the origin of complexity.
The “concurrency control” of transactions fundamentally addresses the concurrency problem of data reads and writes.
First, the database system decides the granularity of concurrency control. Common concurrency control granularity includes data storage pages and data behavior units. In general, the smaller the granularity, the better. The smaller the granularity, the less the interaction. OceanBase uses a concurrency control granularity of “rows”. Second, solve the concurrency problem of reading and writing. Here are some examples: First, the concurrent operation between read and read; Second, concurrent operations between writes; Third, concurrent operations between read and write.
Between reading and reading, no special processing is required, no matter how many times a piece of data is read, the data is still there, no increase or decrease.
Do not write and write at the same time, otherwise the data will be corrupted. OceanBase uses a row-level mutex mechanism, meaning that rows modified by one transaction are locked and cannot be modified by other transactions until the end of the transaction.
Multi-version concurrency control between read and write is the general trend at present, and OceanBase also adopts multi-version concurrency control. Data changed during the execution of a transaction is stored in the system in the form of the new version to ensure that the data before the update is still in the system. Before the transaction is advanced, the data changed by the transaction is not effective. In this case, when a read operation reads these rows, the previous data can be directly read. This is the biggest beauty of multi-version concurrency control, where a row of data can be modified without affecting the data read at the same time. By keeping reads and writes separate, the system can squeeze as much power out of the hardware as possible.
Distributed transaction
Distributed environments bring new challenges to transaction functionality, all due to “latency” in distributed systems. What latency issues in distributed systems pose such a challenge?
The thing about latency is that every time a computer system communicates, communication is the basis of all computer operations.
CPU access to memory takes 50 nanoseconds, while the network latency in the same machine room is 100 microseconds. In other words, an operation of the CPU to fetch data from native memory takes 50 nanoseconds from the time the instruction is initiated to the time the data is fetched. If the data to be retrieved is on another machine, even in the same machine room, it takes 100 microseconds, a difference of 2,000 times.
This delay difference brings two effects. One is to consider the delay of communication and design special algorithms. Second, when a machine is faulty, other machines do not know about it immediately. Therefore, disaster recovery (Dr) needs to consider this new fault scenario.
What are the new challenges in a distributed environment, considering the “failover” and “concurrency control” that transaction functionality addresses?
Start with concurrency control. The multi-version concurrency control mechanism requires a way to express the global snapshot Version. There are usually two ways: one is called Read View and the other is called Read Version. The difference between the two is whether the version number of the transaction is determined at commit time. If a transaction is not versioned at commit time, the snapshot version needs to contain a list of all active transactions, which is a difficult architecture to scale. OceanBase uses the latter, that is, a snapshot version is a timestamp, hence the name ReadVersion. This pattern is much more scalable, but still requires a global single point to provide the timestamp.
In OceanBase, the global timestamp service is called THE Global Timestamp Service (GTS), and although this timestamp service can handle millions of requests per second, it should be able to meet the business needs of users, we still expect OceanBase’s processing power to be fully scalable. In the OceanBase TPC-C test, 1.5 billion transactions were executed per minute, and the timestamp service was not a problem at all.
Compare with CockRoachDB, another distributed database in the industry. CRDB uses the same snapshot mode as OceanBase, which is Read Version. However, CRDB uses a hybrid logical clock HLC instead of a global timestamp service.
In fact, the advantage of HLC is that it is a logical clock in nature, and the communication-based way ensures that the clock values between causal events are in strict order. This mode does not need a single point to generate time stamps. However, HLC has a huge defect, in the consistency of data can not achieve the same global ordered transaction processing as traditional database, according to the description of CRDB system, they can only guarantee the linear consistency of a single line.
So what about linear consistency for a single row? The database only guarantees that if two transactions change the same row, the sequence is clear. Conversely, if two transactions modify different rows, the database does not guarantee the order in which their data is modified. Do we go around or not? It is easy to understand the problems encountered in this scenario: a common scenario in a trading business, such as a user placing an order to buy an item, has one table to store the status of whether the order has been paid for and another table to store the status of whether the order has been shipped. The payment system processes the user’s payment action and changes the order status to paid, and then notifies the shipping system. After the shipping system confirms that the order has been paid, it initiates the shipment action and changes the order status to shipped. Paid and shipped are revised data successively, but in two different tables. If there is another query for these two rows of the order, it is possible to see the shipped status and the unpaid status in the CRDB. That is, seeing a later change, but not seeing the previous change, can put a lot of burden on the database consumer, which is why OceanBase did not choose this approach.
Now look at “Failover.” In a distributed environment, updates to transactions occur on multiple machines, and the transaction logs used to perform “failover” are recorded on different machines. In this case, the logs are recorded in multiple places, and a new process is required to ensure that all logging operations are recorded. This new process is called the Two-phase Commit Protocol. The new process of two-phase commit is to allow each machine to log its own operations, but not to mark final success until the coordinator has confirmed that all operations were logged successfully, and then to record the transaction commit success, which guarantees the atomic success of logging.
This is the flow of the classic two-phase commit protocol, and OceanBase’s extreme optimization allows the two-phase commit protocol to be completed in a much shorter time. OceanBase strategy is not dependent on coordinator for the final record of successful transaction log, but depend on each participant operation log records, in the execution of a two stage if the system failure, then by the participants check if the transaction log all each other, finally still can confirm all of the participants to remember all the respective operating logs, To confirm the status of the transaction’s final commit. If everyone does not remember everything, the transaction is not committed and will end up in a rollback state.
OceanBase’s two-phase commit protocol reduces the delay of transaction submission from three log synchronization delays in traditional protocols to only one log synchronization delay, greatly improving the efficiency of transaction submission.
Continue to compare CRDB. A big difference from traditional database and OceanBase is that the whole transaction persistence model of CRDB is completed by KV system, and transactions no longer operate independent log system, so CRDB loses the opportunity to record system changes with log system and completely depends on the operation of KV system. From the CRDB design principles, this simplifies the complexity of distributed systems, but this simplification also introduces a number of burdens. In CRDB, transactions are committed to ensure that changes within the transaction are atomic in case of failure. However, different modifications within the transaction directly operate the corresponding rows in KV. CRDB needs to ensure that all row atoms modified within the transaction take effect, which means that each row is a participant of CRDB transaction. Although transaction “fault recovery” can be completed eventually, the cost is very high.
conclusion
Today, transaction features are not only used in database management systems, but also used in other systems to describe similar requirements. For example, the latest Intel server CPUS support transaction features to access memory, which brings atomicity and concurrency control to in-memory systems. This has the advantage of making it easier for developers to use memory to express complex change logic.
The kernel of the transaction is to hide the complex operation logic in the system, exposing the simple interface to the user, so that the user is more convenient. Keeping the complexity in the system and leaving the simplicity to the users is the biggest value of the database system, and it is also the principle that OceanBase adheres to in the development of various functions.
Q&A
Eggs link
According to the user’s concern about the question, Teacher Yan Ran made the answer, see the specific questions and answers, I hope to give you a little help.
Q. As a distributed database, how does OceanBase define whether a transaction belongs to distributed or stand-alone transaction internally? For example, in the following situations: A. A transaction involves multiple copies on the same observer node b. A transaction involves multiple copies on different observer nodes
A: OB determines whether A transaction is distributed or not by partition. OB will automatically track the changes in a transaction. When the transaction finally executes the COMMIT command, if the changes are found in one partition, the transaction atomic COMMIT will be completed through the one-phase COMMIT logic in the region. If the changes appear in multiple partitions, the transaction atomic COMMIT will be performed in two phases.
Q. How does OceanBase, as a distributed database, quickly recover services after a switch failure?
A: OB uses multiple copies for Dr. To accommodate switch faults, you only need to deploy multiple copies of OB across switches to ensure consistency through the Paxos protocol. If the switch fails, as long as most copies remain outside the failed switch, these copies will be re-selected and the database service restored.
Q, the two-stage submission optimization of OB distributed transaction is one drop disk and two RPC. Could you tell us in detail how to ensure consistency in case of failure?
A: Transaction all changes are still dependent on log for persistence, OB optimization is to let all participants logs at the same time, as long as the log record success this time, all the transaction change is complete, based on the change of these log records, you can resume within a transaction all changes, if there is no record of success, because failure has participants then identify this kind of situation, All changes within the transaction are rolled back using the rollback logic of the transaction. So, the whole thing only needs to rely on a log persistence by the participant to guarantee the previous logic. Two RPC communications are used to confirm that all participants have persisted successfully and to tell all participants whether the transaction was ultimately successful.
Q, may I ask OB multi-version control? How is the data of different versions of the same primary key organized? Is it stored in the same tree or how to achieve it?
A: There are two types of organization: in memory and on hard disk. The memory MemTable is a row structure of Btree. The multi-version in a row is a linked list structure. The hard disk SSTable stores row data and multi-version in a row consecutively.
Q, OB distributed transaction internally resembles an optimized XA scheme. Are there any other optimizations for transaction queries? For example, after prepare, the query can be read directly, but the write operation needs to be completed asynchronously. Is distributed transaction coordination at the microservice level available? External XA calls? Or some other transaction solution?
A: OB also supports joint transactions with other systems through XA protocol. In this usage mode, OB is just an RM in XA, but OB still guarantees atomicity of distributed transactions within the RM through two-phase protocol. This is a two-tier, two-phase commit in a cascade. The XA protocol itself does not solve the problem of concurrency control, only atomic commits. If the data is read after XA prepare, it means that there is no concurrency control, and some transaction data problems may occur. For example, even if one RM’s XA prepare succeeds, other RM may fail, and the transaction will be rolled back, that is, the data that was just read is rolled back again. Microservices using XA to address transaction requirements is a preferred direction, as X is the most business-friendly compared to external transaction systems.