OB jun: This article is”
OceanBase 2.0 technical resolution series“. Today we’ll continue to talk about distributed architecture, and talk about the “globally consistent snapshot” feature that everyone is concerned about in 2.0. More exciting welcome to pay attention to the OceanBase public account continue to subscribe to this series of content!

preface

First of all, I think some of you may be looking at this headline and asking:

  • What is a Global Consistency Snapshot?
  • What role does it play in the OceanBase database?
  • Why did OceanBase Database introduce this in version 2.0?

In fact, the story stems from two traditional concepts in databases: “Snapshot Isolation” and “multi-version Concurrency Control” (MVCC). The general meaning of the two technologies are: to maintain multiple versions of data in the database (i.e., multiple snapshots), when the data is modified, can use different versions of distinguish is the modified content and modify the content of the before, to realize the multiple versions of the same data for concurrent access, to avoid the “lock” in the classical implementation mechanism caused by conflicts of reading and writing.

Therefore, the two technologies are used by many database products, such as Oracle, SQL Server, MySQL, and PostgreSQL. The OceanBase database also uses the two technologies to improve execution efficiency in concurrent scenarios. However, unlike the traditional single-point, shared-everything architecture of databases, OceanBase is a native distributed architecture with a multi-point, shared-nothing architecture. Achieving globally (across machines) consistent snapshot isolation levels and multi-version concurrency control presents the technical challenges of distributed architecture (more on this later).

To address these challenges, OceanBase database introduced global Consistency snapshot technology in version 2.0. This section describes the concepts and basic implementation principles related to OceanBase Global Consistency snapshot technology.

Note: All the descriptions in this article are for implementation mechanisms that use the “snapshot isolation level” and “multi-version concurrency control” techniques. The classic pattern of using a “lock” mechanism to implement a traditional “Isolation Level” is beyond the scope of this article.)

The realization principle of traditional database

First, let’s look at how “snapshot isolation levels” and “multi-version concurrency control” are implemented in traditional databases.

In the classic Oracle database, Oracle assigns a “System Change Number (SCN)” as the version Number when data changes are committed in the database. The SCN is strongly related to the system clock and can be simply understood as the system timestamp. Different SCNS represent the Committed Version of data at different points in time, thus achieving the snapshot isolation level of data.

Suppose a record was originally inserted with version number SCN0, while transaction T1 is changing the record but has not yet committed (note: (SCN0 is not yet generated for T1, so it will wait until T1 is committed). Oracle will save SCN0 in “Undo Segments” before the data change. If T2 wants to read the record, Oracle assigns a SCN2 to T2 based on the current system timestamp and looks for data based on two criteria:

1) Data must be Committed;

2) The Committed Version of the data is less than or equal to the maximum value of SCN2.

Based on the above conditions, transaction T2 will obtain the data corresponding to the SCN0 version from the rollback segment, regardless of transaction T1 which is modifying on the same record. This method not only avoids the occurrence of “Dirty Read”, but also does not cause the lock conflict between concurrent Read/write operations, and realizes the multi-version concurrency control of data. The whole process is shown in the figure below:

The implementation mechanisms for snapshot Isolation level and multi-version concurrency control vary from database to database, but most of them follow the following principles:

  • Each time a change to the data is committed, a new version number is assigned to the data.
  • Changes to the version number must be “flat forward”.
  • The version number is taken from the current timestamp in the system clock, or a value strongly related to the current timestamp.
  • When querying data, you also need a latest version number (again, the current timestamp or a value strongly related to the current timestamp) and look for recently committed data that is less than or equal to this version number.

The previous description of “concurrency control over multiple versions” seems perfect, but there is an implicit precondition: the order in which version numbers change in a database must be consistent with the time sequence of real-world transactions, namely:

– In the real world, earlier transactions necessarily acquire a smaller (or equal) version number;

– Transactions that occur later in the real world necessarily acquire a larger (or equal) version number.

What happens if this consistency is not met? Take the following scenario as an example:

1) Record that R1 is first inserted and committed in transaction T1, corresponding to SCN1 is 10010;

2) Then record R2 is inserted and committed in transaction T2, corresponding to SCN2 of 10030;

3) Then, transaction T3 wants to read these two data, it gets SCN3 is 10020, so it only gets record R1 (SCN1

SCN3, does not satisfy the condition). The schematic is as follows:
,>

This is a logical error for the application: I have inserted two records into the database and successfully submitted both, but I can only read one of them. The reason for this problem is that this scenario violates the consistency described above, that is, the order of SCN (version number) changes is not consistent with the time sequence of real-world transactions.

In fact, a violation of this consistency can lead to even more extreme cases. Consider the following scenario:

1) Record that R1 is first inserted and committed in transaction T1, corresponding to SCN1 of 10030;

2) Then record R2 is inserted and committed in transaction T2, corresponding to SCN2 of 10010;

3) Then, transaction T3 wants to read these two data, it gets SCN3 is 10020, so it can only get record R2 (SCN2

SCN3, does not satisfy the condition). The schematic is as follows:
,>

For applications, this result is logically more difficult to understand: the data inserted first is not found, but the data inserted later is found, which makes no sense at all.

Some friends might say that none of this happens in practice because the system timestamp is always monotonically forward, so in the real world the first committed transaction must have a smaller version number. Yes, with a traditional database, with a single source of system clock, the timestamp (i.e., version number) can change monotonically forward and always in the same chronological order as in the real world.

However, for a distributed database such as OceanBase, data distribution and management involve different physical machines because of the shared-nothing architecture, and the system clock of multiple physical machines is inevitably different. If the local system timestamp is used as the version number, There is no guarantee that the version numbers obtained on different machines will be consistent with the real world time order. For the two scenarios above, if T1, T2, and T3 are executed on different physical machines and each of them uses the local system timestamp as the version number, then it is entirely possible for the two exceptions mentioned above to occur due to clock differences between the machines.

To solve the above problems, two concepts are introduced in the distributed domain: “External Consistency” and “Causal Consistency”. Take the above two scenarios as an example. In the real world, transactions occur in the order of T1 -> T2-> T3. If the change of SCN can guarantee the order of SCN1 < SCN2 < SCN3, and the physical machine where the transaction occurs can be completely ignored, the change of SCN is considered to satisfy the “external consistency”.

And “causal consistency” is a special case of “external consistency” : transactions occur not only sequentially, but also sequentially. Therefore, “external consistency” covers a wider range of areas, and “causal consistency” is only one of them. If “external consistency” is satisfied, then “causal consistency” must be satisfied. When OceanBase satisfies “external consistency” in its implementation, it also satisfies “causal consistency”, and the latter part of this article will focus on “external consistency”.

Commonly used solutions in the industry

So how does a distributed database maintain external consistency on a global (cross-machine) scale to achieve globally consistent snapshot isolation levels and multi-version concurrency control? In general, the industry has two ways to do this: 1) Use special hardware devices, such as GPS and Atomic clocks, to keep the system clocks across multiple machines so consistent that the errors are so small that the application is completely unaware of them. In this case, you can continue to use the local system timestamp as the version number, while also satisfying external consistency on a global scale.

2) The version number no longer depends on the local system clock of each machine. All database transactions obtain the globally consistent version number through the centralized service, which ensures the monotonous forward version number. In this way, the logical model for obtaining the version number in the distributed architecture is the same as the logical model in the single-point architecture, completely eliminating the factor of clock differences between machines.

The classic example of the first approach is Google’s Spanner database. It uses a GPS system to keep time in sync with each other around the world, and atomic clocks to keep local clocks within a narrow margin of error. This ensures that the clocks of multiple computer rooms around the world are consistent to a very high precision, a technique known in Spanner’s database as TrueTime. Based on this, the Spanner database can use the local system timestamp as the version number in the traditional way, without breaking the global external consistency.

The benefits of this approach are that the software is simpler to implement and the performance bottlenecks associated with centralized services are avoided. However, this method also has its disadvantages. Firstly, the hardware requirements of the computer room are significantly improved. Secondly, the “GPS+ atomic clock” method cannot ensure that the system clock of multiple machines is completely consistent 100%. According to GoogleSpanner’s paper, the probability of clock drift is very small, but not zero. Here’s a look at the clock error range for thousands of machines in Google Spanner’s paper:

OceanBase uses the second implementation method, that is, centralized services to provide a global unified version number. This choice is based on the following considerations:

  • This problem can be avoided completely by logically eliminating clock differences between machines.
  • Avoid strong dependence on special hardware. This is especially important for a general-purpose database product, and we can’t assume that all users using OceanBase databases have “GPS+ atomic clock” devices deployed in their equipment room.

Global Consistency Snapshot technology of OceanBase As described in the preceding section, the OceanBase database uses a centralized service to provide globally consistent version numbers. When a transaction modifies or queries data, it obtains the version number from this centralized service, regardless of which physical machine the request is from. OceanBase ensures that all the version numbers move forward monotonously and in the same chronological order as in the real world.

With such a globally consistent version number, OceanBase is able to take consistent snapshots of data globally (across machines) based on the version number, hence the name “global consistent snapshot” for this technique. With globally consistent snapshot technology, you can achieve globally consistent snapshot isolation levels and multi-version concurrency control without fear of breaking external consistency.

However, I believe that some friends see here will have questions, such as:

  • How is the unified version number generated in this centralized service? How can you keep moving forward monotonously?
  • What is the scope of this centralized service? Do transactions across the OceanBase cluster use the same service?
  • What is the performance of this centralized service? Is this a performance bottleneck, especially with high concurrent access?
  • What if this centralized service is interrupted?
  • If a network exception (such as transient network jitter) occurs in the process of obtaining the global version number of the transaction, will the “external consistency” be broken?

Here are the answers to these questions one by one.

First, the version number generated by this centralized Service is the local system Timestamp, but its Service object is not only local transactions, but all transactions in the Global scope. Therefore, in OceanBase, this Service is called “Global Timestamp Service” (Global Timestamp Service). GTS for short). Because GTS services are centralized and retrieve timestamps from only one system clock, it is guaranteed that the retrieved timestamps (i.e., version numbers) must be monotonically forward and in the same chronological order as the real world.

Then, is there only one GTS service in a OceanBase database cluster that all transactions in the cluster get timestamp from? Those who know the OceanBase database know that a tenant is a basic unit to realize resource isolation in OceanBase. It is similar to the concept of an instance in a traditional database. Data of different tenants is completely isolated, and there is no consistency requirement. Therefore, each “tenant” in the OceanBase database cluster has a separate GTS service. This not only allows for more flexible management of GTS services (on a tenant basis), but also distributes version number requests within the cluster to multiple GTS services, greatly reducing the possibility of performance bottlenecks due to a single point of service. The following is a simple schematic diagram of the GTS service in the OceanBase database cluster:

When it comes to performance, a single GTS service has been shown to be able to respond to 2 million request time stamps (i.e., version numbers) per second, so as long as the QPS in the tenant does not exceed 2 million, you will not encounter a GTS performance bottleneck, which is difficult for the actual business to reach. Although the GTS performance is not a problem, but, after all, to get the timestamp from GTS has more overhead than for local timestamp, at least network delay is inevitable, that we have done the measured, under full pressure test and normal operation of network, and using the local timestamp to do compared to the version number, the effects of adopting the GTS the performance is not more than 5%, Most apps won’t be aware of this.

In addition to ensuring the normal processing performance of the GTS, The OceanBase database optimizes the process for obtaining the GTS without affecting the external consistency. For example:

  • Some GTS requests are converted into local requests and the version number is obtained from the local control file to avoid the overhead of network transmission.
  • Combine multiple GTS requests into one batch request to improve the global throughput of GTS;
  • Use “local GTS caching technology” to reduce the number of GTS requests.

These optimizations further improve the efficiency of GTS processing, so that users do not have to worry about GTS performance at all. All of the above is normal, but exceptions must be considered in the case of centralized services. The first problem is high availability. Like basic data services in OceanBase, GTS service implements high availability by Paxos protocol. If THE GTS service is interrupted due to exceptions (such as downtime), OceanBase will automatically select a new service node according to Paxos. The whole process is automatic and rapid (1~15 seconds) without human intervention.

What if a network exception occurs? For example, if the network jitter is 10 seconds, will it affect the global consistency of the version number, and thus affect the external consistency? For databases, external consistency reflects the order of “complete transactions” in the real world, such as T1 -> T2-> T3, T1’s Begin-> T1’s End -> T2’s Begin-> T2’s End -> T3’s Begin-> T3’s End -> T3’s Begin-> T3’s End If there is overlap, there is no true “order” between transactions and there is no external consistency.
Therefore, no matter how abnormal the network, as long as the “complete transaction” in the real world satisfies this order, the global version number must satisfy external consistency.

Finally, if the transaction finds that the GTS response is too slow, it resends the GTS request to avoid the GTS request being stuck due to special circumstances (such as network packet loss). In short, GTS has been designed and developed with many exceptions in mind to ensure stable and reliable service.

Summed up a “global consistency snapshot” technology, OceanBase database has in the global (cross machine) within the scope of “snapshot isolation level” and “multi version concurrency control ability, can guarantee the consistency” external “in the global scope, and on this basis, we can realize the function of many involved in global data consistency, Such as global consistent read and global index.

In this way, compared to traditional single-point databases, OceanBase retains the advantages of distributed architecture without any degradation in global data consistency. Application developers can use OceanBase just like they would use a single-point database without worrying about the underlying data consistency between machines. With the help of “Global Consistency snapshot” technology, OceanBase database is perfect for achieving global data consistency under distributed architecture!

reference

1. Snapshot isolation 2. Multiversionconcurrency control 3. Isolation(database systems) 4. Spanner (database) 5. CloudSpanner: TrueTime and External Consistency. In Causal Consistency 2.0

  • Next Generation OceanBase Cloud Platform (1)
  • As smooth as silk! Front-line Operation and maintenance Personnel talk about how to achieve smooth online upgrade of database (2)
  • Key Infrastructure for OceanBase — DBReplay (3)
  • The Allure of OceanBase Load Balancing (4)

OceanBase TechTalk · Beijing Station

The second phase of OceanBase TechTalk offline technology exchange will be launched in Beijing on October 27 (Saturday). At that time, OceanBase three heavyweight guests: Chen Mengmeng (wine man), Han Fusheng (Yan Ran), Qiao Guozhi (kenteng) will talk with everyone to support this year’s Day cat double 11 OceanBase 2.0 version of the product new features and major technical innovation. Beijing Zhongguancun, we be there or be square!

Registration link: 2018 second OceanBase TechTalk Technology Salon · Beijing Station