Qingtao, Ant Financial: The technical rationale behind OceanBase's 213.5 billion turnover

Speaker profile: Mei Qing (Name: Qing Tao)

As a technical expert in OceanBase team of Ant Financial, he once supported Alibaba Cloud database and Tmall Double 11 promotion business, and has rich experience in the development and architecture of distributed database. At present, I am mainly engaged in solution design and technology promotion for OceanBase.

This live video highlights, poke here! The following content is compiled according to the video sharing and PPT of the speakers. This sharing mainly focuses on the following three aspects:

OceanBase Basic concepts
OceanBase distributed design
OceanBase performance tuning

1. OceanBase

Cluster OceanBase is a general distributed relational database and a database independently developed by ants. It is presented in the form of a cluster. In terms of appearance, taking triplex as an example, it is divided into three zones. Three zones in three machine rooms is the best choice. All servers in the OceanBase cluster are common commercial servers that run the Linux operating system running RedHat or CentOS. In the production environment, common SSDS are used. In addition, the OceanBase cluster does not rely on shared storage and fiber optic devices.

The following figure details the internal architecture of the cluster. The roles of the nine machines are basically the same. It is very convenient to set up the OceanBase database cluster. You only need to install OBServer software on each machine. It is a single process program, divided into several modules, SQL engine, storage engine and master control service (optional). The master control service is responsible for metadata management and scheduling management of the entire cluster. General control service only need in each area (Zone) of a machine can, on a total of three, including a general control service provided above, the other two machines is a copy of it, this is to ensure that when the total control service is unavailable can quickly another two machines on the election of a new general control services to provide services. The following figure also shows that the data is stored in partitions, which can be accessed by the storage engine. Because there are three copies of data, namely partitions. The three pieces of data must be in three different zones, not in the same Zone or OBServer. The OBServer has another feature. When the OBServer process runs on a machine, it appropriates most of the machine’s resources, such as CPU, memory, and disk space. In this mode, cluster resources are aggregated into a large resource pool, and then resources are allocated and multi-tenant management is implemented.

Multi-tenant resource management Resource Specifications (Resource FIG). A tenant is an abstraction of resources. The entire cluster may have hundreds of cpus, several TERabytes of memory, and dozens of terabytes of space. An application does not need such a large amount of memory when it first comes online and can allocate some resources to tenants. Define resource specifications before allocating resources. Resource specifications define the number of cpus, the size of memory, and so on. After defining resource specifications, you need to specify the specifications and quantity to be used to create resource pools. After the command is executed, the resources in the cluster are actually separated.

The Tenant (Tenant). A resource pool can be used only after it is associated with tenants. The tenant needs to create a new one. The concept of a tenant is the same as an instance in a traditional database. A tenant gives a developer an example of a database. The largest box is an abstraction of cluster resources. Tenants are extracted from a large resource pool. Tenants vary in size depending on different resource specifications, which is the same as hotel room specifications. If the tenant’s specifications are set at the beginning, it is also very convenient to adjust the size later. After the r & D students get the tenant, they can create the database in it. Although the current instance of OceanBase is very similar to MySQL, it is not MySQL, as you can see it has an additional database called OceanBase. Also, you can create a number of user databases where tables can be built. A common table in OceanBase is a Partition. A partitioned table can contain many partitions.

The Unit is balanced. The following is the view that the R&D student can see after he gets the tenant, but he can’t see which machine the data is on (he doesn’t need to know). This diagram explains how resources are allocated from another perspective. After a resource pool is created, each Zone is allocated two resource specifications of the same size (because unit_num=2). Which machine should the green Unit choose from? In the figure below, there are four machines in each Zone. OceanBase will choose one that is relatively free. The concept of load balancing is involved here. OceanBase tries to keep the utilization rate of each machine balanced when allocating resources (when creating units). Tenant is an abstraction of resources, decoupled from the actual machine. O&m personnel do not need to care about which machine the data is on, but only need to ensure that there are free resources on the machine. In addition, the distribution of units in the design mechanism of the tenant is decoupled from the actual machine, so the tenant has a good elastic scalability.

Partition

Table groups. R&d students will start to build tables after they get tenants. Partition is a very important concept in OceanBase. A partition is the minimum granularity of data. It can be a regular table or a partition of a partitioned table, which is allocated in a tenant’s Unit. If the tenant’s Resource Pool has multiple units in the Zone, which Unit should be allocated when creating a Zone? This is the second scenario of OceanBase load balancing. The default policy is to balance resource usage for each Unit as much as possible. Partitions cannot cross nodes and can only be inside a Unit, but partitions of a partitioned table can be inside different units. This is a characteristic of OceanBase’s distribution. Some businesses care more about whether the query is on the same machine. If the primary and sub-tables are joined on different nodes, the query performance will not be the best. At best, it’s on the same machine, or even in the same block of memory. Here, the strategy provided by OceanBase allows two tables to be related, so that partitions of related tables can be aggregated into one Unit when the lower level allocates partition location. This strategy is set by Tablegroup, which is similar to Tablefamily in Hadoop and has the same design idea. Tablegroups can also be subdivided into partitiongroups. If a partitioned table has two partitions, tablegroups can be divided into two partitiongroups. Partitiongroup 0 and Partitiongroup 1.

Partitiongroup. The orange ××× dotted box in the figure below is a Tablegroup, and the black dotted box is a Partitiongroup. Partitiongroup is used to gather these partitions in the same Unit on the same machine to ensure that partitions do not cross nodes. Although a Partition does not span nodes, different partitions of a Partition table can span machines. Therefore, when the capacity of a table is too large for one machine, we can set up a Partition table, so that it can be divided into different machines.

It is worth noting that partitions are also the smallest unit of data migration. By default, a distributed system does not control the distribution of partitions, which may cause uneven utilization of machine resources. OceanBase may move partitions from one Unit to another or move a Unit to another machine. Data migration is based on partitions. Data migration is completely internal and has little impact on service reads and writes.

Three copies of

The following figure shows a three-node cluster. The three computer rooms are IDC1,IDC2, and IDC3. Each Zone has two machines, OBserver1-6. A cluster structure called 222. P1, 2,… Each Partition has three copies. Each copy is called a copy and the content is the same. Which copy should services access? The triple replica has one Leader replica and two Follower replicas. The Leader copy is in pink. By default, services read and write the Leader copy. We don’t usually call them master replicas because OceanBase’s Leader and Follower replicas are not exactly the same concept as traditional master/slave replicas. There are both leaders and followers in each machine, so it is impossible to say which machine is the master and which machine is the standby. All six machines provide access. For the business, it is not known on which machine the leader copy of the data to access is on. This is usually handled by OBProxy, which is the reverse proxy for a Partition. OBProxy knows the location of the Leader copy of the data that the business needs to access, and the business only needs to access OBProxy.

A Partition is also the smallest unit of high availability. If a machine is down, o&M personnel switch the standby database to the primary database in traditional scenarios. However, in the OceanBase scenario, there is no traditional master/slave concept. After the machine is down, only the Leader copy is affected, while the Follower copy is not affected (because services are not provided). If the Leader replica is not accessible, OceanBase quickly elects a new Leader from the other two Follower replicas to continue providing the service.

Three copy action

1. Strong consistency between three replicas. The triple replica ensures that every change made on the Leader replica is synchronized to the Follower replica at Commit time. In a traditional system, a transaction log (Redo) is dropped when a transaction is committed. In OceanBase, a transaction log is generated and then the data is modified. When committed, the transaction log is persisted locally as well as on other Follower copies. In traditional strong synchronization of one master and two standby, commits on the primary library can be returned as long as one of the two standby servers accepts the transaction log and persists it. OceanBase chooses another strategy, with three copies in each partition. After more than half of the members of the three copies have successfully accepted and persisted the transaction log, the Leader will consider the transaction log to be reliable and can directly return it. There is no need to wait for all copies of the transaction log to be confirmed as successful. This strategy, the performance is ok, transaction log reliability is also guaranteed.

2. No manual handling of machine faults. If the Leader becomes unavailable, a new Leader is selected for each Partition. The Leader selection takes about 14-20 seconds. This selection is automatic and does not require the intervention of operations personnel or external tools. The application senses the Leader switch through OBproxy, so the business part doesn’t need to change the connection string. The overall effect is that no one in the OceanBase cluster has to deal with the downtime of any machine.

SQL compatibility

Data type. OceanBase was initially compatible with MySQL’s connection protocol, which implemented most of MySQL’s syntax and supported different data types. But compatibility with MySQL is not OceanBase’s main goal, its main goal is compatibility with Oracle.

SQL layer functionality. At present OceanBase supports Oracle’s add, delete, change and search, and has implemented stored procedures, window functions and hierarchical queries internally. DB Link and foreign keys are still in development.

The transaction layer. OceanBase supports Read Committed isolation levels, is working on Serializable isolation levels, and flashback capabilities. OceanBase also supports distributed transactions (XA protocol).

Internal view. OceanBase implements most of the internal views in Oracle, including some of the ALL_/DBA_/USER_ views. Index support global index, functional index.

Partition. OceanBase supports level-1 and Level-2 partitions. Since there are many Partition combinations, they are being improved. Hash, range, and List partitions are now supported.

The pseudo column. Oracle has good pseudo columns such as Rownum, Sequence and Virtual column. OceanBase is also developing pseudo columns.

Stored procedures. This is the most important point, many traditional businesses write on stored procedures, and if OceanBase were to switch, they might not want to change their stored procedures. Clients can connect to OceanBase, either through OBproxy, or Java programs can provide Java drivers.

Second, OceanBase distributed design

Split design generally do distributed design, will think about whether to split, from what latitude split. There are several ways to split.

1. Split vertically. A large business is usually in a library, vertical split is based on business modules, different modules into different tenants.

2. Split horizontally. There are several ways to split horizontally.

Separate database and separate table. It can split the business table into multiple identical physical structures. The business table is in one table Order, but the database is in Order00,01… Wait for multiple physical tables. The middleware solves the PROBLEM of SQL routing. The location of data can be known through the middleware, and the data location should be notified to the middleware according to the conditions of split parts in advance. If there is no split in SQL and the middleware has no way of knowing where the data is going, it will choose to look through all the physical tables, and performance will suffer.

The partition table. As shown in the figure below, Order00 is a table stored in the database. OceanBase will divide the table into many small partitions during storage, and then the partitions will be distributed to different machines. OceanBase chose the partitioned table approach, which has the advantage that the business can control the split strategy and decide which latitude to split at.

The storage layer is divided into blocks of fixed length. First define a large table, then at the storage level according to the fixed size of the block, divided into many small blocks, small blocks on different machines for storage. With this split approach, the business is completely transparent, which has the advantage that the business doesn’t need to care about product specifications, but has the disadvantage of not knowing where the data is and then having a lot of cross-machine access.

The Locality sets

Because the Partition of distributed database is random, it is hoped to control the distribution of Partition from the business level. OceanBase also provides policies to control the distribution of partitions.

Tablegroup Tablegroup. The first strategy is to group tableGroups by tables. Add the Tablegroup attribute when building a table, and set different tables to the same Tablegroup when they are associated. In this way, the next level of the same partition will be in the same PartitionGroup, bound within the same Unit.

Tenant groups. A greater scope of control is tenant groups. If the volume of services is large, vertically split the services into different services and allocate them to different tenants. However, some services are related to each other in a service process. Therefore, related services are expected to be placed in the same equipment room. Tenants can be grouped to complete all service requests in the same equipment room.

Primary Zone. Another option is to set up the Primary Zone for different services, which controls which Zone the Leader replica resides in.

Locality details setting. The last command in the figure below is the detailed setting of Locality from top to bottom, which you do not need to do. The largest scope is the Tenant, followed by the database, followed by the table. If the tenant is set, the database and table can be left unset and inherit the Settings properties from the upper layer. The concept of Locality in OceanBase is used to design unitary access to avoid distributed transactions and cross-geographical SQL requests.

Different live

Remote live is a concept also mentioned in traditional databases. Here are five different forms of living in different places that get more and more difficult from top to bottom.

· The first one is application hypermetro, double-click access. Since applications are stateless, the deployment for each machine room is called application hypermetro. However, data on one side can be read and written, and data on the other side cannot be read and written. This is called backup DISASTER recovery (Dr).

· The second is to apply hypermetro. The database still reads and writes on one side, but a read-only library can be opened on the other side, such as Oracle’s Active Dataguard, to perform multiple standby accounts and open the primary account in another machine room. This is called read-write separation.

· The third is to apply hypermetro, database live, read and write different tables at the same time. Both computer rooms can provide write, only from the business level is double write, but write is different tables, so to write data will not conflict.

· The fourth is the application of hypermetro, database live, read and write the same table at the same time. But the written record is different, this kind of work adopts stagger writing way.

· In the last form, both sides write the same record, and an error will be reported if there is a conflict. While this can be done at design time, it is inevitable that there will be data collisions and that one party’s write will be discarded.

OceanBase implements the third staggered write method. Service data is divided into different tables by database and table, and data is written from different tables on both sides. For example, table 1 is written in room A and Table 2 is written in room B.

In unitary mode, data is read and written locally by applications. However, the requirement is that data is written locally in both equipment rooms at the same time. There is no cross-regional request, that is, self-sealing.

unitized

The following figure shows the multi-active Dr Solution in three places and five centers. There are five copies of data in this solution. Applications in five equipment rooms can write data in five equipment rooms at the same time. In stateless applications, each equipment room also has data, but the write points are different and the links are different. In this case, the application and database must be combined to ensure that the traffic splitting rules at the application layer are consistent with those at the data layer. This is the key to understanding Ali and the unitization of ants. OceanBase can interfere with the split rules for data and can set where Leader replicas are distributed. Keeping data in sync is an in-database activity that does not need to be done by external products, so there is no need to worry about data loss, data consistency, and availability when problems occur.

OceanBase performance tuning

Similar to Oracle, OceanBase’s SQL execution plan uses soft parsing, hard parsing, etc. OceanBase supports execution planning and caching, soft parsing, and various join syntax.

In addition, OceanBase supports very complex Hints, such as changing table join order Hints, index-related Hints, and debugging statement Hints. When debugging SQL performance, there are Hints that you must debug. Hints that include parallelism, SQL rewriting and so on.

Outline is an Oracle-specific thing that allows you to change the SQL execution plan online. If there is a problem with the SQL execution plan, you need to define an Outline online to change the execution plan, including pinning the execution plan during testing prior to early live, and then moving it online. The better use of Outline is to limit SQL flow. If the performance of an SQL in the SQL business is very poor, you can use Outline to limit the number of parallel SQL.

There are generally two strategies for OceanBase performance tuning. The first is SQL response time tuning, which includes strategies such as optimizing access paths, sorting or aggregation operations, Partition clipping, query parallelism, and joins. If the SQL response time tuning method does not achieve the optimization purpose, the database throughput needs to be adjusted, mainly by optimizing the traffic distribution of SQL, optimize the balanced distribution of partitions.

Click to read more for more details

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Qingtao, Ant Financial: The technical rationale behind OceanBase’s 213.5 billion turnover

Qingtao, Ant Financial: The technical rationale behind OceanBase’s 213.5 billion turnover

Related Posts

5 common RPA myths you need to know

IOS gives proper names to functions and variables

Small program, and finally fire | deep krypton