Introduction: Based on PolarDB storage computing separation architecture, we developed shared storage based MPP architecture with HTAP capability, supporting two sets of execution engines for a set of TP data: stand-alone execution engine for processing high concurrency OLTP; The MPP cross-machine distributed execution engine is used for complex OLAP queries, leveraging the computing power and IO throughput of multiple RO nodes in a cluster.
Author: Beixia, Ali Cloud senior technical expert, Ali Cloud PolarDB PostgreSQL cloud native database HTAP business and technical director.
On the basis of PolarDB storage computing separation architecture, we developed shared storage based MPP architecture with HTAP capability, supporting two sets of execution engines for a set of TP data:
- Stand-alone execution engines are used to handle high-concurrency OLTP
- The MPP cross-machine distributed execution engine is used for complex OLAP queries, leveraging the computing power and IO throughput of multiple RO nodes in a cluster
PolarDB for PostgreSQL: An introduction to HTAP Architecture
Storage computing separation architecture
First of all, let’s take a look at PolarDB’s architecture. From the figure above, we can see that the integration of computing and storage is on the left. The storage of traditional databases is local. On the right is the PolarDB storage computing separation architecture, with the underlying shared storage that can be attached to any number of computing nodes. The compute nodes are stateless, which makes for a good scale-up and a cost reduction, such as users scaling up to 16 nodes, but the underlying storage is still one storage (3 copies).
Distributed storage is a more mature storage solution, with high availability of storage, second backup, like Ceph, PolarStorage, are more mature storage solutions. PolarDB is a PolarDB database that runs directly on a shared storage device. The answer is no, the root reason is that there is one storage in this architecture, but there are N compute nodes, which need to be coordinated.
The architecture of storage computing separation needs to solve the problem, the first is the consistency problem, 1 storage +N computing. Second, read/write separation, low latency replication on this architecture. Third, high availability, how to do fast recovery. Fourth, the IO model has changed. The distributed file system does not implement cache, so the memory saved can be used directly by the BufferPool of the database.
HTAP Architecture – The challenge of storing computing separately for AP queries
Under this architecture, if the user needs to run some analytical queries, for example, a telecom billing system that handles user recharge and various points settlement during the day, such requests will have userIDS and can be indexed to the modified page precisely. In the evening, I will do some batch analysis, such as reconciliation, statistics of provinces and cities in different dimensions, statistics of the overall sales situation. The architecture of separation of storage and computation processes large queries by separating SQL from read and write and dynamically loading SQL to nodes with low load.
When this node processes complex SQL, PG database has the capability of stand-alone parallel processing. Although the stand-alone parallel processing of complex SQL is greatly improved compared with the serial processing of stand-alone, memory and CPU still have certain limitations under stand-alone parallel processing. In this architecture, the processing of complex SQL can only be accelerated by Scale Up. That is, if you find that SQL processing is slow, you have to increase CPU, increase memory, and find a higher configuration machine as read-only node. Moreover, a single node processing a complex SQL cannot take full advantage of the large bandwidth of the entire storage pool.
Distributed storage has multiple disks, and each disk has read and write capabilities. If compute nodes become the bottleneck, the capacity of each disk in the underlying shared storage pool cannot be utilized. Another problem, when only one node to deal with complex SQL, other nodes may be free, because often AP concurrency is very low, is likely to run a few fixed statements are just a few nodes in SQL, while the rest of the node is in the idle state, the CPU, memory and network is also have no way to use up.
HTAP Architecture – SHARED storage based MPP
PolarDB’s solution is to connect multiple read-only nodes together to implement a distributed parallel execution engine based on shared storage, allowing users to use the entire system flexibly. For example, if some nodes are used to run TP queries, the code path goes to stand-alone queries. The advantage of single-machine query is that it is faster to process point-check and write, because it does not involve distributed transactions and can be quickly processed by single-machine. When complex SQL needs to be evaluated, multiple read-only nodes can be used to execute a SINGLE SQL in parallel, known as the distributed parallel execution engine MPP scheme.
PolarDB’s MPP is fundamentally different from shard based MPPS in traditional databases such as Greenplum. For example, PolarDB can quickly increase the number of read-only nodes at a point in time when computing power is low, and the entire underlying shared storage data does not need to be redistributed. If you’ve used a traditional share nothing MPP with Greenplum, you know that scaling up or downsizing is a very big operation.
PolarDB is storage computing separation, the compute nodes are stateless, you can quickly add nodes to make computing power more powerful. In addition, TP and AP can be physically isolated to ensure that the AP and AP are not affected during TP execution.
This scheme actually has a set of data. Some traditional schemes support two sets of data. For example, the data of TP is exported to the system of another AP, and its data needs to be copied. Moreover, it is a waste of resources, such as running TP during the day and AP at night. In fact, only one of the two clusters is functioning. PolarDB is an all-in-one solution that supports two computing engines, one stand-alone engine and one distributed parallel execution engine, with one set of data on shared storage. This can be done in milliseconds with the shared storage feature and the latency between read and write nodes. Compared to the traditional TP data to AP system, data freshness can be millisecond delay.
HTAP architecture principles
How to implement a parallel database? The core idea is to introduce Shuffle operator into the plan tree, through which the underlying data distribution characteristics can be shielded. In fact, it is also the working principle of MPP.
So what will happen to shared storage based on PolarDB? Because the underlying data is A shared state, for example the plan tree actually joins B by A and connt(*) the results. If the parallel mode of Greenplum is directly implemented in PolarDB, A set of traditional MPP is implemented. The two nodes join AB at the same time. Since A and B are shared by the two nodes, all the data can be seen. You end up with twice the real number. At the same time, the amount of data processed by A and B did not decrease, which did not accelerate the whole process.
So you have to solve the problem of how to dynamically split any table. Parallelization of parallel operators is required. All Scan operators and index Scan operators in the original PG database are parallelized. Parallelization means that any table can be logically shred according to some fixed strategy. After the shard, for the whole plan number of the upper operator, is not aware of the bottom is shared storage. Similarly, Shuffle operator is used to shield data distribution features. PolarDB uses a series of PXScan parallel scan operators to shield sharing features of underlying data. This is how HTAP works architecturally.
From the database module, based on shared storage implementation MPP, what needs to be done?
- First, distributed actuators. Because we need to parallelize all the scan operators. And then you introduce networks, because you have to interact with the data, you have to Shuffle, you have to introduce schedule management.
- Second, transaction consistency. Since the PG database query was limited to the single machine, the single machine query needs to achieve transaction consistency through the snapshot of multiple versions of MVCC. But now it is to spread the SQL to different stages to execute, different nodes in the playback of master database data, is fast and slow, need to do one-time control, in order to let all nodes data can be centralized in unity.
- Third, distributed optimizer. Distributed optimizer is a secondary architecture extension based on community GPORCA. The GPORCA optimizer is modular in design. Since the underlying data is now not sharded, some rules need to be added to the optimizer to tell the optimizer that the underlying data is shared.
- Fourth, SQL is fully compatible. If you want to support a new execution mode, then in the SQL standard, all aspects have to do compatibility. For example, Left Join has different methods in standalone and distributed mode. It is problematic to put the native PG community operator directly into distribution, and some of the behavior does not conform to THE SQL standard.
HTAP – Actuator
HTAP actuator is the general MPP approach, as a whole divided into control link and data link. There are two kinds of roles, PX Coordinato and PX Worker. A PX coordinator executes part of the optimizer, generates a distributed plan number, and slices the plan out. The data may be distributed to other RO nodes in the Polar DB cluster. These RO nodes have a number of sub-plans that are collected through data links and sent to PX coordinators to return the data to customers.
HTAP – Elastic extension
What are the advantages of doing MPP based on shared storage?
First, PolarDB is more resilient than traditional SHARE noth-based MPPS. In the right part of the figure above, all the states dependent on the execution path of the whole MPP, such as the state of metadata and the state of each Worker’s running period, are stored in shared storage. Make each worker of distributed computing Stateless. Its state, on the one hand, is read from shared storage and on the other is sent from the coordinator over the network. This enables stateless and distributed execution. In PolarDB, data is stored on shared storage and the original data is stored in tables in shared storage. For example, when workers are connected to RO1 by a CERTAIN SQL, 8 workers need to be started to work, 8 workers are distributed to RO2 and RO3, and 4 workers do not know any information at the beginning of startup. RO1 uses the information related to this SQL, It is sent to 8 workers through the network, and these 8 workers can execute it. This is the idea of making a fully flexible MPP distributed engine. The Coordinator node becomes stateless. RO1 can be used as either a centralized coordination node or RO2 as a coordination node, eliminating the single point problem with traditional Greenplum architectures.
Second, the elastic expansion of computing force. In the figure above, there are four nodes whose business involves some SQL. These SQL are complex queries that can be queried in RO1 and RO2. Another business domain can split its business into two parts, part of the business can run on RO3 and RO4, which can be adjusted dynamically.
PolarDB performance
The figure above compares the distributed parallel performance of Polar DB with the standalone parallel performance. The first figure shows TPCH’s 22 SQL speedup ratios, with three SQL speedup ratios exceeding 60 times and most SQL speedups exceeding 10 times. The second test will share storage on 1TB of TPCH data, 16 compute nodes, to see how well it performs by adding cpus. In the second test chart, from 16 cores to 256 cores, the performance is basically linear, but the bottleneck is reached at 256 cores. This is because limited by the storage bandwidth, if you increase the bandwidth, the overall performance will improve. The bottom graph shows the performance of the 22 SQL columns from 16core to 256core, and shows a linear improvement from 16core to 128core.
Another group is PolarDB and Greenplum. The test environment uses the same hardware, 16 compute nodes, and 1TB TPCH. As you can see from the figure above, Greenplum has 16 cores and 16 cpus doing SQL processing. At the same degree of parallelism, PolarO performed 89% of Greenplum’s performance. Why does Polar not perform as well as Greenplum on a single core? This is because data has no data characteristics on shared storage. When Greenplum creates a table, data is hashed by default. When two tables are joined, the join Key and distribution Key are the same, and data Shuffle is not required. Polar has only one table, which has no data characteristics and is a randomly distributed data format. When any two tables join, shuffle is required. Due to network factors, Polar’s single-core performance can only reach 89% of Greenplum’s. To solve this problem, we will optimize it by way of partition table of PG.
Although the underlying data of Polar DB is shared, a partitioned table can still be hashed. You can align Polar DB’s HTAP MPB mode with Greenplum’s mode. Once this function is implemented, Polar’s single core performance is the same as Greenplum’s. In the red box of the figure, we ran four more tests. Polar DB supports elastic scaling of computing power, so data does not need to be redistributed. This is the benefit of random distribution of data. When you do a distributed execution engine, the first priority is not extreme performance, but scalability of the system, which means you can quickly add nodes to speed up the computation when you run out of computing power.
Traditional MPP databases, such as Greenplum, have fixed nodes, while Polar is stateless and can be adjusted to compute cpus at any time. In this group of tests, only one GUC parameter is needed to change Polar from 16core to 256core, and the calculation power is linear expansion.
What else can Polar DB do once it supports MPP? After a large amount of data is imported into a newly launched service, indexes need to be performed. The idea is to sort the data, organize it in memory into index pages, and then write those pages directly to disk. If Polar DB supports parallelism, the gameplay is different. As you can see from the figure above, the nodes RO1, RO2, and RO3 can scan data on shared storage in parallel and then sort data locally in parallel. After sorting, the data is passed to the RW node over the network. The RW node merges and sorts the sorted data into an index page in memory and gives it to the BtBuild process. In memory, through the index page, to update the index page between the pointing relationship, to build the index tree instruction relationship, and then began to write disk.
This scheme makes use of the computing power of multiple nodes and RO capability to accelerate in the sorting stage. At the same time, a QC node, namely the central node, is transmitted to MPP through the network. This node is then sent to the BtBuild process via shared memory. In tests using 500 gigabytes of data to build the index, the performance can be improved to about five times.
Accelerated spatiotemporal database
The spatio-temporal database is a computationally intensive, coarse filter indexed by RTree. First through RTree, then through space treads to locate an area, within this area, further precise filtering. In the process of index scan of shared storage, RTree scan, NestLoopIndex Join can only be used, because hash Join cannot be done, because the two-dimensional space of RTree cannot be completely segmented. For spatiotemporal services, NestLoopIndex Join is used to fetch data from one table and scan the RTree in another table, which cannot be done in Greenplum because its index tree is split. However, in PolarDB, the index tree of RTree is in the shared state, so no matter whether the worker is on node 1 or node 2, the index tree is complete in the shared storage concept. At this point, the two workers can directly use their appearance to make coordinated segmentation. Because it is computationally intensive, the acceleration is even better. After testing, in an 80 CPU environment, the overall improvement can reach 71 times.
This is the introduction to the HTAP architecture, and more implementation details such as optimizers, actuators, and distributed consistency will be shared in the future.
The original link
This article is the original content of Aliyun and shall not be reproduced without permission.