Introduction: This paper will analyze the high availability challenges and targeted design of real-time data warehouse from the alibaba Double 11 scenario.

Alibaba’s singles’ Day in 2021 has come to a perfect end, which is a shopping feast for consumers and an annual test for the business support technicians behind it. In this big test, Hologres, a one-stop real-time data warehouse, achieved a high speed of 1.12 billion entries per second and a peak query rate of 110 million times per second (including spot search and OLAP query), providing satisfactory results and supporting the core application scenarios of Alibaba’s Double 11 in a stable and efficient way.

This is the fifth year that Hologres, a one-stop real-time data warehouse, has been born, and the fourth year that the core scene of Alibaba’s Singles’ Day has been supported. This is evidence that real-time warehouse technology has come from behind the scenes to support more online production systems and has withstood tough tests in terms of performance, stability, and high availability.

This paper will analyze the high availability challenges and targeted design of real-time data warehouse from the perspective of Alibaba double 11.

The combined challenges of availability, cost, and complexity

Traditionally, real-time data warehouse (data warehouse) is a non-production system, mainly for internal customers, supporting real-time large screen, real-time reports and other scenarios. Users’ requirements for its stability and availability are much weaker than those of production systems such as orders and shopping carts. As long as it can continue to be used, even if the real-time warehouse is temporarily unavailable or has significant fluctuations, the impact is not great.

As the real-time data warehouse began to provide services to the outside world, the business has put forward higher and more stringent requirements on the availability and stability of the system. Especially if you’re providing a to C service, it’s even more demanding.

A few examples of Hologres being used in Ali’s production system:

  1. Ali’s CCO (Customer Chief Officer) provides query services to C-end consumers through Ali Xiaomi.
  2. Ali mother for advertisers (B end) to provide advertising circle selection services.
  3. Dharma Yuan unmanned vehicle parcel delivery service.

.

The biggest characteristic of these systems is that they are all production systems, and if the system is unstable or unavailable, the impact can be very large.

Specific to the extreme scene like Double 11, it is a great challenge for any system to produce high service quality and achieve high availability under the peak flow. While traditional distributed systems use replicas and isolation mechanisms to achieve availability and consistency, there are trade-offs and challenges to achieve production-available high availability:

  • Scalability for peak flow
  • The ability to quickly recover from system downtime due to an accident or failure
  • Data consistency during the active/standby switchover
  • Ensures high performance and resource isolation capability
  • Resource costs associated with multiple copy isolation
  • .

In this article, we will introduce the highly available architecture design and practice of Hologers, a one-stop real-time data warehouse, to achieve low-cost, scalable, highly available production service capabilities.

Hologres high availability architecture design

1. The computing and storage separation architecture improves the flexibility of system expansion

Number of real-time warehouse technology oriented analytical scene or service scenario, the data processing by the order of magnitude, and scene complexity are far higher than traditional database, Internet, electricity and other industries, in particular, promotional activities, large community daily flow processing by the completely different, it’s very test resource level extension ability of the system.

In a traditional distributed system, the storage computing architecture is as follows:

  1. Shared Disk/Storage: In a distributed Storage cluster, each compute node accesses data on the Shared Storage as it accesses data on a single computer. The storage layer of this architecture can be easily expanded. However, the scalability of compute nodes is limited due to the introduction of distributed coordination mechanism to ensure data synchronization and consistency.
  2. Shared Nothing: Each compute node mounts its own storage. Each compute node can process only one piece of data, communicate with each other, and finally have a summary node to summarize the data. This architecture can be easily expanded, but its disadvantage is that node failover can only provide services after data loading is completed. Storage and computing need to be expanded at the same time, which is not flexible enough. After capacity expansion, there is a long data rebalance.
  3. Storage Disaggregation: Storage is similar to Shared Storage in that there is a distributed Shared Storage cluster. The computing layer processes data in a manner similar to Shared Nothing. Data is fragmented, and each shard processes only the data in its own shard.

The benefits of this storage-computing separation architecture are:

  • Consistency processing is simple: the computing layer only needs to ensure that only one compute node writes the data in the same slice at a time.
  • Flexible scalability: Computing and storage nodes can be expanded separately. Computing nodes cannot be expanded, and storage nodes cannot be expanded. This will be very flexible in the big push and other scenarios. No need to make data rebalance like Shared Nothing does. There is no single storage capacity bottleneck like there is with Shared Nothing.
  • Fast recovery from compute node failures: After a compute node failover occurs, data can be asynchronously pulled from the distributed shared storage on demand. Therefore, failover is very fast.

In terms of architecture, Hologres adopts the third storage and computing separation architecture. Hologres uses Pangu distributed file system (similar to HDFS) developed by Ali for storage. Users can flexibly expand and shrink the capacity based on service requirements to cope with different traffic peaks of online systems.

2. Polymorphic Replication solves data read and write separation

Replication is an essential technology for high availability. Different types of Replication designs can be used to quickly replicate data between nodes and clusters, achieving isolation and SLA protection.

Hologers supports both logical and physical Replication, and Hologres Replication is described in detail below.

1) Binlog-based logical Replication

Similar to the concept of Binlog in the traditional MySQL database, in Hologres, Binlog is used to record the changes of table data in the database, such as Insert/Delete/Update operations. The main application scenarios include:

  1. Real-time data replication and synchronization scenarios, a typical scenario is to copy a Hologres row storage table into a column storage table, row storage supports point-check and point-write, column storage supports multi-dimensional analysis requirements, synchronization logic is usually supported by Flink. This is a typical use of Hologres prior to V1.1. With Hologres 1.1’s support for row and column co-existence tables, you can meet both row and column storage requirements in a single table.
  2. Realize the full link drive of events. Through Flink consumption of Hologres Binlog, realize the development of event-driven processing and complete the full real-time processing operations from ODS to DWD and FROM DWD to DWS.

In Hologres, logical Replication relies on the Binlog implementation, and the tables that have changed publish change events as publications, and the processing logic is written to the Subscription side. Users can subscribe to a table of Binlog into Record, written to another table, logical replication function. This approach can naturally achieve the isolation of different Workload, but it has two problems: it is a ultimately consistent model, which is difficult to achieve strong consistency; The other is that it consumes two resources, uses two stores, and writes two resources to the link.

So Hologres has also implemented physical Replication.

2) Physical Replication

In Hologres, physical Replication is WAL Log-based Replication, and we can think of the storage engine as a state machine with WAL logs as inputs to the state machine. When Replication is performed on a Shard, a Follower Shard reads the latest WAL log and plays it back. At the same time, a new WAL is generated on the Leader Shard. The Follower Shard subscribes to the latest WAL from the Leader Shard and plays back the WAL to match the Leader Shard. To ensure visibility on the Follower Shard, we can add a strongly consistent option to the read request to ask for the WAL log playback gap between the Follower Shard and the Leader Shard, and then return the query results after the gap is completed.

The Follower Shard can copy the data file pointed to in the WAL Log during playback. The replication mode is called the non-shared storage mode, and the reference mode is called the shared storage mode.

Based on this, Hologres implements three forms of physical Replication:

1. Single instance and multiple copies: The Shard level multiple copies mechanism is adopted within an instance, which can be used to achieve cross-process high availability and read and write separation. At the same time, because copies can be dynamically increased, it can easily support high concurrent read.

2. Read/write separation of multiple instances: Different instances share a storage to achieve high availability across equipment rooms. This mode is usually used in read/write separation scenarios and supports high concurrency read scenarios

3. Multi-instance Dr: Multiple instances do not share storage resources, achieving cross-room high availability of computing and storage services, supporting read/write separation, high read concurrency, version hot upgrade, and storage system migration

  • Single instance multiple copies

The Hologres data Shard unit is a Shard, which can have multiple copies but only one copy of storage. At ordinary times, query traffic can be evenly distributed among copies to achieve high QPS. When a copy is failover, traffic can be quickly diverted to other copies. And Shard failover is very light, only need to play back some WAL, no data replication. Based on the mechanism of multiple copies within a single instance, the scalability of computing can be easily realized, and the failover problem of a single physical machine can be quickly solved.

Application Scenarios:

  1. High availability of queries in a single instance: When a Shard Worker is faulty, requests can be redirected to the duplicate Shard Worker through the retry operation in the front-end stage, so that the application is not aware of exceptions.
  2. Multiple shards provide services for the same data. Different query routes are routed to different Shard nodes to balance the load among multiple Shards. QPS can be significantly improved.
  3. Quick Failover: From Hologres V1.1, a new recovery mechanism is adopted. The Shard recovery speed is less than one minute, and the availability is further enhanced.

  • Multi-instance read/write separation

Replication across instances enables physical Replication of Meta compared to Replication within a single instance with multiple copies.

Hologres in V1.1 supports multi-instance deployment of shared storage. In this scheme, the master instance has complete capability, data can be read and written, permissions and system parameters can be configured, while the sub-instance is in the read-only state, all changes are completed through the master instance, a data store is shared between instances, and data between instances is asynchronous and real-time synchronized.

Application Scenarios:

1. Read/write separation: this solution realizes complete read/write separation function and ensures SLA of different business scenarios. In scenarios such as high-throughput data writing and complex architecture operations, OLAP, AdHoc query and online service, loads are physically completely isolated from each other without query jitter caused by writing.

2. Multi-type load fine-grained resource allocation: A master instance can be configured with multiple read-only slave instances, with different specifications between instances depending on the business situation, such as 256Core as a write and machining instance, 512Core as an OLAP read-only instance, 128Core as an online Serving instance, and 32Core as a development test instance.

  • Multi-instance Cross-city Dr

Replication of multi-instance non-shared storage can be understood as a traditional DISASTER recovery function. It supports DISASTER recovery, remote Replication, read/write separation and high read concurrency. It can also achieve high read availability based on multiple instances. In addition, you can perform version hot upgrade and storage system migration.

Application Scenarios:

  1. Disaster recovery: Different Storage clusters (PangUs) are deployed in different regions. Data is stored in different storage clusters after synchronization. When a Region becomes unavailable, the remote backup instance can continue to provide services.
  2. Cluster migration: The capacity of the equipment room is limited. Due to uncontrollable reasons, instances need to be migrated from one equipment room to another, or even from one Region to another Region. Users expect the migration process to be as insensitive as possible. Therefore, instance state can be migrated between clusters through the Replication mechanism.
  3. Hot upgrade: During the hot upgrade, the service service capability must not be interrupted, which is required for engine replacement on expressways. Therefore, the system must have a rolling upgrade capability. Using Replication, you can clone an instance, upgrade the software version on the new instance, and then connect related configuration, such as network routing, to the new instance to complete hot upgrade without downtime.

3. The scheduling system improves the failover recovery capability of nodes

In distributed environment, failover is inevitable. When failover occurs, it needs efficient detection and fast recovery, which is the category of scheduling.

A Hologres instance has multiple Holoworkers. When a HoloWorker has an accident, breakdown or failover, the Hologres scheduling system can quickly detect node abnormalities. In addition, services such as Frontend, Coordinator, and Shard of the faulty node are quickly scheduled to another healthy HoloWorker, and the SLB diverts traffic to the new healthy Frontend.

Scheduling is divided into computing unit scheduling and traffic scheduling:

1) Scheduling of computing units

Cell scheduling is divided into Pod scheduling, Pod sub-process scheduling and Actor scheduling

  • Pod scheduling makes use of K8S capabilities. The unit in Hologres that is scheduled by K8S is HoloWorker.
  • Scheduling of sub-processes and actors in Pod is a capability provided by Hologres distributed scheduling module HoloFlow.

Two types of cells in Hologres need to be scheduled. One provides services in child process mode, such as Frontend. The other is services provided in Actor mode, such as a Shard data Service Shard. HoloFlow provides health detection and scheduling capabilities for both types of services.

2) Traffic scheduling

Traffic scheduling is divided into external traffic and internal traffic scheduling.

  • Hologres monitors the health status of Frontend periodically. Once a Frontend becomes unhealthy, the traffic is removed from the SLB.
  • Internal traffic Hologres provides internal health detection and service discovery mechanisms. For example, StoreMaster provides health detection and service discovery mechanisms for shards. If a Shard is unhealthy, a Coordinator directs traffic to a healthy Replica of the Shard.

Through the Hologres scheduling system, it realizes the rapid detection of node failure and Failover, and automatic scheduling and recovery capabilities, which meet the needs of business stability and improve system availability.

4. Multi-level isolation to cope with SLA of different services

As the number of real-time warehouse is becoming more and more widely used in the production system, different business also have different SLA demands, such as double at 11, boss and operation of trading data query demand is higher, real-time efficient logistics and hope that order can refresh, development and hope that the data can quickly write, do not affect the back of the data query and analysis…

Specific to Hologres, an instance supports different workloads, including check and write, batch import, interactive analysis, and so on. Therefore, SLA of different Workload needs to be guaranteed. For example, batch import cannot affect the delay of interactive analysis, and requests of interactive analysis cannot affect the effectiveness of real-time writing. Hologres also supports the simultaneous use of multiple tenants, and different tenants cannot influence each other.

The scenarios described above are isolated. Relatively speaking, the higher the isolation level, the higher the cost and the lower the resource utilization. Implementing low-cost available isolation within a process is a technical challenge.

Hologres implements multiple levels of isolation. Replication is essentially serving the same piece of data (or a copy of it) on different machines/containers, so it is essentially a physical isolation. In addition to physical isolation, Hologres also supports resource group isolation, SchedulingGroup and SchedulingGroup isolation, and users can tradeoff costs and slas to meet the needs of different users for isolation.

1) Physical machines are isolated from containers

In terms of physical machine and container isolation, Hologers is deployed using K8S, and the Node Selector/Affinity and Taints/Tolerations capabilities of K8S facilitate container isolation between instances. For some customers with very high SLA requirements, we can also mark the machine separately, allowing only the container of one instance to be scheduled to the marking machine, so as to achieve machine-level isolation and prevent interference from other instances.

2) Resource group isolation

In Hologres, multi-tenant isolation needs are implemented through resource groups. Hologres’s resource group isolation is essentially thread-level isolation. The Worker in the instance can be divided into different resource groups based on CPU, memory, and IO. Users can be added to different resource groups to limit the upper limit of resources used by each user to ensure that each user’s work does not interfere with each other.

For example, resource group (1) has 50% resources, resource group (2) has 30%, and resource group (3) has 20%. We bind user A to resource group 1, user B to resource group 2, and users C and D to resource group 3. In this way, requests initiated by users A and B.C are scheduled to different resource groups.

By isolating resource groups, you can isolate resources within instances. This isolation has the advantage of being able to isolate different users within an instance and ensure that the jobs of users do not affect each other. This isolation is a soft isolation that is not as effective as replication-based physical isolation. So resource group isolation is better suited for scenarios such as OLAP query isolation for different users, while Replication-based physical isolation is better suited for online services.

3) SchedulingGroup isolation

In general, the thread-level isolation model in 2) has the following problems:

  • At the operating system level: thread switching is a significant overhead. A lot of CPU is wasted on thread switches in order to take advantage of idle CPU waiting for IO. Tests found that, in severe cases, thread switching can waste up to half of the CPU;
  • The number of threads is difficult to master: For different queries, different data, and different cache hit ratios, the likelihood of IO blocking can vary so much that the number of threads required can vary greatly. In this case, using a thread pool with a fixed number of threads is uncomfortable. Too many threads will cause redundant switching and increase the overhead of switching. With fewer threads, you might not be able to use all the spare CPU. Thread creation and destruction are more expensive than thread switching, so it is not possible to keep the right number of threads by dynamically creating threads.

The ideal solution would be to have a lightweight scheduling unit that functions like threads, but with much less overhead to create/destroy and schedule/switch. Words like this:

  • We can create as many “threads” as our business logic needs to use the CPU concurrently without worrying about switching overhead or CPU usage.
  • When business logic needs to use CPU, N such “threads” are created based on the concurrency requirement, and they are destroyed when used up. In this way, the business logic can flexibly control the parallelism of tasks without being constrained by the underlying framework;

According to the above design concept, Hologres is implemented through a lightweight scheduling unit EC in the self-developed scheduling system HOS.

SchedulingGroup isolation takes advantage of the HOS EC scheduling capability, where multiple ECS execute on the same Query. These ECs can be grouped into a SchedulingGroup, and the different schedulinggroups can divide the time slice with fair policies.

SchedulingGroup isolation ensures that when a large Query (analysis) and a small Query (Query) run in the system at the same time, the small Query will not be held by the large Query block because it cannot grab the CPU. SchedulingGroup isolation is essentially coroutine level isolation and is one of Hologres’ core competencies.

Hologres high availability in double 11 landing practice

Hologers’ high availability technology has also steadily supported the core business scenarios of Alibaba’s Double 11 this year. Here is a brief introduction.

1) Dual-link services + read-write instance separation (DT team practice)

DT datao coefficient is a typical data center of Alibaba Group. It is responsible for the promotion of tmall, Taobao, Juhuasuan and other businesses and the daily industry demand for data, analyzes products through tmall/Taobao marketing activities, and supports decision makers and junior students to conduct data analysis and decision-making during the promotion period.

With the growth of business and the continuous iteration of products, the data team is also facing more complex analysis requirements. While ensuring the real-time and accuracy of data, the technical team also faces higher pressure of writing, and faces more stringent tests in ensuring the stability of the overall data link and the high availability of the overall product.

On high availability scenario, this year the introduction of Hologres DT separation, speaking, reading and writing ability, and combined with the link of the main case of the double link, to reduce the probability of single reservoir problem of constructing long-distance main disaster for at the same time, establish product core index of “resurrection”, through the second level switch of high availability disaster plan, high throughput writing and flexible query each other, Analysis query QPS increases by 80%, while the query jitter is significantly reduced, so that the business has the confidence and confidence to deal with uncontrollable risks that may appear at any time, and provide stable support for the whole product and business decision analysis.

2) Dual-link Dr + READ/write separation (CCO team practice)

CCO is the customer experience division of Alibaba Group, supporting scenarios including service resource scheduling, real-time data monitoring, etc. The organizational guarantee of “customer first” values is the neural network of customer experience in the whole economy, and also the forefront of reaching consumers and businesses.

With the development of business, as well as the overall business trend of the industry, and the corresponding businesses and consumers more real-time and stable service requests. Last year, dual-link write and storage redundancy were implemented to ensure high availability. This year’s Double 11 uses Hologres original read-only instance and DISASTER recovery high availability scheme to remove the dual-link of the business, saving the manpower input of real-time task development and maintenance and data comparison on the standby data link, reducing the data inconsistency during link switchover, etc., and reducing the overall development manpower cost by 200 person-days, more than 50% lower than last year. 100+ backup link jobs for real-time reinsurance were reduced and real-time computing resources were reduced by 2000CU.

conclusion

In the past year, Hologres has introduced multiple copies, resource isolation, read and write separation and other capabilities, which have been well applied in this year’s Alibaba core application scenarios to achieve high availability in production.

With the wide use of real-time data warehouse technology in production systems, the demand for high availability is becoming more and more stringent. We hope that by sharing Hologres high availability design principles and application practices, we can exchange knowledge with the industry and contribute to the high development of society together.

Hologres is a one-stop real-time data warehouse developed by Ali Cloud. This cloud native system integrates real-time services and big data analysis scenarios, is fully compatible with PostgreSQL protocol and seamlessly integrates with big data ecology. It can support real-time writing, real-time query and real-time offline federated analysis with the same data architecture. Its emergence simplifies the business architecture, provides real-time decision-making capabilities for business, and enables big data to bring greater business value into play. From the birth of Ali Group to the commercialization of cloud, With the development of business and the evolution of technology, Hologres has been continuously optimizing the competitiveness of core technologies. In order to let everyone know Hologres better, we plan to continuously launch the series of secrets of the underlying technology principles of Hologres, from high-performance storage engine to efficient query engine. High throughput write to high QPS query, etc., full interpretation of Hologres, please continue to pay attention to!

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.