Introduction: In this evaluation, Hologres is the largest MPP data warehouse product that has passed the large-scale performance evaluation of distributed analytical database of big data products of China Academy of Information and Communication Technology. The evaluation proves that Hologres can serve as the infrastructure of data warehouse and big data platform, meet the needs of users to build large-scale data warehouse and data platform, and have the ability to support the core business data platform of key industries.

The author source | | Owen ali technology to the public

From November 23 to December 3, 2021, China Academy of Information and Communications Technology (CAICT) evaluated the big data product capabilities of 27 products in the 13th batch of distributed analytical database. Hologres (the former Aliyun Interactive analysis) has passed the performance evaluation of distributed analytical database (large-scale) by China Academy of Information and Communication Technology in terms of report task, interactive query, pressure test and stability, and has set a new scale record with 8192 nodes that have passed this evaluation.

In this evaluation, Hologres is the largest MPP data warehouse product that has passed the large-scale performance evaluation of distributed analytical database of big data products of China Academy of Information and Communication Technology. The evaluation proves that Hologres can serve as the infrastructure of data warehouse and big data platform, meet the needs of users to build large-scale data warehouse and data platform, and have the ability to support the core business data platform of key industries.

In the construction of cloud native scheduling and operation and maintenance system of Hologres instance, the team also cooperated with Ali Cloud native and other teams to solve the problem of super-large cluster; In terms of operation and maintenance capacity construction, the team solved the problems of case deployment and stability guarantee through the construction of automatic and intelligent operation and maintenance system.

The challenges of a very large deployment

With the development of the Internet, the amount of data has increased exponentially, and the single database can no longer meet the needs of the business. Especially in the field of analysis, a single query may need to process a large part or even the whole amount of data, so the pressure brought by massive data becomes particularly urgent. At the same time, with the acceleration of the process of enterprise digital transformation, the timeliness of data is becoming more and more important, how to make better use of data enabling business has become the key of enterprise digital transformation.

Compared with the database, the scale of big data real-time warehouse is often multiplied: the amount of data increases (TB, PB or even EB-level), the complexity of data processing is higher, the performance is faster, and the service and analysis must be satisfied at the same time, etc.

Users of open source OLAP systems, especially self-built OLAP clusters, have experienced some difficulties in deployment and maintenance, including ClickHouse and Druid.

  • How to meet the rapid delivery and elastic scaling of the cluster
  • How to define service availability metrics and SLA architecture
  • Integration of storage and computing, model selection and capacity planning is difficult
  • The monitoring capability is weak, fault recovery is slow, and the self-healing capability is missing

At the same time, with the increase of scale, the pressure of scale advantage and high performance throughput, the deployment and operation difficulty of real-time data warehouse increases exponentially, and the system faces various challenges in scheduling, deployment and operation:

  • How to solve the scheduling capacity to meet the requirements of second pull up and elastic expansion of service instances in a single cluster of 10,000 units;
  • How to solve the capacity planning, stability guarantee and machine self-healing of large-scale cluster, and improve related operation and maintenance efficiency;
  • How to achieve the dual requirements of timeliness and accuracy of monitoring instances and clusters, including how to complete problem discovery and minute-level problem resolution within minutes

Thanks to ali cloud based service research and development ability, strong cloud native Hologres real-time number storehouse by excellent architectural design and ali cloud big data intelligence operations the ability of China, and other core ability construction, to address these challenges, provides users with a strong performance, outstanding ability, high reliability, extend from operations on the number of real-time warehouse products.

This paper will start from the construction of super-scale deployment and operation and maintenance system, analyze the challenges faced by super-scale real-time data warehouse and the targeted design and solutions to achieve high performance while supporting high load and high throughput, and achieve high availability at production level.

Second, large-scale scheduling architecture design based on cloud native

With the rise of cloud technology, the original more and more systems just began to use Kubernetes as container application cluster management system, container application provides automatic resource scheduling, container deployment, dynamic expansion, rolling upgrade, load balancing, service discovery and other functions.

Hologres has been optimized in advance at the beginning of its design architecture. It adopts the cloud native containerization deployment method and uses Kubernetes as the resource scheduling system to meet the super-large nodes and scheduling capabilities in real-time warehouse scenarios. Hologres relies on cloud-native clusters that can support more than 10,000 servers, with single instances reaching 8,192 nodes or more.

1 Kubernetes 10,000 units scheduling

Kubernetes officially announced that the maximum cluster size is 5000, and in the ali cloud scenario, in order to meet the requirements of business scale, resource utilization and other requirements, the cloud native cluster size should reach 10,000. As we all know, Kubernetes is a central node-type service, which strongly relies on ETCD and Kube-Apiserver. This block is where the performance bottleneck lies, and it requires in-depth optimization of relevant components to break through the scale of 10,000 units. At the same time, the problem of single point Failover speed should be solved to improve the availability of cloud native cluster.

Through pressure measurement and simulation of pressure under 10,000 nodes and one million pods, serious response delay problems are found, including:

  1. Etcd has a large amount of read and write delay, and has denial of service. Meanwhile, due to its space limitation, it cannot carry a large number of objects stored in Kubernetes.
  2. API Server query latency is very high, and concurrent query requests may cause back-end ETCD OOM.
  3. Controller has a high processing delay and takes a long time to recover. If an abnormal restart occurs, the service takes several minutes to recover.
  4. Scheduler has high latency and low throughput, which cannot adapt to the demand of daily operation and maintenance of business, let alone support extreme scenarios of large promotion

In order to break through the bottleneck of K8S cluster scale, the relevant team made a detailed investigation and found the causes of the processing bottleneck:

  1. It is found that the performance bottleneck is in kubelet, which reports its own full information every 10s as heartbeat synchronization to k8s. The data volume can be as small as several KB and as large as 10KB+. When the node reaches 5000, it will cause write pressure to kube-apiserver and ETCD.
  2. The recommended storage capacity of ETCD is only 2G, but the object storage requirements of K8S cluster in the scale of 10,000 units far exceed this requirement, and the performance should not be degraded.
  3. In the deployment of multiple API servers supporting cluster high availability, load imbalance occurs, affecting the overall throughput capacity.
  4. The original scheduler has poor performance and weak capability, unable to meet the capabilities of mixed parts and large promotion scenarios.

In view of this situation, the following optimization is made to achieve the scale scheduling of 10,000 units:

  1. Etcd designed a new memory free page management algorithm, greatly optimize etCD performance;
  2. By landing Kubernetes lightweight heartbeat, improve HA cluster under the load balancing of multiple APIServer nodes, solve the performance bottleneck of APIServer;
  3. Through hot backup, the service interruption time of controller/ Scheduler during the active/standby switchover is greatly shortened and the availability of the whole cluster is improved.
  4. By supporting equivalence class processing and introducing random relaxation algorithm, Scheduler scheduling performance is improved

Hologres operation and maintenance system construction

1 Overview of Hologres operation and Maintenance System

For OLAP system problems and weaknesses, and under pressure from large scale deployment of operational challenges, at the same time relying on ali cloud big data operations middle office, we have designed Hologres operational system, solving the problem of resources and cluster delivery such as automation, cluster and the instance level real-time observability problem and intelligent system of healing, Increase Hologres SLA to production availability level.

2 Cluster automated delivery

Hologres is designed and implemented completely based on cloud native. It decouples computing resources from storage resources through the way of storage and computing separation. Compute nodes are deployed and pulled up in a K8s cluster. Through the self-developed operation and maintenance management system ABM, we carry on the abstract design to the cluster delivery, and separate the concept of resource cluster and business cluster. In the delivery of resource clusters, ABM connects with the underlying platform to create resource clusters and maintain capacity. For service clusters, ABM provides a deployment template based on the K8s concept to quickly pull management nodes from resource clusters to complete delivery.

3 observability system

The observable performance of the system helps services better manage cluster water levels and troubleshooting, improving enterprise-level management and control capabilities. In terms of observability, it is necessary not only to provide more simple monitoring indicators, but also to have a mature log collection system to achieve simpler operation and maintenance, and only to be responsible for service problems. Based on ali Cloud’s monitoring products and Hologres’s observability requirements, we designed Hologres’s real-time monitoring capability.

Metric monitoring system

In order to support detailed system capability observation, performance monitoring, and rapid problem location and debugging, Hologres supports a very rich Metric monitoring system, which also puts high demands on the collection, storage, and query of the entire Metric link. On the monitoring link, Hologres chose alibaba’s own Emon platform. In addition to supporting 100 million Metric writes per second, Emon also supports automatic downsample, aggregation optimization and other capabilities. At the same time, we can send the core Metric to the cloud monitoring through the real-time link at the back end, so that users can monitor and observe instances and locate problems by themselves.

Log collection and monitoring

In log collection, Hologres adopts mature cloud product SLS, which can support centralized log investigation and filtering. At the same time, considering that Hologres logs are also very large, the mechanism of module and classification is adopted in the collection, which can well solve the needs of problem investigation and audit while controlling the cost. The SLS also provides a monitoring solution based on keywords to generate alarms for critical errors, facilitating timely troubleshooting.

Meta store-based availability monitoring

Metric and log collection and alertness are more focused on a particular module and cannot fully answer the availability of a particular instance. Based on this, we constructed a Hologres operational dimension warehouse to comprehensively judge whether the instance works properly through multi-dimensional events and states. Multi-dimensional data will be collected and maintained in the meta warehouse, including meta data of the instance, availability judgment criteria of each module in Hologres, status of each module of the instance, event center, including operation and maintenance events, customer events, system events, etc. In addition to instance availability judgment, meta warehouse also provides various data for instance diagnosis, instance inspection and so on. The current capabilities of the meta warehouse have been productized and published as slow Query logs. Users can use slow Query logs for self-service problem diagnosis and tuning.

4 Intelligent O&M improves product SLA

On the basis of perfect observability, in order to improve the speed of problem location and shorten the case recovery time, namely to improve Hologres MTTR, we built a complete Hologres SLA management system and fault diagnosis and self-healing system based on the basic capabilities and intelligent operation and maintenance solutions provided by Ali Cloud Big Data Operation and Maintenance Center.

The SLA system

Based on the definition of data and instance availability of Hologres operation and maintenance warehouse, we established a Hologres instance-level availability management system. Instance availability data will enter THE SLI database of ABM, and SLI will trigger instance availability monitoring according to data and conditions. When the monitoring is issued, instance diagnosis will be triggered. Determine whether to perform self-healing. If it is known that automatic recovery can be performed, the system triggers self-healing to perform automatic fault recovery. If it is unknown, it will trigger the generation of manual work order, and the work order system will be completed by human follow-up, and gradually form the self-healing action.

Intelligent inspection

Intelligent inspection solves some hidden and non-urgent problems of clusters or instances, avoiding the accumulation of small problems that may cause qualitative changes and affect the stability of the line. In addition to clearly defined inspection items, intelligent inspection also introduces a clustering algorithm to analyze system indicators. This helps us find discrete nodes in a cluster and handle them in a timely manner, preventing faulty nodes from affecting the availability of the whole instance.

Intelligent diagnosis and self-healing

Intelligent diagnosis not only relies on the data of operation and maintenance metadata warehouse, but also relies on the support of algorithms related to diagnosis, including log clustering and root cause analysis, to cluster error logs and mark clustering results. Supported by algorithms and engineering capabilities provided by ABM, instance diagnosis has helped businesses quickly locate problems, improve the efficiency of problem solving, and shorten the MTTR of instances.

Product-level operation and maintenance capability of Hologres

In addition to the operation and maintenance stability guarantee of Hologres service introduced above. On the product side of Hologres, the system stability is improved in a variety of ways:

1. Highly available architecture

With high availability architecture design, Ali Group has stably supported the peak flow of double 11 and experienced the test of mass production

  • Storage computing separation architecture improves system scalability flexibility
  • Replication is a solution to data read and write separation, including improved throughput of multiple copies, isolation of single instance resource groups, and high availability of multi-instance shared storage
  • The scheduling system improves the failover recovery capability of nodes

2. Diversified system observability indicators

In addition to the architecture design of Hologres itself, it also provides users with diversified observation indicators, real-time monitoring of cluster status and follow-up, without complex operations, only responsible for business:

  • Multi-dimensional monitoring indicators: real-time query of CPU, memory, connection number, IO and other monitoring indicators, real-time warning;
  • Slow Query log: The slow query or failed query is diagnosed and analyzed based on various indicators such as time, plan, and CPU consumption, and optimization measures are taken to improve the self-diagnosis capability.
  • Visualization of execution plan: Execute analysis and execution analysis of Query, interpret detailed operators, and guide optimization suggestions through various visualization presentation methods to avoid blind tuning, reduce the threshold of performance tuning, and quickly achieve the purpose of performance tuning.

Five summarizes

Hologres can deliver 8192 nodes or even larger instances and scale up or down by analyzing and optimizing the scheduling performance bottlenecks under large-scale scheduling. Meanwhile, the construction of cloud-based Hologres intelligent operation and maintenance system solves the problems of operation and maintenance efficiency and stability improvement faced by large-scale clusters and instances, enabling Hologres to achieve high performance and high availability of production level at the same time with high load and throughput after years of double 11 production tests in alibaba’s internal core scenes. Better support for business, for the digital transformation of enterprises to provide good support.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.