With the growth of Alibaba’s big data product business, the number of servers keeps increasing, and the pressure of IT operation and maintenance also increases proportionally. Service interruption caused by hardware and software failures is one of the important factors affecting stability. This paper explains in detail how Ali realizes hardware failure prediction, automatic offline of servers, self-healing of services and self-balanced reconstruction of clusters, so as to realize automatic closed-loop strategy of hardware failure before affecting business and solve common hardware failure automatically without manual intervention.

1. The background

1.1. Challenges

Bearing alibaba group, 95% for data storage and calculation of offline MaxCompute computing platform, with business growth, the scale of the server has reached hundreds of thousands of units, and the characteristics of offline operation is not easy to cause hardware failure was found in the software level, at the same time group unified hardware reported barrier threshold is often missed some influential to application of a hardware failure, Each omission poses a significant challenge to the stability of the cluster.

In view of these challenges, we are faced with two problems: timely discovery of hardware failures and service migration of failed machines. Below, we will analyze these two problems and introduce the automatic hardware self-healing platform — DAM in detail. Before the introduction, we first understand the application management system of flying operating system – Tianji.

1.2. Space-based application management

MaxCompute is built on Top of Apsara, alibaba’s data center operating system, where all applications are space-based managed. Skybase is an automated data center management system that manages the hardware life cycle and various static resources (programs, configurations, operating system images, data, etc.) in the data center. Our hardware self-healing system is closely combined with space-based Healing mechanism to build a closed-loop system of hardware fault discovery and self-healing repair for complex business.



Through space-based, we can send the instructions of various physical machines (restart, reinstall, repair), and space-based will translate them to each application of the physical machine, and the application will decide how to respond to the instructions according to its own business characteristics and self-healing scenarios.

2. Discover the hardware fault

2.1. How to find out

We pay attention to the following hardware problems: hard disk, memory, CPU, and nic power supply. The following lists the methods and main tools for discovering common hardware problems:



Hard disk faults account for more than 50% of all hardware faults. The following describes the most common faults: Hard disk media faults. Usually the problem is a file read/write failure/stuck/slow. However, reading and writing problems are not necessarily caused by media failures, so it is necessary to explain the representations of media failures at various levels.



A. An error message similar to the following can be found in /var/log/messages

Copy the code


Sep 3 13:43:22 host1.a1 kernel: : [14809594.557970] SD 6:0:11:0: [SDL] Sense Key: Medium Error [current]

A1 kernel: : [61959097.553029] Buffer I/O error on device SDI1, Logical Block 796203507

B. Tsar IO index changes refer to changes or mutations in RS/WS /await/ SVCTM /util, which are usually reflected in IOSTAT and then collected in a tsar because of the pause in reading and writing during error reporting.

● In the TSAR IO metrics, there is a rule that allows us to tell if a hard drive is working properly. If there is no large-scale kernel problem, this is usually caused by a hard drive failure.

C. System indicator changes are usually caused by IO changes, such as load increase caused by D.

D. The change of smart value refers to the change of 197(Current_Pending_Sector)/5(Reallocated_Sector_Ct). The relationship between these two values and read/write exceptions is as follows:

● After a media read/write exception occurs, 197(Pending) +1 is displayed on smart, indicating that there is a sector to be confirmed.

● Then, when the hard drive is idle, it checks the various sectors saved in 197(pending). If the read/write passes 197(pending)-1, and if the read/write fails 197(pending)-1 and 5(reallocate)+1.

In conclusion, it is not enough to observe only one stage in the whole error reporting link, and comprehensive analysis of multiple stages is needed to prove the hardware problem. Since we can rigorously prove media failures, we can also work backwards and quickly distinguish between software and hardware problems when there is an unknown problem.

The above tools are accumulated based on o&M experience and fault scenarios. We also know that a single source is not enough. Therefore, we also introduce other sources of hardware fault discovery and combine various inspection methods to finally diagnose hardware faults.

2.2. How to converge

Many of the tools and paths mentioned in the previous chapter are used to find hardware faults, but not every time we find a hardware fault, we maintain the following principles when we perform hardware problem convergence:

● Indicators are independent of applications or services as far as possible: Some application indicators are closely related to hardware faults, but they are monitored only. For example, when the IO util is greater than 90%, the hard drive is extremely busy. However, it does not mean that the hard drive has a problem. Only disks with IO util>90 and IOPS <30 and over 10 minutes may have hardware problems.

● Sensitive collection and cautious convergence: All possible hardware fault features are collected, but in the final automatic convergence analysis, most of the collection items are only for reference, not as a basis for repair report. Again, if the IO util of the disk is greater than 90 and the IOPS is less than 30, we will not automatically report the disk repair because kernel problems may also occur. The disk fault is diagnosed only when specific fault items such as SmartCTL timeout or faulty sector occur. Otherwise, the fault is isolated and not reported for repair.

2.3. The coverage

Taking the IDC work order of a production cluster in x year 20xx as an example, the statistics of hardware faults and work orders are as follows:



Excluding out-of-band failures, our hardware failure detection rate is 97.6%.

3. The hardware fault is rectified

3.1 Self-healing process

For the hardware problems of each machine, we will issue an automatic rotation work order to follow up. Currently, there are two sets of self-healing processes: [Application maintenance process] and [non-application maintenance process]. The former is for the hot-pluggable hard disk failure, and the latter is for the rest of the whole machine hardware failure.



In our automated process, there are a few clever designs:

A. No disk diagnosis

● For the machine down, can not enter no disk (RamOS) to open [unexplained downtime] maintenance work order, which can greatly reduce false positives, reduce the burden of the service desk students.

● Pressure measurement in no disk can completely eliminate the influence of the current version of kernel or software, and truly determine whether there is a performance problem in hardware.

B. Determine the impact surface or upgrade the impact

● For the maintenance with application, we will also carry out the process of D live judgment. If process D lives for more than 10 minutes, the disk fault affects the entire system and you need to restart it.

● When restarting, if there is a situation that can not be started, there is no need for manual intervention, directly affecting the upgrade, [with application maintenance process] directly upgrading into [no application maintenance process].

C. Automatic backstop for unknown problems

● In the process of operation, there will be some machine down after can enter no disk, but the pressure test can not find any hardware problems, this time can only let the machine to install again, a small part of the machine is indeed installed in the process, found hardware problems and then repaired.

D. Downtime analysis

● The clever design of the whole process enables us to deal with hardware failures while also having the ability to analyze downtime.

● But the whole process is also problem-oriented, downtime analysis is just a by-product.

● At the same time, we also automatically introduce the group downtime diagnosis results for analysis, to achieve the effect of 1+1>2.

3.2. Process statistical analysis

If the same hardware problem repeatedly triggers the self-healing, the problem can be found in the process work order statistics. For example, lenovo RD640 virtual serial port problem, before locating the root cause, we found through statistics: the machine of the same model has repeated downtime self-healing situation, even after the machine is reinstalled, the problem will still appear. We then quarantined the machines, keeping the cluster stable and buying time for the investigation.

3.3. Misunderstanding of business association

In fact, with the above complete self-healing system, some business /kernel/software problems that need to be dealt with can also enter this self-healing system and then go to the branch of unknown problems. In fact, hardware self-healing to solve business problems, a bit of poison quench thirst, easy to make more and more problems have not thought clearly, try to solve the bottom in this way.



At present, we are gradually removing the processing of non-hardware problems and returning to the hardware-oriented self-healing scenarios (software-oriented general self-healing is also carried by the system, and such scenarios have great coupling with the business and cannot be generalized to the group), which is more conducive to the classification of hardware and software problems and the discovery of unknown problems.

4. Architecture evolution

4.1. The cloud

The initial version of the self-healing architecture was implemented on the controller in each cluster, because the operations students initially handled the problems on the controller as well. However, as automation continues, it is found that such an architecture seriously hinders data openness. Therefore, we used a centralized architecture for reconstruction, but the centralized architecture would encounter the problem of massive data processing, which simply a few servers could not handle.

Therefore, we further carried out distributed service-oriented reconstruction of the system to support massive business scenarios, disassembled each module in the architecture, introduced ali Cloud Log Service (SLS)/ Ali Cloud Streaming Computing (BLINK)/ Ali Cloud analysis database (ADS), and shared each collection and analysis task by cloud products. Only the core hardware fault analysis and decision making functions are left on the server.

Here is the architectural comparison of DAM1 and DAM3



4.2. Digital

With the deepening of the self-healing system, the data at each stage also have stable output. Higher dimensional analysis of these data can enable us to find more valuable and clear information. At the same time, we also reduced the dimension of the high-dimensional analysis results and marked each machine with health distribution. Through health points, operation and maintenance students can quickly know the hardware situation of a single machine, a cabinet, or a cluster.

4.3. As a service

Based on the control of the whole link data, we provide the whole fault self-healing system as a standardized service of the whole hardware life cycle to different product lines. Based on the sufficient abstraction of the decision, the self-healing system provides various perception thresholds, supports the customization of different product lines, and forms a full life cycle service suitable for personalization.

5. Fault self-healing closed-loop system

In the closed-loop system of AIOps perception, decision making and execution, fault self-healing of software/hardware is the most common application scenario, and everyone in the industry also chooses fault self-healing as the first location of AIOps. In our opinion, providing a set of universal fault self-healing closed-loop system is the cornerstone to realize AIOps and even NoOps (unattended operation and maintenance), and the intelligent self-healing closed-loop system is particularly important to deal with the operation and maintenance of massive systems.

5.1. Necessity

In a complex distributed system, there will inevitably be operational conflicts among various architectures, and the essence of these conflicts lies in information asymmetry. The reason for this information asymmetry is that every distributed software architecture is designed in a closed loop. Now, these conflicts can be smoothed out with a variety of mechanisms and operations tools, but this approach is like patching, and as the architecture continues to evolve, the patching seems endless, and more and more. Therefore, it is necessary to abstract this behavior into a self-healing behavior, explicitly declare this behavior at the architecture level, let each software participate in the whole process of self-healing, and transform the original conflict into synergy in this way.

At present, we focus on the biggest conflict point in the operation and maintenance scene: the conflict between hardware and software to design the architecture and product, and improve the overall robustness of the complex distributed system through self-healing.

5.2. General

Through the hardware self-healing rotation of a large number of machines, we found that:

● The side effects of o&M tools incorporated into the self-healing system are gradually reduced (due to the extensive use of O&M tools, operations in o&M tools become more refined).

● The human operation and maintenance behaviors that are included in the self-healing system are gradually becoming automated.

● Each o&M action has a stable SLA commitment time, rather than an o&M script that can run at any time with errors.

Therefore, in fact, self-healing is to construct a layer of closed-loop architecture after fully abstracting the operation and maintenance automation on a complex distributed system, so as to form greater coordination and unity of architecture ecology.


The original link

This article is the original content of the cloud habitat community, shall not be reproduced without permission.