The authors introduce

Ma Bo, operation and maintenance development engineer of Ping An Technology database team, participated in AIOps implementation projects of Ping An Technology database, mainly including trend prediction, anomaly detection, automatic operation and maintenance platform, log and alarm, etc. Currently, ma Bo is committed to the construction of intelligent database operation and maintenance system on Ping An Cloud.

 

Review the development history of o&M, from the beginning of system management to basic script o&M, then to automated o&M, and finally to intelligent o&M. After years of development, the work content of operation and maintenance personnel has undergone earth-shaking changes:

 

More than a decade ago, we did not know where the failure would occur, nor did we know when the failure would occur. Only when the failure occurred, we could find the root cause and solve the failure, which was a very passive approach.

Later, with the introduction of large-scale scripts, our way of dealing with problems became more scientific and the speed was not satisfactory, but the nature of passive problem solving still remained unchanged. With previous experience, many companies have introduced monitoring systems and developed their own automated operations platforms, designed to automatically solve problems when they occur or are about to occur. This approach has just broken through the nature of all previous “passive operations” and can nip problems in the cradle before they occur. However, with a large number of alarms and massive monitoring data, how to solve the problem more efficiently has become a problem we must solve now.

 

The advent of the era of artificial intelligence just solves the problems we are facing above, and AIOps hopes to further solve the problems that cannot be solved by automatic operation and maintenance based on the existing operation and maintenance data (logs, monitoring information, application information, etc.) through machine learning.

 

We are now actively promoting the transformation of database operation and maintenance from automation to intelligence. As we all know, data mining and machine learning cannot do without massive data as the foundation, and Ping An Technology has accumulated massive multidimensional database performance data, log data and host data through the application of automatic operation and maintenance in recent years.

 

Using these data, we can obtain the information we want in multiple application scenarios, such as time series anomaly detection, root cause analysis, mail alarm convergence, capacity prediction, etc., through machine learning methods, so as to automatically discover, diagnose and resolve faults.

 

1. Time series anomaly detection

 

Timing data is the basic data of AIOps, which is characterized by large scale, variety and diverse demands. In the stage of automatic operation and maintenance, we mostly adopt the method of constant threshold.

 

This approach is simple and easy to implement, but it has obvious disadvantages: it is not flexible enough and fault detection is not timely enough to meet current alarm requirements. As shown in the figure below, a traditional threshold alarm ignores two fluctuating exceptions:

 

Constant threshold method

 

At the same time, the dynamic threshold method arises at the historic moment, dynamic threshold method is adopted based on the traditional statistical methods, and annulus compared to the method of explanatory is strong, easy to implement, but poor flexibility, greatly influenced by the holidays (the following diagram, on September 24th for the Mid-Autumn festival, flow rate and significant drop compared to last week, the sequential and compared with the method is not applicable). Problems are not identified in a timely manner.

 

There are also many companies that use the weighted moving average method to do dynamic threshold. They believe that in the same dimension, the value of a certain point must be related to its data in the previous period, as shown in the following formula:

 

9/18-9/25 Index data graph

 

 

We are currently applying machine learning to anomaly detection of time series data, which is more accurate and more expensive than the above methods.

 

Time series anomaly detection can also be regarded as a dichotomous problem of “normal” and “abnormal” in essence. By labeling historical monitoring data and combining supervised and unsupervised algorithms to build a model, we can judge whether the current time series is normal or not.

 

 

2. Root cause analysis

 

In most cases, due to the correlation of monitoring indicators, if one indicator is abnormal, many related indicators will also be abnormal. If all alarm indicators are analyzed and processed at the same time, a lot of manpower is wasted. In order to solve this problem, we need to conduct root cause analysis for targeted treatment.

 

Generally, we can perform root cause analysis on data by the following three methods:

 

  • Relevancy index acquisition, finding indicators similar to abnormal indicators in a specific period of time.

  • In a large number of samples, abnormal indicators that often appear together are found (this problem is transformed into frequent sequence mining problem), and the implementation methods are association rules, APRIORI, FP_GROTH, etc.

  • Based on the strong interpretability of decision tree, the positive and negative samples are classified, and then the frequent abnormal indicator sets are found through the classification tree approach of abnormal indicators.

 

For example, the DB_TIME of an Oracle database is too high.

 

  • The first method is to find the index that has a similar curve to the DB_TIME index in the current period, and take the most similar index TOP N as the root cause.

  • The second method is in the historical data, when the DB_TIME anomaly, other abnormal indicators into several item sets, and then from these item sets using association rules to find out the strong correlation combination, then the other indicators in these combinations are regarded as the root cause;

  • The third method is to divide the historical data into positive and negative samples according to whether the DB_TIME is abnormal, and train the decision tree model to get the final root cause.

 

Root analysis method 1

 

Root analysis method 2

 

Root analysis method three

 

3. Alarm convergence

 

When the monitoring service grows to a certain scale, the number of alarm emails received every day increases exponentially, especially when some frequently monitored items are faulty.

 

To solve this problem, in the beginning, we set the alarm frequency so that the same alarm occurs only once in a period of time.

 

This method can reduce some alarms, but some obvious alarms can be further converged by setting rules. For example, if the database in the same cluster cannot be pinged or all IP addresses on the same network segment suddenly increase, you can consolidate the alarms and send them.

 

 

 

In the era of AIOps, alarm convergence and root cause analysis are often carried out together.

 

Similar to root cause analysis method 2, we can obtain the alarm item set data and extract frequent items. If alarm A and alarm B occur together and alarm A occurs earlier than alarm B in the set of frequent alarms, you can ignore alarm B in the email alarm and only push alarm A to the O&M personnel.

 

There are different requirements for alarm convergence in different scenarios. Compared with AIOps, the traditional alarm convergence method is simpler and more efficient, and the rule-based method has strong expansibility and interpretation. AIOps, on the other hand, can mine associations that common sense and experience cannot find, and then do alarm convergence.

 

Iv. Capacity prediction

 

Capacity prediction is applied in many places of database operation and maintenance. Different application scenarios have different characteristics, so it is difficult to find a model to adapt to all the data.

 

In capacity prediction, our typical application is database size capacity prediction, database capacity has the overall rise, irregular, large fluctuation characteristics. Reasonable prediction of database capacity can detect possible faults in advance in the short term, and take the initiative to prevent and solve them in advance. There is no need to deal with them passively when problems occur. In the long run, reasonable capacity planning and resource allocation can be implemented.

 

At first, we thought of linear regression plus simple data preprocessing, but the results were far from ideal. Due to the difference of business scale, the capacity of different databases is very different, and the effect of linear or nonlinear fitting is not satisfactory when conducting operations such as table guide and expansion of databases.

 

Obviously, although the traditional linear regression method is simple, but the prediction effect is poor, can not meet the requirements. In order to solve this problem, we classify capacity data into periodic type and sudden rise and fall type. The classification method can be statistical method, clustering method or classification method.

 

For periodic data, we can consider that it is actually linear and fitting, because in the overall upward trend, the growth value of periodic data is linearly increasing within the period. For this type of data, we can use linear regression machine learning methods to predict database capacity.

 

 

Periodic data

 

However, for the data of sudden increase and sudden fall, the linear fitting effect is poor. At this time, we use the sequential increment summation method to obtain the weighted average of the specific daily increment from Monday to Sunday in the historical data. This increment is then applied to the forecast. Compared with the simple linear fitting method, the accuracy of this method is improved a lot, and the mean square residual of the average prediction data is reduced nearly twice.

 

Sudden rise and fall type data

 

The technological development of the above four application scenarios is committed to making operation and maintenance more efficient through AI, so that more faults can be found and solved in advance. There are a lot of things we can try and explore about AIOps, such as intelligent question answering robot and log focused analysis platform. We will share relevant results with you later.