A background.

In the intelligent monitoring field of Ant Group, timing anomaly detection is a very important link. During anomaly detection, the business side monitors various indicators of different businesses, applications, interfaces and clusters by referring to the industry standard output Metrics index data. Includes Metrics (total volume, failures, and elapsed time) and system service Metrics (CPU, MEM, DISK, JVM, AND IO). Early time sequence anomaly detection is accomplished by SRE combined with long-term operation and maintenance experience and configured with expert rules. With the popularization of AI technology, anomaly detection is gradually ai-based. In real scenes, AI algorithms often face the following challenges:

  • In different time periods of the day, the business time series curve presents different local mean/variance characteristics.
  • For special days, such as large holidays and promotion days, time series data are greatly different from daily life, and even from the same period in previous years.
  • Partial timing events occurring at random moments in a fixed period of time every day or several days;
  • Massive monitoring services, it is difficult to model for individual indicators one by one;

The following figure shows a group of time series data of time consuming indicators, which are sampled by minute and have obvious daily periodicity. In different periods of each day, the mean value/variance varies significantly. It is difficult and low in accuracy to set the threshold value forcibly according to the time period based on expert experience; Using regression model to fit data distribution has high precision, but it is difficult to be widely applied to other indicators.

FIG. 1 Time-sequence curve of time consuming

This paper makes some explorations based on CNN neural network, which can guarantee the detection accuracy and recall rate as well as the good generalization of the model.

2. Algorithm research

The figure below summarizes some algorithms involved in timing data anomaly detection, which will not be detailed here. We are interested in inquiring relevant algorithm principles by ourselves.

FIG. 2 Anomaly detection related algorithms

From the perspective of whether or not to rely on annotation training samples, algorithms are mainly divided into supervised and unsupervised directions (semi-supervised is not introduced here). The unsupervised algorithm saves a lot of manpower for sample labeling and is suitable for cold start. However, the algorithm developer needs to keep adjusting parameters to find the optimal classification decision plane, and the characteristics of different monitoring services need to be taken into account in the process of adjusting parameters. Supervised algorithms are on the contrary, but the model is often poorly interpretable. In daily operation and maintenance, users often ask: At this time, o&M personnel may be confused about why alarms are generated or no alarms are generated. In addition, different service owners have different criteria for exception evaluation. If they cannot reach an agreement on exception evaluation criteria, the supervised algorithm is used to maintain multiple sample sets for different evaluation criteria.

The convolutional layer of CNN has obvious advantages in extracting abnormal waveform features, and the fully connected layer with appropriate complexity can theoretically fit all nonlinear relations. Also, the design of network architecture is a flexible job, and there is a lot of room for algorithm engineers to play, rather than just tuning parameters.

3. Algorithm principle

This chapter introduces the scheme based on CNN model from three parts: feature engineering, sample enhancement and neural network design, and takes time consuming index as an example.

3.1 Feature Engineering

The mean/variance/trend of different samples varies significantly, so it is necessary to map the original time series data to a unified dimensional space.

The original input of the model is 5 groups of input channels:

  1. Data of the day: time series data from the first N minutes to the current predicted time point;
  2. Year-on-year data (previous day) : The current time one day ago is the reference point, and the sequential data from the first N minutes to the next m minutes;
  3. Year-on-year data (first 2 days) : the current moment two days ago is the reference point, and the sequential data from the first N minutes to the next m minutes;
  4. Year-on-year data (first 7 days) : the current moment 7 days ago is the reference point, and the sequential data from the first N minutes to the next m minutes;
  5. Year-on-year data (the first 14 days) : the current time 14 days ago is the reference point, and the time series data from n minutes before to m minutes after;

The period from n minutes before to m minutes after the year-on-year data is selected because some periodic events do not occur at a fixed point in time, but take random values in a fixed period. In author practice n=60, m=30.

Main problems solved:

  1. The influence of water level difference of time series data on different dates is eliminated.
  2. The influence of jitter amplitude difference of time series data on different dates is eliminated.
  3. The influence brought by the difference of the range of different indexes is eliminated.

The whole data processing process is carried out in the following sequence.

3.1.1 Variance standardization

Variance reflects the jitter intensity of time series data in the statistical period. In real samples, the variance of the day and the variance of the previous period may differ greatly from that of the previous period. If no standardized processing is done, time series data with severe jitter may easily produce false positives.

FIG. 3 Comparison before and after treatment

3.1.2 to mean

The mean value reflects the water level of time series data in the statistical period. In the real sample, the mean value of the day is not equal to the mean value of the previous period, so the water level needs to be aligned. The median of time series data in each group of input channels is taken, and the median is 0 for translation. Note that the median is not the statistical mean.

FIG. 4 Comparison before and after treatment

As to why the median is used instead of the statistical mean, the following figure shows the problem of using the mean, which is not aligned.

Figure 5 Median and statistical mean differences

3.1.3 Extracting trend baseline

Moving average: A rolling window extracts the trend baseline within the reference period, which requires denoising of the data set within the window at a certain rate.

Figure 6 extracting trend baseline

3.1.4 to trend

Make a simple mapping operation, true value – trend baseline value, extract the residual time sequence after the trend.

Figure 7 detrending

3.1.5 standardization

Do a standardized operation on residual timing.

FIG. 8 Standardization

3.1.6 Data interception

Through the above steps, the temporal sequence data of the day and the past five channels are mapped to the new space. When anomaly detection is carried out, the timing sequence input of the day into the neural network only needs to intercept the latest C minute, as only the current moment is abnormal. In the author’s practice, C =7.

3.2 Data Enhancement

Before model training, an appropriate amount of data can be enhanced, which can not only improve the universality of the model, but also make it easier to extract abnormal waveform features in the process of training convergence, thus greatly improving the accuracy/recall rate. Data enhancement follows feature engineering.

  • Exchange the data of the input channel of the previous reference day. In the following example, the input of channel Y7 and channel Y14 are exchanged.

Figure 9 switching channel input

  • Modify the current time value of abnormal sample, set it randomly below the specified threshold value, and turn an abnormal sample into a normal sample;

Figure 10 Modifying the current time value

  • The vector of the input channel of the day in the abnormal sample was modified, and the whole vector was negatively shifted in large scale to change an abnormal sample into a normal sample.

FIG. 11 Modifying the input vector of the day

  • Simulate periodic events, randomly select several forward channels, and generate data similar to today’s abnormal waveform in the input time series;

Figure 12 simulates periodic events

3.3 Neural network design

Compared with complex image recognition, the image features of abnormal waveform are much simpler. On the premise of meeting the call rate, fewer hidden layers and fewer parameters should be used to solve the problem. There are two key points in the model structure:

  1. Each input channel shares the same convolution layer, because the waveform features to be extracted from each channel are consistent, sharing the convolution layer can save computational performance.
  2. In essence, the MaxPool layer operates an element with maximum orientation for each channel. Therefore, no matter how the length of Input vector changes in the Input layer, the output data structure of the MaxPool layer is fixed. Therefore, in the prediction of the model, the Input channel can flexibly Input time series data of different lengths.

Model Keras defines the code:

Model structure printing:

Figure 13 Network structure

Iv. Effect evaluation

4.1 Annotation sample set evaluation results

At present, the accuracy rate can reach 98.9% on the 10,000 + training sample set. Because there are some ambiguous annotation data in the marking data, it is difficult for different business personnel to reach the unity of judgment. Therefore, it may affect the universality of the model to forcibly fit the training sample set by increasing the model complexity.

4.1 Online prediction of evaluation results

Latest review data: 78% accurate, 96% recalled, main reasons for false report:

  • Some business personnel judge the burr with short duration as false positives, but such burr anomaly is marked in the training sample set. A simple postposition rule can be added to filter the abnormal duration.
  • When the original input of the day is about 1 hour, abnormal waveform is obvious, but when the time axis is elongated, the fluctuation range is normal. Such false positives can be reduced by feeding longer time series data into the original data, or filtering by off-line statistics of appropriate minimum thresholds based on historical data.
  • The performance of small data/sparse data model is poor;
  • Periodicity difference, the trend of the day and historical differences greatly;

Some of the anomalies found are in red.

Figure 14 Detected anomalies

V. Some current problems and thoughts

As more and more monitoring services are connected, it is difficult for different services to completely unify the definition of exception standards. The exception recognized by A is A normal phenomenon in the eyes of B, which means that multiple training sets need to be maintained by using supervised schemes, which is unrealistic in practice. At the same time, this scheme will generate a large number of time series data query requests in real-time prediction, which needs strong support from the platform. In our practical exploration, we found that a single algorithm could not solve all the problems, different algorithms have their advantages and disadvantages, have their fit and awkward scenes, the appropriate method is the best method.

The authors introduce

Wang Rui, alias Biannan, ant Group technical expert, has been engaged in AIOps algorithm related research work. Currently, I am the leader of the algorithm group of ant Group intelligent monitoring team.

About us

Welcome to the world of ant Intelligent Operation and Maintenance. This public account is produced by ant Group Technology risk Team in Taiwan. For students who pay attention to intelligent operation and maintenance, technical risk and other technologies, we will share with you from time to time ant Group’s thinking and practice in the architecture design and innovation of intelligent operation and maintenance in the cloud native era.

Ant technical risk China team, responsible for the technical risk of ant group base platform construction, including intelligent monitoring, capital verification, performance, capacity and link all pressure measurement as well as the risk data infrastructure platform and business capability and solve world-class distributed processing problem, identify and resolve potential technical risks, ant double tenth first-class large-scale activities, Ensure the high availability and capital security of the whole ant system under the limit requests through the platform capability.

If you have any topic about “intelligent operation and maintenance”, please leave a comment and let us know.

PS: Technical risk center is looking for technical experts, welcome to join us, interested to contact [email protected]

Public account: Ant Intelligent operation and maintenance