The traditional monitoring of KPI data is usually based on rules and alarms are generated according to preset thresholds. Although this method is simple, it is not adaptable and prone to a large number of false alarms, which brings great challenges to the actual operation and maintenance work. Therefore, this paper develops a new KPI anomaly detection method based on machine learning. The method uses logistic regression model, wavelet analysis model, random forest model and BiLSTM model as basic models to identify outliers through weighted synthesis. Finally, in the first AIOps competition, this method ranked the second in the final with f1-score of 0.771397.
The keyword
Anomaly detection, sequence data, stream data, real-time detection, KPI
1.1 Background
With the development of the Internet, especially the mobile Internet, WEB services have penetrated into various fields. The stability of WEB services is mainly guaranteed by operation and maintenance. Operation and maintenance personnel can judge whether the services are stable or not by monitoring various OPERATION and maintenance KPI indicators. However, there are hundreds of o&M indicators in practical applications, such as CPU usage, memory usage, disk IO, etc., which cannot be manually monitored by human alone. Therefore, some automatic tools or programs are needed to assist in O&M monitoring. The KPI anomaly detection model based on machine learning proposed in this paper has the characteristics of modularity, universality and high efficiency, which is suitable for online real-time KPI anomaly detection and can effectively reduce the workload of operation and maintenance personnel.
1.2 Data Exploration
For the preliminary data set officially provided by AIOps, this paper carried out KPI daily mean analysis, KPI similarity analysis, macro trend analysis, anomaly detail analysis and multi-scale wavelet analysis. Through comparison, it can be found that some KPIs have some similar forms and outlier characteristics, and the differences between different KPIs can be very large. A single model cannot be applied to anomaly detection of all KPIs once and for all.
Figure 1.1 Data exploration
Figure 2.1 is the basic framework of anomaly detection algorithm based on weighted synthesis, which is mainly divided into offline model training and online detection. This paper tries to a wide variety of anomaly detection model, after weighing the validity of model and game time, eventually selected the logistic regression model, wavelet analysis model, random forest model and bidirectional LSTM neural network model, as the basic model of the weighted model, when real-time single point detection, by means of weighted voting for the determination of abnormal points.
Figure 2.1 Algorithm framework
3.1 Data preprocessing
Data preprocessing mainly involves three aspects:
(1) Abnormal data elimination. Using the Grubbs guidelines, outliers are not necessarily officially marked outliers.
(2) Data standardization processing. In order to generate different features, z-Score and min-max methods were selectively used to standardize and normalize the data.
(3) Missing value processing. After the completion methods such as elimination method, mean value method and interpolation method are tried, the mean value is finally selected for completion.
3.2 Feature Extraction
Feature extraction is the most important content before model training. The quality of feature extraction determines the upper limit of machine learning. This paper refers to the feature extraction method proposed by Professor Pei Dan in Opprentice framework, and at the same time draws on the time series feature extraction method, and finally selects the original value feature, statistical feature, fitting feature and wavelet analysis feature. By adjusting the size of time window and parameters of extraction method, 61 time series features are extracted by this model. The extraction effect of some features is shown in Figure 3.2.
Figure 3.1 Summary of feature extraction
Figure 3.2 Sample features
For the training set, each data point has features of 61 dimensions that can be used for model training. In the online detection stage, every data point acquired needs to calculate the features of 61 dimensions of the current point by using historical window data.
3.3 Data Balancing
In the sample data of KPI anomaly detection in this competition, positive and negative samples are extremely unbalanced, which will have a great impact on the training effect of the model. Therefore, sample data should be balanced before the training of the neural network model and other models.
In this paper, two data balance schemes are adopted at the same time, one is under-sampling non-abnormal data (stochastic sampling method), the other is over-sampling abnormal point data, mainly using SMOTE and ADASYN methods. This ensures that the sample data input to the model has a 1:1 ratio of positive and negative samples.
In a sampling scheme, this article is not to all the abnormal samples as a sampling of the initial sample, but excluding abnormal data points of the feature is not obvious, such as in figure 3.3 level of abnormal data points, in accordance with the requirements for the game, as long as the abnormal section of the previous seven points with an exception was normal recognition, the entire section will be identified, However, if only the data points after the seventh point are recognized, the segment will be marked as unrecognized. Abnormal recognition after the seventh point has no effect on the improvement of F-Score, but may lead to the increase of error rate due to the addition of new features. Therefore, in the training of this paper, the eighth point in the abnormal section and the subsequent sample data points are not included in the model training.
Figure 3.3 Example outlier data
3.4 Anomaly detection based on logistic regression
Logistic regression is based on linear regression, using sigmoID function to complete the transformation from linear to nonlinear, is widely used in fitting and classification. Judging from the results, IT can be seen that KPI anomaly detection is also a dichotomous problem. Facts show that as long as feature extraction is accurate, the use of logistic regression model can also achieve good results, and it is simple, efficient and easy to understand.
In this paper, the generalized linear model GLM in R language is used for model training and prediction. In the selection of threshold, it is common to determine the threshold through Receiver Operating Characteristic (ROC) curve or precision-recall curve (PRC) curve. The ROC curve can keep the curve unchanged when the positive and negative sample distribution is changing, but the accuracy of the ROC curve is not consistent with the maximum goal of F1-Score. PRC curve focuses on F1-score, and the closer the accuracy and recall rate is, the greater the corresponding F1-score will be. However, in the case of uneven sample distribution, PRC curve cannot well reflect the performance of classifier. Finally, the grid search method is used to determine the classification threshold. In this paper, the performance of the logistic regression model is retrospectively reviewed by using the final data. The performance of the model in the training set and test set is shown in Figure 3.4. According to the official standard, its total score in the test set is 0.693049243.
Figure 3.4 Score of logistic regression anomaly detection on the final dataset
3.5 Anomaly detection based on wavelet analysis
Wavelet analysis has been widely used in signal processing, image compression, pattern recognition and other nonlinear scientific fields. In the study of time series, it is mainly used for denoising and filtering, and can reflect the features of sequence data that cannot be observed on the conventional scale. Different from Fourier analysis, wavelet analysis has the function of time-frequency multi-resolution analysis. Based on this, this paper decomposed the original KPI sequence into multiple signal sequences by discrete wavelet transform (DWT), and reconstructed the decomposed sequence. As shown in Figure 3.5a, the original sequence is decomposed by 5-order wavelet, with low-frequency signal (approximate part) on the left and high-frequency signal (detail part) on the right.
FIG. 3.5. 5-order decomposition of a wavelet
FIG. 3.5. distribution of outliers by b wavelet decomposition
As can be seen from FIG. 3.5b, the distribution of outliers is more obvious in the detail sequence, and the outliers fall on the high frequency components of the detail part with a greater probability. Thus, outlier detection of KPI time series can be transformed into outlier detection of detail series.
Match to achieve KPI online anomaly detection, so this article by choosing appropriate time window, within the time window on the data of wavelet decomposition and reconstruction, extract detail sequence, and USES the Grubbs criteria within the time window on abnormal points, which can identify if the current time point in the anomalous point set, detect abnormal think the current moment.
In the process of wavelet analysis, the wavelet basis function, signal expansion mode, the number of layers of wavelet decomposition and the size of time window will affect the final anomaly detection effect. Therefore, in KPI training, this paper takes the highest score of training set as the goal, extracts the best wavelet analysis parameters for each KPI through grid search method, and applies them to online detection.
Figure 3.6 Parameter optimization of wavelet analysis
The score of wavelet analysis in the FINAL KPI is shown in Figure 3.7. Using the official scoring standard, the total score of wavelet analysis in the test set is 0.649976063. As can be seen from the figure, the performance of wavelet analysis on different KPIs is different. It can be seen that wavelet analysis is not applicable to all KPIs.
Figure 3.7 Anomaly detection score of wavelet analysis on the final dataset
3.6 Anomaly detection algorithm based on random forest
Random forest is a kind of combinatorial classifier, whose basic structure is decision tree. It can not only be used for data cleaning, but also be extended to classification and recognition, and can effectively solve the problem of data imbalance. In this paper, python is used to provide RandomForestClassifier RandomForestClassifier to detect outliers. The data input source is part of the features extracted by the feature extraction algorithm. Due to the time limit of the competition, we did not optimize the tree number, maximum depth and maximum feature number of the random forest too much, and there is still a lot of room for improvement of the random forest.
The performance of the random forest anomaly detection algorithm in the final is shown in Figure 3.8, and its total score in the test set is 0.701142416, which is the highest among the four basic algorithms.
Figure 3.8 Score of random forest anomaly detection on the final dataset
3.7 Anomaly detection algorithm based on BiLSTM neural network
In this paper, bidirectional LSTM neural network is used for anomaly detection. LSTM is an excellent variant of cyclic neural network, which inherits most of the characteristics of the cyclic model and solves the problem of gradient disappearance. Neural network can automatically learn the correlation between features and targets. LSTM is very suitable for dealing with problems highly related to time series.
Bidirectional LSTM has better performance, so we use it to build the anomaly detection model. The difficulty lies in the effective selection of the hyperparameters. Due to the problem of computing resources, we choose the manual optimization method, and the training and testing have better performance in the scenario with full data coverage.
Figure 3.9 BiLSTM neural network structure
The performance of the neural network algorithm based on BiLSTM in the final is shown in Figure 3.9, and its total score in the final test set is 0.691123881.
Figure 3.9 BiLSTM anomaly detection score on the final dataset
The above four models have different performance in different KPIs. In order to obtain more accurate anomaly detection, this paper adopts weighted voting model to synthesize the four models.
Among them:
Is the weight of each model, the value is 0~1, and is the score of model training set
The value can be 0 or 1
Comprehensive prediction results, if P<0.5, the prediction is normal, otherwise abnormal
For weight confirmation, there are two scenarios:
(1) Under the condition that there are models with scores greater than 0.5, if the training set score of a model is less than 0.5, the corresponding weight is reset to 0
(2) If the score of all models is less than 0.5, the model with the highest score among the 4 models is taken as the prediction model, and the weight of the corresponding other models is reset to 0
Reviewing the final results, the single model scores of the four models ranged from 00.64 to 0.66, but the comprehensive score reached 0.771397. It can be seen that the model synthesis has achieved certain results and can significantly reduce the misjudgment.
Table 4.1 Comparison of model scores
Attempts in a large number of anomaly detection model, this paper finally chose the logistic regression model, wavelet analysis model, random forest model and bidirectional LSTM neural network model of these four model carries on the preliminary anomaly detection, due to the diversity of KPI, these four models have different performance, in the end we use the comprehensive weighted method for four model for integration, seen from the results, The integration of the models has achieved remarkable results. KPI anomaly detection models are diverse, and it is difficult to find or develop a universal anomaly detection solution in practical application scenarios. Therefore, this paper synthesized the current effective anomaly detection methods to obtain a relatively general anomaly detection scheme.
Each sub-model of the detection model can also work independently or cooperatively, or can be added or reduced by pluggable way; In addition, the competition time is limited and a lot of work has not been completed. This detection model still has a lot of potential for improvement, such as feature selection through automation, selection of hyperparameters through neural network optimization to optimize the model, and optimization of integration performance through improving model integration strategy.