Author | Van Den, Bai Yu

Review & proofread: white VERIFICATION

Editing & Typesetting: Wen Yan

background

As a basic and important function in intelligent operation and Maintenance (AIOps) system, anomaly detection aims to automatically detect abnormal fluctuations in KPI time series data through algorithms, and provide decision-making basis for subsequent alarms, automatic stop loss, root cause analysis, etc. So, how do we use exception detection in real scenarios, and what is exception detection, today we will go into an in-depth explanation.

What is exception detection?

Before we get started, we need to understand what exception detection is. Anomaly detection refers to identifying abnormal events and phenomena from time series or event logs. The anomaly detection we are talking about here refers to the anomaly detection of time series. Through the comprehensive judgment of the value of time series and curve shape, we can find the curve anomaly. The abnormal performance generally refers to the rise, fall or fluctuation of the time series that does not conform to expectations.

For example, a machine’s memory usage index fluctuates from around 40% to 100%. The connection number of a Certain Redis database is always around 100, but suddenly it drops to 0 on a large scale. The number of people online for a business fluctuates around 100,000, suddenly drops to 50,000 and so on.

What is a time series?

Time series refers to a series of data points arranged according to the sequence of time occurrence, usually the time interval of a set of time series is a constant value (such as 1 minute, 5 minutes).

How does current open source Prometheus detect anomalies?

The current open source version of Prometheus is based on threshold rules, and this reliance on threshold setting raises the following questions.

Q&A

Question 1: In the face of tens of thousands of indicators, how to quickly and reasonably complete the detection configuration?

As the meanings of different indicators vary greatly, the appropriate thresholds are also different. The same threshold cannot be used for indicators of the same type due to different service statuses. Therefore, when configuring a threshold, you need to set an appropriate threshold based on service requirements. Due to differences in the cognitive level and work experience of operation and maintenance personnel, different staffing thresholds are also different. Secondly, many indicators do not have a clear and reasonable range definition, which leads to many threshold configurations are determined by “head beating”, with strong randomness.

For example, a reasonable threshold can be set only after carefully observing and analyzing the numerical distribution and trend of the historical index curve for an online population indicator.

Question 2: How do I maintain detection rules as services evolve?

For relatively stable services, service indicators remain stable for a long time. In this case, the configured thresholds can take effect for a long time. However, for the constantly changing business, with the continuous evolution of business, the water level and trend of indicators are also constantly changing. These changes easily lead to the threshold detection set at the beginning, but after a period of time, it does not meet the detection status quo. In this case, o&M experts are required to periodically check whether the detection threshold still meets the current detection requirements and maintain and modify unreasonable configurations. Therefore, static threshold method has the problem of high maintenance cost.

For example, if the I/O throughput fluctuates around 10000 at the beginning, the alarm threshold is set to 20000 at the beginning. However, with the development of the service, the I/O throughput has stabilized at around 25,000. At this time, the threshold set at the beginning of this time leads to continuous alarm disturbance.

Q3: What can be done about poor data quality?

Poor data quality is manifested in several specific phenomena: large acquisition delay, many missing values and many data burrs (reflected in the curve is not smooth enough). For the first two, more targeted optimization is carried out from the collection and aggregation side. Arms-prometheus continues to optimize acquisition capabilities. However, static threshold method can not effectively avoid data quality problems with many data burrs. In the intelligent operator of Arms-managed version Prometheus, multi-burr points were identified effectively to ensure that the burr points would not form invalid alarms and reduce the formation disturbance on the user side/operation side.

How does Ali Cloud Prometheus monitor solve these problems

In the face of the above problems, Ali Cloud Prometheus monitoring detection configuration capability in addition to support the original threshold setting detection method, fully support template threshold setting detection method and intelligent detection operator method.

Service value 1: Efficient and high-quality alarm configuration

(1) Configure detection rules for specific application scenarios. Aliyun Prometheus provides mature alarm configuration templates. Users do not need to manually set thresholds, but only need to select corresponding templates. For example, in the machine counter scenario, configure a template for MACHINE counter CPU usage >80%. Using a template solves the pain points in application scenarios where certain exceptions occur and services are stable.

(2) For unclear indicator scenarios or service indicator scenarios that are difficult to set, it is recommended to use the intelligent detection operator function.

For example, if you need to set a threshold for an online number indicator, you need to observe the historical curve for a long time to set a proper threshold. In this scenario, the user can directly select the intelligent detection operator.

Service value 2: Adaptively tracks service changes, greatly reducing detection threshold maintenance costs

Intelligent detection operator function monitored by Prometheus of Aliyun, by setting parameters referring to the length of historical data, the model can adaptively track the change of indicator trend without manual periodic review of configuration rules.

Business Value 3: Intelligent detection can also be realized for indexes with poor quality, missing values or too many burrs

In the function of intelligent detection operator, if the historical data is missing, the algorithm can be linear interpolation, polynomial interpolation and other ways to automatically fill the missing value.

For the unsmooth index curve detection, the intelligent detection operator also adaptively selects the optimal model of the scene for detection to ensure the overall detection effect.

How to apply it to specific business scenarios

Water level surge/drop indicator: QPS indicator of a service

When the service starts to set the threshold, it is likely that the threshold will not exceed 150 by observation. However, as the business iterates, QPS metrics change in a variety of ways. From the index is shown as: there is a sudden increase to a certain value, and then the stable state. In this case, the set static threshold is difficult to continuously meet the detection requirements. On the other hand, stable conditions can also have sudden drops, which can be detected by a static threshold with only an upper limit. In this case, the intelligent detection operator can adaptively track the changes of the business level and intelligently identify the sudden increase or decline of the business.

Cyclical indicators:

In the index drawing module, if the current index is identified to have a certain period, the corresponding period value, period offset value and overall trend curve will be extracted from it. After the periodicity and trend of the original time series are removed, the residual is used for anomaly detection. As for the cycle index in the figure above, a period of about 11.30 minutes is obviously different from other periods. The traditional static threshold is difficult to solve the detection problem in such scenarios, but the intelligent detection operator can identify the anomaly.

Trend breakers:

In addition, a common type of indicator anomaly is a period in which the indicator continues to rise (or fall). Sudden trend destruction occurs at a node, and the local trend is different from the overall trend. This type of exception is also common, but static thresholds are difficult to set to resolve this situation. And the intelligent detection operator can accurately identify the anomaly for this type.

Best practices

Aliyun Prometheus monitoring usage process

At present, Ali Cloud Prometheus monitoring has supported the intelligent detection operator function, just log in arms-Prometheus/Grafana, enter the corresponding PromQL.

Operator defined

"anomaly_detect": { Name: anomaly_detect", ArgTypes: []ValueType{ValueTypeMatrix, ValueTypeScalar}, ReturnType: ValueTypeVector,}, input: indicator time series, type range vector; Check parameters, use the default 3 can output: abnormal return 1, normal return 0Copy the code

Use case:

anomaly_detect(node_memory_free_bytes[20m],3)
Copy the code

The value must be a range vector, so add [180m] to the end of the indicator name. The default time range is 180m, and the default parameter is 3
Anomaly_detect (sum(node_memory_free_bytes)[180m:],3) Anomaly_detect (sum(node_memory_free_bytes)[180m:],3)

Example:

Step 1: Log in to Arms-Prometheus or Grafana and select the corresponding Prometheus data source

Select the corresponding data source:

Step 2: Select indicators and view them

Step 3: Input abnormal detection operator

About Prometheus- Intelligent Detection operator

Ali cloud Prometheus monitoring intelligent detection operator, summary of the industry dozens of leading algorithm design. The index portrait is established for common index types, and the best model is adaptively selected to carry out detection calculation. After each index data is input into the model, the model will first establish index portrait of the current index, including stability, jitter, trend, periodicity, whether it is a special holiday/activity, etc. Based on these features, the model adaptively selects the optimal one or a combination of multiple algorithms to solve the current index detection problem, ensuring the optimal overall effect. Currently, the supported functions include surge detection, burr detection, and period identification (identifying periodicity and period offset).

Through the integration of intelligent detection operator in Prometheus monitoring of Ali Cloud, we hope to provide users with out-of-the-box, continuous iterative update of intelligent detection services. Currently, users can view and use the intelligent detection operator in Alicloud Prometheus monitoring, while the intelligent detection alarm function based on ARMS native configuration and Grafana dynamic display will be introduced in the near future.

👇 Click here to access Prometheus Monitoring now!

How to use exception detection in real scenarios? Ali cloud Prometheus intelligent detection operator came