How to improve the accuracy of alarms from Shannon entropy to Alarm Noise Reduction?

Dong Shan-dong & Bai Yu

For most people, information is a very abstract concept. People often say there’s a lot of information or a little information, but it’s hard to say how much information there is. Like how informative a help document or an article really is. It was not until 1948 that C.E.Shannon (Shannon) proposed the concept of “information entropy”, which solved the problem of quantitative measurement of information. Shannon borrowed the term entropy of information from thermodynamics. In thermodynamics, thermal entropy is a physical quantity representing the disorder degree of molecular states. Shannon uses the concept of information entropy to describe the uncertainty of information source.

Shannon’s entropy of information is essentially a mathematical measure of the “uncertainty” that we see all too often. For example, if the weather report says “there is a 60% chance of rain this afternoon,” we all think of taking an umbrella with us. If the forecast says “60 per cent chance of rain”, we hesitate to take an umbrella, which is a white elephant when it is useless. Obviously, there is less uncertainty about rain in the first forecast, and much more uncertainty about rain in the second forecast.

As a rather abstract concept in mathematics, information entropy can be understood as the probability of the occurrence of certain information. Information entropy and thermodynamic entropy are closely related. According to Charles H. Bennett’s reinterpretation of Maxwell’s Demon, the destruction of information is an irreversible process, so the destruction of information conforms to the second law of thermodynamics. Generating information is the process of introducing negative (thermodynamic) entropy into the system. When a piece of information is more likely to appear, it is more widely disseminated or cited. We can think that from the perspective of information transmission, information entropy can represent the value of information, so we have a standard to measure the value of information.

When it comes to our daily operation and maintenance work, all kinds of alarm events, as the most typical information, have become an important issue for us to evaluate the information value of alarms in the face of a large number of high alarm events every day.

Monitoring platforms or tools use two methods to identify indicator exceptions and trigger alarm events. The first is the common approach of setting thresholds/dynamic thresholds. The second is to set the default rules, triggering the system preset rule events, such as: machine restart. At the same time, o&M teams do not rely on a single monitoring tool, and often need to set corresponding monitoring alarms in different levels of tools.

In this context, multiple monitoring sources and monitoring tools may cause a large number of redundant alarms to be generated by different monitoring tools and monitoring rules for the same cause. An alarm storm can even occur when a widespread failure occurs. It is difficult for o&M personnel to quickly and effectively identify important and accurate alarms from a large number of alarms. As a result, valid alarms are often drowned. Therefore, there are several pain points for o&M teams and alarm products:

Multiple monitoring alarm sources and frequent false positives lead to a large number of repeated, redundant, and inefficient events. Important events are submerged in them and cannot be identified effectively.
Alarm storms caused by large-scale faults;
Dirty data such as test events is mixed with events.

What is ARMS intelligent noise reduction

The INTELLIGENT denoising function of ARMS relies on NLP algorithm and information entropy theory to establish a model, and excavates the pattern rules of these events from a large number of historical alarm events. When a real-time event is triggered, it labels each event with information entropy value and noise recognition to help users quickly identify the importance of the event.

The realization principle of intelligent noise reduction is introduced

It is difficult to abstract event patterns and values from a large number of historical events deposited in the event center. The intelligent noise reduction power of real-time monitoring service ARMS ITSM product is applied to collect different alarm sources to a unified platform for alarm event processing. Pattern recognition of these historical events is carried out to mine the internal correlation, and a machine learning model based on information entropy is established to assist users to identify the importance of events. The core steps of the model include:

Step 1: Based on natural language processing and domain vocabulary, lexicon quantization of event content is completed to achieve the measurement of event minimum granularity.
Step 2: Based on the concept of information entropy in information theory and combined with TFIDF model, the information entropy value and importance measurement model of word vector are constructed.
Step 3: Use SIGMOD to complete the nonlinear and normalized “information entropy” measurement of events;
Step 4: Combined with the processing records and feedback of historical events, build iterative training and verification of the model.

This paper uses natural language processing algorithm to represent the importance of events based on the concepts of information amount and information entropy in information theory, and helps users to iterate the model of identifying the importance of events by training a large number of historical events. Quickly identify the importance of a new live event when it fires. At the same time, combined with the setting of information entropy threshold, noise event filtering and shielding are completed. According to the time evolution, event type and content change, the model periodically realizes iterative update (update frequency is once a week) through self-adaptation, without any operation by users, the accuracy of the model can be guaranteed.

Intelligent noise reduction business value

Business value 1: intelligently identify repetitive and inefficient events and mine novel events

(1) Identification of a large number of repeated and similar events

For a large number of repeated and similar events, such events continue to appear in a large number of event alarms, and the model will continue to provide reduced information entropy for the information entropy of such events, that is, the information entropy of such events will become lower and lower until it finally approaches 0. This is because the model expects the user to pay more attention to the response of important events, but if the events are repeated and triggered in large numbers, it usually indicates that the user does not care about such events, which also supports the model mechanism from the business logic.

(2) Explore novel events

For events that do not appear in historical events but seldom appear, the model will focus on them and identify them as novel events, and give the current events a large information entropy value in order to expect users to pay more attention to such events. Therefore, THE ARMS intelligent noise reduction model also has the function of helping users identify important events.

Business Value 2: Customize requirements and support Settings

For some user test events or field specific events, we often want to customize the processing of such events, such as test events that only trigger the view of the entire process, but do not need to click to do anything. For example, some events contain extremely important field information, and these events need to be handled first.

Business value 3: The model has high growth

For users with a small number of historical events (<1000), it is generally not recommended to enable this function. This is because the model is difficult to fully train and recognize its internal patterns and rules when the number of historical events is too small. However, after it is opened, the model will conduct model iteration training on the basis of new events in the week. On the premise that users do not need to care, on the one hand, the model adaptively tracks the change of event pattern, and on the other hand, the model with insufficient number of original events continues to iterate fully.

Best practices

Usage Process description

Step 0: Entrance

Step 1: Turn it on

You can enable intelligent noise reduction when you feel that there are too many events, repeated events, and inefficient/invalid events.

Step 2: Use

After this function is enabled, event data of the history of one month will be pulled (if there are too many events in one month, part of them will be pulled for training at present) for intelligent model training. Click intelligent noise reduction to enter the details page.

Step 3: Parameter setting

With a deeper understanding of this functionality, users can begin to consider setting some keys for event prioritization and masking. The details of priority and shield words can be explained by referring to nouns.

Noun explanation

Noise event threshold: After intelligent noise reduction is enabled, we will calculate the information entropy value for each new event. The noise event threshold is the dividing line between noise and non-noise events.
Noise event: The event whose information entropy is lower than the set information entropy threshold is called noise event.
Non-noise events: the events whose information entropy is greater than or equal to the set information entropy threshold are collectively called non-noise events.
Priority words: in keyword setting, users can set some words they want to see first, such as: important, critical, etc. When the event name and event content of an event contain a set priority word, the priority of the current event is increased accordingly to avoid being identified as a noise event.
Shielding words: in the keyword setting, users can set some words that they think are not important, such as: test, test and so on. When the event name and event content contain the preset shielding word, the current event will be directly identified as the information entropy is 0 (if the information entropy threshold is set >0, it will be identified as a noise event).
Top50 common words: based on the statistical learning of historical events, the model stores a word frequency list of the event words. For common words, the word frequency table is sorted according to the frequency of occurrence, and the Top50 words are selected for display.

Q&A

When to enable this function

For users whose number of historical events is greater than 1000, ARMS Intelligent Noise reduction will be automatically enabled.

For users with a small number of historical events, users can open it themselves, but the model effects need to be iteratively tuned over a period of time.

Whether model parameters need to be modified

You are advised to use it in the initial stage and keep the default Settings.

After understanding the function, you can try to set priority words and shielding words, as well as the information entropy threshold, to achieve more customized requirements.

Click here to go to aliyun observables page for more information!

The recent hot

# Ali Cloud Observables series open course #