For the background service running 24/7, monitoring alarms is the cornerstone of stable operation. Many developers have experienced the experience, to the service of each index all did strict monitoring and alarm, lest we should miss the alarm led to problems that cannot be found, leading to a day to receive a lot of invalid alarms, the alarm of flood gradually paralyzing the vigilance, the real problem of leakage in the early signs but ignored, eventually led to the serious fault.
How to improve the effectiveness of alarms, identify problems accurately, and avoid drowning in a large number of invalid alarms is the content of this paper.
Alarms are the basis of reliability
Let’s start by looking at the importance of alarms and why we need to spend so much effort optimizing them. Although we all expect a service to be trouble-free, the fact is that there is no system with 100% problems. We can only continuously improve the reliability of the service. We expect to achieve:
- Be aware of and in control of the current state of the service
- Be able to find the problem immediately and locate the cause of the problem quickly
In order to achieve the above two points, we can only rely on sound monitoring and alarms to display the complete running status of services. However, it is impossible to stare at the screen all the time and pay attention to all aspects. In order to passively understand the system status, we can only automatically detect anomalies through alarms. Therefore, alarms are one of the most important means for teams to monitor service quality and availability. MTBF, MTTF and MTTR are usually used to represent the time problem related to system failure.
- MTTF (Mean Time To Failure) : indicates the average Time for a system To run without a Failure. MTTF is the average Time between the normal operation of the system and the occurrence of a fault. MTTF is equal to ∑T1 over N
- MTTR (Mean Time To Repair) : indicates the average Time between the occurrence of a fault and the completion of maintenance. MTTR is equal to the sum of T2 plus T3 over N
- MTBF (Mean Time Between Failure) : indicates the average Time period Between the occurrence of two system failures. MTBF = ∑ T2+T3+T1 over N
Reliability is the pursuit of higher MTTF and lower MTTR (mean time to failure). The best case scenario is no failure, but there is no 100% reliable system, and when a failure/anomaly occurs, we need to minimize MTTR (mean time to repair). The point of an alarm is to minimize T2 + T3 time.
This topic describes the actual problems faced by alarms
In the ideal alarm, there are no false positives (that is, the original normal alarm is abnormal) and no missing alarms (that is, the original abnormal alarm is mistaken for normal), so the ideal model meets the following three points:
- If the false positive is 0, all alarms are problems that need to be handled
- If the missed alarm is 0, all abnormal problems can be detected
- Timely detection: The ability to detect problems in the first place, even before they lead to faults
But in practice, the ideal situation cannot be achieved. To reduce missed alarms, it is necessary to monitor every possible scenario and configure alarms, which is not difficult. However, there are too many alarms, not too few, that it takes a lot of time to deal with invalid alarms. Because there are too many alarms, it is easy to ignore useful alarms, which takes a long time to discover exceptions or ignores potential risks. Therefore, the biggest problem for alarms is how to reduce invalid alarms and improve alarm efficiency.
First, look at the cause of invalid alarms.
Monitoring systems are supposed to solve two questions: what went wrong, and why. “Something is broken” is a symptom, and “why” is a cause (possibly an intermediate cause). The distinction between phenomenon and cause is the most important concept in the construction of high SNR monitoring system.
In practice, it’s almost impossible to do either of these things absolutely, but we can go infinitely to the ideal model.
Alarms are generally judged by the symptoms, and faults are determined by the causes of the symptoms. The possibility that the same phenomenon may cause different causes is the core cause of false alarms.
For example, a request failure alarm may be caused by a request content problem, an upstream machine exception, or our own service processing exception. Ideally, alarms would be expected to have a single cause, but in practice, due to the complexity of the reality, it is not always possible to make an accurate distinction.
The idea of reducing false alarms is to reduce the cause of the phenomenon as much as possible, and if you can reduce it to a single cause, then you can identify the problem.
The alarm classification
Is also a warning for the CPU to run this kind of situation need immediate attention, but for a single health warning (normal abnormal machine automatically displacement, abnormal situation may replace failure), the system does not automatically solve this situation, but does not handle over a period of time, also won’t affect, load balancing device will be automatically removed.
Of course, alarms can be classified into different levels. However, in order to optimize invalid alarms, we can divide them into three categories according to whether we need to stop working immediately and deal with them immediately:
- Emergency: When an alarm is received, an action needs to be performed immediately. For example, the CPU or memory is used up. Determine the criteria, whether it has an impact on business, and whether there are potential unknown risks
- No emergency: The system cannot automatically resolve the current situation. However, if the current situation is not handled for a period of time, no impact will be caused. For example, access failure occurs on a single machine. Judgment criteria, no impact on services, basically no potential risk, but ultimately requires manual intervention
- No handling: The system automatically recovers the known faults without manual access. If the machine is abnormal, the operation and maintenance base will be handled automatically for a period of time without manual intervention
For an exception, you need to determine whether it needs to be handled immediately and optimize it separately.
For exceptions that do not need to be handled, no alarm is necessary. If you need to sense events, you can notify them by email. You do not need to interrupt work through alarm channels.
If non-critical alarms are supported by the tool, they should be periodically pushed in the form of work orders and handled in a unified manner. Real-time alarms are not necessary to reduce interruptions to normal work. In scenarios not supported by the tool, you can adjust the alarm interval and the convergence policy for repeated alarms. If a single machine is abnormal, an hourly alarm can be generated to avoid repeated interruptions. Of course, if more than a certain percentage of the machine is abnormal at the same time, it is converted to a critical alarm and needs to be reached in real time.
For critical alarms, improve their real-time performance and accuracy, and remove invalid alarms as much as possible. Then how to identify and judge invalid alarms? Next, we can look at the principles of alarm setting.
Alarm Setting Principles
Whenever an alarm occurs, the student on duty needs to stop the work in hand and check the alarm. This kind of interruption greatly affects the work efficiency and increases the research and development cost, especially for the students who are developing and debugging, the impact is very serious. Therefore, whenever we receive an alarm, we hope that it can truly reflect the abnormal, that is, the alarm as far as possible not false alarm (alarm to the normal state); Whenever there is an abnormal occurrence, the alarm should be timely issued, that is, the alarm can not be missed (missed alarm). False positives and false positives are always contradictory indicators.
The following are principles for setting alarms:
- An alarm must be real: The alarm must report a real phenomenon and show that your service is experiencing or about to encounter a problem
- Detailed description of an alarm: The alarm must provide a detailed description of the phenomenon. For example, what is the exception that occurs at a certain point in time
- Operable alarms: When receiving an alarm, you need to perform some operations. You need to cancel some alarms that do not need to be performed. Notification is required if and only if something needs to be done
- Use conservative thresholds for new alarms: At the beginning of alarm configuration, you need to expand alarm coverage as much as possible and select conservative thresholds to avoid missed alarms.
- Continuous alarm optimization: Statistical analysis of alarms is performed continuously to reduce false alarms by shielding, simplifying, adjusting thresholds, and accurately reflecting causes. This is a long-term process.
For example, an alarm is generated only when the number of failed requests exceeds a certain threshold. For example, some maliciously constructed requests also trigger a failure alarm. Such an alarm is neither true nor actionable, because there really is no processing required. For such cases, we should try to identify them by features, so as to distinguish the cause of the alarm more accurately.
Make good use of tools
In addition, an essential condition for alarm optimization is to be familiar with the use of the alarm platform. If you do not know what the alarm platform can achieve and how flexible the threshold can be set, it is difficult to optimize the alarm reasonably.
- What can the alarm monitoring platform provide: service basic indicators, system basic indicators, and statistical methods of various dimensions
- Alarm Threshold setting: How do I set alarm thresholds by phone or SMS
- Alarm statistics and trend: helps analyze data and optimize alarms
Take request failure as an example, whether the alarm platform can distinguish alarms of different cause types, whether the statistics on success rate alarms can be set, whether the duration of alarms can be set, and whether the conditional alarms can be set on a sequential basis. In addition, different types of thresholds are configured, such as short messages or phone alarms under different conditions, whether the short messages that have not been processed for a period of time can be converted to phone notifications, and whether repeated alarms can be shielded. All of these features help us to set a precise alarm condition.
In addition, the statistics and trends provided by the platform are conducive to targeted optimization. We can check the TopN alarms every day and every week and the overall trend, so as to conduct targeted optimization.
Alarm Handling Process
As mentioned earlier, alarm optimization is an ongoing process, and there is no one-size-fits-all. It does require a certain amount of investment to have someone on duty on a daily or weekly basis. The student on duty should pay attention to the validity of the alarm, analyze the cause of the invalid alarm, and continuously optimize the threshold or alarm policy.
Reasonable alarm flow process:
In the handling process, after an alarm is triggered, the handler on duty receives the order and processes the alarm. To avoid a large number of repeated invalid alarms, the handler on duty does not trigger the alarm until the processing is complete. Confirm the cause and return the statement to the cause.
For example, the alarm platform does not support an abnormal event. Many repeated alarms may be generated during the process. The system does not have the process of feedback reason, etc.), but the students on duty need to have such a process in mind to ensure that every alarm is clearly at which stage and there is no ambiguity.
In addition, the students on duty should emphasize several matters needing attention
- Ensure that all alarms can be received: This can be resolved through the receiver group. Ensure that all students on duty are in the same receiver group. If there are personnel changes, it is easy to modify
- Ascending route: it is necessary to judge the seriousness of the problem, and increase resources to eliminate the problem quickly and then bud
- Identify the root cause: Make sure you understand the cause of the problem rather than trying to restore it. For example, an alarm is generated indicating that the CPU on a single machine is full. If a process is restarted, the problem may be covered up and a large number of faults may occur in the future
Finally, the handling of the alarm, in the mentality to do: everything is best in doubt, not in doubt.
Reference & quotation
- SRC: Google Operations and Decryption
- The MTTR MTTF/MTBF illustrated: blog.csdn.net/starshinnin…
- “An article to understand monitoring alarm” : zhuanlan.zhihu.com/p/60416209
- The accuracy, precision and recall rate: www.cnblogs.com/xuexuefirst…
- Alarm Configuration Principles and Experience wsfdl.com/devops/2018…