The alarm of convergence
Alertmanager provides four methods for convergence:
- grouping
- Inhibition of
- silent
- Time delay
Group (group)
Group: Combines alerts of a similar nature into a single notification
- Advantage: Suppose you have a large number of alarms about MySQL, but you want to analyze the problem for different instances, so you can group different instances. Each alarm is divided into different instance groups, and each group finally synthesizes a message and sends it to the recipient. Therefore, the o&M personnel end up receiving one email after another, and each email is an alarm about an instance. In this way, the number of alarm messages is effectively reduced. Each email is an alarm about an instance, which helps O&M troubleshoot problems.
Example: Suppose MySQL A generates an alarm, another MySQL B hangs, and the monitoring system detects that the IO thread and SQL also hang. Each instance is assigned to A different group by ID. In the end, o&M will receive two alarm messages. One is about the high CPU usage of MySQL A. MySQL B has failed.
Group_by: [' alertName '] # group_wait: 10s # group_interval: 10s # ALARM sending interval REPEAT_interval: 1h # Interval for sending repeated alarmsCopy the code
Inhibition (inhibition)
Inhibition: After an alarm is generated, stop sending other alarms repeatedly to eliminate redundant alarms
- Advantage: If a host fails and the MySQL service is running on it, the sequence in which alarms generated by the host failure and MySQL failure are sent to the Alertmanager may be different, and the sequence in which alarms are received by O&M personnel may be different. If you receive an alarm indicating that the MySQL service is down first, you may go to another direction to rectify the fault. However, this is not the fundamental cause. Therefore, you can suppress the alarm indicating that the MySQL service is down on the host by suppressing the alarm. In this way, the redundant information can be eliminated and the essential cause of the failure can be found.
Example: Assume that MySQL server A is running MySQL services. When the server suddenly breaks down, both of these alarms are generated. However, you configure A suppression rule to suppress the MySQL alarm, and finally receive an alarm indicating that the server is down.
Severity: 'critical' Target_match: # Severity: 'warning' equal: severity: 'warning' equal: ['id', 'instance']Copy the code
Silence (silences)
Silences are simple mechanisms for silences at specific times
- Advantage: For example, if you have a bunch of alarms about MySQL instance 1, 2, and 3, you can set the alarm silence for this period of time, so that you can no longer receive alarms about this instance, but you can still receive alarms about other instances. In this way you can prevent the system from sending some expected alarms.
For example, if you want to run A batch task in MySQL A, this batch task consumes A lot of system resources and triggers these alarms. In addition, there is also A MySQL B server in your system, which provides external services. You do not want to silence its alarm. Therefore, you can configure A silence rule to silence MySQL A alarm, so that the alarm will not receive MySQL A, and other services will not be affected.
# Configure silence in Silence on the AlertManager webui. - Select Create Silence to create a new silence configuration. - Enter the silence period in start and end. - Name silence in name Example: (172.16.214.141:9100Copy the code
Delay time (Delay)
The system generates an alarm every minute when a fault occurs. Such alarm information is very frustrating. The first parameter provided by Alertmanager is the repeat interval, which can send repeated alarms at greater frequency, but only this parameter causes two problems.
- Problem 1: The first problem is that the alarm cannot be received in a timely manner. Assume that the current alarm is generated and the next alarm will be generated in an hour. However, the system generates an alarm within an hour. In this case, the alarm cannot be sent in time. So alertManager provides a second parameter, group inteval, to send the alarm in a timely manner.
- Fault 2: When a fault occurs, alarm conditions are met one by one and the alarm is sent to the Alertmanager in different order. Therefore, multiple messages may be received at the beginning. The Alertmanager provides a third parameter called group wait. After a group receives the first alarm message, it waits for group wait to collect the alarm generated when the fault occurs and send it as a message.