First, pre-knowledge
For servers monitored by Prometheus, we have an UP metric that tells us if the service is online.
Up == 0 The task service is offline. Up == 1 Indicates that the task service is online.Copy the code
Second, the demand for
This alarm is generated for services that have been offline for more than one minute.
Three, implementation steps
1. Write alarm rules
groups:
- name: Test-Group-001 The group name must be unique in this file
rules:
- alert: InstanceDown The name of the alarm must be unique in the group
expr: up = = 0 # if the result is true, an alarm is required
for: 1m # How long before the alarm is considered necessary (i.e., the duration of up==0)
labels:
severity: warning # define tag
annotations:
summary: "Service{{ $labels.instance }}Logged off. ""
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."
Copy the code
Note: 1. For specifies how long it takes to send alarm data after the alarm threshold is reached. 2. In labels, you can specify user-defined labels. If defined labels already exist, they will be overwritten. You can use templates. Annotations; Labels represent the labels of alarm data; {{\value}} represents the value of time series.
2. Change the location of the promethe. yml alarm rule
rule_files:
- "rules/*_rule.yaml"
Copy the code
Yaml: load all files ending in _rule. Yaml under the rules directory in the upper directory of Prometheus.
Note:./promtool check config prometheus.yml can check if our configuration file is written correctly.
3. Screenshot of the configuration file
4. View the alarm data on the page
In the figure above, you can see the alarm data in three states.Inactive
,Pending
andFiring
.
5. Query the alarm data generated by Prometheus
4. Status of alarm data
1, Inactive
The expR expression is invalid because the alarm threshold is not reached.
2, Pending
The expR expression is valid but does not meet the alarm duration, that is, the value of for.
3, Firing
The threshold and duration of the alarm have been reached. Procedure It is found that if the same alarm data is Firing, another alarm data will not be generated unless the alarm is resolved. Eg: For example, the service of 192.168.1.1:9080 is down for more than 1 minute, and an alarm is generated for Firing. If the machine is not recovered, the same alarm will not be generated again.
5. Reference documents
1, Prometheus. IO/docs/promet…