First, pre-knowledge

For servers monitored by Prometheus, we have an UP metric that tells us if the service is online.

Up == 0 The task service is offline. Up == 1 Indicates that the task service is online.Copy the code

Second, the demand for

This alarm is generated for services that have been offline for more than one minute.

Three, implementation steps

1. Write alarm rules

groups:
- name: Test-Group-001 The group name must be unique in this file
  rules:
  - alert: InstanceDown The name of the alarm must be unique in the group
    expr: up = = 0 # if the result is true, an alarm is required
    for: 1m # How long before the alarm is considered necessary (i.e., the duration of up==0)
    labels:
      severity: warning # define tag
    annotations:
      summary: "Service{{ $labels.instance }}Logged off. ""
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."
Copy the code

Note: 1. For specifies how long it takes to send alarm data after the alarm threshold is reached. 2. In labels, you can specify user-defined labels. If defined labels already exist, they will be overwritten. You can use templates. Annotations; Labels represent the labels of alarm data; {{\value}} represents the value of time series.

2. Change the location of the promethe. yml alarm rule

rule_files:
  - "rules/*_rule.yaml"
Copy the code

Yaml: load all files ending in _rule. Yaml under the rules directory in the upper directory of Prometheus.

Note:./promtool check config prometheus.yml can check if our configuration file is written correctly.

3. Screenshot of the configuration file

4. View the alarm data on the page

In the figure above, you can see the alarm data in three states.Inactive,PendingandFiring.

5. Query the alarm data generated by Prometheus

4. Status of alarm data

1, Inactive

The expR expression is invalid because the alarm threshold is not reached.

2, Pending

The expR expression is valid but does not meet the alarm duration, that is, the value of for.

3, Firing

The threshold and duration of the alarm have been reached. Procedure It is found that if the same alarm data is Firing, another alarm data will not be generated unless the alarm is resolved. Eg: For example, the service of 192.168.1.1:9080 is down for more than 1 minute, and an alarm is generated for Firing. If the machine is not recovered, the same alarm will not be generated again.

5. Reference documents

1, Prometheus. IO/docs/promet…