1. Installation of Alertanager

1, download

2, installation,

Download different installation packages for different platformsWget HTTP: / / https://github.com/prometheus/alertmanager/releases/download/v0.21.0/alertmanager-0.21.0.darwin-amd64.tar.gz# decompressionThe tar ZXVF alertmanager - 0.21.0. Darwin - amd64. Tar. Gz# renameMv alertmanager - 0.21.0. Darwin - amd64. Tar. Gz alertmanagerCopy the code

3, start,

Specify the path to the configuration file and the port to boot
./alertmanager --config.file=alertmanager.yml --web.listen-address=": 9093"
# Display help information
./alertmanager --help
Copy the code

Integration of AlertManager and Prometheus

Modify the prometheus.yml configuration file

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - 127.0. 01.: 9082 # Address of the alarm manager
Copy the code

Prometheus integrated reference linking. IO/docs/promet…

2. Alarm groups

The grouping mechanism combines alarms of a certain type into one alarm to avoid sending too many alarm emails.

** For example: ** We had 3 servers involved in Prometheus, which went down at the same time, so 3 alarms would have been sent if they were not grouped; if they were grouped, they would have been consolidated into one large alarm.

1. Alarm rules

If the server is down for more than one minute, an alarm email is sent.

groups:
- name: Test-Group-001 The group name must be unique in this file
  rules:
  - alert: InstanceDown The name of the alarm must be unique in the group
    expr: up = = 0 # if the result is true, an alarm is required
    for: 1m # How long before the alarm is considered necessary (i.e., the duration of up==0)
    labels:
      severity: warning # define tag
    annotations:
      summary: "Service{{ $labels.instance }}Logged off. ""
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."
Copy the code

2. Configure AlertManager.yml

global:
  resolve_timeout: 5m
  # Integrate QQ mail
  smtp_smarthost: 'smtp.qq.com:465'
  smtp_from: '[email protected]'
  smtp_auth_username: '[email protected]'
  smtp_auth_identity: 'xxxxxx'
  smtp_auth_password: 'xxxxxx'
  smtp_require_tls: false
# routing
route:
  group_by: ['alertname'] Group_by groups[n].name group_by groups[n].name
  group_wait: 10s # When a new group is generated, the alarm message can be sent only after group_wait.
  group_interval: 10s If the last alarm message is sent successfully, a new alarm is sent after group_interval
  repeat_interval: 120s If the last alarm message is sent successfully and the problem is not resolved, wait for REPEAT_interval to send the alarm data again
  receiver: 'email' Receivers need to match the value of the alarm receiver [n]. Name.
receivers:
- name: 'email'
  email_configs:
  - to: '[email protected]'
Copy the code

3. Configure alertManager related to groups

route:
  group_by: ['alertname'] 
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 120s 
Copy the code

For details about group_wait, group_interval, and REPEAT_interval, refer to the above comments. And this link www.robustperception.io/whats-the-d…

4. Email sending results

3. Alarm suppression

When an alarm of a certain type is generated, other related alarms do not need to be sent.

** For example: ** If we monitor the CPU usage of a particular machine, say 80% and 90%, we might want to stop sending 80% emails if the CPU usage reaches 90%.

1. Alarm rules

This alarm is generated if the CPU usage exceeds 80% in 5 minutes.

This alarm is generated if the CPU usage exceeds 90% in 5 minutes.

groups:
- name: Cpu
  rules:
    - alert: Cpu01
      expr: "(1 - avg(irate(node_cpu_seconds_total{mode='idle'}[5m])) by (instance,job)) * 100 > 80"
      for: 1m
      labels:
        severity: info # customize a tag info level
      annotations:
        summary: "Service{{ $labels.instance }}CPU usage is too high"
        description: "{{ $labels.instance }} of job {{ $labels.job }}Description The CPU usage is too high in the past 5 minutes{{humanize $value}}."
    - alert: Cpu02
      expr: "(1 - avg(irate(node_cpu_seconds_total{mode='idle'}[5m])) by (instance,job)) * 100 > 90"
      for: 1m
      labels:
        severity: warning # Customize a tag to a warning level
      annotations:
        summary: "Service{{ $labels.instance }}CPU usage is too high"
        description: "{{ $labels.instance }} of job {{ $labels.job }}Description The CPU usage is too high in the past 5 minutes{{humanize $value}}."
Copy the code

2. Alertmanager. yml Configure suppression rules

Suppression rules:

If the alarm name is AlertName = Cpu02 and the alarm severity = Warning, suppress the alarm data labeled severity = info in the new alarm information. The instance labels of source and target alarms must have the same value.

Suppress rules to reduce alarm data
inhibit_rules:
- source_match: After the current alarm rule is matched, the alarm rule of target_match is suppressed
    alertname: Cpu02 The alarm name of the # tag is Cpu02
    severity: warning # The user-defined alarm severity is Warning
  target_match: # Suppressed alarm rule
    severity: info # Suppressed alarm severity
  equal:
  - instance The value of the instance tag in the source and target alarm data must be the same.
Copy the code

Suppression Rule Configuration

3. Email sending results

Only warning alarms are sent, but no info alarms are sent.

4. Alarm silence

It refers to the silent period and does not send alarm information.

** For example: ** Our system will be down for maintenance for a certain period of time, which may generate a lot of alarm information, but the alarm information is meaningless at this time, you can configure the silent rule to filter out.

1. Configure silence rules

You need to operate from the Console of alertManager or through amtool.

After the preceding configuration, the alarm information cannot be received.

5. Alarm routing

1. Prepared alterManager. Yml configuration file

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.qq.com:465'
  smtp_from: '[email protected]'
  smtp_auth_username: '[email protected]'
  smtp_auth_identity: 'xxxxx'
  smtp_auth_password: 'xxxxx'
  smtp_require_tls: false

Match and match_re cannot exist in the root route. If any alarm data does not match a route, the root route will be processed.
route:
  group_by: ['job']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 120s
  receiver: 'default-receiver'
  routes:
  - match_re:
      alertname: 'Cpu.*'  # If the alarm name starts with Cpu, send the alarm to receiver-01
    receiver: 'receiver-01'
  - match:
      alertname: 'InstanceDown' # If the name of the alarm is InstanceDown, it is sent to receiver-02
    receiver: 'receiver-02'
    group_by: ['instance'] Group by instance tag
    continue: true  # true also needs to match child routes.
    routes:
    - match:
        alertname: 'InstanceDown' If the name of the alarm is InstanceDown, it needs to be sent to receiver-03
      receiver: 'receiver-03'

Define 4 recipients (recipient groups, etc.)
receivers:
  - name: 'default-receiver'
    email_configs:
      - to: '[email protected]'
        send_resolved: true
  - name: 'receiver-01'
    email_configs:
      - to: '[email protected]'
        send_resolved: true
  - name: 'receiver-02'
    email_configs:
      - to: '[email protected]'
        send_resolved: true
  - name: 'receiver-03'
    email_configs:
      - to: '[email protected]'
        send_resolved: true

inhibit_rules:
  - source_match:
      alertname: Cpu02
      severity: warning
    target_match:
      severity: info
    equal:
      - instance
Copy the code

Alarm result:

1. The alarm name contains Cpu information and is sent to receiver-01([email protected]).

2. If the alarm name is InstanceDown, send the alarm to receiver-02 and Receiver-03 ([email protected] and [email protected]).

If the continue parameter is true, the child routes will continue to match. If the continue parameter is false, the child routes below it will no longer match.

If an alarm does not match a task route, the root route processes the alarm.

To access the url www.prometheus.io/webtools/al… View the alarm tree.

2. Route matching

The alarm data enters the routing tree through the top-level route. The root route must match all the alarm data. Match and match_re cannot be configured

Each route has its own child routes. ** For example: ** If the severity of an alarm is normal, user A is notified of the alarm. If the alarm persists for A period of time and becomes serious, user Y is notified of the alarm to User 3 and User 4.

By default, after an alarm enters from the root route, it traverses all the child routes below it.

If continue = false in route, the match is stopped after the first matching route is matched.

If continue = true, matching continues.

If none is matched, the root route is used.

Customize mail templates

1. Define an alarm template

cat email.template.tmpl
Copy the code
{{define "email.template.tmpl"}}{{-if GT (len.alerts.Firing) 0 -}}{{range.alerts}} Alarm name: {{.attractions.alertName}} <br> Instance name: {{.attractions.instance}} <br> {{.Annotations. Description}} <br> Level: {{.alllabels. Severity}} <br> Start time: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}<br> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++<br> {{ end }}{{ end -}} {{- if gt (len .alerts.Resolved) 0 -}}{{range.Alerts}} Resolved- The alarm has been recovered. <br> Alarm name: {{.attractions.alertName}} < BR > Instance name: {{.attractions.instance}} < BR > {{.Annotations. Description}} <br> Level: {{.alllabels. Severity}} <br> Start time: {{(.startsat.add 28800e9).Format "2006-01-02 15:04:05"}}<br> {{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}<br> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++<br> {{ end }}{{ end -}} {{- end }}Copy the code

2. Modify the AlertManager.yml configuration file

1. Position for loading the alarm template

global:
  resolve_timeout: 5m
templates:
- '/ Users/huan/soft/Prometheus/alertmanager - 0.21.0 / templates / *. TMPL'
Copy the code

Configure the Templates option

2. Recipients use mail templates

receivers:
  - name: 'default-receiver'
    email_configs:
      - to: '[email protected]'
        send_resolved: true
        html: '{{template "email.template.tmpl" . }}'
Copy the code

Note:

Template in HTML: ‘{{template “email.template. TMPL “.}}’ is the value in {{define “email.template. TMPL “}}.

7. Reference links

1. Download AlertManager

2. Official document of AlertManager

Alertmanager and Prometheus integrated reference links

4. Group_wait, group_interval, and REPEAT_interval explanations in group alarms

5. Configure suppression rules

6. The old value is included after the alarm is cleared

7, aleiwu.com/post/promet…

8. View the alarm tree