Alarm modes of Prometheus

Prometheus is an open-source monitoring tool that has been popular in the last few years. It supports Kubernetes native, which is handy if you’re using a Kubernetes cluster. In addition, Prometheus also provided alertManager, an alarm tool. In fact, alarm capability is a separate part of Prometheus architecture, which periodically calculates alarm rules by defining a set of rules and triggering conditions. If yes, an alarm is generated, that is, an alarm is sent to the AlertManager for alarm processing. So how do AlertManager alarms work? In fact, there are many ways, for example:

  • Email alerts
  • Enterprise wechat alarm
  • Nailing the alarm
  • Slack alarm
  • Webhook interface alarms are generated

In fact, there are some, but these are not important, these are just tools, the important thing is how to use, the following webhook way to let alertManager call the interface, send POST request to complete the alarm push, and this push can be email, wechat, nail, etc.

Invokes the interface to send alarms in email format

General process is that the first after we defined a heap of alarm rules, if the trigger condition, alertmanager alarm information will be pushed to the interface, this interface and then we will do something similar with polymerization, summary, optimization of some of the operation, then the processed alarm message again in the form of mail sent to the designated person or group. This is the picture below:

Our focus here is mainly on how to write this Webhook, and what to pay attention to when writing webhook? The following will explain one by one

Suppose you have a Prometheus monitoring system and the alarm rules are configured

Configuration alertmanager

You need to configure alertManager to call the interface. The configuration is as follows:

Receivers: - webhook_configs: url: http://10.127.34.107:5000/webhook send_resolved: trueCopy the code

That’s it! You can specify multiple alarm modes. After the configuration is complete, the Alertmanger invokes the interface in POST request mode

Write a simple interface

Flask: Flask: flask: flask: flask: flask: flask: flask: flask

import json
from flask import Flask, request
from gevent.pywsgi import WSGIServer

app = Flask(__name__)

@app.route('/webhook', methods=['POST'])
def webhook() :
    prometheus_data = json.loads(request.data)
    print(prometheus_data)
    return "test"

if __name__ == '__main__':
    WSGIServer(('0.0.0.0'.5000), app).serve_forever()
Copy the code

Some modules imported above, remember to go to download oh

pip install flask
pip install gevent
Copy the code

In this case, we run the code directly, the machine listens for port 5000, and if Prometheus has an alarm, we see what data format Prometheus is sending from Prometheus. Here’s an example:

{ 'receiver': 'webhook', 'status': 'firing', 'alerts': [{ 'status': 'firing', 'labels': { 'alertname': 'Memory usage ', 'instance':'10.12792.100.', 'job': 'sentry', 'Severity ': 'warning', 'team':' OPS '}, 'Annotations ': {'description':' Memory usage has exceeded55%, memory usage:58%', 'summary': 'memory usage'}, 'startsAt': '2020- 12- 30T07:20:08.775177336Z',
		'endsAt': '0001- 01- 01T00:00:00Z',
		'generatorURL': 'http://prometheus-server:9090/graph? g0.expr=round%28%281+-+%28node_memory_MemAvailable_bytes%7Bjob%3D%22sentry%22%7D+%2F+%28node_memory_MemTotal_bytes%7Bjob %3D%22sentry%22%7D%29%29%29+%2A+100%29+%3E+55&g0.tab=1',
		'fingerprint': '09F94bd1aa7da54f '}, {'status': 'firing', 'labels': {' alertName ': 'memory usage ', 'instance':'10.12792.101.', 'job': 'sentry', 'Severity ': 'warning', 'team':' OPS '}, 'Annotations ': {'description':' Memory usage has exceeded55%, memory usage:58%', 'summary': 'memory usage'}, 'startsAt': '2020- 12- 30T07:20:08.775177336Z',
		'endsAt': '0001- 01- 01T00:00:00Z',
		'generatorURL': 'http://prometheus-server:9090/graph? g0.expr=round%28%281+-+%28node_memory_MemAvailable_bytes%7Bjob%3D%22sentry%22%7D+%2F+%28node_memory_MemTotal_bytes%7Bjob %3D%22sentry%22%7D%29%29%29+%2A+100%29+%3E+55&g0.tab=1',
		'fingerprint': '8A972e4907cf2c60 '}], 'groupLabels': {' alertName ': 'memory usage'}, 'commonLabels': {' alertName ': 'memory usage ', 'job': 'Sentry ', 'severity': 'warning', 'team':' OPS '}, 'commonAnnotations': {'summary': 'memory usage'}, 'externalURL': 'HTTP://alertmanager-server:9093',
	'version': '4',
	'groupKey': '{}:{alertname="Memory Usage"}',
	'truncatedAlerts': 0
}
Copy the code

The json module is used to convert the json string into a Python dictionary, which provides the following information (very important) :

  • Issued each timejsonThe alarm information in the data stream is the same type of alarm, for example, it’s all about memory
  • status: Indicates the alarm status. There are two types:firingandresolved
  • alerts: is a list of elements consisting of dictionaries. Each element is a specific alarm information
  • commonLabels: This is public information

The rest of the keys are easy to understand, but with some rules for Prometheus, see why this alarm is generated.

Yaml # cat system. yaml # cat system. yaml # cat system. yaml # cat system. yaml # cat system. yaml
groups:
    - name: sentry
      rules:
      - alert: "Memory Usage"
        expr: round((1-(node_memory_MemAvailable_bytes{job='sentry'} / (node_memory_MemTotal_bytes{job='sentry'})))* 100) > 85
        for: 5m
        labels:
          team: ops
          severity: warning
          cloud: yizhuang
        annotations:
          summary: "Memory usage is too high and over 85% for 5min"
          description: "The current host {{$labels.instance}}' memory usage is {{ $value }}%"
Copy the code

This is the configured alarm rule that tells Prometheus how to generate alarms, and refers to it in the configuration of Prometheus as follows:

# cat prometheus.yml
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets: ['10.10.10.111:9093']

# Here it is, look here
rule_files:
  - "/alertmanager/rule/*.yaml"  Set the file directory arbitrarily
.
.
.
A bunch of configurations are omitted here
Copy the code

So now you know what the alarm rule is, and then you know why the alarm is the way it is, right

Process the original alarm information and send email alarms

The original alarm information looks quite regular, and it only needs to be spliced together. However, there is a problem with the time format of startsAt and endsAt in the alerts, which are UTC time zone and need to be converted. Another thing to note is that if the outermost status is firing, it does not mean that the status in alerts is definitely firing and may also be resolved, as shown in JSON below:

{'receiver': 'webhook', 'status': 'firing', 'alerts': [{'status': 'resolved', # {' alertName ': 'CPU usage ', 'instance': '10.12791.26.', 'Severity ': 'warning', 'team':' OPS '}, 'Annotations ': {'description': 'CPU usage has exceeded35%, CPU usage:38%', 'summary': 'CPU usage '}, 'startsAt': '2020- 12- 30T07:38:38.775177336Z',
		'endsAt': '2020- 12- 30T07:38:53.775177336Z',
		'generatorURL': 'http://prometheus-server:9090/graph? g0.expr=round%28100+-+%28avg+by%28instance%29+%28irate%28node_cpu_seconds_total%7Bjob%3D%22sentry%22%2Cmode%3D%22idle%22 %7D%5B5m%5D%29%29+%2A+100%29%29+%3E+35&g0.tab=1',
		'fingerprint': '58393B2abd2c6987 '}, {'status': 'resolved', 'labels': {' alertName ': 'CPU usage ', 'instance': '10.12792.101.', 'Severity ': 'warning', 'team':' OPS '}, 'Annotations ': {'description': 'CPU usage has exceeded35%, CPU usage:38%', 'summary': 'CPU usage '}, 'startsAt': '2020- 12- 30T07:42:08.775177336Z',
		'endsAt': '2020- 12- 30T07:42:38.775177336Z',
		'generatorURL': 'http://prometheus-server:9090/graph? g0.expr=round%28100+-+%28avg+by%28instance%29+%28irate%28node_cpu_seconds_total%7Bjob%3D%22sentry%22%2Cmode%3D%22idle%22 %7D%5B5m%5D%29%29+%2A+100%29%29+%3E+35&g0.tab=1','fingerprint': 'eACA600142F9716C'}], 'groupLabels': {' AlertName ': 'CPU usage '}, 'commonLabels': {' alertName ': 'CPU Usage ', 'Severity ': 'warning', 'team':' OPS '}, 'commonAnnotations': {'summary': 'CPU usage '}, 'externalURL': 'HTTP://alertmanager-server:9093',
	'version': '4',
	'groupKey': '{}:{alertname="CPU Usage"}',
	'truncatedAlerts': 0
}
Copy the code

Now that I have all the attention I need, let’s get started. First, let’s talk about one end result I want to achieve:

  • Time zone conversion
  • Different types of alarm information are pushed to different people
  • The alarm content is displayed in a table using HTML

Time zone conversion

Let’s take a look at the time zone conversion, this is easier to solve, the code is as follows:

import datetime
from dateutil import parser

def time_zone_conversion(utctime) :
    format_time = parser.parse(utctime).strftime('%Y-%m-%dT%H:%M:%SZ')
    time_format = datetime.datetime.strptime(format_time, "%Y-%m-%dT%H:%M:%SZ")
    return str(time_format + datetime.timedelta(hours=8))
Copy the code

Send E-mail

Let’s take a look at email sending, which is also very simple, the code is as follows:

import smtplib
from email.mime.text import MIMEText

def sendEmail(title, content, receivers=None) :
    if receivers is None:
        receivers = ['[email protected]']
    mail_host = "xxx"
    mail_user = "xxx"
    mail_pass = "xxx"
    sender = "xxx"
    msg = MIMEText(content, 'html'.'utf-8')
    msg['From'] = "{}".format(sender)
    msg['To'] = ",".join(receivers)
    msg['Subject'] = title
    try:
        smtpObj = smtplib.SMTP_SSL(mail_host, 465)
        smtpObj.login(mail_user, mail_pass)
        smtpObj.sendmail(sender, receivers, msg.as_string())
        print('mail send successful.')
    except smtplib.SMTPException as e:
        print(e)
Copy the code

Generating an Alarm Template

It’s easier to create a table using HTML, but the table is constantly changing, so you need to use a template language to support this dynamic change: Jinja is a common template language for ansible, which is used by the python template. ansible templates are also used by jinja templates. Don’t get to see the official document, is simple: http://docs.jinkan.org/docs/jinja2/ so I grew up and became the HTML like this, because I am in front don’t understand, so can fulfill my needs.

<meta http-equiv="Content-Type"content="text/html; charset=utf-8">
<html align='left'>
    <body>
        <h2 style="font-size: x-large;">{{prometheus_MONITor_info ['commonLabels']['cloud']}}-- Monitors alarm notifications</h2><br/>
        <br>
    <table border="1" width = "70%" cellspacing='0' cellpadding='0' align='left'>
    <tr>
        <! Monitoring types: system level, business level, service level, etc.
        <th style="font-size: 20px; padding: 5px; background-color: #F3AE60">Monitoring category</th>
        <! -- Status: alarm notification or recovery notification -->
        <th style="font-size: 20px; padding: 5px; background-color: #F3AE60">state</th>
        <! -- Status: Level: Alarm Level -->
        <th style="font-size: 20px; padding: 5px; background-color: #F3AE60">level</th>
        <! -- Status: Instance: machine address -->
        <th style="font-size: 20px; padding: 5px; background-color: #F3AE60">The instance</th>
        <! -- Status: Description: Alarm description -->
        <th style="font-size: 20px; padding: 5px; background-color: #F3AE60">describe</th>
        <! -- Status: Description: Alarm description -->
        <th style="font-size: 20px; padding: 5px; background-color: #F3AE60">A detailed description</th>
        <! -- Status: Start time: alarm start time -->
        <th style="font-size: 20px; padding: 5px; background-color: #F3AE60">The start time</th>
        <! -- Status: Start time: alarm end time -->
        <th style="font-size: 20px; padding: 5px; background-color: #F3AE60">The end of time</th>
    </tr>
    {% for items in prometheus_monitor_info['alerts'] %}
    <tr align='center'>
        {% if loop.first %}
        <td style="font-size: 16px; padding: 3px; word-wrap: break-word; background-color: #F3AE60" rowspan="{{ loop.length }}">{{ prometheus_monitor_info['commonLabels']['alertname'] }}</td>
        {% endif %}
        {% if items['status'] == 'firing' %}
        <td style="font-size: 16px; padding: 3px; background-color: red; word-wrap: break-word">The alarm</td>
        {% else %}
        <td style="font-size: 16px; padding: 3px; background-color: green; word-wrap: break-word">restore</td>
        {% endif %}
        <td style="font-size: 16px; padding: 3px; word-wrap: break-word; background-color: #EBE4D3">{{ items['labels']['severity'] }}</td>
        <td style="font-size: 16px; padding: 3px; word-wrap: break-word; background-color: #EBE4D3">{{ items['labels']['instance'] }}</td>
        <td style="font-size: 16px; padding: 3px; word-wrap: break-word; background-color: #EBE4D3">{{ items['annotations']['summary'] }}</td>
        <td style="font-size: 16px; padding: 3px; word-wrap: break-word; background-color: #EBE4D3">{{ items['annotations']['description'] }}</td>
        <td style="font-size: 16px; padding: 3px; word-wrap: break-word; background-color: #EBE4D3">{{ items['startsAt'] }}</td>
        {% if items['endsAt'] == '0001-01-01T00:00:00Z' %}
        <td style="font-size: 16px; padding: 3px; word-wrap: break-word; background-color: #EBE4D3">00 00:00:00:</td>
        {% else %}
        <td style="font-size: 16px; padding: 3px; word-wrap: break-word; background-color: #3DE869">{{ items['endsAt'] }}</td>
        {% endif %}
    </tr>
    {% endfor %}
    </table>
    </body>
</html>
Copy the code

En… It’s a bunch of “for” loops, and the “if” loops, and the “merge” cells of the table. It’s a little hard for me, so I simply merge the monitoring categories into a single cell, and leave the rest of the table uncategorized


is a line of alarm information, there is a judgment, is to judge this alarm information is alarm or recovered, and then according to the different to set a different color display, so that the leadership will feel really sweet.

And then I’ll just say one important thing

{% for items in prometheus_monitor_info['alerts'] %}Prometheus_monitor_info is a variable that converts the JSON string from Prometheus into a Python dictionary, which then performs a time zone conversion. Prometheus_monitor_info ['alerts'] retrieves the list of alerts and iterates through the list with a for loop. Items is the specific alarm information, which is a dictionary, and then retrieves the values from the dictionary, HMM. It's easy when you think about it.{% endfor %}
Copy the code

So now THAT I have my HTML template, how do I use this template? Here I write a method to parse the template and pass in the corresponding parameters

from jinja2 import Environment, FileSystemLoader

class ParseingTemplate:
    def __init__(self, templatefile) :
        self.templatefile = templatefile

    def template(self, **kwargs) :
        try:
            env = Environment(loader=FileSystemLoader('templates'))
            template = env.get_template(self.templatefile)
            template_content = template.render(kwargs)
            return template_content
        except Exception as error:
            raise error
Copy the code

Basically, what this class does is it passes in an alarm, reads the HTML template, returns the parsed HTML, and then sends it out in an email, and that’s it.

Precise alarm, corresponding to a specific person

If you look carefully at the rule alarm rule above, you will notice that there is a custom key-value in it:

groups:
    - name: sentry  # This name can be understood as a classification, make a distinction
      rules:
      - alert: "Memory Usage"
        expr: round((1-(node_memory_MemAvailable_bytes{job='sentry'} / (node_memory_MemTotal_bytes{job='sentry'})))* 100) > 85
        for: 5m
        labels:
          team: ops   # This is where I define a group to send messages to
          severity: warning
          cloud: yizhuang
.
.
Copy the code

Then when I parse the original JSON, I get the value of the team. According to the value, I get the specific email address of the group and finally send it to these people. The specific email address, I have taken out, but how do I know which environment or application these people should correspond to, that is the following:

groups:
    - name: sentry
.
.
Copy the code

Prometheus job_name = job_name; Prometheus job_name = job_name;

# cat prometheus.yml
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets: ['10.127.92.105:9093']

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "/alertmanager/rule/*.yaml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'
    static_configs:
      - targets: ['10.127.92.105:9090']

  - job_name: 'cadvisor-app'
    file_sd_configs:
    - refresh_interval: 1m
      files:
      - /etc/prometheus/file-sd-configs/cadvisor-metrics.json

  - job_name: 'sentry'
    file_sd_configs:
    - refresh_interval: 1m
      files:
      - /etc/prometheus/file-sd-configs/system-metrics.json

  - job_name: 'kafka-monitor'
    file_sd_configs:
    - refresh_interval: 1m
      files:
      - /etc/prometheus/file-sd-configs/kafka-metrics.json
Copy the code

Is it strung? Think back and refer back to my final complete code

Complete code Reference

Reference code

from flask import Flask, request
from dateutil import parser
import json
import yaml
import datetime
import smtplib
from email.mime.text import MIMEText
from jinja2 import Environment, FileSystemLoader
from gevent.pywsgi import WSGIServer


def time_zone_conversion(utctime) :
    format_time = parser.parse(utctime).strftime('%Y-%m-%dT%H:%M:%SZ')
    time_format = datetime.datetime.strptime(format_time, "%Y-%m-%dT%H:%M:%SZ")
    return str(time_format + datetime.timedelta(hours=8))


def get_email_conf(file, email_name=None, action=0) :
    "" :param file: indicates the file type in yamL format. Param email_name: indicates the name of the email list to be sent. Param action: indicates the operation type. Return: Returns an invalid data structure based on the action value.
    try:
        with open(file, 'r', encoding='utf-8') as fr:
            read_conf = yaml.safe_load(fr)
            if action == 0:
                for email in read_conf['email'] :if email['name'] == email_name:
                        return email['receive_addr']
                    else:
                        print("%s does not match for %s" % (email_name, file))
                else:
                    print("No recipient address configured")
            elif action == 1:
                return [items['name'] for items in read_conf['email']]
            elif action == 2:
                return read_conf['send']
    except KeyError:
        print("%s not exist" % email_name)
        exit(-1)
    except FileNotFoundError:
        print("%s file not found" % file)
        exit(-2)
    except Exception as e:
        raise e


def sendEmail(title, content, receivers=None) :
    if receivers is None:
        receivers = ['[email protected]']
    send_dict = get_email_conf('email.yaml', action=2)
    mail_host = send_dict['smtp_host']
    mail_user = send_dict['send_user']
    mail_pass = send_dict['send_pass']
    sender = send_dict['send_addr']
    msg = MIMEText(content, 'html'.'utf-8')
    msg['From'] = "{}".format(sender)
    msg['To'] = ",".join(receivers)
    msg['Subject'] = title
    try:
        smtpObj = smtplib.SMTP_SSL(mail_host, 465)
        smtpObj.login(mail_user, mail_pass)
        smtpObj.sendmail(sender, receivers, msg.as_string())
        print('mail send successful.')
    except smtplib.SMTPException as e:
        print(e)


class ParseingTemplate:
    def __init__(self, templatefile) :
        self.templatefile = templatefile

    def template(self, **kwargs) :
        try:
            env = Environment(loader=FileSystemLoader('templates'))
            template = env.get_template(self.templatefile)
            template_content = template.render(kwargs)
            return template_content
        except Exception as error:
            raise error


app = Flask(__name__)


@app.route('/webhook', methods=['POST'])
def webhook() :
    try:
        prometheus_data = json.loads(request.data)
        # Time switch, switch to east 8 time
        for k, v in prometheus_data.items():
            if k == 'alerts':
                for items in v:
                    if items['status'] = ='firing':
                        items['startsAt'] = time_zone_conversion(items['startsAt'])
                    else:
                        items['startsAt'] = time_zone_conversion(items['startsAt'])
                        items['endsAt'] = time_zone_conversion(items['endsAt'])
        team_name = prometheus_data["commonLabels"] ["team"]
        generate_html_template_subj = ParseingTemplate('email_template_firing.html')
        html_template_content = generate_html_template_subj.template(
            prometheus_monitor_info=prometheus_data
        )
        Get the recipient mailing list
        email_list = get_email_conf('email.yaml', email_name=team_name, action=0)
        sendEmail(
            'Prometheus Monitor',
            html_template_content,
            receivers=email_list
        )
        return "prometheus monitor"
    except Exception as e:
        raise e


if __name__ == '__main__':
    WSGIServer(('0.0.0.0'.5000), app).serve_forever()
Copy the code

Configuration File Reference

send:
  smtp_host: smtp.163.com
  send_user: [email protected]
  send_addr: [email protected]
  send_pass: BRxxxxxxxZPUZEK
email:
  - name: kafka-monitor   # Correspond to team
    receive_addr:
      - Email Address 1
      - Email Address 2
      - Email Address 3
  - name: ops
    receive_addr:
      - Email Address 1
      - Email Address 2
Copy the code

Final rendering

1) It’s all alarm

2) Both alarm and recovery

3) They are all recovered