Alarm modes of Prometheus
Prometheus is an open-source monitoring tool that has been popular in the last few years. It supports Kubernetes native, which is handy if you’re using a Kubernetes cluster. In addition, Prometheus also provided alertManager, an alarm tool. In fact, alarm capability is a separate part of Prometheus architecture, which periodically calculates alarm rules by defining a set of rules and triggering conditions. If yes, an alarm is generated, that is, an alarm is sent to the AlertManager for alarm processing. So how do AlertManager alarms work? In fact, there are many ways, for example:
- Email alerts
- Enterprise wechat alarm
- Nailing the alarm
- Slack alarm
- Webhook interface alarms are generated
In fact, there are some, but these are not important, these are just tools, the important thing is how to use, the following webhook way to let alertManager call the interface, send POST request to complete the alarm push, and this push can be email, wechat, nail, etc.
Invokes the interface to send alarms in email format
General process is that the first after we defined a heap of alarm rules, if the trigger condition, alertmanager alarm information will be pushed to the interface, this interface and then we will do something similar with polymerization, summary, optimization of some of the operation, then the processed alarm message again in the form of mail sent to the designated person or group. This is the picture below:
Our focus here is mainly on how to write this Webhook, and what to pay attention to when writing webhook? The following will explain one by one
Suppose you have a Prometheus monitoring system and the alarm rules are configured
Configuration alertmanager
You need to configure alertManager to call the interface. The configuration is as follows:
Receivers: - webhook_configs: url: http://10.127.34.107:5000/webhook send_resolved: trueCopy the code
That’s it! You can specify multiple alarm modes. After the configuration is complete, the Alertmanger invokes the interface in POST request mode
Write a simple interface
Flask: Flask: flask: flask: flask: flask: flask: flask: flask
import json
from flask import Flask, request
from gevent.pywsgi import WSGIServer
app = Flask(__name__)
@app.route('/webhook', methods=['POST'])
def webhook() :
prometheus_data = json.loads(request.data)
print(prometheus_data)
return "test"
if __name__ == '__main__':
WSGIServer(('0.0.0.0'.5000), app).serve_forever()
Copy the code
Some modules imported above, remember to go to download oh
pip install flask
pip install gevent
Copy the code
In this case, we run the code directly, the machine listens for port 5000, and if Prometheus has an alarm, we see what data format Prometheus is sending from Prometheus. Here’s an example:
{ 'receiver': 'webhook', 'status': 'firing', 'alerts': [{ 'status': 'firing', 'labels': { 'alertname': 'Memory usage ', 'instance':'10.12792.100.', 'job': 'sentry', 'Severity ': 'warning', 'team':' OPS '}, 'Annotations ': {'description':' Memory usage has exceeded55%, memory usage:58%', 'summary': 'memory usage'}, 'startsAt': '2020- 12- 30T07:20:08.775177336Z',
'endsAt': '0001- 01- 01T00:00:00Z',
'generatorURL': 'http://prometheus-server:9090/graph? g0.expr=round%28%281+-+%28node_memory_MemAvailable_bytes%7Bjob%3D%22sentry%22%7D+%2F+%28node_memory_MemTotal_bytes%7Bjob %3D%22sentry%22%7D%29%29%29+%2A+100%29+%3E+55&g0.tab=1',
'fingerprint': '09F94bd1aa7da54f '}, {'status': 'firing', 'labels': {' alertName ': 'memory usage ', 'instance':'10.12792.101.', 'job': 'sentry', 'Severity ': 'warning', 'team':' OPS '}, 'Annotations ': {'description':' Memory usage has exceeded55%, memory usage:58%', 'summary': 'memory usage'}, 'startsAt': '2020- 12- 30T07:20:08.775177336Z',
'endsAt': '0001- 01- 01T00:00:00Z',
'generatorURL': 'http://prometheus-server:9090/graph? g0.expr=round%28%281+-+%28node_memory_MemAvailable_bytes%7Bjob%3D%22sentry%22%7D+%2F+%28node_memory_MemTotal_bytes%7Bjob %3D%22sentry%22%7D%29%29%29+%2A+100%29+%3E+55&g0.tab=1',
'fingerprint': '8A972e4907cf2c60 '}], 'groupLabels': {' alertName ': 'memory usage'}, 'commonLabels': {' alertName ': 'memory usage ', 'job': 'Sentry ', 'severity': 'warning', 'team':' OPS '}, 'commonAnnotations': {'summary': 'memory usage'}, 'externalURL': 'HTTP://alertmanager-server:9093',
'version': '4',
'groupKey': '{}:{alertname="Memory Usage"}',
'truncatedAlerts': 0
}
Copy the code
The json module is used to convert the json string into a Python dictionary, which provides the following information (very important) :
- Issued each time
json
The alarm information in the data stream is the same type of alarm, for example, it’s all about memory status
: Indicates the alarm status. There are two types:firing
andresolved
alerts
: is a list of elements consisting of dictionaries. Each element is a specific alarm informationcommonLabels
: This is public information
The rest of the keys are easy to understand, but with some rules for Prometheus, see why this alarm is generated.
Yaml # cat system. yaml # cat system. yaml # cat system. yaml # cat system. yaml # cat system. yaml
groups:
- name: sentry
rules:
- alert: "Memory Usage"
expr: round((1-(node_memory_MemAvailable_bytes{job='sentry'} / (node_memory_MemTotal_bytes{job='sentry'})))* 100) > 85
for: 5m
labels:
team: ops
severity: warning
cloud: yizhuang
annotations:
summary: "Memory usage is too high and over 85% for 5min"
description: "The current host {{$labels.instance}}' memory usage is {{ $value }}%"
Copy the code
This is the configured alarm rule that tells Prometheus how to generate alarms, and refers to it in the configuration of Prometheus as follows:
# cat prometheus.yml
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ['10.10.10.111:9093']
# Here it is, look here
rule_files:
- "/alertmanager/rule/*.yaml" Set the file directory arbitrarily
.
.
.
A bunch of configurations are omitted here
Copy the code
So now you know what the alarm rule is, and then you know why the alarm is the way it is, right
Process the original alarm information and send email alarms
The original alarm information looks quite regular, and it only needs to be spliced together. However, there is a problem with the time format of startsAt and endsAt in the alerts, which are UTC time zone and need to be converted. Another thing to note is that if the outermost status is firing, it does not mean that the status in alerts is definitely firing and may also be resolved, as shown in JSON below:
{'receiver': 'webhook', 'status': 'firing', 'alerts': [{'status': 'resolved', # {' alertName ': 'CPU usage ', 'instance': '10.12791.26.', 'Severity ': 'warning', 'team':' OPS '}, 'Annotations ': {'description': 'CPU usage has exceeded35%, CPU usage:38%', 'summary': 'CPU usage '}, 'startsAt': '2020- 12- 30T07:38:38.775177336Z',
'endsAt': '2020- 12- 30T07:38:53.775177336Z',
'generatorURL': 'http://prometheus-server:9090/graph? g0.expr=round%28100+-+%28avg+by%28instance%29+%28irate%28node_cpu_seconds_total%7Bjob%3D%22sentry%22%2Cmode%3D%22idle%22 %7D%5B5m%5D%29%29+%2A+100%29%29+%3E+35&g0.tab=1',
'fingerprint': '58393B2abd2c6987 '}, {'status': 'resolved', 'labels': {' alertName ': 'CPU usage ', 'instance': '10.12792.101.', 'Severity ': 'warning', 'team':' OPS '}, 'Annotations ': {'description': 'CPU usage has exceeded35%, CPU usage:38%', 'summary': 'CPU usage '}, 'startsAt': '2020- 12- 30T07:42:08.775177336Z',
'endsAt': '2020- 12- 30T07:42:38.775177336Z',
'generatorURL': 'http://prometheus-server:9090/graph? g0.expr=round%28100+-+%28avg+by%28instance%29+%28irate%28node_cpu_seconds_total%7Bjob%3D%22sentry%22%2Cmode%3D%22idle%22 %7D%5B5m%5D%29%29+%2A+100%29%29+%3E+35&g0.tab=1','fingerprint': 'eACA600142F9716C'}], 'groupLabels': {' AlertName ': 'CPU usage '}, 'commonLabels': {' alertName ': 'CPU Usage ', 'Severity ': 'warning', 'team':' OPS '}, 'commonAnnotations': {'summary': 'CPU usage '}, 'externalURL': 'HTTP://alertmanager-server:9093',
'version': '4',
'groupKey': '{}:{alertname="CPU Usage"}',
'truncatedAlerts': 0
}
Copy the code
Now that I have all the attention I need, let’s get started. First, let’s talk about one end result I want to achieve:
- Time zone conversion
- Different types of alarm information are pushed to different people
- The alarm content is displayed in a table using HTML
Time zone conversion
Let’s take a look at the time zone conversion, this is easier to solve, the code is as follows:
import datetime
from dateutil import parser
def time_zone_conversion(utctime) :
format_time = parser.parse(utctime).strftime('%Y-%m-%dT%H:%M:%SZ')
time_format = datetime.datetime.strptime(format_time, "%Y-%m-%dT%H:%M:%SZ")
return str(time_format + datetime.timedelta(hours=8))
Copy the code
Send E-mail
Let’s take a look at email sending, which is also very simple, the code is as follows:
import smtplib
from email.mime.text import MIMEText
def sendEmail(title, content, receivers=None) :
if receivers is None:
receivers = ['[email protected]']
mail_host = "xxx"
mail_user = "xxx"
mail_pass = "xxx"
sender = "xxx"
msg = MIMEText(content, 'html'.'utf-8')
msg['From'] = "{}".format(sender)
msg['To'] = ",".join(receivers)
msg['Subject'] = title
try:
smtpObj = smtplib.SMTP_SSL(mail_host, 465)
smtpObj.login(mail_user, mail_pass)
smtpObj.sendmail(sender, receivers, msg.as_string())
print('mail send successful.')
except smtplib.SMTPException as e:
print(e)
Copy the code
Generating an Alarm Template
It’s easier to create a table using HTML, but the table is constantly changing, so you need to use a template language to support this dynamic change: Jinja is a common template language for ansible, which is used by the python template. ansible templates are also used by jinja templates. Don’t get to see the official document, is simple: http://docs.jinkan.org/docs/jinja2/ so I grew up and became the HTML like this, because I am in front don’t understand, so can fulfill my needs.
<meta http-equiv="Content-Type"content="text/html; charset=utf-8">
<html align='left'>
<body>
<h2 style="font-size: x-large;">{{prometheus_MONITor_info ['commonLabels']['cloud']}}-- Monitors alarm notifications</h2><br/>
<br>
<table border="1" width = "70%" cellspacing='0' cellpadding='0' align='left'>
<tr>
<! Monitoring types: system level, business level, service level, etc.
<th style="font-size: 20px; padding: 5px; background-color: #F3AE60">Monitoring category</th>
<! -- Status: alarm notification or recovery notification -->
<th style="font-size: 20px; padding: 5px; background-color: #F3AE60">state</th>
<! -- Status: Level: Alarm Level -->
<th style="font-size: 20px; padding: 5px; background-color: #F3AE60">level</th>
<! -- Status: Instance: machine address -->
<th style="font-size: 20px; padding: 5px; background-color: #F3AE60">The instance</th>
<! -- Status: Description: Alarm description -->
<th style="font-size: 20px; padding: 5px; background-color: #F3AE60">describe</th>
<! -- Status: Description: Alarm description -->
<th style="font-size: 20px; padding: 5px; background-color: #F3AE60">A detailed description</th>
<! -- Status: Start time: alarm start time -->
<th style="font-size: 20px; padding: 5px; background-color: #F3AE60">The start time</th>
<! -- Status: Start time: alarm end time -->
<th style="font-size: 20px; padding: 5px; background-color: #F3AE60">The end of time</th>
</tr>
{% for items in prometheus_monitor_info['alerts'] %}
<tr align='center'>
{% if loop.first %}
<td style="font-size: 16px; padding: 3px; word-wrap: break-word; background-color: #F3AE60" rowspan="{{ loop.length }}">{{ prometheus_monitor_info['commonLabels']['alertname'] }}</td>
{% endif %}
{% if items['status'] == 'firing' %}
<td style="font-size: 16px; padding: 3px; background-color: red; word-wrap: break-word">The alarm</td>
{% else %}
<td style="font-size: 16px; padding: 3px; background-color: green; word-wrap: break-word">restore</td>
{% endif %}
<td style="font-size: 16px; padding: 3px; word-wrap: break-word; background-color: #EBE4D3">{{ items['labels']['severity'] }}</td>
<td style="font-size: 16px; padding: 3px; word-wrap: break-word; background-color: #EBE4D3">{{ items['labels']['instance'] }}</td>
<td style="font-size: 16px; padding: 3px; word-wrap: break-word; background-color: #EBE4D3">{{ items['annotations']['summary'] }}</td>
<td style="font-size: 16px; padding: 3px; word-wrap: break-word; background-color: #EBE4D3">{{ items['annotations']['description'] }}</td>
<td style="font-size: 16px; padding: 3px; word-wrap: break-word; background-color: #EBE4D3">{{ items['startsAt'] }}</td>
{% if items['endsAt'] == '0001-01-01T00:00:00Z' %}
<td style="font-size: 16px; padding: 3px; word-wrap: break-word; background-color: #EBE4D3">00 00:00:00:</td>
{% else %}
<td style="font-size: 16px; padding: 3px; word-wrap: break-word; background-color: #3DE869">{{ items['endsAt'] }}</td>
{% endif %}
</tr>
{% endfor %}
</table>
</body>
</html>
Copy the code
En… It’s a bunch of “for” loops, and the “if” loops, and the “merge” cells of the table. It’s a little hard for me, so I simply merge the monitoring categories into a single cell, and leave the rest of the table uncategorized
…
…
is a line of alarm information, there is a judgment, is to judge this alarm information is alarm or recovered, and then according to the different to set a different color display, so that the leadership will feel really sweet.
And then I’ll just say one important thing
{% for items in prometheus_monitor_info['alerts'] %}Prometheus_monitor_info is a variable that converts the JSON string from Prometheus into a Python dictionary, which then performs a time zone conversion. Prometheus_monitor_info ['alerts'] retrieves the list of alerts and iterates through the list with a for loop. Items is the specific alarm information, which is a dictionary, and then retrieves the values from the dictionary, HMM. It's easy when you think about it.{% endfor %}
Copy the code
So now THAT I have my HTML template, how do I use this template? Here I write a method to parse the template and pass in the corresponding parameters
from jinja2 import Environment, FileSystemLoader
class ParseingTemplate:
def __init__(self, templatefile) :
self.templatefile = templatefile
def template(self, **kwargs) :
try:
env = Environment(loader=FileSystemLoader('templates'))
template = env.get_template(self.templatefile)
template_content = template.render(kwargs)
return template_content
except Exception as error:
raise error
Copy the code
Basically, what this class does is it passes in an alarm, reads the HTML template, returns the parsed HTML, and then sends it out in an email, and that’s it.
Precise alarm, corresponding to a specific person
If you look carefully at the rule alarm rule above, you will notice that there is a custom key-value in it:
groups:
- name: sentry # This name can be understood as a classification, make a distinction
rules:
- alert: "Memory Usage"
expr: round((1-(node_memory_MemAvailable_bytes{job='sentry'} / (node_memory_MemTotal_bytes{job='sentry'})))* 100) > 85
for: 5m
labels:
team: ops # This is where I define a group to send messages to
severity: warning
cloud: yizhuang
.
.
Copy the code
Then when I parse the original JSON, I get the value of the team. According to the value, I get the specific email address of the group and finally send it to these people. The specific email address, I have taken out, but how do I know which environment or application these people should correspond to, that is the following:
groups:
- name: sentry
.
.
Copy the code
Prometheus job_name = job_name; Prometheus job_name = job_name;
# cat prometheus.yml
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ['10.127.92.105:9093']
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "/alertmanager/rule/*.yaml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
static_configs:
- targets: ['10.127.92.105:9090']
- job_name: 'cadvisor-app'
file_sd_configs:
- refresh_interval: 1m
files:
- /etc/prometheus/file-sd-configs/cadvisor-metrics.json
- job_name: 'sentry'
file_sd_configs:
- refresh_interval: 1m
files:
- /etc/prometheus/file-sd-configs/system-metrics.json
- job_name: 'kafka-monitor'
file_sd_configs:
- refresh_interval: 1m
files:
- /etc/prometheus/file-sd-configs/kafka-metrics.json
Copy the code
Is it strung? Think back and refer back to my final complete code
Complete code Reference
Reference code
from flask import Flask, request
from dateutil import parser
import json
import yaml
import datetime
import smtplib
from email.mime.text import MIMEText
from jinja2 import Environment, FileSystemLoader
from gevent.pywsgi import WSGIServer
def time_zone_conversion(utctime) :
format_time = parser.parse(utctime).strftime('%Y-%m-%dT%H:%M:%SZ')
time_format = datetime.datetime.strptime(format_time, "%Y-%m-%dT%H:%M:%SZ")
return str(time_format + datetime.timedelta(hours=8))
def get_email_conf(file, email_name=None, action=0) :
"" :param file: indicates the file type in yamL format. Param email_name: indicates the name of the email list to be sent. Param action: indicates the operation type. Return: Returns an invalid data structure based on the action value.
try:
with open(file, 'r', encoding='utf-8') as fr:
read_conf = yaml.safe_load(fr)
if action == 0:
for email in read_conf['email'] :if email['name'] == email_name:
return email['receive_addr']
else:
print("%s does not match for %s" % (email_name, file))
else:
print("No recipient address configured")
elif action == 1:
return [items['name'] for items in read_conf['email']]
elif action == 2:
return read_conf['send']
except KeyError:
print("%s not exist" % email_name)
exit(-1)
except FileNotFoundError:
print("%s file not found" % file)
exit(-2)
except Exception as e:
raise e
def sendEmail(title, content, receivers=None) :
if receivers is None:
receivers = ['[email protected]']
send_dict = get_email_conf('email.yaml', action=2)
mail_host = send_dict['smtp_host']
mail_user = send_dict['send_user']
mail_pass = send_dict['send_pass']
sender = send_dict['send_addr']
msg = MIMEText(content, 'html'.'utf-8')
msg['From'] = "{}".format(sender)
msg['To'] = ",".join(receivers)
msg['Subject'] = title
try:
smtpObj = smtplib.SMTP_SSL(mail_host, 465)
smtpObj.login(mail_user, mail_pass)
smtpObj.sendmail(sender, receivers, msg.as_string())
print('mail send successful.')
except smtplib.SMTPException as e:
print(e)
class ParseingTemplate:
def __init__(self, templatefile) :
self.templatefile = templatefile
def template(self, **kwargs) :
try:
env = Environment(loader=FileSystemLoader('templates'))
template = env.get_template(self.templatefile)
template_content = template.render(kwargs)
return template_content
except Exception as error:
raise error
app = Flask(__name__)
@app.route('/webhook', methods=['POST'])
def webhook() :
try:
prometheus_data = json.loads(request.data)
# Time switch, switch to east 8 time
for k, v in prometheus_data.items():
if k == 'alerts':
for items in v:
if items['status'] = ='firing':
items['startsAt'] = time_zone_conversion(items['startsAt'])
else:
items['startsAt'] = time_zone_conversion(items['startsAt'])
items['endsAt'] = time_zone_conversion(items['endsAt'])
team_name = prometheus_data["commonLabels"] ["team"]
generate_html_template_subj = ParseingTemplate('email_template_firing.html')
html_template_content = generate_html_template_subj.template(
prometheus_monitor_info=prometheus_data
)
Get the recipient mailing list
email_list = get_email_conf('email.yaml', email_name=team_name, action=0)
sendEmail(
'Prometheus Monitor',
html_template_content,
receivers=email_list
)
return "prometheus monitor"
except Exception as e:
raise e
if __name__ == '__main__':
WSGIServer(('0.0.0.0'.5000), app).serve_forever()
Copy the code
Configuration File Reference
send:
smtp_host: smtp.163.com
send_user: [email protected]
send_addr: [email protected]
send_pass: BRxxxxxxxZPUZEK
email:
- name: kafka-monitor # Correspond to team
receive_addr:
- Email Address 1
- Email Address 2
- Email Address 3
- name: ops
receive_addr:
- Email Address 1
- Email Address 2
Copy the code
Final rendering
1) It’s all alarm
2) Both alarm and recovery
3) They are all recovered