preface
A recent project called for a complete monitoring and alarm system, and we used Prometheus, an open source monitoring and alarm system; It is powerful, easy to extend, and easy to install and use; This paper first introduces the whole monitoring process of Prometheus. Then it describes how to collect monitoring data, display monitoring data, and trigger alarms. Finally show a business system monitoring demo.
Monitoring architecture
The entire architectural flow of Prometheus can be seen in the following figure:
The whole process is roughly divided into collecting data, storing data, displaying monitoring data and monitoring alarms. Core components include: Exporters, Prometheus Server, AlertManager, and PushGateway;
- Exporters: Monitoring data collector that exposes data to Prometheus Server through Http.
- Prometheus Server: Obtaining, storing, and querying monitoring data; The obtained monitoring data must be in the specified Metrics format so that the monitoring data can be processed. Prometheus provides PromQL for query summary of data, and Prometheus itself provides a Web UI;
- AlertManager: Prometheus supports PromQL to create alarm rules. If the rules are met, an alarm is created and the subsequent alarm process is sent to The AlertManager, which provides various alarm methods including email and Webhook.
- PushGateway: Normally Prometheus Server can communicate directly with exporters and pull data; PushGateway can be used as a relay station when network requirements cannot be met;
To collect data
The main function of My friend is to collect data and expose it to Prometheus via HTTP, which Prometheus then obtains monitoring data through periodic pull-ups; Data comes from a variety of sources, including system-level monitoring data such as CPU and IO of nodes, middleware such as mysql and MQ, process-level monitoring data such as JVM, and business monitoring data. In addition to the business data monitored by each system may be different, in addition to other monitoring data is virtually the same in each system; So sources at My Exporter fall into two categories: community-provided, user-defined;
Exporter source
- The community to provide
The scope of | Commonly used Exporter |
---|---|
The database | MySQL, Redis, MongoDB, etc |
hardware | The Node Exporter etc. |
The message queue | Kafka Exporter, RabbitMQ Exporter, etc |
The HTTP service | Apache, Nginx, etc |
storage | HDFS Exporter etc. |
API service | Docker Hub Exporter, GitHub Exporter, etc |
other | JIRA Exporter, Jenkins Exporter, Confluence Exporter, etc |
Official third party Exporter: Exporters
- User defined
In addition to the third-party exporters provided above, users can also customize their own exporters through Prometheus’s Client Library, which provides support for a variety of languages: Go, Java/Scala, Python, Ruby, etc.
Exportation mode
From the perspective of my Exporter, I can run independently and integrate into applications.
- Independent operation
Middleware such as mysql, Redis, and MQ do not support Prometheus at the time, so an independent Exporter can be referred to through the monitoring data API provided by the middleware to obtain monitoring data and convert it to a data format recognized by Prometheus.
- Integrate into the application
For systems requiring custom monitoring indicators, monitoring data can be provided to Prometheus through the Client Library provided by Prometheus.
The data format
Prometheus obtains monitoring data from its partner through polling, in a format that Prometheus would not otherwise recognize as Metrics.
<metric name>{<label name>=<label value>, ... }Copy the code
It is divided into three parts. Each part must conform to the corresponding regular expression
- Metric Name: Indicates the metric name of the monitored sample
a-zA-Z_:*
_ - Label Name: The label reflects the characteristic dimension of the current sample
[a-zA-Z0-9_]*
- Label Value: Indicates the value of each label. Format is not limited
We can look at the monitoring data for a JVM:
# HELP jvm_memory_max_bytes The maximum amount of memory in bytes that can be used for memory management # TYPE jvm_memory_max_bytes gauge Jvm_memory_max_bytes {application = "springboot - physical - Prometheus - test", area = "nonheap," id = "Metaspace} - 1.0 Jvm_memory_max_bytes {application = "springboot - physical - Prometheus - test", area = "heap" id = "PS Eden Space",} 1.033895936 e9 Jvm_memory_max_bytes {application = "springboot - physical - Prometheus - test", area = "nonheap," id = "Code Cache,"} 2.5165824 e8 jvm_memory_max_bytes{application="springboot-actuator-prometheus-test",area="nonheap",id="Compressed Class Space",} 1.073741824 e9 jvm_memory_max_bytes {application = "springboot - physical - Prometheus - test", area = "heap" id = "PS Survivor Space",} 2621440.0 jvm_memory_max_bytes{application="springboot-actuator- proheme-test ",area="heap",id="PS Old Gen",} 2.09190912 e9Copy the code
More: data_model
The data type
Prometheus defined four different metric types: Counter, Gauge, Histogram, Summary
- Counter
Incrementing counters, such as the number of times certain events occur in an application; Common monitoring metrics, such as HTTP_requests_total;
# HELP jvm_gc_memory_allocated_bytes_total Incremented for an increase in the size of the young generation memory pool after one GC to before the next # TYPE jvm_gc_memory_allocated_bytes_total counter Jvm_gc_memory_allocated_bytes_total {application = "springboot - physical - Prometheus - test",} 6.3123664 e9Copy the code
- Gauge
Focus on the current state of the reaction system, can be increased or decreased; Common indicators are node_memory_MemFree and node_memory_MemAvailable.
# HELP jvm_threads_live_threads The current number of live threads including both daemon and non-daemon threads # TYPE Jvm_threads_live_threads the gauge jvm_threads_live_threads {application = "springboot - physical - Prometheus - test",} 20.0Copy the code
- The Histogram and the Summary
Mainly used for statistics and analysis of the distribution of samples
# HELP jvm_gc_pause_seconds Time spent in GC pause # TYPE jvm_gc_pause_seconds summary jvm_gc_pause_seconds_count{action="end of minor GC",application="springboot-actuator-prometheus-test",cause="Metadata GC Threshold",} 1.0 jvm_gc_pause_seconds_sum{action="end of minor GC, "application =" springboot - physical - Prometheus - test ", cause = "Metadata GC Threshold", 0.008} jvm_gc_pause_seconds_count{action="end of minor GC",application="springboot-actuator-prometheus-test",cause="Allocation ",} 38.0 jvm_gc_pause_SECONds_sum {action="end of minor GC, "application =" springboot - physical - Prometheus - test ", cause = "Allocation Failure", 0.134} jvm_gc_pause_seconds_count{action="end of major GC",application="springboot-actuator-prometheus-test",cause="Metadata GC Threshold",} 1.0 jvm_gc_pause_SECONds_sum {action="end of major" GC, "application =" springboot - physical - Prometheus - test ", cause = "Metadata GC Threshold", 0.073}Copy the code
More: metric_types
Display data
Prometheus can display data through the built-in Prometheus UI and Grafana, which is Prometheus’s built-in Web UI and can be easily used to perform test PromQL; Grafana is an open source application written in go that allows you to capture data from Elasticsearch, Prometheus, Graphite, InfluxDB and other sources and visualize it with beautiful graphics.
- Prometheus UI
The main interface is roughly as follows:
All registered exporters can be checked on the UI, and alarms can be checked on the Alerts interface. At the same time, PromQL can be executed to query monitoring data and display.
- Grafana
In Grafana each monitor query can be made into a panel, which can be presented in a variety of ways, such as:
PromQL profile
PromQL is a built-in data query language for Prometheus, analogous to SQL; Provides rich queries, logical operations, aggregation functions and so on;
- The operator
Operators include: mathematical operators, logical operators, Boolean operators and so on; Such as:
rabbitmq_queue_messages>0
Copy the code
- Aggregation function
Provides a number of built-in functions, such as: sum, min, Max, AVG, etc.
sum(rabbitmq_queue_messages)>0
Copy the code
More: PromQL
The alarm
The alarm process is as follows: Configure the alarm rule in Prometheus using PromQL, and if the rule is valid, send a message to the receiver (AlertManager). AlertManager can configure various alarm methods such as email and Webhook.
User-defined alarm rules
Alarm rules in Prometheus allow you to define alarm trigger conditions based on PromQL expressions. The Prometheus back end calculates these trigger rules periodically and triggers alarm notifications when the trigger conditions are met.
For example, the following alarm rules:
- name: queue-messages-warning
rules:
- alert: queue-messages-warning
expr: sum(rabbitmq_queue_messages{job='rabbit-state-metrics'}) > 500
labels:
team: webhook-warning
annotations:
summary: High queue-messages usage detected
threshold: 500
current: '{{ $value }}'
Copy the code
- Alert: indicates the name of an alarm rule.
- Expr: alarm triggering conditions based on the PromQL expression.
- Labels: user-defined labels associated with a specific Alertmanager.
- Annotations: Used to specify a set of additional information, such as the words used to describe alarm details;
AlertManager
AlertManager is an alarm manager that provides various alarm methods, including email, PagerDuty, OpsGenie, and Webhook. After the preceding alarm rule expressions are successful, alarms can be sent to the AlertManager. The AlertManager informs developers of the alarms in a more diversified manner.
global:
resolve_timeout: 5m
route:
receiver: webhook
group_wait: 30s
group_interval: 1m
repeat_interval: 5m
group_by:
- alertname
routes:
- receiver: webhook
group_wait: 10s
match:
team: webhook-warning
receivers:
- name: webhook
webhook_configs:
- url: 'http://ip:port/api/v1/monitor/alert-receiver'
send_resolved: true
Copy the code
This is the route and receiver Webhooks configured in AlertManager. More: alerting
Installation and Configuration
Here’s a look at the installation of several core components including: Prometheus, AlertManager, Exporter, and Grafana; All components are installed based on the K8S platform;
Prometheus and AlertManager
Prometheus and AlertManager are installed in the following YML files, as shown below:
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: '18'
generation: 18
labels:
app: prometheus
name: prometheus
namespace: monitoring
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
containers:
- image: 'prom/prometheus:latest'
imagePullPolicy: Always
name: prometheus-0
ports:
- containerPort: 9090
name: p-port
protocol: TCP
resources:
requests:
cpu: 250m
memory: 512Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/prometheus
name: config-volume
- image: 'prom/alertmanager:latest'
imagePullPolicy: Always
name: prometheus-1
ports:
- containerPort: 9093
name: a-port
protocol: TCP
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/alertmanager
name: alertcfg
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
- name: data
persistentVolumeClaim:
claimName: monitoring-nfs-pvc
- configMap:
defaultMode: 420
name: prometheus-config
name: config-volume
- configMap:
defaultMode: 420
name: alert-config
name: alertcfg
Copy the code
PROM/Prometheus :latest and PROM/AlertManager :latest and external ports are specified. Yml and alertManager. yml to start the two containers, run the promethees-config and alertmanager.yml configuration files in volumes.
The configuration of Promethe. yml is as follows:
global: scrape_interval: 15s evaluation_interval: 15s rule_files: - 'rabbitmq_warn.yml' alerting: alertmanagers: -static_configs: -targets: ['127.0.0.1:9093'] scrape_configs: -job_name: 'rabbit-state-metrics' static_configs: - targets: ['ip:port']Copy the code
You can run alertManager and the rabbitmq_WARN. Yml rule file. You can run my exporter, which is job_name, to collect monitoring information.
- Check the Exporter
After Starting Prometheus, you can view related exporters and alarm rules in the Prometheus Web UI:
You can view all the exporters in the status/targets directory. If their status is up, Prometheus can receive monitoring data, such as the rabbitMQ monitoring data configured here.
- Check the Alerts
You can also view configured alarms on the Prometheus Web UI:
If the alarm rule is established, red is displayed and a message is sent to the AlertManager.
Grafana
Grafana install YML file as follows:
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: '1'
generation: 1
labels:
app: grafana
name: grafana
namespace: monitoring
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
containers:
- image: grafana/grafana
imagePullPolicy: Always
name: grafana
ports:
- containerPort: 3000
protocol: TCP
resources: {}
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
Copy the code
After installing grafana, you can use grafana. Grafana needs to be able to obtain data from Prometheus, so you need to configure data sources:
This is where you can create a monitor kanban and use PromQL directly:
Exporter
Most of the middleware we use is deployed in standalone mode, such as the rabbitMQ I used here:
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: '3'
labels:
k8s-app: rabbitmq-exporter
name: rabbitmq-exporter
namespace: monitoring
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 2
selector:
matchLabels:
k8s-app: rabbitmq-exporter
template:
metadata:
labels:
k8s-app: rabbitmq-exporter
spec:
containers:
- env:
- name: PUBLISH_PORT
value: '9098'
- name: RABBIT_CAPABILITIES
value: 'bert,no_sort'
- name: RABBIT_USER
value: xxxx
- name: RABBIT_PASSWORD
value: xxxx
- name: RABBIT_URL
value: 'http://ip:15672'
image: kbudde/rabbitmq-exporter
imagePullPolicy: IfNotPresent
name: rabbitmq-exporter
ports:
- containerPort: 9098
protocol: TCP
Copy the code
A RabbitMq-exporter service is started, with port 9098, and is monitoring rabbitMQ’s 15672 interface to extract metrics, which Prometheus can identify; If you need to monitor services, you need to customize monitoring.
MicroMeter
SpringBoot itself provides health checks, metrics, metrics collection and monitoring, and how to expose this data to Prometheus using Micrometer, which provides a common API for performance data collection on the Java platform. The application simply needs to use Micrometer’s generic API to collect performance metrics. Micrometer will be responsible for the integration of different monitoring systems.
Add the dependent
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
Copy the code
After the above dependencies are added, Spring Boot automatically configures PrometheusMeterRegistry and CollectorRegistry to collect and export indicator data in a format that Prometheus can grab;
All relevant data is exposed in the Actuator’s/Prometheus endpoint. Prometheus can grab this endpoint to obtain metric data periodically.
Prometheus endpoint
Start SpringBoot service, can direct access to the http://ip:8080/actuator/prometheus address, you can see SpringBoot has provided some application of public surveillance data such as the JVM:
# HELP tomcat_sessions_created_sessions_total # TYPE tomcat_sessions_created_sessions_total counter Tomcat_sessions_created_sessions_total {application = "springboot - physical - Prometheus - test",} # 1782.0 HELP tomcat_sessions_active_current_sessions # TYPE tomcat_sessions_active_current_sessions gauge Tomcat_sessions_active_current_sessions {application = "springboot - physical - Prometheus - test",} # 365.0 HELP jvm_threads_daemon_threads The current number of live daemon threads # TYPE jvm_threads_daemon_threads gauge Jvm_threads_daemon_threads {application="springboot-actuator- Prometheus -test", cpu usage" for the Java Virtual Machine process # TYPE process_cpu_usage gauge Process_cpu_usage {application = "springboot - physical - Prometheus - test",} # 0.0102880658436214 HELP jvm_gc_memory_allocated_bytes_total Incremented for an increase in the size of the young generation memory pool after one GC to before the next # TYPE jvm_gc_memory_allocated_bytes_total counter Jvm_gc_memory_allocated_bytes_total {application = "springboot - physical - Prometheus - test",} # 9.13812704 e8 HELP jvm_buffer_count_buffers An estimate of the number of buffers in the pool # TYPE jvm_buffer_count_buffers gauge Jvm_buffer_count_buffers {application = "springboot - physical - Prometheus - test", id = "mapped",} 0.0 Jvm_buffer_count_buffers {application = "springboot - physical - Prometheus - test", id = "direct",} 10.0...Copy the code
Prometheus configuration target
Do the following in Promethe. yml:
- job_name: 'springboot-actuator-prometheus-test'
metrics_path: '/actuator/prometheus'
scrape_interval: 5s
basic_auth:
username: 'actuator'
password: 'actuator'
static_configs:
- targets: ['ip:8080']
Copy the code
After adding, you can reload the configuration:
curl -X POST http:``//ip:9090/-/reload
Copy the code
Look again at Prometheus’s target:
Grafana
You can add a Kanban for the JVM as follows:
Business is buried point
Micrometer provides a series of native meters, including Timer, Counter, Gauge, DistributionSummary, LongTaskTimer, and more. Different meter types result in different time series index values. For example, the single index value is represented by Gauge, and The Times and total time of timing events are represented by Timer.
- Counter: Increments are allowed by a fixed value, which must be positive;
- Gauge: Handle to get the current value. Typical examples are getting collections, maps, or the number of running threads, etc.
- Timer: Timer measures the frequency of short time delays and such events. All Timer implementations report at least the total time and number of events as separate time series;
- LongTaskTimer: a LongTaskTimer that tracks the total duration of all running long-running tasks and the number of such tasks;
- DistributionSummary: Used to track distributed events;
More: Micrometer
conclusion
In this article, Prometheus introduces the whole process of monitoring services, from principles to examples, which can be used as a starting point, but the power of Prometheus is that it provides PromQL, which can be learned on its own. Also, the Micrometer buried interface is a wrapper for Prometheus API (Simpleclient) for developers to use and learn as needed.