Monitoring System – Prometheus

Some images show problems ~, such as affecting reading, go to monitoring system – Prometheus] ==

1. Introduction

1.1 Official introduction

Prometheus is a system monitoring and alarm tool developed as open source software by a former Google engineer at Soundcloud since 2012, and has since been used by many companies and organizations to monitor alarms. Prometheus has an active developer and user community and is now a standalone open source project that can be maintained independently of any company. To prove this, Prometheus joined the CNCF Foundation in May 2016, becoming the second CNCF managed project after Kubernetes.

The diagram is old and has been updated to version 2.25.

1.2 Main Functions/Advantages

  • Multidimensional data model (time series consisting of metric names and K/V labels)
  • Powerful query statement (PromQL)
For example: Query the average CPU usage of the machine whose instance is 10.224.192. d{3}:9100 in the last 30 seconds. avg(rate(node_cpu_seconds_total {instance = ~ 10.224.192. \ \ d {3} : "9100", the mode = "user"} [s] 30 * 100))Copy the code
  • Non-dependent storage, supports local and remote models. A single service node has the autonomy capability
  • Using HTTP protocol, using pull mode to pull data, easy to understand. Data can also be pushed through an intermediate gateway
  • Monitor the target, either through service discovery or static configuration
  • Support a variety of statistical data models, graphical friendly

1.3 Comparison of monitoring systems

Zabbix and Nagios are both old monitoring systems. Open-falcon is an open source monitoring system developed by Xiaomi

  1. In terms of system maturity, Zabbix and Nagios are both old monitoring systems with stable system functions and high maturity. Prometheus and Open-Falcon have been around for the last few years and are still iterating.
  2. Extensibility: Prometheus extends collection capabilities through a variety of systems and storage solutions through interfaces, which will be described later.
  3. In terms of activity, Prometheus community was the most active and supported by CNCF;
  4. In terms of performance, the main difference lies in storage: Prometheus uses a high-performance sequential database, Zabbix uses a relational database, and Nagios and Open-Falcon use RRD data storage.
  5. Container support: Zabbix and Nagios came out earlier, containers were not born, so container support is poor. Open-falcon has limited support for container monitoring. Prometheus’ dynamic discovery mechanism supports monitoring of various container clusters and is the best solution for container monitoring.

1.4 Summary

  • Prometheus is a one-stop alarm monitoring platform with few dependencies and full functions.

  • Prometheus supports cloud or container monitoring, while other systems monitor hosts.

  • Prometheus has more expressive data query statements and more powerful statistical functions built in.

  • Prometheus is not as good as InfluxDB, OpenTSDB, Sensu in data storage scalability and persistence.

  • Applicable scenario

    Prometheus is suitable for recording time series in text format for both machine-centric and highly dynamic service-oriented architecture monitoring. In the microservices world, its support for multidimensional data collection and queries has particular advantages. Prometheus was designed to improve system reliability by quickly diagnosing problems during power outages, and each Prometheus Server is independent of each other and does not rely on network storage or other remote services. When an infrastructure failure occurs, you can use Prometheus to quickly locate the failure without consuming significant infrastructure resources.

  • Not applicable Scenario

    Prometheus places a high value on reliability, allowing you to view statistics available about the system at all times, even in the event of a failure. If you need 100 per cent accuracy, such as per-request billing, Prometheus is not for you, as the data it collects may not be complete enough. In this case, you’re better off using other systems to collect and analyze data for billing purposes, and Prometheus to monitor the rest of the system.

2. Architecture design

2.1 Overall Architecture

Prometheus consists of several key components, shown in yellow above: Prometheus Server, Pushgateway, Alertmanager, and the associated webui.

Prometheus Server periodically pulls and stores data from a target node, which can be a friend of any type that directly exposes data through an HTTP interface, through a server discovery or static configuration. It can also be a pushGateway dedicated to receiving push data.

Prometheus can also configure rules, query data periodically, and push alerts to the configured AlertManager when conditions are triggered. The alertManager finally receives the warning, and the AlertManager can configure, aggregate, de-weight, de-noise, and finally send the warning.

Prometheus supports collecting information from other Promeheus instances, and if there are multiple data centers, a single instance of Promeheus can be deployed in a federated cluster for each data center. It is then responsible for aggregating monitoring data from multiple data centers through Prometheus Server in one center.

2.2 Main Modules

2.2.1 prometheus server

According to how it works, Prometheus itself finds the target node using the Scrape Manager indicator and then passes it through Fanout Storage, a storage agent layer. Store the data to the corresponding local or remote storage.

Local storage: A customized storage format is used to store sample data on Local disks. However, Local storage has poor reliability and is recommended only when data persistence requirements are not high.

Remote Storage: For flexibility, Prometheus defines two standard interfaces (remote_write/remote_read) that allow users to store data to any third-party storage service based on these interfaces.

RuleManager periodically queries the configured rules. When conditions are triggered, RuleManager writes the results to the storage and sends alert messages to the AlertManager through the Notifier

2.2.2 exporter

With a simple deployment of my friend, which operates many official and third-party libraries and can be operated by itself, I can maintain a lot of common indicators such as database indicators, hardware indicators, etc. See the official website for details.

2.2.3 alertmanager

2.2.3.1 profile

The alarm capability is divided into two separate parts in the architecture of Prometheus. By defining AlertRule (alarm rules) in Prometheus Server, Prometheus periodically calculates alarm rules and sends alarm information to the Alertmanager if the alarm trigger conditions are met.

Components of an alarm rule:

  • The alarm name

    You need to name an alarm rule to directly express the main content of the alarm

  • The alarm rules

    The alarm rule is actually defined by PromQL. The actual meaning is that an alarm is generated when the PromQL query results last During

    Alertmanager is an independent component that receives and processes alarm information from Prometheus Server(or any other client program). The Alertmanager can perform further operations on these alarms, such as deduplication, grouping, and routing. Prometheus currently has built-in support for email, Slack, and other notification methods, such as Popo or SMS, via Webhook.

    For background information, alarms in the Prometheus ecosystem are generated by computing Alert rules in Prometheus Server, which periodically execute a PromQL, The query result records a time series alert {alertName = “”, <alert tag >} for subsequent alarms. When Prometheus Server calculates an alarm, it does not have the ability to send the alarm to the Alertmanager, which sends the alarm. This is partly because of a single responsibility, and partly because sending alarms is really not “simple” and requires a dedicated system to do it well. Suffice it to say, the goal of Alertmanager is not simply to “send out alerts”, but to “send out high-quality alerts”.

2.2.3.2 features

In addition to basic alarm notification capabilities, Alertmanager provides alarm features such as grouping, suppression, and silence:

  • Grouping: The grouping mechanism combines detailed alarm information into a notification

    The grouping mechanism combines detailed alarm information into a notification. In some cases, for example, when the system is down, a large number of alarms are triggered simultaneously. In this case, the grouping mechanism can combine the triggered alarms into one alarm notification, avoiding receiving a large number of alarm notifications at a time and failing to quickly locate faults.

    For example, when there are hundreds of running service instances in a cluster, and alarm rules are set for each instance. If a network failure occurs, a large number of service instances may fail to connect to the database, and hundreds of alarms may be sent to the Alertmanager. As a user, you may only want to be able to see which service instances are affected in a single notification. In this case, you can group alarms based on the service cluster or alarm name, and collect the alarms as a notification.

    You can configure alarm groups, alarm time, and alarm receiving mode in the Configuration file of the Alertmanager.

  • Suppression: After an alarm is generated, you can stop sending other alarms caused by this alarm

    Suppression is a mechanism by which you can stop sending other alarms after an alarm is generated.

    For example, if an alarm is generated when a cluster is inaccessible, you can configure Alertmanager to ignore all other alarms related to the cluster. In this way, you can avoid receiving a large number of alarm notifications unrelated to the actual problem. The suppression mechanism is also set in the Alertmanager configuration file.

    • Silent: provides a simple mechanism for quickly silent processing of alarms based on labels. If the received alarms comply with the silent configuration, the Alertmanager does not send alarm notifications

      Silence provides a simple mechanism for fast silent processing of alarms based on labels. If the received alarms comply with the silent configuration, the Alertmanager does not send alarm notifications. The silent setting must be set on the Werb page of the Alertmanager.

2.2.3.3 architecture

  1. Starting from the top left, Alarm rules are configured in Prometheus Server. By default, these alarm rules are calculated every minute, and if the trigger conditions are met an alarm log is generated and pushed to the AlertManager through the API.
  2. The AlertManager receives alarm records through the API, authenticates, transforms, and stores the alarm information in the AlertProvider. Currently, the AlertManager stores the alarm information in the memory, and other storage methods can be implemented through interfaces.
  3. The Dispatcher monitors and receives new alarms and routes alarms to corresponding groups according to the routing tree configuration, facilitating unified management of alarms with the same attribute information [group_by** defines grouping rules based on the labels contained in alarms]. This can greatly reduce alarm storms.
  4. Each subgroup will perform Notification Pipeline process periodically at the configured cycle time;
  5. In fact, the Notification Pipeline process is to execute the suppression logic, silence logic, wait for data collection, deduplication, send and retry logic, and finally realize the alarm. After the alarm is completed, the Notification memory will be synchronized to the cluster for subsequent silence and deduplication between clusters
  • The alarm inhibition

    This prevents users from receiving a large number of other alarm notifications caused by a certain problem. For example, when a cluster is unavailable, a user may want to receive only an alarm telling him that the cluster is faulty, rather than a large number of alarm notifications such as application exceptions in the cluster or middleware service exceptions.

    inhibit_rules:
    - source_match:
        alertname: NodeDown
        severity: critical
      target_match:
        severity: warning
      equal:
      - node
    Copy the code

    For example, when a host node in the cluster is down abnormally, the NodeDown alarm is triggered and the alarm severity=critical is defined in the alarm rule. Due to the abnormal host outage, all services deployed on the host, middleware will be unavailable and trigger an alarm. According to the suppression rules, if a new alarm is set to Severity =critical and the value of node is the same as that of NodeDown, the new alarm is caused by NodeDown. In this case, the suppression mechanism is enabled to stop sending notifications to receivers.

  • The alarm silence

    Users can temporarily mask specific alarm notifications through background Settings or APIS. By defining the matching rule (string or regular expression) of the tag, if the new alarm notification meets the Settings of the silent rule, the receiver stops sending notifications to the receiver.

2.2.4 pushgateway

Its presence allowed ephemeral and bulk operations to expose its indicators to Prometheus. Since the life cycle of these jobs may not be long enough to exist for Prometheus to capture their metrics. Pushgateway allows them to push their metrics to Pushgateway, which then exposes them to Prometheus for fetching.

The main reasons for using it are:

  • Prometheus uses the pull mode. It may be because there is no subnet or firewall for Prometheus, so Prometheus cannot directly fetch target data.
  • When monitoring service data, collect different data from Prometheus.
  • Data collection for temporary tasks

Disadvantages:

  • If pushGateway fails, the impact is greater than if multiple targets fail.
  • Prometheus Pull statusupPushgateway alone does not work for every node.
  • Pushgateway persists all monitoring data pushed to it.

As a result, Prometheus pulls old monitoring data even after your monitor has gone offline, requiring pushGateway to manually clean up unwanted data.

3. How to integrate

See another article ~ juejin.cn/post/694340…