This is the first day of my participation in Gwen Challenge

Introduction and architecture of Prometheus

Prometheus is an open source monitoring alarm solution from SoundCloud. The architecture diagram is as follows:As shown above, Prometheus consists of the following components:

  • Prometheus Server: For capturing and storing time-serialized data
  • Exporters: A plug-in that pulls data actively
  • Pushgateway: a plug-in that passively pulls data
  • Altermanager: alarm sending module
  • Prometheus Web UI: interface, including data display and alarm sending combined with Grafana

Prometheus itself is a C/S model for monitoring data collection, computation, querying, updating, and storage that starts as a process and is followed by multi-processes and multithreads.

2 the installation

#Creating a Configuration File[root@VM-10-48-centos ~] mkdir /data/config/prometheus && cd /data/config/prometheus [root@VM-10-48-centos prometheus]vim prometheus.yml global: scrape_interval: 60s evaluation_interval: 60s scrape_configs: - job_name: prometheus static_configs: - targets: ['localhost:9090'] labels: instance: prometheus - job_name: linux static_configs: -targets: [' 10.1.10.48.9100 '] labels: instance: localhost-job_name: 'Nginx' Static_configs: -targets: labels: instance: localhost-job_name: 'Nginx' Static_configs: -targets: [' 10.1.10.48:9145]Copy the code

Container startup (please configure the Docker environment and image acceleration in advance)

docker run -it --name prometheus -d -p 9090:9090 -v /data/config/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus
Copy the code
docker run -d -p 9100:9100 -v "/proc:/host/proc:ro" -v "/sys:/host/sys:ro" -v "/:/rootfs:ro"   --net="host" prom/node-exporter
Copy the code

Visit http://localhost:9100/metrics, you can see the current node exporter of access to the current host all the monitoring data, as shown below:

3 PromQL

3.1 What is PromQL

Prometheus Query Language PromQL (Prometheus Query Language) is a built-in data Query Language for Prometheus, which enables the Query, aggregation, and logical operation of event sequence data. It is also widely used in routine applications of Prometheus, including data query, visualization, and alarm processing.

In short, PromQL is widely present in monitoring systems with Prometheus at their core. So PromQL is used where data filtering is needed. For example: the setting of monitoring indicators, alarm indicators and so on.

3.2 Basic usage of PromQL

After Prometheus collects the corresponding monitoring indicator sample data through ITS Exporter, we can query the monitoring sample data through PromQL.

You can query all time series of a monitoring indicator by name. We start here Prometheus server, and open the http://localhost:9090/graph address. In the query box, we enter: prometheus_http_requests_total and click Execute.

You can see that we have queried all the data with the index name prometheus_HTTP_requests_total.

The PromQL support user filters time series according to the tag matching mode of time series. Currently, the PromQL support mainly supports two matching modes: full matching and regular matching.

3.3 Perfect Match

PromQL supports using = and! = Two exact matching modes.

  • Equal to zero. Through the use oflabel=valueYou can select those tags that satisfy the time series defined by the expression.
  • Is not equal to. Through the use oflabel! =valueTime series can be excluded according to tag matching.

For example, all prometheus_HTTP_requests_total data is displayed. At this point we want to only look at the wrong requests, that is, filter out all data with a code tag other than 200. Prometheus_http_requests_total {code! = “200”}.

As you can see from the figure above, the result of the query has filtered out all data with code not 200.

3.4 Regular Matching

PromQL can also use regular expressions as matching conditions, and can use multiple matching conditions.

  • Forward matching. uselabel=~regxIndicates the selection of time series whose labels match the regular expression definition.
  • Reverse matching. uselabel! ~regxTo exclude.

Prometheus_http_requests_total; / API /v1; Prometheus_http_requests_total {handler = ~ “/ API/v1 /. *}”.

3.5 Range Query

When you query a time series using an expression like prometheus_HTTP_requests_total, only one value is returned for each indicator and label. Such an expression is called an instant vector expression, and the result returned is called an instant vector.

If we want to query sample data within a period of time, we need to use interval vector expression, and the queried result is called interval vector. The time range is defined by the time range selector []. For example, you can select all sample data within the last 5 minutes by using the following expression

prometheus_http_requests_total{}[5m]
Copy the code

As can be seen from the query result, we have queried all sample data at this time, instead of the statistical value of a sample data.

In addition to using m for minutes, PromQL’s time range selector supports other units of time:

S - seconds m - minutes H - hours D - days W - weeks Y - yearsCopy the code

3.6 Time displacement operation

In instantaneous vector expressions or interval vector expressions, both are based on the current time:

Prometheus_http_requests_total {}[5m] select * from requests_total{}Copy the code

What if you query the instantaneous sample data from 5 minutes ago, or the sample data from yesterday? At this point we can use the displacement operation, the keyword of the displacement operation is offset.

Http_request_total {} offset 5m http_request_total{} offset 5m 2020-10-06 http_request_total{}[1d] offset 1dCopy the code

3.7 Aggregation Operations

In general, the data we query with PromQL is quite large. PromQL provides aggregation operations that can be used to process these time series to form a new time series.

With our prometheus_http_requests_total{code! =”200″} for example, without any conditions, the data we query is:

As can be seen from the figure above, there are three pieces of data, and the value of the three pieces of data is: 26

Count (prometheus_http_requests_total{code! = “200”})

Sum (prometheus_http_requests_total{code! = “200”})

3.8 a scalar

In PromQL, a scalar is a floating point numeric value without timing. For example: 26.

Note that when the expression count(http_requests_total) is used, the data type returned is still a transient vector. The user can convert a single instantaneous vector to a scalar with the built-in function Scalar ().

As shown above, we have converted the sum operation with Scalar, and the result is a scalar.

4 PromQL operator

PromQL also supports rich operators that users can use to reprocess further sequences of events. These operators include: mathematical operators, logical operators, Boolean operators and so on.

4.1 Mathematical operators

Mathematical operators are simple: simple addition, subtraction, multiplication, and division.

For example, prometheus_http_response_size_bytes_sum can be used to query the total number of HTTP response bytes for Prometheus. But this unit is in bytes, and we want to display it in MB. Prometheus_http_response_size_bytes_sum /8/1024

The resulting data is displayed in MB.

All of the mathematical operators supported by PromQL are as follows:

  • Plus (plus)
  • - (subtraction)
  • * (multiplication)
  • / (division)
  • % (remainder)
  • ^ (power operation)

4.2 Boolean operators

Boolean operators allow users to filter time series based on the values of samples in the time series. For example, prometheus_http_requests_total can be used to query the number of requests per interface, but what if we want to filter out more than 20 requests?

At this point we can use the following PromQL expression:

prometheus_http_requests_total > 100
Copy the code

As you can see, we have filtered all the data whose value is greater than 100. From the indicator name, we can see the corresponding interface name.

As we can see from the figure above, the value of “value” is still a concrete value. But if we want to change value to 1 for the data that meets the criteria. If the data does not meet the conditions, value changes to 0. So we can use the bool modifier.

We use the following PromQL expression:

prometheus_http_requests_total > bool 100
Copy the code

As you can see from the result below, instead of filtering out the data, the value is changed to either 1 or 0.

Currently, Prometheus supports the following Boolean operators:

  • * == (equal)
  • ! = (unequal)
  • > (greater than)
  • < (less than)
  • >= (greater than or equal to)
  • <= (less than or equal to)

4.3 Set operators

Through set operation, corresponding set operation can be carried out between two instantaneous vectors and instantaneous vectors. Currently, Prometheus supports the following set operators:

  • And with the operating
  • The or or operation
  • Unless exclusion operation

4.3.1 and operation

Vector1 and vector2 perform an and operation to create a new set. The elements in this set are present in both vector1 and Vector2.

For example, if we have vector1 is A, B, C, and vector2 is B, C, and D, then the result of vector1 and vector2 is: B, C.

4.3.2 OR or Operation

An or operation performed on a vector1 and vector2 produces a new set. This collection contains all elements in vector1 and Vector2.

For example, if we have A vector1 is A, B and C, and A vector2 is B, C and D, then the result of A vector1 or vector2 is: A, B, C and D.

4.3.3 Unless exclusion operation

An or operation performed on a vector1 and vector2 produces a new set. The set first takes all elements of the vector1 set and then excludes all elements that are present in vector2.

For example, if we have A vector1 for A, B, C, and A vector2 for B, C, and D, then A vector1 unless vector2 is: A.

5 PromQL aggregation operation

Prometheus also provides aggregate operators that operate on transient vectors. The sample data returned by the instantaneous expression can be aggregated to form a new time series. Currently supported aggregate functions are:

  • Sum (sum)
  • Min (minimum value)
  • Max (maximum)
  • Avg (average)
  • Stddev (standard deviation)
  • Stdvar (Standard variance)
  • Count (= count)
  • Count_values (count values)
  • Bottomk (last N sequence)
  • Topk (the first N sequences)
  • Quantile

5.1 the sum sum

Used to sum the value of a record.

For example, sum(prometheus_http_requests_total) indicates the count of all HTTP requests.

5.2 Min Minimum

Returns the minimum value for all records.

The figure shows all data of the prometheus_HTTP_REQUESTs_total indicator:

When we perform the following PromQL, the smallest recorded value is filtered out.

min(prometheus_http_requests_total)
Copy the code

5.3 Max Value

Returns the maximum value of all records.

When we perform the following PromQL, the maximum record value is filtered out.

max(prometheus_http_requests_total)
Copy the code

5.4 Average AVG

The AVG function returns the average of all records.

When we perform the following PromQL, the maximum record value is filtered out.

avg(prometheus_http_requests_total)
Copy the code

5.5 count count

The count function returns the count of all records (that is, how many records there are).

For example, count(prometheus_http_requests_total) indicates the number of methods to collect statistics on all HTTP request records.

5.6 BottomK after a few

Bottomk is used to sort the sample values and returns the time series of N bits after the current sample value.

For example, if you want to get the last 5 bits of an HTTP request, you can use the expression:

bottomk(5, prometheus_http_requests_total)
Copy the code

5.7 The first few topK items

Topk is used to sort the sample values and return the time series of the first N bits of the current sample value.

For example, to get the top 5 HTTP requests, you can use an expression:

topk(5, prometheus_http_requests_total)
Copy the code

6 PromQL built-in functions

PromQL provides a number of built-in functions for rich processing of timing data. For example, the irate () function can help us calculate the growth rate of the monitoring indicator without having to calculate it manually.

6.1 Rate growth rate

As we know, counter type indicators are characterized by increasing but not decreasing. In the case of no reset, their sample values are constantly increasing. In order to visually observe the period change, we need to calculate the growth rate of the sample.

The increase(v range-vector) function is one of the many built-in functions provided in PromQL. The parameter v is an interval vector, and the increase function takes the last sample of the interval vector after the first one and returns its increment. Therefore, we can use the following expression Counter to type the growth rate of the indicator:

increase(node_cpu[2m]) / 120
Copy the code

Here, node_CPU [2m] is used to obtain all samples in the last two minutes of the time series. Increase calculates the growth of the last two minutes, and finally divides the time by 120 seconds to obtain the average growth rate of node_CPU samples in the last two minutes. This value also approximates the average CPU usage of the host node in the last two minutes.

In addition to using the increase function, PromQL also directly builds the rate (v range-vector) function, which can directly calculate the average growth rate of interval vector V in the time window. Therefore, we can get the same result as the increase function by using the following expression:

rate(node_cpu[2m])
Copy the code

It should be noted that using rate or increase functions to calculate the average growth rate of samples is easy to fall into the “long tail problem”, which cannot reflect the sudden changes of sample data in the time window.

For example, in a 2-minute window for a host, there may be a situation where the CPU usage is 100% due to traffic or other problems, but the problem is not reflected by calculating the average growth rate over the time window.

To solve this problem, PromQL provides another, more sensitive function, IRATE (v range-vector). Irate is also used to calculate the computation rate of interval vectors, but it reflects the instantaneous growth rate. The irATE function calculates the growth rate of the interval vector through the last two sample data in the interval vector.

In this way, the “long tail problem” within the time window can be avoided, and better sensitivity can be reflected. ICONS drawn by irATE function can better reflect the instantaneous change state of sample data.

irate(node_cpu[2m])
Copy the code

The IRATE function provides higher sensitivity than the rate function, but this sensitivity may cause interference when long-term trends need to be analyzed or in alarm rules. Therefore, it is recommended to use the rate function in long-term trend analysis or alarms.

6.2 Predict_LINEAR growth prediction

In common cases, the system administrator sets alarm thresholds for server resources to ensure service continuity. For example, send an alarm notification when the disk space is only 512MB. This threshold – based alarm mode works well when the resource usage increases smoothly.

But what if resources don’t change smoothly? For example, when some services grow, the growth rate of storage space increases several times. In this case, if an alarm is triggered based on the original threshold, the system may become unavailable before the system administrator can handle the fault.

Therefore, the threshold is not fixed. You need to adjust the threshold periodically to ensure that the threshold takes effect. Is there a better way?

The predict_Linear (V range-vector, T scalar) function built into PromQL can help system administrators deal with such situations better. Predict_linear can predict the value of the time series V after T seconds.

Based on the simple linear regression method, it makes statistics on the sample data in the time window, so as to predict the change trend of the time series. For example, to predict whether the available disk space of the host will be filled in 4 hours based on a 2-hour sample data, the following expression can be used:

predict_linear(node_filesystem_free{job="node"}[2h], 4 * 3600) < 0
Copy the code

7 Metric indicators

In Prometheus, all of our information is in the form of Metrics.

Metrics consists of a metric name and a label name.

<metric name>{<label name>=<label value>, ... }Copy the code

For example, the following api_HTTP_requests_total is the metrics name and method is the label name. The metric name plus the label name is a complete metric.

api_http_requests_total{method="POST", handler="/messages"}
Copy the code

7.1 Metric Type Indicates the Metric Type

There are four main index types in Prometheus, which are used to adapt to different index types.

  • Counter counter
  • Gauges meter
  • The histogram histogram
  • The summary summary

Counter counter

The data starts accumulating from zero and ideally should grow forever or stay the same.

This applies to values such as machine startup time and HTTP traffic.

Gauges again

Retrieves a return value, whatever it is. The value could go up or down.

This parameter is applicable to values such as disk capacity and CPU memory usage.

The histogram histogram

Counter and evacuate reflect numerical conditions, while histogram shows the distribution of reaction values.

Histogram charts reflect the interval distribution of samples and are often used to represent information such as request duration and response size.

Let’s say we have 1000 HTTP requests per minute, and we want to know how long most of them take. Our use of the evaluation time may be inaccurate here, but using the histogram shows the time range within which most of these requests are distributed.

The Histogram and the summary

The easiest way to distinguish between average slowness and long tail slowness is to group by the range of request latency. For example, count how many requests are delayed between 0 10ms and 10 20ms. In this way, you can quickly analyze the cause of system slowness. Both Histogram and Summary are intended to solve such problems. Through the monitoring indicators of both Histogram and Summary, we can quickly understand the distribution of monitored samples.

Histogram indicators directly reflect the number of samples in different ranges, which are defined by the label Len. The summary is the median of the sample.

In this article we introduce Prometheus to several key concepts:

  • Metric indicators
  • Metric Type Indicates the Metric Type
  • Jobs and instances

The relationship between them is shown below:

A Job can have multiple instances, and an Instance can have multiple metrics. A Metric can have only one Metric Type.