Monitor hosts and containers
Node_export monitoring
Node_export a library that can be used to collect various host metrics, including CPU, memory, and disk data. It is installed on Node.
Download and install
Prometheus. IO/download / # n… Decompress the package and install it on node
The port number
The default port is 9100
node_exporter --web.listen-address==":9600" --web.telemetry-path=="/node_metrics"
Copy the code
Collect list
node_exporter --no-collector.arp
Copy the code
Do not use the collector. The arp
All listings are available at github.com/prometheus/…
textfile collector
Metadata {role=”docker_server”,datacenter=”NJ”} 1
Specify the directory address: /var/lib/node_exporters /textfile_collector/metadata.prom
--collector.textfile.directory
Copy the code
Systemd collector
- Docker. Service Docker daemon process
- Ssh. service SSH daemon process
- Rsyslog. service Rsyslog daemon
node_exporter --collector.textfile.directory /var/lib/node_exporter/textfile_collector --collector.systemd --collector.systemd.unit-whitelist="(docker|ssh|rsyslog).service"
Copy the code
grab
9100 is the port number of node_exporter
scrap_config:
- job_name: 'promethus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets: ['ip1:9100', 'ip2:9100']
Copy the code
Filter collector
scrap_config:
- job_name: 'promethus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets: ['ip1:9100', 'ip2:9100']
params:
collect[]:
- cpu
- meminfo
Copy the code
Monitor docker containers
Cadvisor Docker run executes port 8080 UI address /containers
docker run \
...
--publish=8080:8080
google/cadvisor:latest
Copy the code
Life cycle of fetching
Service discovery – “configuration -” – “crawl -” – “re – mark (still don’t understand why re – mark twice
The label
Changing or adding tags creates a new time series.
tag
- topological label
- The type of the thing that the job is monitoring
- Instance Indicates the IP address and port of the target
- schemetic label
Url, error_code, user
To mark
The easiest way to remember both phases is to use relabel_configs before fetching and metric_relabel_configs after fetching
- Delete unnecessary indicators
- Remove sensitive or unwanted labels from metrics
- Add, edit, or modify the label value or format of an indicator
Delete indicators
Action: drop
Replace the indicators
This is because the default operation is replace, and if no operation is specified, Prometheus assumes that you want to replace. By default, Honor_labels is false, and Prometheus will rename existing labels by adding exported_ prefix to them. replacement: $1
Remove the label
The Action: LabelDrop tag is a unique constraint on the time series. If you delete tags and cause the time series to repeat, then the system may have problems!
Method of USE
CPU utilization
100 - avg(irate(node_cpu_seconds_total{job="node",mode="idle"}[5m])) by (instance) * 100
Copy the code
CPU saturation rate
It is usually normal for the average load to be less than the number of cpus, and exceeding that number for an extended period of time indicates CPU saturation.
- Number of CPU:
count by (instance) (node_cpu_seconds_total{mode="idle"})
Copy the code
- Node_load:
They show 1 -, 5 -, and 15-minute load averages. Average load of 1 minute: node_load1.
node_load1 > on (instance) 2 * count by (instance) (node_cpu_seconds_total{mode="idle"})
Copy the code
Memory usage
Unit: bytes
- Node_memory_MemTotal_bytes: indicates the total memory on a host
- Node_memory_MemFree_bytes: memory available on a host
- Node_memory_Buffers_bytes: memory in the buffer cache
- Node_memory_Cached_bytes: memory in the page cache
The last three add up to the total available memory
100 - (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)) / node_memory_MemTotal_bytes * 100
Copy the code
Memory saturation
1024 * sum by (instance) (
(rate(node_vmstat_pswpin[1m]) + rate(node_vmstat_pswpout[1m]))
)
Copy the code
Node_exporter Specifies the number of bytes, in KB, collected from /proc/vmstat since the last boot.
- Node_vmstat_pswpin: number of bytes read from disk to memory per second
- Node_vmstat_pswpout: indicates the number of bytes written from memory to disk per second
Disk usage
For disks, we only measure disk usage and not usage, saturation, or errors.
(1-node_filesystem_size_bytes{mountpoint="/data"}/node_filesystem_free_bytes{mountpoint="/data"})*100
Copy the code
predict_linear(node_filesystem_free_bytes{job="node"}[1h], 4*3600) < 0
Copy the code
Service status
node_systemd_unit_state
The metadata index
node_systemd_unit_state{name="docker.service"} == 1
and on (instance, job)
metadata{datacenter="SF"}
Copy the code
Query persistence
Evaluation_interval is the time here
rule_files:
- "rules/node_rules.yml"
Copy the code
- Record rule: Create a new indicator based on the query
Generate aggregates across multiple time series. Precomputation consumes large queries
- Alert rules: Generate alerts from queries
- Visualization: Visualize queries using dashboards such as Grafana