Monitor hosts and containers

Node_export monitoring

Node_export a library that can be used to collect various host metrics, including CPU, memory, and disk data. It is installed on Node.

Download and install

Prometheus. IO/download / # n… Decompress the package and install it on node

The port number

The default port is 9100

node_exporter --web.listen-address==":9600" --web.telemetry-path=="/node_metrics"
Copy the code

Collect list

node_exporter --no-collector.arp
Copy the code

Do not use the collector. The arp

All listings are available at github.com/prometheus/…

textfile collector

Metadata {role=”docker_server”,datacenter=”NJ”} 1

Specify the directory address: /var/lib/node_exporters /textfile_collector/metadata.prom

--collector.textfile.directory
Copy the code

Systemd collector

  • Docker. Service Docker daemon process
  • Ssh. service SSH daemon process
  • Rsyslog. service Rsyslog daemon
node_exporter --collector.textfile.directory /var/lib/node_exporter/textfile_collector --collector.systemd --collector.systemd.unit-whitelist="(docker|ssh|rsyslog).service" 
Copy the code

grab

9100 is the port number of node_exporter

scrap_config:
  - job_name: 'promethus'
  static_configs:
    - targets: ['localhost:9090']
  - job_name: 'node'
  static_configs:
    - targets: ['ip1:9100', 'ip2:9100']
Copy the code

Filter collector

scrap_config:
  - job_name: 'promethus'
  static_configs:
    - targets: ['localhost:9090']
  - job_name: 'node'
  static_configs:
    - targets: ['ip1:9100', 'ip2:9100']
  params:
    collect[]:
      - cpu
      - meminfo
Copy the code

Monitor docker containers

Cadvisor Docker run executes port 8080 UI address /containers

docker run \
...
--publish=8080:8080
google/cadvisor:latest
Copy the code

Life cycle of fetching

Service discovery – “configuration -” – “crawl -” – “re – mark (still don’t understand why re – mark twice

The label

Changing or adding tags creates a new time series.

tag

  • topological label
  1. The type of the thing that the job is monitoring
  2. Instance Indicates the IP address and port of the target
  • schemetic label

Url, error_code, user

To mark

The easiest way to remember both phases is to use relabel_configs before fetching and metric_relabel_configs after fetching

  • Delete unnecessary indicators
  • Remove sensitive or unwanted labels from metrics
  • Add, edit, or modify the label value or format of an indicator

Delete indicators

Action: drop

Replace the indicators

This is because the default operation is replace, and if no operation is specified, Prometheus assumes that you want to replace. By default, Honor_labels is false, and Prometheus will rename existing labels by adding exported_ prefix to them. replacement: $1

Remove the label

The Action: LabelDrop tag is a unique constraint on the time series. If you delete tags and cause the time series to repeat, then the system may have problems!

Method of USE

CPU utilization

100 - avg(irate(node_cpu_seconds_total{job="node",mode="idle"}[5m])) by (instance) * 100
Copy the code

CPU saturation rate

It is usually normal for the average load to be less than the number of cpus, and exceeding that number for an extended period of time indicates CPU saturation.

  • Number of CPU:
count by (instance) (node_cpu_seconds_total{mode="idle"})
Copy the code
  • Node_load:

They show 1 -, 5 -, and 15-minute load averages. Average load of 1 minute: node_load1.

node_load1 > on (instance) 2 * count by (instance) (node_cpu_seconds_total{mode="idle"})
Copy the code

Memory usage

Unit: bytes

  • Node_memory_MemTotal_bytes: indicates the total memory on a host
  • Node_memory_MemFree_bytes: memory available on a host
  • Node_memory_Buffers_bytes: memory in the buffer cache
  • Node_memory_Cached_bytes: memory in the page cache

The last three add up to the total available memory

100 - (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)) / node_memory_MemTotal_bytes * 100
Copy the code

Memory saturation

1024 * sum by (instance) (
(rate(node_vmstat_pswpin[1m]) + rate(node_vmstat_pswpout[1m]))
)
Copy the code

Node_exporter Specifies the number of bytes, in KB, collected from /proc/vmstat since the last boot.

  • Node_vmstat_pswpin: number of bytes read from disk to memory per second
  • Node_vmstat_pswpout: indicates the number of bytes written from memory to disk per second

Disk usage

For disks, we only measure disk usage and not usage, saturation, or errors.

(1-node_filesystem_size_bytes{mountpoint="/data"}/node_filesystem_free_bytes{mountpoint="/data"})*100
Copy the code
predict_linear(node_filesystem_free_bytes{job="node"}[1h], 4*3600) < 0
Copy the code

Service status

node_systemd_unit_state

The metadata index

node_systemd_unit_state{name="docker.service"} == 1
and on (instance, job)
metadata{datacenter="SF"}
Copy the code

Query persistence

Evaluation_interval is the time here

rule_files:
  - "rules/node_rules.yml"
Copy the code
  • Record rule: Create a new indicator based on the query

Generate aggregates across multiple time series. Precomputation consumes large queries

  • Alert rules: Generate alerts from queries
  • Visualization: Visualize queries using dashboards such as Grafana