Add monitoring items and export query data
Node_export introduction
Native curl accesses data
[root@prome_master_01 tgzs]# curl -s localhost:9100/metrics |grep node_ |head -20 # HELP node_arp_entries ARP entries by device # TYPE node_arp_entries gauge node_arp_entries{device="eth0"} 3 # HELP node_boot_time_seconds Node boot time, In unixtime. # TYPE node_boot_time_seconds gauge node_boot_time_seconds 1.616987084e+09 # HELP node_context_switches_total Total number of context switches. # TYPE node_context_switches_total counter Node_context_switches_total 2.105979e+06 # HELP node_cooling_device_cur_state Current throttle state of the cooling device # TYPE node_cooling_device_cur_state gauge node_cooling_device_cur_state{name="0",type="Processor"} 0 node_cooling_device_cur_state{name="1",type="Processor"} 0 node_cooling_device_cur_state{name="2",type="Processor"} 0 node_cooling_device_cur_state{name="3",type="Processor"} 0 # HELP node_cooling_device_max_state Maximum throttle state of the cooling device # TYPE node_cooling_device_max_state gauge node_cooling_device_max_state{name="0",type="Processor"} 0 node_cooling_device_max_state{name="1",type="Processor"} 0 node_cooling_device_max_state{name="2",type="Processor"} 0Copy the code
The project address
- node_exporter
Viewing startup Logs
Mar 29 15:38:51 prome_master_01 node_exporter: Level =info ts=2021-03-29T07:38:51.315Z My friend =" my friend "version="(version=1.1.2, branch=HEAD, revision=b597c1244d7bef49e6f3359c87a56dd7707f6719)" Mar 29 15:38:51 prome_master_01 node_exporter: Level =info ts= 2021-03-29T07:38:51.315z caller= node_classifie. go: 50 MSG ="Build context" build_context="(go=go1.15.8, user=root@f07de8ca602a, date=20210305-09:29:10)" Mar 29 15:38:51 prome_master_01 node_exporter: Level =warn TS =2021-03-29T07:38:51.315Z Caller = node_main. go:181 MSG ="Node is running as root user exporter is designed to run as unpriviledged user, root is not required." Mar 29 15:38:51 prome_master_01 node_exporter: Level = INFO ts=2021-03-29T07:38:51.316Z Caller =filesystem_common.go:74 collector=filesystem MSG ="Parsed flag --collector.filesystem.ignored-mount-points" flag=^/(dev|proc|sys|var/lib/docker/.+)($|/) Mar 29 15:38:51 prome_master_01 node_exporter: Level = INFO ts=2021-03-29T07:38:51.316Z Caller =filesystem_common.go:76 collector=filesystem MSG ="Parsed flag --collector.filesystem.ignored-fs-types" flag=^(autofs|binfmt_misc|bpf|cgroup2? |configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs |selinuxfs|squashfs|sysfs|tracefs)$ Mar 29 15:38:51 prome_master_01 node_exporter: Level =info ts=2021-03-29T07:38:51.316Z Caller = node_main. go:106 MSG ="Enabled collectors" Mar 29 15:38:51 prome_master_01 node_exporter: Level = INFO TS =2021-03-29T07:38:51.316Z Caller = node_main. go:113 Collector = ARP Mar 29 15:38:51 PROme_master_01 node_exporter: Level = INFO TS =2021-03-29T07:38:51.316Z Caller = node_main. go:113 Collector =bcache Mar 29 15:38:51 PROme_master_01 node_exporter: Level =info TS =2021-03-29T07:38:51.316Z Caller = node_main. go:113 Collector =bonding Mar 29 15:38:51 PROme_master_01 node_exporter: Level = INFO TS =2021-03-29T07:38:51.316Z Caller = node_main. go:113 Collector = BTRFS Mar 29 15:38:51 PROme_master_01 node_exporter: Level = INFO TS =2021-03-29T07:38:51.316Z Caller = node_main. go:113 Collector = Conntrack Mar 29 15:38:51 PROme_master_01 node_exporter: Level = INFO TS =2021-03-29T07:38:51.316Z Caller = node_main. go:113 Collector = CPU Mar 29 15:38:51 PROme_master_01 node_exporter: Level = INFO TS =2021-03-29T07:38:51.316Z Caller = node_main. go:113 Collector = CPUfreq Mar 29 15:38:51 PROme_master_01 node_exporter: Level = INFO TS =2021-03-29T07:38:51.316Z Caller = node_main. go:113 Collector = DISKSTATS Mar 29 15:38:51 PROme_master_01 node_exporter: Level = INFO TS =2021-03-29T07:38:51.316Z Caller = node_main. go:113 Collector = EDac Mar 29 15:38:51 PROme_master_01 node_exporter: Level =info TS =2021-03-29T07:38:51.316Z Caller = node_main. go:113 Collector =entropy Mar 29 15:38:51 prome_master_01 node_exporter: Level =info TS =2021-03-29T07:38:51.316Z Caller =node_exporter. Go :113 Collector = fibreChannel Mar 29 15:38:51 prome_master_01 node_exporter: Level = INFO TS =2021-03-29T07:38:51.316Z Caller = node_main. go:113 Collector = Filefd Mar 29 15:38:51 PROme_master_01 node_exporter: Level = INFO TS =2021-03-29T07:38:51.316Z Caller = node_main. go:113 Collector = Filesystem Mar 29 15:38:51 PROme_master_01 node_exporter: Level = INFO TS =2021-03-29T07:38:51.316Z Caller = node_main. go:113 Collector = hwMon Mar 29 15:38:51 PROme_master_01 node_exporter: Level =info TS =2021-03-29T07:38:51.316Z Caller =node_exporter. Go :113 Collector = Infiniband Mar 29 15:38:51 PROme_master_01 Node_exporter: level=info ts=2021-03-29T07:38:51.316Z Caller =node_exporter. Go :113 Collector = IPVsCopy the code
This section describes collection items that are enabled by default
- Blacklist: Disables a collection item that is enabled by default
-- no - collector. < name > flag # before open [root @ prome_master_01 node_exporter] # curl -s localhost: 9100 / metrics | grep node_cpu # HELP node_cpu_guest_seconds_total Seconds the CPUs spent in guests (VMs) for each mode. # TYPE node_cpu_guest_seconds_total counter node_cpu_guest_seconds_total{cpu="0",mode="nice"} 0 node_cpu_guest_seconds_total{cpu="0",mode="user"} 0 node_cpu_guest_seconds_total{cpu="1",mode="nice"} 0 node_cpu_guest_seconds_total{cpu="1",mode="user"} 0 node_cpu_guest_seconds_total{cpu="2",mode="nice"} 0 node_cpu_guest_seconds_total{cpu="2",mode="user"} 0 node_cpu_guest_seconds_total{cpu="3",mode="nice"} 0 node_cpu_guest_seconds_total{cpu="3",mode="user"} 0 # HELP node_cpu_seconds_total Seconds the CPUs spent in each mode. # TYPE node_cpu_seconds_total counter node_cpu_seconds_total{CPU ="0",mode="idle"} 17691.27 Node_cpu_seconds_total {= "0", CPU mode = "iowait}" 8.9 node_cpu_seconds_total {= "0", CPU mode = "irq}" 0 Node_cpu_seconds_total {= "0", CPU mode = "nice"} 0.32 node_cpu_seconds_total {= "0", CPU mode = "softirq}" 0.28 Node_cpu_seconds_total {CPU ="0",mode="steal" localhost:9100/metrics |grep node_cpuCopy the code
- Whitelist: Disables the default collection items and enables only certain collection items
--collector.disable-defaults --collector.<name>. # Enable meM collection only./node_exporter --collector.disable-defaults --collector.meminfo # enable meM and CPU collection only./node_exporter --collector.disable-defaults --collector.meminfo --collector.cpuCopy the code
The default shutdown reason
- Too heavy: High Cardinality
- Prolonged: Nudge Runtime that exceeds the Prometheus scrape_interval or scrape_timeout
- Significant Resource demands on the host
Disable golang SDK metrics
- use
--web.disable-exporter-metrics
promhttp_
On behalf of the visit/metrics
HTTP is
[root@prome_master_01 tgzs]# curl -s localhost:9100/metrics |grep promhttp_ # HELP promhttp_metric_handler_errors_total Total number of internal errors encountered by the promhttp metric handler. # TYPE promhttp_metric_handler_errors_total counter promhttp_metric_handler_errors_total{cause="encoding"} 0 promhttp_metric_handler_errors_total{cause="gathering"} 0 # HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served. # TYPE promhttp_metric_handler_requests_in_flight gauge promhttp_metric_handler_requests_in_flight 1 # HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code. # TYPE promhttp_metric_handler_requests_total counter promhttp_metric_handler_requests_total{code="200"} 8 promhttp_metric_handler_requests_total{code="500"} 0 promhttp_metric_handler_requests_total{code="503"} 0Copy the code
go_
Represents goruntime information, etc
# HELP go_goroutines Number of goroutines that currently exist. # TYPE go_goroutines gauge go_goroutines 7 # HELP Go_info info about the Go environment. # TYPE go_info gauge go_info{version="go1.15.8"} 1 # HELP go_memstats_alloc_bytes Number of bytes allocated and still in use. # TYPE go_memstats_alloc_bytes gauge 2.781752 e+06 go_memstats_alloc_bytesCopy the code
process_
Represents the process information
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds. # TYPE process_cpu_seconds_total Counter process_cpu_seconds_total 0.54 # HELP process_max_fds Maximum number of open file descriptors. # TYPE process_max_fds gauge process_max_fds 1024 # HELP process_open_fds Number of open file descriptors. # TYPE process_open_fds gauge process_open_fds 9 # HELP process_resident_memory_bytes Resident memory size in bytes. # TYPE 1.5720448 e+07 process_resident_memory_bytes the gauge process_resident_memory_bytesCopy the code
Data on nodes is reported
--collector.textfile.directory=""
Configure a local collection directory- Created in the collection directory
.prom
File,Format specification
Cat <<EOF >./text_file_dir/test. PROM # HELP nyy_test_metric just test # TYPE Nyy_test_metric gauge nyy_test_metric{method="post",code="200"} 1027 EOF # Start the service./ node_exporters -- the collector. Textfile) directory =)/text_file_dir # curl view data [root @ prome_master_01 TGZS] # curl -s localhost: 9100 / metrics |grep nyy # HELP nyy_test_metric just test # TYPE nyy_test_metric gauge nyy_test_metric{code="200",method="post"} 1027Copy the code
HTTP incoming parameters are filtered by collector
- Principle: Filters collectors by HTTP request parameters
func (h *handler) ServeHTTP(w http.ResponseWriter, r *http.Request) { filters := r.URL.Query()["collect[]"] level.Debug(h.logger).Log("msg", "collect query:", "filters", filters) if len(filters) == 0 { // No filters, use the prepared unfiltered handler. h.unfilteredHandler.ServeHTTP(w, r) return } // To serve filtered metrics, we create a filtering handler on the fly. filteredHandler, err := h.innerHandler(filters...) if err ! = nil { level.Warn(h.logger).Log("msg", "Couldn't create filtered metrics handler:", "err", err) w.WriteHeader(http.StatusBadRequest) w.Write([]byte(fmt.Sprintf("Couldn't create filtered metrics handler: %s", err))) return } filteredHandler.ServeHTTP(w, r) }Copy the code
- The HTTP access
http://192.168.0.112:9100/metrics? # only look at the CPU collector of indicators Collect [] = # CPU only look at the CPU and mem collector indicator of http://192.168.0.112:9100/metrics? collect[]=cpu&collect[]=meminfoCopy the code
- Prometheus configuration
params:
collect[]:
- cpu
- meminfo
Copy the code
- And Prometheus
relabel_config
The difference between:Filter by collector VS filter by metric_NAME or label
Import the node_exporter template in the Dashboard mall
- Address grafana.com/grafana/das…
-
Two import modes
- The url to import
- Json File import
-
Grafana.com/grafana/das…
Query the data on the Prometheus Graph page
Node_cpu_seconds_total node_cpu_seconds_total {mode = "user"} {CPU = "0", the instance = "172.20.70.205:9100, job =" Prometheus," Mode ="user"} 53.43 node_cpu_seconds_total{CPU ="0", instance=" 172.20.70.25:9100 ", job=" Prometheus ", Mode ="user"} 8.17node_cpu_seconds_total {CPU ="1", instance="172.20.70.205:9100", job=" Prometheus ", Mode ="user"} 28.96node_cpu_seconds_total {CPU ="1", instance=" 172.20.70.25:9100 ", job=" Prometheus ", Mode ="user"} 12.32node_cpu_seconds_total {CPU ="2", instance="172.20.70.205:9100", job=" Prometheus ", Mode ="user"} 31.54 node_cpu_seconds_total{CPU ="2", instance=" 172.20.70.25:9100 ", job=" Prometheus ", Mode ="user"} 8.32node_cpu_seconds_total {CPU ="3", instance="172.20.70.205:9100", job=" Prometheus ", Mode ="user"} 53.88node_cpu_seconds_total {CPU ="3", instance=" 172.20.70.25:9100 ", job=" Prometheus ", mode="user"} 6.38Copy the code
Prometheus queries data and data concepts
Basic concepts of Prometheus
The sample data points
type sample struct {
t int64
v float64
}
Copy the code
- Sample represents a data point
- Size :16 bytes: Contains one 8-byte INT64 timestamp and one 8-byte FLOAT64 value
The Label tag
type Label struct {
Name, Value string
}
Copy the code
- A pair of labels for example
cpu="0"
mode: "user"
Labels TAB sets
type Labels []Label
Copy the code
- This is all the tag values of a metric
Prometheus four query types
- The document address
-
Instant Vector: A set of time series, each time series contains one sample, and all samples share the same timestamp
Prometheus is a table query, corresponding to the interface/API /v1/query
-
The vector vector
type Vector []Sample Copy the code
- Vector, which is an alias for samples, but all samples have the same timestamp and are often used as the result of instance_query
-
Range vector: A set of time series, each containing a sample, all of which share the same timestamp
Graph query on Prometheus page, corresponding to query interface/API /v1/query
Matrix Matrix
type Matrix []Series Copy the code
- Matrix is a slice of series, the result returned by a normal range_query
-
The Scalar Scalar is a simple numeric floating point value
-
String A simple String value; Not currently in use
Four tag matching modes
-
= is equal to the
- Node_cpu_seconds_total {mode=”user”, CPU =”0″}
-
! = is not equal to
- Query: Number of bytes received by a non-LO NIC node_network_receive_bytes_total{device! =”lo”}
-
=~ Indicates a regular match
- Query: Remaining bytes of the filesystem starting with /run node_filesystem_avail_bytes{mountpoint=~”^/run.*”}
-
! ~ Indicates a non-match
- Query: Block device name does not contain vDA read bytes node_disk_read_bytes_total{device! ~. “.vda. “}
Four data types
gauge
The current value
node_memory_MemFree_bytes
Copy the code
counter
The counter represents a cumulative index monotonically increasing counter whose value can only be increased or zeroed on restart. For example, you can use counters to indicate the number of served requests, completed tasks, or errors.
http_request_total
Copy the code
histogram
Histogram samples observe (usually something like request duration or response size) and count them in the configured bucket. It also provides the sum of all observations.
Sum /count Prometheus_http_request_duration_seconds_sum / Prometheus_http_request_duration_seconds_count # histogram_quantile (0.95, The sum (rate (prometheus_http_request_duration_seconds_bucket [m] 5)) by (le, handler)) histogram_quantile (0.95, Sum (rate(prometheus_HTTP_request_duration_seconds_bucket [1m])) by (le)) # range_query sum(rate(prometheus_http_request_duration_seconds_bucket{handler="/api/v1/query_range"}[5m])) by (le))Copy the code
summary
The summary samples observations (usually things like request duration and response size). Although it also provides the total number of observations and the sum of all observations, it can compute configurable quantiles within a sliding time window.
# gc耗时
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 0.000135743
go_gc_duration_seconds{quantile="0.25"} 0.000872805
go_gc_duration_seconds{quantile="0.5"} 0.000965516
go_gc_duration_seconds{quantile="0.75"} 0.001055636
go_gc_duration_seconds{quantile="1"} 0.006464756
# summary 平均值
go_gc_duration_seconds_sum /go_gc_duration_seconds_count
Copy the code
Range Vector Selectors
- Range vectors work in the same way as real-time vectors except that they select a range of samples from the current real-time. Syntactically, the duration is appended to square brackets () at the end of the [] vector selector to specify how far the time value should be extracted for each result range vector element.
- Can only work on
counter
on
Time range
Ms - millisecond S - second M - minute H - hour D - Day - Assume that a day is always 24 hours W - Week - Assume that a week is always 7 days Y - Year - Assume that a year is always 365 daysCopy the code
Node_network_receive_bytes_total {device! =”lo”}[1m]
Error executing query: invalid expression type "range vector" for range query, must be Scalar or instant Vector
Copy the code
A non-aggregate function, such as rate irate delta IDelta sum, needs to be superimposed
- Calculate the rate of incoming nic traffic (node_network_receive_bytes_total{device! =”lo”}[1m])
The time range cannot be lower than the collection interval
-
Data is collected for 30 seconds and queried for 10 seconds
-
rate(node_network_receive_bytes_total{device! =”lo”}[10s])