Kube Prometheus monitoring

introduce

Prometheus Operator provides Kubernetes with a local deployment and management solution for monitoring components associated with Prometheus machines. The purpose of this project is to simplify and automate the configuration of Prometheus based monitoring stacks, including the following functions:

Kubernetes Custom resources: Use Kubernetes CRD to deploy and manage Prometheus, Alertmanager, and related components. Simplified deployment configuration: Configure Prometheus directly from Kubernetes resource listings, such as version, persistence, replica, retention policy, and so on. Prometheus monitoring target configuration: Automatically generates monitoring target configuration based on the well-known Kubernetes tag query, without learning the specific configuration of Prometheus.

Architecture diagram for Prometheus Operator:

Kube-prometheus is a monitoring scheme implemented through the k8S operator, which registers multiple CRD resources in the K8S cluster, including four major CRD resources, and compiles a list of commonly used monitoring resources using the operator.

Prometheus: Operator based on a custom resource kind: This custom resource can be considered a StatefulSets resource specifically used to manage Prometheus Server in a Cluster of Prometheus Server clusters as described in Prometheus type.

ServiceMonitor: This CRD defines the dynamic service that we need to monitor and extracts pod monitoring data by fetching the SVC to the label label and binding it to the specified SVC and then fetching the MEtics route of the corresponding EP resource (url that defines the target metrics). The Operator also updates the /etc/prometheus/config_out/ promethe.env. yaml configuration file in real time to monitor changes in the watch Servicemonitor resource. And perform reload on Prometry-server.

Alertmanager: This CRD defines the configuration of Alertmanager to run in a Kubernetes cluster and also provides a variety of configurations, including persistent storage. For each Alertmanager resource, operators deploy a configured StatefulSet in the same namespace, and Alertmanager Pods are configured to contain a Secret named, Yaml is used as the key to save the configuration file.

PrometheusRule: This CRD defines the rules we need to monitor for alarms. It has a very important property, ruleSelector, which matches the filter of the rule. A PrometheusRule resource object with Prometheus = K8S and Role = alert-Rules tags is required to match. Prometheusrule: configMap prometheus-k8s-RuleFiles-0: configMap prometheus-k8s-RuleFiles-0: configMap prometheus-k8s-RuleFiles-0: configMap prometheus-k8s-RuleFiles-0 Then mount to Prometheus – server/etc/Prometheus/rules/Prometheus – k8s – rulefiles – 0 / the path below)

Prometry-operator's Controller custom controller is then run in a Deployment manner in the cluster. The controller will regularly list/watch the four main CRD resources mentioned above, and adjust and manage the corresponding resources through a loop, so as to effectively manage and monitor resources.

Daily management and use

The process for creating a monitor:

Expose the Metrics interface and provide monitoring data during application development
Create an SVC to ensure that users on the SVC can obtain ep monitoring data
The servicemonitor is created by binding the LABEL of the SVC and setting jobLabel (defines which label through the EP is set to the name of the job). If the metrics interface is not /metrics, you can also set path to change the default value

Monitoring ETCD cases:

Change the listening address of ETCD metrics

This environment uses Kubeadm deployment, through kubectl to obtain the etCD configuration file check:

`[root@dm01 ~]# kubectl get po etcd-dm01 -n kube-system -o yaml| grep -C 10 metrics`
  - command:
    - etcd
    - --advertise-client-urls=https:/ / 115.238.100.73:2379
    - --cert-file=/etc/kubernetes/pki/etcd/server.crt
    - --client-cert-auth=true
    - --data-dir=/var/lib/etcd
    - --initial-advertise-peer-urls=https:/ / 115.238.100.73:2380
    - --initial-cluster=dm01=https:/ / 115.238.100.73:2380, dm02 = https://192.168.1.12:2380, dm03 = https://192.168.1.13:2380
    - --key-file=/etc/kubernetes/pki/etcd/server.key
    - --listen-client-urls=https:/ / 127.0.0.1:2379, https://115.238.100.73:2379
    - --listen-metrics-urls=http:// 127.0.1:2381 # you can see that the default is to listen on 127.0.0.1
    - --listen-peer-urls=https:/ / 115.238.100.73:2380
    - --name=dm01
    - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
    - --peer-client-cert-auth=true
    - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
    - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    - --snapshot-count=10000
    - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    image: registry.aliyuncs.com/k8sxio/etcd:3.413.-0
    imagePullPolicy: IfNotPresent
Copy the code

You can see the listen-metrics-urls=http://127.0.0.1:2381 configuration in the startup parameter, which specifies that the metrics interface runs on port 2381 and is HTTP. Therefore, no certificate configuration is required, which is much easier than the previous version, which required HTTPS access, so you had to configure the corresponding certificate.

Modify/etc/kubernetes/manifest/directory (static Pod default directory) under the etcd. Listen – metrics – above the yaml file urls change http://0.0.0.0:2381’s exposure can, Otherwise, data obtained through the Servicemonitor will be rejected later.

Create the ServiceMonitor resource


apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    k8s-app: etcd-k8s  
spec:
  endpoints:
  - interval: 15s     
    port: port
  jobLabel: k8s-app # Specifies which label value should be used as the job namenamespaceSelector: matchNames: -kube-system # Which command space to matchselector: matchLabels: k8S-app: etcd-demo # Matches SVC labelsCopy the code

Create SVC and EndPoints

apiVersion: v1
kind: Service
metadata:
  labels:
    k8s-app: etcd-demo
  name: etcd-k8s
  namespace: kube-system
spec:
  clusterIP: None
  ports:
  - name: port
    port: 2381
    protocol: TCP
    targetPort: 2381
  type: ClusterIP
---
apiVersion: v1
kind: Endpoints
metadata:
  labels:
    k8s-app: etcd-demo
  name: etcd-k8s
  namespace: kube-system
subsets:
- addresses:
  - ip: 192.1681.11.
    nodeName: dm01
  - ip: 192.1681.12.
    nodeName: dm02
  - ip: 192.1681.13.
    nodeName: dm03
  ports:
  - name: port
    port: 2381
    protocol: TCP
Copy the code

Go to Prometheus to check whether ETCD monitoring data is obtained

After data collection, dashboard no. 3070 can be imported into Grafana to obtain the monitoring chart of ETCD

Create PrometheusRule

ETCD monitoring indicator example

We set a rule to trigger an alarm if the number of surviving ETCD cluster nodes is less than 3

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    prometheus: k8s
    role: alert-rules
  name: etcd-rules
  namespace: monitoring
spec:
  groups:
  - name: etcd
    rules:
    - alert: EtcdClusterUnavailable
      annotations:
        description: Node survival is not equal to three
        summary: etcd cluster small
      expr: "sum(up{job="etcd-demo"}) < 3"
      for: 1m 
      labels:
        severity: critical

Copy the code

Key Configuration Description

1. The following two labels must be identified when a rule is created, which is defined by ruleSeletor in the YAML file from the CRD resource Prometheus

`[root@dm01 /opt/softs]# kubectl get Prometheus k8s -o yaml -n monitoring `...ruleSelector:
    matchLabels:
      prometheus: k8s
      role: alert - rules...Copy the code

Expr indicates whether the PromQ syntax determines whether the set threshold is met. For indicates how long it takes for the alarm to occur when the condition is met. Generally speaking, the condition is met and the state is pending

      expr: "sum(up{job="etcd-demo"}) < 3"
      for: 1m 

Copy the code

You can see this rule on the Rules page for Prometheus when it is created

Example of Node memory monitoring indicators

`[root@dm01 /opt/softs]# cat node-mem.yaml `
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    prometheus: k8s
    role: alert-rules
  name: node-mem
  namespace: monitoring
spec:
  groups:
  - name: cronjob
    rules:
    - alert: node-mem-limit
      annotations:
        description: node-exporter {{ $labels.instance }} mem only {{ printf "%.2f" $value}}%  < 15%
        summary: node-mem-limit  alert
      expr: " (node_memory_MemAvailable_bytes{job='node-exporter'} / node_memory_MemTotal_bytes{job='node-exporter'}) * 100 < 15 "
      for: 1m
      labels:
        severity: warning
Copy the code

Description:

{{$allelage. instance}} represents the key under the label tag of the monitored service. {{printf "%.2f" $value}} Printf "%.2f" means value which is two digits after the decimal point

Effect achieved:

The description parameter =====>> Node-CN/hang. xx mem only 11.50% < 15%

The container CPU usage exceeds 200%

Record: node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
'Concrete implementation'sum by(cluster, namespace, pod, container) (irate(container_cpu_usage_seconds_total{image! ="",job="kubelet",metrics_path="/metrics/cadvisor"}[5m])) * on(cluster, namespace, pod) group_left(node) topk by(cluster, namespace, pod) (1, max by(cluster, namespace, pod, node) (kube_pod_info{node! =""})) * 100 > 200
Copy the code

The container memory usage exceeds 1 GB. Procedure

container_memory_working_set_bytes {pod! =""} / (1024*1024*1024) > 1
Copy the code

Altermanager implements enterprise wechat alarm

In this test, Altermanager+prometheusAlter is used to call the wechat robot of the enterprise to realize the alarm.

Deploy the prometheusAlter service and modify the app.conf file

kubectl apply -n monitoring -f https://raw.githubusercontent.com/feiyu563/PrometheusAlert/master/example/kubernetes/PrometheusAlert-Deployment.yaml
Copy the code

The configuration file of prometheusAlter is mounted in the form of ConfigMap. This service provides many alarm channels, and we just use the wechat alarm mode

Modifying Key Configurations

`[root@dm01 /etc/kubernetes]# kubectl edit configmap prometheus-alert-center-conf -n monitoring -o yaml`# Whether to enable the wechat alarm channel. Multiple channels can be started at the same time0To shut down,1To open the open - weixin =1# Default address wxURL = XXXXX ···Copy the code

Access his UI by creating the ingress configuration domain name

You can run an alarm test to check whether any problem exists

Configure the Altermanager service

Altermanager alarm files are stored in the secret of alertmanager-secret, edit configuration

[root@dm01 /etc/kubernetes]# cat /root/kube-prometheus/manifests/alertmanager.yaml
global:
  resolve_timeout: 5m
route:
  group_by: ['instance']
  group_waitWhen a new alarm group is created, wait at least group_wait time to initialize the notification. This ensures that you have enough time to get multiple alarms for the same group and then trigger the alarm message together.group_interval: 10s # Interval for sending alarm notifications between the same groupsrepeat_interval: 10m # If an alarm message has been sent successfully, wait for repeat_interval to send them again. The sending frequency of different types of alarms needs specific configurationreceiver: 'web.hook.prometheusalert'# Default receiver: If an alarm is not matched by a route, it is sent to the default receiver. # All attributes above are inherited by all subroutes and can be overridden on each subroute.routes:
  - receiver: 'web.hook.prometheusalert'
    group_wait: 10s
    match:
      level: '1'
receivers:
- name: 'web.hook.prometheusalert'
  webhook_configsDefine the SVC address of the PrometheusAlter service we deployed above - url:'http://prometheus-alert-center.monitoring.svc.cluster.local:8080/prometheus/alert'** ** ** ** ** ** ** ** ** ** ** ** ** ** *'https://prometheus-alert.center.monitoring.svc.cluster.local:8080/prometheusalert? Type =wx&tpl=aliyun-prometheus&wxurl= aT =zhangsan'
Copy the code

Re-create the alertManager-main secret implementation update

[root@dm01 /etc/kubernetes]# kubectl delete secret alertmanager-main -n monitoring
secret "alertmanager-main" deleted
[root@dm01 /etc/kubernetes]# kubectl create secret generic alertmanager-main --from-file=/root/kube-prometheus/manifests/alertmanager.yaml -n monitoring
Copy the code

Example of creating a prometheusAlert template

{{ $var := .externalURL}}{{ range $k,$v:=.alerts }}
{{if eq $v.status "resolved"}} #### [Prometheus recovery information]({{$v.generatorurl}}) ><font color="info">The alarm name</font>: [{{$v.l. Abels alertname}}] ({{$var}}) ><font color="info">The alarm level</font>: {{$v.l. Abels severity}} ><font color="info">The start time</font>: {{$v. GetCSTtime tartsAt}} ><font color="info">The end of time</font>: {{$v.e GetCSTtime ndsAt}} ><font color="info">The host name</font>: {{$v.l. Abels instance}} * * {{$says v. nnotations. The description}} {{* *else}} #### [Prometheus alarm]({{$v.generatorurl}}) ><font color="#FF0000">The alarm name</font>: [{{$v.l. Abels alertname}}] ({{$var}}) ><font color="#FF0000">The alarm level</font>: {{$v.l. Abels severity}} ><font color="#FF0000">The start time</font>: {{$v. GetCSTtime tartsAt}} ><font color="#FF0000">The end of time</font>: {{$v.e GetCSTtime ndsAt}} ><font color="#FF0000">The host name</font>: {{$v.l. Abels instance}} * * {{$says v. nnotations. The description}} * * {{end}} {{end}}Copy the code

Trigger alarm test

Causes the ETCD of a node to hang

In the kubeadm deployment environment directly mv the ETCD yamL file can be used

[root@dm01 /etc/kubernetes/manifests]# mv etcd.yaml /opt/softs/
[root@dm01 /etc/kubernetes/manifests]# docker ps | grep -i etcd
Copy the code

Check Alter’s status in a minute

The enterprise wechat received the alarm. Procedure

Golang basic monitoring implementation

Use a basic piece of code

package main

import (
        "net/http"

        "github.com/prometheus/client_golang/prometheus/promhttp"
)

func main() {
        http.Handle("/metrics", promhttp.Handler())
        http.ListenAndServe(": 2112", nil)
}
Copy the code

Package via Dockerfile

### Build the manager binary
FROM golang:1.15 as builder

WORKDIR /workspace
ENV GOPROXY="https://goproxy.cn,direct"
# Copy the go source
COPY . .

# Build
RUN CGO_ENABLED=0 GOOS=linux GOARCH=amd64 GO111MODULE=on go build -o gotest main.go

# Runtime
FROM alpine

WORKDIR /usr/local/bin
COPY --from=builder /workspace/gotest /usr/local/bin/gotest

CMD ["gotest"]

Copy the code

Deploy to the K8S environment and create the Servicemonitor resource

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: go-demo
  namespace: monitoring
  labels:
    k8s-app: go-demo
spec:
  jobLabel: k8s-app
  endpoints:
  - port: port
    interval: 15s
  selector:
    matchLabels:
      k8s-app: go-demo
  namespaceSelector:
    matchNames:
    - test008
---
apiVersion: v1
kind: Service
metadata:
  name: go-demo
  namespace: test008
  labels:
    k8s-app: go-demo
spec:
  type: ClusterIP
  ports:
  - name: port
    port: 2121
  selector:
    k8s-app: go-demo

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: go-demo
  namespace: test008
  labels:
    k8s-app: go-demo
spec:
  replicas: 1
  selector:
    matchLabels:
      k8s-app: go-demo
  template:
    metadata:
      labels:
        k8s-app: go-demo
    spec:
      containers:
      - name: go-demo
        image: xxx
        ports:
        - containerPort: 2121
Copy the code

The problem

At this point, when we create the console, we find that we are not actually getting the service we deployed to, and when we look at Prometheus' log, we find that Prometheus has no access to the test008 resource

why

Due to our default deployment of Kube-Prometheus, whose role and rolebinding bind permissions “default” to only three namespaces, Monitoring, kube-system, and role and RoleBinding rules need to be created when retrieving data from other namespaces

Create roles for the specified namespace


apiVersion: rbac.authorization.k8s.io/v1
items:
- apiVersion: rbac.authorization.k8s.io/v1
  kind: Role
  metadata:
    name: prometheus-k8s
    namespace: default
  rules:
  - apiGroups:
    - ""
    resources:
    - services
    - endpoints
    - pods
    verbs:
    - get
    - list
    - watch
  - apiGroups:
    - extensions
    resources:
    - ingresses
    verbs:
    - get
    - list
    - watch
- apiVersion: rbac.authorization.k8s.io/v1
  kind: Role
  metadata:
    name: prometheus-k8s
    namespace: kube-system
  rules:
  - apiGroups:
    - ""
    resources:
    - services
    - endpoints
    - pods
    verbs:
    - get
    - list
    - watch
  - apiGroups:
    - extensions
    resources:
    - ingresses
    verbs:
    - get
    - list
    - watch
- apiVersion: rbac.authorization.k8s.io/v1
  kind: Role
  metadata:
    name: prometheus-k8s
    namespace: monitoring
  rules:
  - apiGroups:
    - ""
    resources:
    - services
    - endpoints
    - pods
    verbs:
    - get
    - list
    - watch
  - apiGroups:
    - extensions
    resources:
    - ingresses
    verbs:
    - get
    - list
    - watch
- apiVersion: rbac.authorization.k8s.io/v1
  kind: Role
  metadata:
    name: prometheus-k8s
    namespace: test008
  rules:
  - apiGroups:
    - ""
    resources:
    - services
    - endpoints
    - pods
    verbs:
    - get
    - list
    - watch
  - apiGroups:
    - extensions
    resources:
    - ingresses
    verbs:
    - get
    - list
    - watch
kind: RoleList
Copy the code

Create Rolebinding

apiVersion: rbac.authorization.k8s.io/v1
items:
- apiVersion: rbac.authorization.k8s.io/v1
  kind: RoleBinding
  metadata:
    name: prometheus-k8s
    namespace: default
  roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: Role
    name: prometheus-k8s
  subjects:
  - kind: ServiceAccount
    name: prometheus-k8s
    namespace: monitoring
- apiVersion: rbac.authorization.k8s.io/v1
  kind: RoleBinding
  metadata:
    name: prometheus-k8s
    namespace: kube-system
  roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: Role
    name: prometheus-k8s
  subjects:
  - kind: ServiceAccount
    name: prometheus-k8s
    namespace: monitoring
- apiVersion: rbac.authorization.k8s.io/v1
  kind: RoleBinding
  metadata:
    name: prometheus-k8s
    namespace: monitoring
  roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: Role
    name: prometheus-k8s
  subjects:
  - kind: ServiceAccount
    name: prometheus-k8s
    namespace: monitoring
- apiVersion: rbac.authorization.k8s.io/v1
  kind: RoleBinding
  metadata:
    name: prometheus-k8s
    namespace: test008
  roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: Role
    name: prometheus-k8s
  subjects:
  - kind: ServiceAccount
    name: prometheus-k8s
    namespace: monitoring
kind: RoleBindingList
Copy the code

Console View Target

The default data retention duration of Prometheus was changed

Modify Prometheus operator time. Set this parameter in Prometheus. Spec

retention: 7d
Copy the code