introduce
Prometheus Operator provides Kubernetes with a local deployment and management solution for monitoring components associated with Prometheus machines. The purpose of this project is to simplify and automate the configuration of Prometheus based monitoring stacks, including the following functions:
Kubernetes Custom resources: Use Kubernetes CRD to deploy and manage Prometheus, Alertmanager, and related components. Simplified deployment configuration: Configure Prometheus directly from Kubernetes resource listings, such as version, persistence, replica, retention policy, and so on. Prometheus monitoring target configuration: Automatically generates monitoring target configuration based on the well-known Kubernetes tag query, without learning the specific configuration of Prometheus.
Architecture diagram for Prometheus Operator:
Kube-prometheus is a monitoring scheme implemented through the k8S operator, which registers multiple CRD resources in the K8S cluster, including four major CRD resources, and compiles a list of commonly used monitoring resources using the operator.
Prometheus: Operator based on a custom resource kind: This custom resource can be considered a StatefulSets resource specifically used to manage Prometheus Server in a Cluster of Prometheus Server clusters as described in Prometheus type.
ServiceMonitor: This CRD defines the dynamic service that we need to monitor and extracts pod monitoring data by fetching the SVC to the label label and binding it to the specified SVC and then fetching the MEtics route of the corresponding EP resource (url that defines the target metrics). The Operator also updates the /etc/prometheus/config_out/ promethe.env. yaml configuration file in real time to monitor changes in the watch Servicemonitor resource. And perform reload on Prometry-server.
Alertmanager: This CRD defines the configuration of Alertmanager to run in a Kubernetes cluster and also provides a variety of configurations, including persistent storage. For each Alertmanager resource, operators deploy a configured StatefulSet in the same namespace, and Alertmanager Pods are configured to contain a Secret named, Yaml is used as the key to save the configuration file.
PrometheusRule: This CRD defines the rules we need to monitor for alarms. It has a very important property, ruleSelector, which matches the filter of the rule. A PrometheusRule resource object with Prometheus = K8S and Role = alert-Rules tags is required to match. Prometheusrule: configMap prometheus-k8s-RuleFiles-0: configMap prometheus-k8s-RuleFiles-0: configMap prometheus-k8s-RuleFiles-0: configMap prometheus-k8s-RuleFiles-0 Then mount to Prometheus – server/etc/Prometheus/rules/Prometheus – k8s – rulefiles – 0 / the path below)
Prometry-operator's Controller custom controller is then run in a Deployment manner in the cluster. The controller will regularly list/watch the four main CRD resources mentioned above, and adjust and manage the corresponding resources through a loop, so as to effectively manage and monitor resources.
Daily management and use
The process for creating a monitor:
- Expose the Metrics interface and provide monitoring data during application development
- Create an SVC to ensure that users on the SVC can obtain ep monitoring data
- The servicemonitor is created by binding the LABEL of the SVC and setting jobLabel (defines which label through the EP is set to the name of the job). If the metrics interface is not /metrics, you can also set path to change the default value
Monitoring ETCD cases:
Change the listening address of ETCD metrics
This environment uses Kubeadm deployment, through kubectl to obtain the etCD configuration file check:
`[root@dm01 ~]# kubectl get po etcd-dm01 -n kube-system -o yaml| grep -C 10 metrics`
- command:
- etcd
- --advertise-client-urls=https:/ / 115.238.100.73:2379
- --cert-file=/etc/kubernetes/pki/etcd/server.crt
- --client-cert-auth=true
- --data-dir=/var/lib/etcd
- --initial-advertise-peer-urls=https:/ / 115.238.100.73:2380
- --initial-cluster=dm01=https:/ / 115.238.100.73:2380, dm02 = https://192.168.1.12:2380, dm03 = https://192.168.1.13:2380
- --key-file=/etc/kubernetes/pki/etcd/server.key
- --listen-client-urls=https:/ / 127.0.0.1:2379, https://115.238.100.73:2379
- --listen-metrics-urls=http:// 127.0.1:2381 # you can see that the default is to listen on 127.0.0.1
- --listen-peer-urls=https:/ / 115.238.100.73:2380
- --name=dm01
- --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
- --peer-client-cert-auth=true
- --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
- --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
- --snapshot-count=10000
- --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
image: registry.aliyuncs.com/k8sxio/etcd:3.413.-0
imagePullPolicy: IfNotPresent
Copy the code
You can see the listen-metrics-urls=http://127.0.0.1:2381 configuration in the startup parameter, which specifies that the metrics interface runs on port 2381 and is HTTP. Therefore, no certificate configuration is required, which is much easier than the previous version, which required HTTPS access, so you had to configure the corresponding certificate.
Modify/etc/kubernetes/manifest/directory (static Pod default directory) under the etcd. Listen – metrics – above the yaml file urls change http://0.0.0.0:2381’s exposure can, Otherwise, data obtained through the Servicemonitor will be rejected later.
Create the ServiceMonitor resource
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
k8s-app: etcd-k8s
spec:
endpoints:
- interval: 15s
port: port
jobLabel: k8s-app # Specifies which label value should be used as the job namenamespaceSelector: matchNames: -kube-system # Which command space to matchselector: matchLabels: k8S-app: etcd-demo # Matches SVC labelsCopy the code
Create SVC and EndPoints
apiVersion: v1
kind: Service
metadata:
labels:
k8s-app: etcd-demo
name: etcd-k8s
namespace: kube-system
spec:
clusterIP: None
ports:
- name: port
port: 2381
protocol: TCP
targetPort: 2381
type: ClusterIP
---
apiVersion: v1
kind: Endpoints
metadata:
labels:
k8s-app: etcd-demo
name: etcd-k8s
namespace: kube-system
subsets:
- addresses:
- ip: 192.1681.11.
nodeName: dm01
- ip: 192.1681.12.
nodeName: dm02
- ip: 192.1681.13.
nodeName: dm03
ports:
- name: port
port: 2381
protocol: TCP
Copy the code
Go to Prometheus to check whether ETCD monitoring data is obtained
After data collection, dashboard no. 3070 can be imported into Grafana to obtain the monitoring chart of ETCD
Create PrometheusRule
ETCD monitoring indicator example
We set a rule to trigger an alarm if the number of surviving ETCD cluster nodes is less than 3
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
prometheus: k8s
role: alert-rules
name: etcd-rules
namespace: monitoring
spec:
groups:
- name: etcd
rules:
- alert: EtcdClusterUnavailable
annotations:
description: Node survival is not equal to three
summary: etcd cluster small
expr: "sum(up{job="etcd-demo"}) < 3"
for: 1m
labels:
severity: critical
Copy the code
Key Configuration Description
1. The following two labels must be identified when a rule is created, which is defined by ruleSeletor in the YAML file from the CRD resource Prometheus
`[root@dm01 /opt/softs]# kubectl get Prometheus k8s -o yaml -n monitoring `...ruleSelector:
matchLabels:
prometheus: k8s
role: alert - rules...Copy the code
Expr indicates whether the PromQ syntax determines whether the set threshold is met. For indicates how long it takes for the alarm to occur when the condition is met. Generally speaking, the condition is met and the state is pending
expr: "sum(up{job="etcd-demo"}) < 3"
for: 1m
Copy the code
You can see this rule on the Rules page for Prometheus when it is created
Example of Node memory monitoring indicators
`[root@dm01 /opt/softs]# cat node-mem.yaml `
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
prometheus: k8s
role: alert-rules
name: node-mem
namespace: monitoring
spec:
groups:
- name: cronjob
rules:
- alert: node-mem-limit
annotations:
description: node-exporter {{ $labels.instance }} mem only {{ printf "%.2f" $value}}% < 15%
summary: node-mem-limit alert
expr: " (node_memory_MemAvailable_bytes{job='node-exporter'} / node_memory_MemTotal_bytes{job='node-exporter'}) * 100 < 15 "
for: 1m
labels:
severity: warning
Copy the code
Description:
{{$allelage. instance}} represents the key under the label tag of the monitored service. {{printf "%.2f" $value}} Printf "%.2f" means value which is two digits after the decimal point
Effect achieved:
The description parameter =====>> Node-CN/hang. xx mem only 11.50% < 15%
The container CPU usage exceeds 200%
Record: node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
'Concrete implementation'sum by(cluster, namespace, pod, container) (irate(container_cpu_usage_seconds_total{image! ="",job="kubelet",metrics_path="/metrics/cadvisor"}[5m])) * on(cluster, namespace, pod) group_left(node) topk by(cluster, namespace, pod) (1, max by(cluster, namespace, pod, node) (kube_pod_info{node! =""})) * 100 > 200
Copy the code
The container memory usage exceeds 1 GB. Procedure
container_memory_working_set_bytes {pod! =""} / (1024*1024*1024) > 1
Copy the code
Altermanager implements enterprise wechat alarm
In this test, Altermanager+prometheusAlter is used to call the wechat robot of the enterprise to realize the alarm.
Deploy the prometheusAlter service and modify the app.conf file
kubectl apply -n monitoring -f https://raw.githubusercontent.com/feiyu563/PrometheusAlert/master/example/kubernetes/PrometheusAlert-Deployment.yaml
Copy the code
The configuration file of prometheusAlter is mounted in the form of ConfigMap. This service provides many alarm channels, and we just use the wechat alarm mode
Modifying Key Configurations
`[root@dm01 /etc/kubernetes]# kubectl edit configmap prometheus-alert-center-conf -n monitoring -o yaml`# Whether to enable the wechat alarm channel. Multiple channels can be started at the same time0To shut down,1To open the open - weixin =1# Default address wxURL = XXXXX ···Copy the code
Access his UI by creating the ingress configuration domain name
You can run an alarm test to check whether any problem exists
Configure the Altermanager service
Altermanager alarm files are stored in the secret of alertmanager-secret, edit configuration
[root@dm01 /etc/kubernetes]# cat /root/kube-prometheus/manifests/alertmanager.yaml
global:
resolve_timeout: 5m
route:
group_by: ['instance']
group_waitWhen a new alarm group is created, wait at least group_wait time to initialize the notification. This ensures that you have enough time to get multiple alarms for the same group and then trigger the alarm message together.group_interval: 10s # Interval for sending alarm notifications between the same groupsrepeat_interval: 10m # If an alarm message has been sent successfully, wait for repeat_interval to send them again. The sending frequency of different types of alarms needs specific configurationreceiver: 'web.hook.prometheusalert'# Default receiver: If an alarm is not matched by a route, it is sent to the default receiver. # All attributes above are inherited by all subroutes and can be overridden on each subroute.routes:
- receiver: 'web.hook.prometheusalert'
group_wait: 10s
match:
level: '1'
receivers:
- name: 'web.hook.prometheusalert'
webhook_configsDefine the SVC address of the PrometheusAlter service we deployed above - url:'http://prometheus-alert-center.monitoring.svc.cluster.local:8080/prometheus/alert'** ** ** ** ** ** ** ** ** ** ** ** ** ** *'https://prometheus-alert.center.monitoring.svc.cluster.local:8080/prometheusalert? Type =wx&tpl=aliyun-prometheus&wxurl= aT =zhangsan'
Copy the code
Re-create the alertManager-main secret implementation update
[root@dm01 /etc/kubernetes]# kubectl delete secret alertmanager-main -n monitoring
secret "alertmanager-main" deleted
[root@dm01 /etc/kubernetes]# kubectl create secret generic alertmanager-main --from-file=/root/kube-prometheus/manifests/alertmanager.yaml -n monitoring
Copy the code
Example of creating a prometheusAlert template
{{ $var := .externalURL}}{{ range $k,$v:=.alerts }}
{{if eq $v.status "resolved"}} #### [Prometheus recovery information]({{$v.generatorurl}}) ><font color="info">The alarm name</font>: [{{$v.l. Abels alertname}}] ({{$var}}) ><font color="info">The alarm level</font>: {{$v.l. Abels severity}} ><font color="info">The start time</font>: {{$v. GetCSTtime tartsAt}} ><font color="info">The end of time</font>: {{$v.e GetCSTtime ndsAt}} ><font color="info">The host name</font>: {{$v.l. Abels instance}} * * {{$says v. nnotations. The description}} {{* *else}} #### [Prometheus alarm]({{$v.generatorurl}}) ><font color="#FF0000">The alarm name</font>: [{{$v.l. Abels alertname}}] ({{$var}}) ><font color="#FF0000">The alarm level</font>: {{$v.l. Abels severity}} ><font color="#FF0000">The start time</font>: {{$v. GetCSTtime tartsAt}} ><font color="#FF0000">The end of time</font>: {{$v.e GetCSTtime ndsAt}} ><font color="#FF0000">The host name</font>: {{$v.l. Abels instance}} * * {{$says v. nnotations. The description}} * * {{end}} {{end}}Copy the code
Trigger alarm test
Causes the ETCD of a node to hang
In the kubeadm deployment environment directly mv the ETCD yamL file can be used
[root@dm01 /etc/kubernetes/manifests]# mv etcd.yaml /opt/softs/
[root@dm01 /etc/kubernetes/manifests]# docker ps | grep -i etcd
Copy the code
Check Alter’s status in a minute
The enterprise wechat received the alarm. Procedure
Golang basic monitoring implementation
Use a basic piece of code
package main
import (
"net/http"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
func main() {
http.Handle("/metrics", promhttp.Handler())
http.ListenAndServe(": 2112", nil)
}
Copy the code
Package via Dockerfile
### Build the manager binary
FROM golang:1.15 as builder
WORKDIR /workspace
ENV GOPROXY="https://goproxy.cn,direct"
# Copy the go source
COPY . .
# Build
RUN CGO_ENABLED=0 GOOS=linux GOARCH=amd64 GO111MODULE=on go build -o gotest main.go
# Runtime
FROM alpine
WORKDIR /usr/local/bin
COPY --from=builder /workspace/gotest /usr/local/bin/gotest
CMD ["gotest"]
Copy the code
Deploy to the K8S environment and create the Servicemonitor resource
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: go-demo
namespace: monitoring
labels:
k8s-app: go-demo
spec:
jobLabel: k8s-app
endpoints:
- port: port
interval: 15s
selector:
matchLabels:
k8s-app: go-demo
namespaceSelector:
matchNames:
- test008
---
apiVersion: v1
kind: Service
metadata:
name: go-demo
namespace: test008
labels:
k8s-app: go-demo
spec:
type: ClusterIP
ports:
- name: port
port: 2121
selector:
k8s-app: go-demo
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: go-demo
namespace: test008
labels:
k8s-app: go-demo
spec:
replicas: 1
selector:
matchLabels:
k8s-app: go-demo
template:
metadata:
labels:
k8s-app: go-demo
spec:
containers:
- name: go-demo
image: xxx
ports:
- containerPort: 2121
Copy the code
The problem
At this point, when we create the console, we find that we are not actually getting the service we deployed to, and when we look at Prometheus' log, we find that Prometheus has no access to the test008 resource
why
Due to our default deployment of Kube-Prometheus, whose role and rolebinding bind permissions “default” to only three namespaces, Monitoring, kube-system, and role and RoleBinding rules need to be created when retrieving data from other namespaces
Create roles for the specified namespace
apiVersion: rbac.authorization.k8s.io/v1
items:
- apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: prometheus-k8s
namespace: default
rules:
- apiGroups:
- ""
resources:
- services
- endpoints
- pods
verbs:
- get
- list
- watch
- apiGroups:
- extensions
resources:
- ingresses
verbs:
- get
- list
- watch
- apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: prometheus-k8s
namespace: kube-system
rules:
- apiGroups:
- ""
resources:
- services
- endpoints
- pods
verbs:
- get
- list
- watch
- apiGroups:
- extensions
resources:
- ingresses
verbs:
- get
- list
- watch
- apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: prometheus-k8s
namespace: monitoring
rules:
- apiGroups:
- ""
resources:
- services
- endpoints
- pods
verbs:
- get
- list
- watch
- apiGroups:
- extensions
resources:
- ingresses
verbs:
- get
- list
- watch
- apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: prometheus-k8s
namespace: test008
rules:
- apiGroups:
- ""
resources:
- services
- endpoints
- pods
verbs:
- get
- list
- watch
- apiGroups:
- extensions
resources:
- ingresses
verbs:
- get
- list
- watch
kind: RoleList
Copy the code
Create Rolebinding
apiVersion: rbac.authorization.k8s.io/v1
items:
- apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: prometheus-k8s
namespace: default
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: prometheus-k8s
subjects:
- kind: ServiceAccount
name: prometheus-k8s
namespace: monitoring
- apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: prometheus-k8s
namespace: kube-system
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: prometheus-k8s
subjects:
- kind: ServiceAccount
name: prometheus-k8s
namespace: monitoring
- apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: prometheus-k8s
namespace: monitoring
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: prometheus-k8s
subjects:
- kind: ServiceAccount
name: prometheus-k8s
namespace: monitoring
- apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: prometheus-k8s
namespace: test008
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: prometheus-k8s
subjects:
- kind: ServiceAccount
name: prometheus-k8s
namespace: monitoring
kind: RoleBindingList
Copy the code
Console View Target
The default data retention duration of Prometheus was changed
Modify Prometheus operator time. Set this parameter in Prometheus. Spec
retention: 7d
Copy the code