Longhorn, Enterprise Cloud Native Container Distributed Storage - Monitoring (Prometheus+AlertManager+Grafana)

The content is from the official Longhorn 1.1.2 English technical manual.

A series of

What’s a Longhorn?
Longhorn enterprise cloud native container distributed storage solution design architecture and concepts
Longhorn Enterprise Cloud Native Container Distributed storage – Deployment
Longhorn Enterprise Cloud Native Container Distributed Storage – Volume and Node
Longhorn, enterprise cloud native Container Distributed storage -K8S resource configuration example

Set up the`Prometheus` 和 `Grafana`To monitor`Longhorn`

An overview of

Longhorn natively exposes metrics in Prometheus text format at REST endpoint http://LONGHORN_MANAGER_IP:PORT/metrics. For a description of all the metrics available, see Longhorn’s Metrics. You can capture these metrics using any collection tool such as Prometheus, Graphite, Telegraf, etc., and then visualize the collected data using tools such as Grafana.

This document provides a sample setup for monitoring Longhorn. Monitoring systems used Prometheus to collect data and alerts, and Grafana to visualise /dashboarding the collected data. In a high-level overview, monitoring systems include:

PrometheusServer fromLonghornIndicator endpoints capture and store time series data.PrometheusIt is also responsible for generating alerts based on configured rules and collected data.PrometheusThe server then sends the alert toAlertmanager.
AlertManagerThen manage these alerts (alerts), including silence (silencing), restrain (inhibition), aggregation (aggregation) and send notifications via email, call notification systems and chat platforms.
Grafana 向 PrometheusThe server queries the data and draws dashboards for visualization.

The following figure describes the detailed architecture of the monitoring system.

There are two unmentioned components in the figure above:

The Longhorn backend service is pointingLonghorn manager podsSet of services.LonghornThe index is at the endpointhttp://LONGHORN_MANAGER_IP:PORT/metrics 的 Longhorn manager podsIn the public.
Prometheus operatorMade inKubernetesRunning on thePrometheusIt becomes very easy.operatormonitoring3Five custom resources:ServiceMonitor,Prometheus 和 AlertManager. When users create these custom resources,Prometheus OperatorDeploy and manage using user-specified configurationsPrometheus server.AlerManager.

The installation

Follow these instructions to install all components into the Monitoring namespace. To install them in different namespaces, change the field namespace: OTHER_NAMESPACE

create`monitoring`The namespace

apiVersion: v1
kind: Namespace
metadata:
  name: monitoring
Copy the code

The installation`Prometheus Operator`

Deploy the Prometheus Operator and its required ClusterRole, ClusterRoleBinding, and Service Account.

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  labels:
    app.kubernetes.io/component: controller
    app.kubernetes.io/name: prometheus-operator
    app.kubernetes.io/version: v0.38.3
  name: prometheus-operator
  namespace: monitoring
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus-operator
subjects:
- kind: ServiceAccount
  name: prometheus-operator
  namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    app.kubernetes.io/component: controller
    app.kubernetes.io/name: prometheus-operator
    app.kubernetes.io/version: v0.38.3
  name: prometheus-operator
  namespace: monitoring
rules:
- apiGroups:
  - apiextensions.k8s.io
  resources:
  - customresourcedefinitions
  verbs:
  - create
- apiGroups:
  - apiextensions.k8s.io
  resourceNames:
  - alertmanagers.monitoring.coreos.com
  - podmonitors.monitoring.coreos.com
  - prometheuses.monitoring.coreos.com
  - prometheusrules.monitoring.coreos.com
  - servicemonitors.monitoring.coreos.com
  - thanosrulers.monitoring.coreos.com
  resources:
  - customresourcedefinitions
  verbs:
  - get
  - update
- apiGroups:
  - monitoring.coreos.com
  resources:
  - alertmanagers
  - alertmanagers/finalizers
  - prometheuses
  - prometheuses/finalizers
  - thanosrulers
  - thanosrulers/finalizers
  - servicemonitors
  - podmonitors
  - prometheusrules
  verbs:
  - The '*'
- apiGroups:
  - apps
  resources:
  - statefulsets
  verbs:
  - The '*'
- apiGroups:
  - ""
  resources:
  - configmaps
  - secrets
  verbs:
  - The '*'
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - list
  - delete
- apiGroups:
  - ""
  resources:
  - services
  - services/finalizers
  - endpoints
  verbs:
  - get
  - create
  - update
  - delete
- apiGroups:
  - ""
  resources:
  - nodes
  verbs:
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - namespaces
  verbs:
  - get
  - list
  - watch
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/component: controller
    app.kubernetes.io/name: prometheus-operator
    app.kubernetes.io/version: v0.38.3
  name: prometheus-operator
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/component: controller
      app.kubernetes.io/name: prometheus-operator
  template:
    metadata:
      labels:
        app.kubernetes.io/component: controller
        app.kubernetes.io/name: prometheus-operator
        app.kubernetes.io/version: v0.38.3
    spec:
      containers:
      - args:
        - --kubelet-service=kube-system/kubelet
        - --logtostderr=true
        - - config - reloader - image = jimmidyson/configmap - reload: v0.3.0
        - - Prometheus - config - reloader = quay. IO/Prometheus - operator/Prometheus - config - reloader: v0.38.3
        image: Quay. IO/Prometheus - operator/Prometheus - operator: v0.38.3
        name: prometheus-operator
        ports:
        - containerPort: 8080
          name: http
        resources:
          limits:
            cpu: 200m
            memory: 200Mi
          requests:
            cpu: 100m
            memory: 100Mi
        securityContext:
          allowPrivilegeEscalation: false
      nodeSelector:
        beta.kubernetes.io/os: linux
      securityContext:
        runAsNonRoot: true
        runAsUser: 65534
      serviceAccountName: prometheus-operator
---
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/component: controller
    app.kubernetes.io/name: prometheus-operator
    app.kubernetes.io/version: v0.38.3
  name: prometheus-operator
  namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/component: controller
    app.kubernetes.io/name: prometheus-operator
    app.kubernetes.io/version: v0.38.3
  name: prometheus-operator
  namespace: monitoring
spec:
  clusterIP: None
  ports:
  - name: http
    port: 8080
    targetPort: http
  selector:
    app.kubernetes.io/component: controller
    app.kubernetes.io/name: prometheus-operator
Copy the code

The installation`Longhorn ServiceMonitor`

Longhorn ServiceMonitor has a label selector app: Longhorn-Manager to select the Longhorn backend service. Later, Prometheus CRD could include the Longhorn ServiceMonitor so that Prometheus Server could discover all Longhorn Manager Pods and their endpoints.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: longhorn-prometheus-servicemonitor
  namespace: monitoring
  labels:
    name: longhorn-prometheus-servicemonitor
spec:
  selector:
    matchLabels:
      app: longhorn-manager
  namespaceSelector:
    matchNames:
    - longhorn-system
  endpoints:
  - port: manager
Copy the code

Installation and configuration`Prometheus AlertManager`

Create a highly available Alertmanager deployment with three instances:

apiVersion: monitoring.coreos.com/v1
kind: Alertmanager
metadata:
  name: longhorn
  namespace: monitoring
spec:
  replicas: 3
Copy the code

An Alertmanager instance will not start unless a valid configuration is provided. For more instructions on Alertmanager configuration, see here. The following code shows an example configuration:

global:
  resolve_timeout: 5m
route:
  group_by: [alertname]
  receiver: email_and_slack
receivers:
- name: email_and_slack
  email_configs:
  - to: <the email address to send notifications to>
    from: <the sender address>
    smarthost: <the SMTP host through which emails are sent>
    # SMTP authentication information.
    auth_username: <the username>
    auth_identity: <the identity>
    auth_password: <the password>
    headers:
      subject: 'Longhorn-Alert'
    text: |- {{ range .Alerts }} *Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}` *Description:* {{ Annotations. Description}} *Details:* {{range.attractions.sortedpairs}} • *{{{.name}}:* '{{.value}}' {{end}} {{end} }}  slack_configs:
  - api_url: <the Slack webhook URL>
    channel: <the channel or user to send notifications to>
    text: |- {{ range .Alerts }} *Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}` *Description:* {{ Annotations. Description}} *Details:* {{range.attractions.sortedpairs}} • *{{{.name}}:* '{{.value}}' {{end}} {{end} }}Copy the code

Save the above Alertmanager configuration in a file named AlertManager.yaml and create a secret from it using Kubectl.

The Alertmanager instance requires that the secret resource name follow the Alertmanager -{ALERTMANAGER_NAME} format. In the previous step, the name of Alertmanager was Longhorn, so the secret name must be AlertManager-Longhorn

$ kubectl create secret generic alertmanager-longhorn --from-file=alertmanager.yaml -n monitoring
Copy the code

To be able to view the Web UI of the Alertmanager, expose it through a Service. A simple way to do this is to use a Service of type NodePort:
```
apiVersion: v1
kind: Service
metadata:
  name: alertmanager-longhorn
  namespace: monitoring
spec:
  type: NodePort
  ports:
  - name: web
    nodePort: 30903
    port: 9093
    protocol: TCP
    targetPort: web
  selector:
    alertmanager: longhorn
Copy the code
```
After the preceding services are created, you can access the Web UI of Alertmanager using the IP address and port 30903 of the node.

Use the NodePort service above for quick validation, as it does not communicate over a TLS connection. You might want to change the service type to ClusterIP and set up an Ingress-Controller to expose the Alertmanager web UI over TLS connections.

Installation and configuration`Prometheus server`

Create a custom PrometheusRule resource that defines alert conditions.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    prometheus: longhorn
    role: alert-rules
  name: prometheus-longhorn-rules
  namespace: monitoring
spec:
  groups:
  - name: longhorn.rules
    rules:
    - alert: LonghornVolumeUsageCritical
      annotations:
        description: Longhorn volume {{$labels.volume}} on {{$labels.node}} is at {{$value}}% used for
          more than 5 minutes.
        summary: Longhorn volume capacity is over 90% used.
      expr: 100 * (longhorn_volume_usage_bytes / longhorn_volume_capacity_bytes) > 90
      for: 5m
      labels:
        issue: Longhorn volume {{$labels.volume}} usage on {{$labels.node}} is critical.
        severity: critical
Copy the code

For more information on how to define the alarm rules, please see Prometheus. IO/docs/promet…

If RBAC authorization is enabled, create a ClusterRole and ClusterRoleBinding for Prometheus Pod:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
  namespace: monitoring
Copy the code

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: prometheus
  namespace: monitoring
rules:
- apiGroups: [""]
  resources:
  - nodes
  - services
  - endpoints
  - pods
  verbs: ["get"."list"."watch"]
- apiGroups: [""]
  resources:
  - configmaps
  verbs: ["get"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]
Copy the code

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: monitoring
Copy the code

Create a Prometheus custom resource. Note that we selected the Longhorn Service Monitor and Longhorn rules in the spec.

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus
  namespace: monitoring
spec:
  replicas: 2
  serviceAccountName: prometheus
  alerting:
    alertmanagers:
      - namespace: monitoring
        name: alertmanager-longhorn
        port: web
  serviceMonitorSelector:
    matchLabels:
      name: longhorn-prometheus-servicemonitor
  ruleSelector:
    matchLabels:
      prometheus: longhorn
      role: alert-rules
Copy the code

To be able to view the Web UI of the Prometheus server, expose it through a Service. A simple way to do this is to use a Service of type NodePort:
```
apiVersion: v1
kind: Service
metadata:
  name: prometheus
  namespace: monitoring
spec:
  type: NodePort
  ports:
  - name: web
    nodePort: 30904
    port: 9090
    protocol: TCP
    targetPort: web
  selector:
    prometheus: prometheus
Copy the code
```
After the service is created, you can access the Web UI of Prometheus Server using the IP address and port 30904 of the node.

At this point, you should be able to see all Longhorn Manager Targets and Longhorn Rules in the Targets and Rules section of the Prometheus Server UI.

Use the above NodePort service for quick validation because it does not communicate over a TLS connection. You may want to change the service type to ClusterIP and set up an Ingress-Controller to expose Prometheus Server’s Web UI over TLS connections.

Install Grafana

Create Grafana data source configuration:

apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-datasources
  namespace: monitoring
data:
  prometheus.yaml: |- { "apiVersion": 1, "datasources": [ { "access":"proxy", "editable": true, "name": "prometheus", "orgId": 1, "type": "prometheus", "url": "http://prometheus:9090", "version": 1 } ] }Copy the code

Creating a Grafana deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: grafana
  namespace: monitoring
  labels:
    app: grafana
spec:
  replicas: 1
  selector:
    matchLabels:
      app: grafana
  template:
    metadata:
      name: grafana
      labels:
        app: grafana
    spec:
      containers:
      - name: grafana
        image: Grafana/grafana: 7.1.5
        ports:
        - name: grafana
          containerPort: 3000
        resources:
          limits:
            memory: "500Mi"
            cpu: "300m"
          requests:
            memory: "500Mi"
            cpu: "200m"
        volumeMounts:
          - mountPath: /var/lib/grafana
            name: grafana-storage
          - mountPath: /etc/grafana/provisioning/datasources
            name: grafana-datasources
            readOnly: false
      volumes:
        - name: grafana-storage
          emptyDir: {}
        - name: grafana-datasources
          configMap:
              defaultMode: 420
              name: grafana-datasources
Copy the code

Grafana exposure on NodePort 32000:
```
apiVersion: v1
kind: Service
metadata:
  name: grafana
  namespace: monitoring
spec:
  selector:
    app: grafana
  type: NodePort
  ports:
    - port: 3000
      targetPort: 3000
      nodePort: 32000
Copy the code
```
Use the above NodePort service for quick validation because it does not communicate over a TLS connection. You might want to change the service type to ClusterIP and set up an Ingress-Controller to expose Grafana over TLS connections.
Access the Grafana dashboard using any node IP on port 32000. The default credentials are:
```
User: admin
Pass: admin
Copy the code
```
Install the Longhorn dashboard

After entering Grafana, import the preset panel: grafana.com/grafana/das…

For instructions on how to import the Grafana Dashboard, see grafana.com/docs/grafan…

Upon success, you should see the following dashboards:

will`Longhorn`Metrics integration into`Rancher`In the monitoring system

about`Rancher`The monitoring system

With Rancher, you can monitor the status and progress of cluster nodes, Kubernetes components, and software deployments through integration with Prometheus, a leading open source monitoring solution.

For instructions on how to deploy/enable the Rancher monitoring system, see rancher.com/docs/ranche…

will`Longhorn`Add metrics to`Rancher`The monitoring system

If you use Rancher to manage your Kubernetes and have Rancher monitoring enabled, you can add Longhorn metrics to Rancher monitoring by simply deploying the following ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: longhorn-prometheus-servicemonitor
  namespace: longhorn-system
  labels:
    name: longhorn-prometheus-servicemonitor
spec:
  selector:
    matchLabels:
      app: longhorn-manager
  namespaceSelector:
    matchNames:
    - longhorn-system
  endpoints:
  - port: manager
Copy the code

After the ServiceMonitor is created, Rancher will automatically discover all Longhorn metrics.

You can then set up the Grafana dashboard for visualization.

`Longhorn`Monitoring indicators

Volume (Volume)

Indicators of	instructions	The sample
longhorn_volume_actual_size_bytes	The actual space used by each copy of the volume on the corresponding node	Longhorn_volume_actual_size_bytes {node = “worker – 2”, volume = “testvol}” 1.1917312 e+08
longhorn_volume_capacity_bytes	Configuration size of this volume in bytes	Longhorn_volume_capacity_bytes {node = “worker – 2”, volume = “testvol}” 6.442450944 e+09
longhorn_volume_state	Status: 1=creating, 2=attached, 3=Detached, 4=Attaching, 5=Detaching, 6=Deleting	longhorn_volume_state{node=”worker-2″,volume=”testvol”} 2
longhorn_volume_robustness	The robustness of this volume is 0=unknown, 1=healthy, 2= Degraded, 3=faulted	longhorn_volume_robustness{node=”worker-2″,volume=”testvol”} 1

The Node (the Node)

Indicators of	instructions	The sample
longhorn_node_status	The status of this object: 1=true, 0=false	longhorn_node_status{condition=”ready”,condition_reason=””,node=”worker-2″} 1
longhorn_node_count_total	Total number of nodes in the Longhorn system	longhorn_node_count_total 4
longhorn_node_cpu_capacity_millicpu	The maximum allocated CPU on this node	longhorn_node_cpu_capacity_millicpu{node=”worker-2″} 2000
longhorn_node_cpu_usage_millicpu	CPU usage on this node	longhorn_node_cpu_usage_millicpu{node=”pworker-2″} 186
longhorn_node_memory_capacity_bytes	The maximum allocatable memory on this node	Longhorn_node_memory_capacity_bytes {node = “worker – 2”} 4.031229952 e+09
longhorn_node_memory_usage_bytes	Memory usage on this node	Longhorn_node_memory_usage_bytes {node = “worker – 2”} 1.833582592 e+09
longhorn_node_storage_capacity_bytes	Storage capacity of the node	Longhorn_node_storage_capacity_bytes {node = “worker – 3”} 8.3987283968 e+10
longhorn_node_storage_usage_bytes	Used storage for this node	Longhorn_node_storage_usage_bytes {node = “worker – 3”} 9.060941824 e+09
longhorn_node_storage_reservation_bytes	The storage space reserved for other applications and systems on this node	Longhorn_node_storage_reservation_bytes {node = “worker – 3”} 2.519618519 e+10

Disk (Disk)

Indicators of	instructions	The sample
longhorn_disk_capacity_bytes	The storage capacity of this disk	Longhorn_disk_capacity_bytes {disk = “default – disk – 8 b28ee3134628183”, node = “worker – 3} 8.3987283968 e+10
longhorn_disk_usage_bytes	Used storage space for this disk	Longhorn_disk_usage_bytes {disk = “default – disk – 8 b28ee3134628183”, node = “worker – 3} 9.060941824 e+09
longhorn_disk_reservation_bytes	The storage space reserved for other applications and systems on this disk	Longhorn_disk_reservation_bytes {disk = “default – disk – 8 b28ee3134628183”, node = “worker – 3} 2.519618519 e+10

Instance Manager(Instance Manager)

Indicators of	instructions	The sample
longhorn_instance_manager_cpu_usage_millicpu	The CPU usage of this Longhorn instance manager	longhorn_instance_manager_cpu_usage_millicpu{instance_manager=”instance-manager-e-2189ed13″,instance_manager_type=”engin e”,node=”worker-2″} 80
longhorn_instance_manager_cpu_requests_millicpu	The CPU resources requested in kubernetes of this Longhorn instance manager	longhorn_instance_manager_cpu_requests_millicpu{instance_manager=”instance-manager-e-2189ed13″,instance_manager_type=”en gine”,node=”worker-2″} 250
longhorn_instance_manager_memory_usage_bytes	Memory usage for this Longhorn instance manager	longhorn_instance_manager_memory_usage_bytes{instance_manager=”instance-manager-e-2189ed13″,instance_manager_type=”engin E “, node = “worker – 2” 2.4072192 e+07}
longhorn_instance_manager_memory_requests_bytes	Memory requested by the Longhorn instance manager in Kubernetes	longhorn_instance_manager_memory_requests_bytes{instance_manager=”instance-manager-e-2189ed13″,instance_manager_type=”en gine”,node=”worker-2″} 0

Manager

Indicators of	instructions	The sample
longhorn_manager_cpu_usage_millicpu	The CPU usage of this Longhorn Manager	longhorn_manager_cpu_usage_millicpu{manager=”longhorn-manager-5rx2n”,node=”worker-2″} 27
longhorn_manager_memory_usage_bytes	Memory usage for this Longhorn Manager	Longhorn_manager_memory_usage_bytes {manager = “longhorn – manager – 5 rx2n”, node = “worker – 2”} 2.6144768 e+07

support`Kubelet Volume`indicators

about`Kubelet Volume`indicators

Kubelet discloses the following metrics:

kubelet_volume_stats_capacity_bytes
kubelet_volume_stats_available_bytes
kubelet_volume_stats_used_bytes
kubelet_volume_stats_inodes
kubelet_volume_stats_inodes_free
kubelet_volume_stats_inodes_used

These metrics measure information related to the PVC file system within the Longhorn Block device.

They differ from the Longhorn_volume_ * metric, which measures information specific to a Longhorn block device.

You can set up a monitoring system to grab Kubelet metric endpoints to get the status of the PVC and set up alerts for abnormal events, such as the PVC running out of storage space.

A popular monitoring setup is promethy-operator/kube-Promethy-Stack, which grabs the kubelet_volume_stats_* metrics and provides them with dashboards and alert rules.

Longhorn CSI plugin support

In V1.1.0, the Longhorn CSI plug-in supports NodeGetVolumeStats RPC according to the CSI spec.

This allows Kubelet to query the Longhorn CSI plug-in to get the status of the PVC.

Kubelet then exposes this information in the kubelet_volume_stats_* metric.

`Longhorn`Example alert rule

We have provided several sample Longhorn alert rules below for your reference. See here for a list of all the Longhorn metrics available and to build your own alert rules.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    prometheus: longhorn
    role: alert-rules
  name: prometheus-longhorn-rules
  namespace: monitoring
spec:
  groups:
  - name: longhorn.rules
    rules:
    - alert: LonghornVolumeActualSpaceUsedWarning
      annotations:
        description: The actual space used by Longhorn volume {{$labels.volume}} on {{$labels.node}} is at {{$value}}% capacity for
          more than 5 minutes.
        summary: The actual used space of Longhorn volume is over 90% of the capacity.
      expr: (longhorn_volume_actual_size_bytes / longhorn_volume_capacity_bytes) * 100 > 90
      for: 5m
      labels:
        issue: The actual used space of Longhorn volume {{$labels.volume}} on {{$labels.node}} is high.
        severity: warning
    - alert: LonghornVolumeStatusCritical
      annotations:
        description: Longhorn volume {{$labels.volume}} on {{$labels.node}} is Fault for
          more than 2 minutes.
        summary: Longhorn volume {{$labels.volume}} is Fault
      expr: longhorn_volume_robustness = = 3
      for: 5m
      labels:
        issue: Longhorn volume {{$labels.volume}} is Fault.
        severity: critical
    - alert: LonghornVolumeStatusWarning
      annotations:
        description: Longhorn volume {{$labels.volume}} on {{$labels.node}} is Degraded for
          more than 5 minutes.
        summary: Longhorn volume {{$labels.volume}} is Degraded
      expr: longhorn_volume_robustness = = 2
      for: 5m
      labels:
        issue: Longhorn volume {{$labels.volume}} is Degraded.
        severity: warning
    - alert: LonghornNodeStorageWarning
      annotations:
        description: The used storage of node {{$labels.node}} is at {{$value}}% capacity for
          more than 5 minutes.
        summary:  The used storage of node is over 70% of the capacity.
      expr: (longhorn_node_storage_usage_bytes / longhorn_node_storage_capacity_bytes) * 100 > 70
      for: 5m
      labels:
        issue: The used storage of node {{$labels.node}} is high.
        severity: warning
    - alert: LonghornDiskStorageWarning
      annotations:
        description: The used storage of disk {{$labels.disk}} on node {{$labels.node}} is at {{$value}}% capacity for
          more than 5 minutes.
        summary:  The used storage of disk is over 70% of the capacity.
      expr: (longhorn_disk_usage_bytes / longhorn_disk_capacity_bytes) * 100 > 70
      for: 5m
      labels:
        issue: The used storage of disk {{$labels.disk}} on node {{$labels.node}} is high.
        severity: warning
    - alert: LonghornNodeDown
      annotations:
        description: There are {{$value}} Longhorn nodes which have been offline for more than 5 minutes.
        summary: Longhorn nodes is offline
      expr: longhorn_node_total - (count(longhorn_node_status{condition="ready"}==1) OR on() vector(0))
      for: 5m
      labels:
        issue: There are {{$value}} Longhorn nodes are offline
        severity: critical
    - alert: LonghornIntanceManagerCPUUsageWarning
      annotations:
        description: Longhorn instance manager {{$labels.instance_manager}} on {{$labels.node}} has CPU Usage / CPU request is {{$value}}% for
          more than 5 minutes.
        summary: Longhorn instance manager {{$labels.instance_manager}} on {{$labels.node}} has CPU Usage / CPU request is over 300%.
      expr: (longhorn_instance_manager_cpu_usage_millicpu/longhorn_instance_manager_cpu_requests_millicpu) * 100 > 300
      for: 5m
      labels:
        issue: Longhorn instance manager {{$labels.instance_manager}} on {{$labels.node}} consumes 3 times the CPU request.
        severity: warning
    - alert: LonghornNodeCPUUsageWarning
      annotations:
        description: Longhorn node {{$labels.node}} has CPU Usage / CPU capacity is {{$value}}% for
          more than 5 minutes.
        summary: Longhorn node {{$labels.node}} experiences high CPU pressure for more than 5m.
      expr: (longhorn_node_cpu_usage_millicpu / longhorn_node_cpu_capacity_millicpu) * 100 > 90
      for: 5m
      labels:
        issue: Longhorn node {{$labels.node}} experiences high CPU pressure.
        severity: warning
Copy the code

The Prometheus. IO/docs/promet… See more information on how to define alert rules.

Longhorn, Enterprise Cloud Native Container Distributed Storage – Monitoring (Prometheus+AlertManager+Grafana)

A series of

directory

Set up thePrometheus 和 GrafanaTo monitorLonghorn

An overview of

The installation

createmonitoringThe namespace

The installationPrometheus Operator

The installationLonghorn ServiceMonitor

Installation and configurationPrometheus AlertManager

Installation and configurationPrometheus server

Install Grafana

willLonghornMetrics integration intoRancherIn the monitoring system

aboutRancherThe monitoring system

willLonghornAdd metrics toRancherThe monitoring system

LonghornMonitoring indicators