The content is from the official Longhorn 1.1.2 English technical manual.

A series of

  • What’s a Longhorn?
  • Longhorn enterprise cloud native container distributed storage solution design architecture and concepts
  • Longhorn Enterprise Cloud Native Container Distributed storage – Deployment
  • Longhorn Enterprise Cloud Native Container Distributed Storage – Volume and Node
  • Longhorn, enterprise cloud native Container Distributed storage -K8S resource configuration example

directory

  1. Set up thePrometheusGrafanaTo monitorLonghorn
  2. willLonghornMetrics integration intoRancherIn the monitoring system
  3. LonghornMonitoring indicators
  4. supportKubelet Volumeindicators
  5. LonghornExample alert rule

Set up thePrometheusGrafanaTo monitorLonghorn

An overview of

Longhorn natively exposes metrics in Prometheus text format at REST endpoint http://LONGHORN_MANAGER_IP:PORT/metrics. For a description of all the metrics available, see Longhorn’s Metrics. You can capture these metrics using any collection tool such as Prometheus, Graphite, Telegraf, etc., and then visualize the collected data using tools such as Grafana.

This document provides a sample setup for monitoring Longhorn. Monitoring systems used Prometheus to collect data and alerts, and Grafana to visualise /dashboarding the collected data. In a high-level overview, monitoring systems include:

  • PrometheusServer fromLonghornIndicator endpoints capture and store time series data.PrometheusIt is also responsible for generating alerts based on configured rules and collected data.PrometheusThe server then sends the alert toAlertmanager.
  • AlertManagerThen manage these alerts (alerts), including silence (silencing), restrain (inhibition), aggregation (aggregation) and send notifications via email, call notification systems and chat platforms.
  • GrafanaPrometheusThe server queries the data and draws dashboards for visualization.

The following figure describes the detailed architecture of the monitoring system.

There are two unmentioned components in the figure above:

  • The Longhorn backend service is pointingLonghorn manager podsSet of services.LonghornThe index is at the endpointhttp://LONGHORN_MANAGER_IP:PORT/metricsLonghorn manager podsIn the public.
  • Prometheus operatorMade inKubernetesRunning on thePrometheusIt becomes very easy.operatormonitoring3Five custom resources:ServiceMonitor,PrometheusAlertManager. When users create these custom resources,Prometheus OperatorDeploy and manage using user-specified configurationsPrometheus server.AlerManager.

The installation

Follow these instructions to install all components into the Monitoring namespace. To install them in different namespaces, change the field namespace: OTHER_NAMESPACE

createmonitoringThe namespace

apiVersion: v1
kind: Namespace
metadata:
  name: monitoring
Copy the code

The installationPrometheus Operator

Deploy the Prometheus Operator and its required ClusterRole, ClusterRoleBinding, and Service Account.

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  labels:
    app.kubernetes.io/component: controller
    app.kubernetes.io/name: prometheus-operator
    app.kubernetes.io/version: v0.38.3
  name: prometheus-operator
  namespace: monitoring
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus-operator
subjects:
- kind: ServiceAccount
  name: prometheus-operator
  namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    app.kubernetes.io/component: controller
    app.kubernetes.io/name: prometheus-operator
    app.kubernetes.io/version: v0.38.3
  name: prometheus-operator
  namespace: monitoring
rules:
- apiGroups:
  - apiextensions.k8s.io
  resources:
  - customresourcedefinitions
  verbs:
  - create
- apiGroups:
  - apiextensions.k8s.io
  resourceNames:
  - alertmanagers.monitoring.coreos.com
  - podmonitors.monitoring.coreos.com
  - prometheuses.monitoring.coreos.com
  - prometheusrules.monitoring.coreos.com
  - servicemonitors.monitoring.coreos.com
  - thanosrulers.monitoring.coreos.com
  resources:
  - customresourcedefinitions
  verbs:
  - get
  - update
- apiGroups:
  - monitoring.coreos.com
  resources:
  - alertmanagers
  - alertmanagers/finalizers
  - prometheuses
  - prometheuses/finalizers
  - thanosrulers
  - thanosrulers/finalizers
  - servicemonitors
  - podmonitors
  - prometheusrules
  verbs:
  - The '*'
- apiGroups:
  - apps
  resources:
  - statefulsets
  verbs:
  - The '*'
- apiGroups:
  - ""
  resources:
  - configmaps
  - secrets
  verbs:
  - The '*'
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - list
  - delete
- apiGroups:
  - ""
  resources:
  - services
  - services/finalizers
  - endpoints
  verbs:
  - get
  - create
  - update
  - delete
- apiGroups:
  - ""
  resources:
  - nodes
  verbs:
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - namespaces
  verbs:
  - get
  - list
  - watch
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/component: controller
    app.kubernetes.io/name: prometheus-operator
    app.kubernetes.io/version: v0.38.3
  name: prometheus-operator
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/component: controller
      app.kubernetes.io/name: prometheus-operator
  template:
    metadata:
      labels:
        app.kubernetes.io/component: controller
        app.kubernetes.io/name: prometheus-operator
        app.kubernetes.io/version: v0.38.3
    spec:
      containers:
      - args:
        - --kubelet-service=kube-system/kubelet
        - --logtostderr=true
        - - config - reloader - image = jimmidyson/configmap - reload: v0.3.0
        - - Prometheus - config - reloader = quay. IO/Prometheus - operator/Prometheus - config - reloader: v0.38.3
        image: Quay. IO/Prometheus - operator/Prometheus - operator: v0.38.3
        name: prometheus-operator
        ports:
        - containerPort: 8080
          name: http
        resources:
          limits:
            cpu: 200m
            memory: 200Mi
          requests:
            cpu: 100m
            memory: 100Mi
        securityContext:
          allowPrivilegeEscalation: false
      nodeSelector:
        beta.kubernetes.io/os: linux
      securityContext:
        runAsNonRoot: true
        runAsUser: 65534
      serviceAccountName: prometheus-operator
---
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/component: controller
    app.kubernetes.io/name: prometheus-operator
    app.kubernetes.io/version: v0.38.3
  name: prometheus-operator
  namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/component: controller
    app.kubernetes.io/name: prometheus-operator
    app.kubernetes.io/version: v0.38.3
  name: prometheus-operator
  namespace: monitoring
spec:
  clusterIP: None
  ports:
  - name: http
    port: 8080
    targetPort: http
  selector:
    app.kubernetes.io/component: controller
    app.kubernetes.io/name: prometheus-operator
Copy the code

The installationLonghorn ServiceMonitor

Longhorn ServiceMonitor has a label selector app: Longhorn-Manager to select the Longhorn backend service. Later, Prometheus CRD could include the Longhorn ServiceMonitor so that Prometheus Server could discover all Longhorn Manager Pods and their endpoints.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: longhorn-prometheus-servicemonitor
  namespace: monitoring
  labels:
    name: longhorn-prometheus-servicemonitor
spec:
  selector:
    matchLabels:
      app: longhorn-manager
  namespaceSelector:
    matchNames:
    - longhorn-system
  endpoints:
  - port: manager
Copy the code

Installation and configurationPrometheus AlertManager

  1. Create a highly available Alertmanager deployment with three instances:

    apiVersion: monitoring.coreos.com/v1
    kind: Alertmanager
    metadata:
      name: longhorn
      namespace: monitoring
    spec:
      replicas: 3
    Copy the code
  2. An Alertmanager instance will not start unless a valid configuration is provided. For more instructions on Alertmanager configuration, see here. The following code shows an example configuration:

    global:
      resolve_timeout: 5m
    route:
      group_by: [alertname]
      receiver: email_and_slack
    receivers:
    - name: email_and_slack
      email_configs:
      - to: <the email address to send notifications to>
        from: <the sender address>
        smarthost: <the SMTP host through which emails are sent>
        # SMTP authentication information.
        auth_username: <the username>
        auth_identity: <the identity>
        auth_password: <the password>
        headers:
          subject: 'Longhorn-Alert'
        text: |- {{ range .Alerts }} *Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}` *Description:* {{ Annotations. Description}} *Details:* {{range.attractions.sortedpairs}} • *{{{.name}}:* '{{.value}}' {{end}} {{end} }}  slack_configs:
      - api_url: <the Slack webhook URL>
        channel: <the channel or user to send notifications to>
        text: |- {{ range .Alerts }} *Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}` *Description:* {{ Annotations. Description}} *Details:* {{range.attractions.sortedpairs}} • *{{{.name}}:* '{{.value}}' {{end}} {{end} }}Copy the code

    Save the above Alertmanager configuration in a file named AlertManager.yaml and create a secret from it using Kubectl.

    The Alertmanager instance requires that the secret resource name follow the Alertmanager -{ALERTMANAGER_NAME} format. In the previous step, the name of Alertmanager was Longhorn, so the secret name must be AlertManager-Longhorn

    $ kubectl create secret generic alertmanager-longhorn --from-file=alertmanager.yaml -n monitoring
    Copy the code
  3. To be able to view the Web UI of the Alertmanager, expose it through a Service. A simple way to do this is to use a Service of type NodePort:

    apiVersion: v1
    kind: Service
    metadata:
      name: alertmanager-longhorn
      namespace: monitoring
    spec:
      type: NodePort
      ports:
      - name: web
        nodePort: 30903
        port: 9093
        protocol: TCP
        targetPort: web
      selector:
        alertmanager: longhorn
    Copy the code

    After the preceding services are created, you can access the Web UI of Alertmanager using the IP address and port 30903 of the node.

    Use the NodePort service above for quick validation, as it does not communicate over a TLS connection. You might want to change the service type to ClusterIP and set up an Ingress-Controller to expose the Alertmanager web UI over TLS connections.

Installation and configurationPrometheus server

  1. Create a custom PrometheusRule resource that defines alert conditions.

    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
      labels:
        prometheus: longhorn
        role: alert-rules
      name: prometheus-longhorn-rules
      namespace: monitoring
    spec:
      groups:
      - name: longhorn.rules
        rules:
        - alert: LonghornVolumeUsageCritical
          annotations:
            description: Longhorn volume {{$labels.volume}} on {{$labels.node}} is at {{$value}}% used for
              more than 5 minutes.
            summary: Longhorn volume capacity is over 90% used.
          expr: 100 * (longhorn_volume_usage_bytes / longhorn_volume_capacity_bytes) > 90
          for: 5m
          labels:
            issue: Longhorn volume {{$labels.volume}} usage on {{$labels.node}} is critical.
            severity: critical
    Copy the code

    For more information on how to define the alarm rules, please see Prometheus. IO/docs/promet…

  2. If RBAC authorization is enabled, create a ClusterRole and ClusterRoleBinding for Prometheus Pod:

    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: prometheus
      namespace: monitoring
    Copy the code
    apiVersion: rbac.authorization.k8s.io/v1beta1
    kind: ClusterRole
    metadata:
      name: prometheus
      namespace: monitoring
    rules:
    - apiGroups: [""]
      resources:
      - nodes
      - services
      - endpoints
      - pods
      verbs: ["get"."list"."watch"]
    - apiGroups: [""]
      resources:
      - configmaps
      verbs: ["get"]
    - nonResourceURLs: ["/metrics"]
      verbs: ["get"]
    Copy the code
    apiVersion: rbac.authorization.k8s.io/v1beta1
    kind: ClusterRoleBinding
    metadata:
      name: prometheus
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: ClusterRole
      name: prometheus
    subjects:
    - kind: ServiceAccount
      name: prometheus
      namespace: monitoring
    Copy the code
  3. Create a Prometheus custom resource. Note that we selected the Longhorn Service Monitor and Longhorn rules in the spec.

    apiVersion: monitoring.coreos.com/v1
    kind: Prometheus
    metadata:
      name: prometheus
      namespace: monitoring
    spec:
      replicas: 2
      serviceAccountName: prometheus
      alerting:
        alertmanagers:
          - namespace: monitoring
            name: alertmanager-longhorn
            port: web
      serviceMonitorSelector:
        matchLabels:
          name: longhorn-prometheus-servicemonitor
      ruleSelector:
        matchLabels:
          prometheus: longhorn
          role: alert-rules
    Copy the code
  4. To be able to view the Web UI of the Prometheus server, expose it through a Service. A simple way to do this is to use a Service of type NodePort:

    apiVersion: v1
    kind: Service
    metadata:
      name: prometheus
      namespace: monitoring
    spec:
      type: NodePort
      ports:
      - name: web
        nodePort: 30904
        port: 9090
        protocol: TCP
        targetPort: web
      selector:
        prometheus: prometheus
    Copy the code

    After the service is created, you can access the Web UI of Prometheus Server using the IP address and port 30904 of the node.

    At this point, you should be able to see all Longhorn Manager Targets and Longhorn Rules in the Targets and Rules section of the Prometheus Server UI.

    Use the above NodePort service for quick validation because it does not communicate over a TLS connection. You may want to change the service type to ClusterIP and set up an Ingress-Controller to expose Prometheus Server’s Web UI over TLS connections.

Install Grafana

  1. Create Grafana data source configuration:

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: grafana-datasources
      namespace: monitoring
    data:
      prometheus.yaml: |- { "apiVersion": 1, "datasources": [ { "access":"proxy", "editable": true, "name": "prometheus", "orgId": 1, "type": "prometheus", "url": "http://prometheus:9090", "version": 1 } ] }Copy the code
  2. Creating a Grafana deployment:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: grafana
      namespace: monitoring
      labels:
        app: grafana
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: grafana
      template:
        metadata:
          name: grafana
          labels:
            app: grafana
        spec:
          containers:
          - name: grafana
            image: Grafana/grafana: 7.1.5
            ports:
            - name: grafana
              containerPort: 3000
            resources:
              limits:
                memory: "500Mi"
                cpu: "300m"
              requests:
                memory: "500Mi"
                cpu: "200m"
            volumeMounts:
              - mountPath: /var/lib/grafana
                name: grafana-storage
              - mountPath: /etc/grafana/provisioning/datasources
                name: grafana-datasources
                readOnly: false
          volumes:
            - name: grafana-storage
              emptyDir: {}
            - name: grafana-datasources
              configMap:
                  defaultMode: 420
                  name: grafana-datasources
    Copy the code
  3. Grafana exposure on NodePort 32000:

    apiVersion: v1
    kind: Service
    metadata:
      name: grafana
      namespace: monitoring
    spec:
      selector:
        app: grafana
      type: NodePort
      ports:
        - port: 3000
          targetPort: 3000
          nodePort: 32000
    Copy the code

    Use the above NodePort service for quick validation because it does not communicate over a TLS connection. You might want to change the service type to ClusterIP and set up an Ingress-Controller to expose Grafana over TLS connections.

  4. Access the Grafana dashboard using any node IP on port 32000. The default credentials are:

    User: admin
    Pass: admin
    Copy the code
  5. Install the Longhorn dashboard

    After entering Grafana, import the preset panel: grafana.com/grafana/das…

    For instructions on how to import the Grafana Dashboard, see grafana.com/docs/grafan…

    Upon success, you should see the following dashboards:

willLonghornMetrics integration intoRancherIn the monitoring system

aboutRancherThe monitoring system

With Rancher, you can monitor the status and progress of cluster nodes, Kubernetes components, and software deployments through integration with Prometheus, a leading open source monitoring solution.

For instructions on how to deploy/enable the Rancher monitoring system, see rancher.com/docs/ranche…

willLonghornAdd metrics toRancherThe monitoring system

If you use Rancher to manage your Kubernetes and have Rancher monitoring enabled, you can add Longhorn metrics to Rancher monitoring by simply deploying the following ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: longhorn-prometheus-servicemonitor
  namespace: longhorn-system
  labels:
    name: longhorn-prometheus-servicemonitor
spec:
  selector:
    matchLabels:
      app: longhorn-manager
  namespaceSelector:
    matchNames:
    - longhorn-system
  endpoints:
  - port: manager
Copy the code

After the ServiceMonitor is created, Rancher will automatically discover all Longhorn metrics.

You can then set up the Grafana dashboard for visualization.

LonghornMonitoring indicators

Volume (Volume)

Indicators of instructions The sample
longhorn_volume_actual_size_bytes The actual space used by each copy of the volume on the corresponding node Longhorn_volume_actual_size_bytes {node = “worker – 2”, volume = “testvol}” 1.1917312 e+08
longhorn_volume_capacity_bytes Configuration size of this volume in bytes Longhorn_volume_capacity_bytes {node = “worker – 2”, volume = “testvol}” 6.442450944 e+09
longhorn_volume_state Status: 1=creating, 2=attached, 3=Detached, 4=Attaching, 5=Detaching, 6=Deleting longhorn_volume_state{node=”worker-2″,volume=”testvol”} 2
longhorn_volume_robustness The robustness of this volume is 0=unknown, 1=healthy, 2= Degraded, 3=faulted longhorn_volume_robustness{node=”worker-2″,volume=”testvol”} 1

The Node (the Node)

Indicators of instructions The sample
longhorn_node_status The status of this object: 1=true, 0=false longhorn_node_status{condition=”ready”,condition_reason=””,node=”worker-2″} 1
longhorn_node_count_total Total number of nodes in the Longhorn system longhorn_node_count_total 4
longhorn_node_cpu_capacity_millicpu The maximum allocated CPU on this node longhorn_node_cpu_capacity_millicpu{node=”worker-2″} 2000
longhorn_node_cpu_usage_millicpu CPU usage on this node longhorn_node_cpu_usage_millicpu{node=”pworker-2″} 186
longhorn_node_memory_capacity_bytes The maximum allocatable memory on this node Longhorn_node_memory_capacity_bytes {node = “worker – 2”} 4.031229952 e+09
longhorn_node_memory_usage_bytes Memory usage on this node Longhorn_node_memory_usage_bytes {node = “worker – 2”} 1.833582592 e+09
longhorn_node_storage_capacity_bytes Storage capacity of the node Longhorn_node_storage_capacity_bytes {node = “worker – 3”} 8.3987283968 e+10
longhorn_node_storage_usage_bytes Used storage for this node Longhorn_node_storage_usage_bytes {node = “worker – 3”} 9.060941824 e+09
longhorn_node_storage_reservation_bytes The storage space reserved for other applications and systems on this node Longhorn_node_storage_reservation_bytes {node = “worker – 3”} 2.519618519 e+10

Disk (Disk)

Indicators of instructions The sample
longhorn_disk_capacity_bytes The storage capacity of this disk Longhorn_disk_capacity_bytes {disk = “default – disk – 8 b28ee3134628183”, node = “worker – 3} 8.3987283968 e+10
longhorn_disk_usage_bytes Used storage space for this disk Longhorn_disk_usage_bytes {disk = “default – disk – 8 b28ee3134628183”, node = “worker – 3} 9.060941824 e+09
longhorn_disk_reservation_bytes The storage space reserved for other applications and systems on this disk Longhorn_disk_reservation_bytes {disk = “default – disk – 8 b28ee3134628183”, node = “worker – 3} 2.519618519 e+10

Instance Manager(Instance Manager)

Indicators of instructions The sample
longhorn_instance_manager_cpu_usage_millicpu The CPU usage of this Longhorn instance manager longhorn_instance_manager_cpu_usage_millicpu{instance_manager=”instance-manager-e-2189ed13″,instance_manager_type=”engin e”,node=”worker-2″} 80
longhorn_instance_manager_cpu_requests_millicpu The CPU resources requested in kubernetes of this Longhorn instance manager longhorn_instance_manager_cpu_requests_millicpu{instance_manager=”instance-manager-e-2189ed13″,instance_manager_type=”en gine”,node=”worker-2″} 250
longhorn_instance_manager_memory_usage_bytes Memory usage for this Longhorn instance manager longhorn_instance_manager_memory_usage_bytes{instance_manager=”instance-manager-e-2189ed13″,instance_manager_type=”engin E “, node = “worker – 2” 2.4072192 e+07}
longhorn_instance_manager_memory_requests_bytes Memory requested by the Longhorn instance manager in Kubernetes longhorn_instance_manager_memory_requests_bytes{instance_manager=”instance-manager-e-2189ed13″,instance_manager_type=”en gine”,node=”worker-2″} 0

Manager

Indicators of instructions The sample
longhorn_manager_cpu_usage_millicpu The CPU usage of this Longhorn Manager longhorn_manager_cpu_usage_millicpu{manager=”longhorn-manager-5rx2n”,node=”worker-2″} 27
longhorn_manager_memory_usage_bytes Memory usage for this Longhorn Manager Longhorn_manager_memory_usage_bytes {manager = “longhorn – manager – 5 rx2n”, node = “worker – 2”} 2.6144768 e+07

supportKubelet Volumeindicators

aboutKubelet Volumeindicators

Kubelet discloses the following metrics:

  1. kubelet_volume_stats_capacity_bytes
  2. kubelet_volume_stats_available_bytes
  3. kubelet_volume_stats_used_bytes
  4. kubelet_volume_stats_inodes
  5. kubelet_volume_stats_inodes_free
  6. kubelet_volume_stats_inodes_used

These metrics measure information related to the PVC file system within the Longhorn Block device.

They differ from the Longhorn_volume_ * metric, which measures information specific to a Longhorn block device.

You can set up a monitoring system to grab Kubelet metric endpoints to get the status of the PVC and set up alerts for abnormal events, such as the PVC running out of storage space.

A popular monitoring setup is promethy-operator/kube-Promethy-Stack, which grabs the kubelet_volume_stats_* metrics and provides them with dashboards and alert rules.

Longhorn CSI plugin support

In V1.1.0, the Longhorn CSI plug-in supports NodeGetVolumeStats RPC according to the CSI spec.

This allows Kubelet to query the Longhorn CSI plug-in to get the status of the PVC.

Kubelet then exposes this information in the kubelet_volume_stats_* metric.

LonghornExample alert rule

We have provided several sample Longhorn alert rules below for your reference. See here for a list of all the Longhorn metrics available and to build your own alert rules.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    prometheus: longhorn
    role: alert-rules
  name: prometheus-longhorn-rules
  namespace: monitoring
spec:
  groups:
  - name: longhorn.rules
    rules:
    - alert: LonghornVolumeActualSpaceUsedWarning
      annotations:
        description: The actual space used by Longhorn volume {{$labels.volume}} on {{$labels.node}} is at {{$value}}% capacity for
          more than 5 minutes.
        summary: The actual used space of Longhorn volume is over 90% of the capacity.
      expr: (longhorn_volume_actual_size_bytes / longhorn_volume_capacity_bytes) * 100 > 90
      for: 5m
      labels:
        issue: The actual used space of Longhorn volume {{$labels.volume}} on {{$labels.node}} is high.
        severity: warning
    - alert: LonghornVolumeStatusCritical
      annotations:
        description: Longhorn volume {{$labels.volume}} on {{$labels.node}} is Fault for
          more than 2 minutes.
        summary: Longhorn volume {{$labels.volume}} is Fault
      expr: longhorn_volume_robustness = = 3
      for: 5m
      labels:
        issue: Longhorn volume {{$labels.volume}} is Fault.
        severity: critical
    - alert: LonghornVolumeStatusWarning
      annotations:
        description: Longhorn volume {{$labels.volume}} on {{$labels.node}} is Degraded for
          more than 5 minutes.
        summary: Longhorn volume {{$labels.volume}} is Degraded
      expr: longhorn_volume_robustness = = 2
      for: 5m
      labels:
        issue: Longhorn volume {{$labels.volume}} is Degraded.
        severity: warning
    - alert: LonghornNodeStorageWarning
      annotations:
        description: The used storage of node {{$labels.node}} is at {{$value}}% capacity for
          more than 5 minutes.
        summary:  The used storage of node is over 70% of the capacity.
      expr: (longhorn_node_storage_usage_bytes / longhorn_node_storage_capacity_bytes) * 100 > 70
      for: 5m
      labels:
        issue: The used storage of node {{$labels.node}} is high.
        severity: warning
    - alert: LonghornDiskStorageWarning
      annotations:
        description: The used storage of disk {{$labels.disk}} on node {{$labels.node}} is at {{$value}}% capacity for
          more than 5 minutes.
        summary:  The used storage of disk is over 70% of the capacity.
      expr: (longhorn_disk_usage_bytes / longhorn_disk_capacity_bytes) * 100 > 70
      for: 5m
      labels:
        issue: The used storage of disk {{$labels.disk}} on node {{$labels.node}} is high.
        severity: warning
    - alert: LonghornNodeDown
      annotations:
        description: There are {{$value}} Longhorn nodes which have been offline for more than 5 minutes.
        summary: Longhorn nodes is offline
      expr: longhorn_node_total - (count(longhorn_node_status{condition="ready"}==1) OR on() vector(0))
      for: 5m
      labels:
        issue: There are {{$value}} Longhorn nodes are offline
        severity: critical
    - alert: LonghornIntanceManagerCPUUsageWarning
      annotations:
        description: Longhorn instance manager {{$labels.instance_manager}} on {{$labels.node}} has CPU Usage / CPU request is {{$value}}% for
          more than 5 minutes.
        summary: Longhorn instance manager {{$labels.instance_manager}} on {{$labels.node}} has CPU Usage / CPU request is over 300%.
      expr: (longhorn_instance_manager_cpu_usage_millicpu/longhorn_instance_manager_cpu_requests_millicpu) * 100 > 300
      for: 5m
      labels:
        issue: Longhorn instance manager {{$labels.instance_manager}} on {{$labels.node}} consumes 3 times the CPU request.
        severity: warning
    - alert: LonghornNodeCPUUsageWarning
      annotations:
        description: Longhorn node {{$labels.node}} has CPU Usage / CPU capacity is {{$value}}% for
          more than 5 minutes.
        summary: Longhorn node {{$labels.node}} experiences high CPU pressure for more than 5m.
      expr: (longhorn_node_cpu_usage_millicpu / longhorn_node_cpu_capacity_millicpu) * 100 > 90
      for: 5m
      labels:
        issue: Longhorn node {{$labels.node}} experiences high CPU pressure.
        severity: warning
Copy the code

The Prometheus. IO/docs/promet… See more information on how to define alert rules.