How to use Loki to analyze Kubernetes events

Of all the objects in the Kubernetes API, Events is one of the most overlooked types. Compared with other objects, events are very active and are unlikely to be stored in ETCD for long periods of time. By default, events only last one hour. When we use Kubectl Describe to retrieve an object, we may not be able to retrieve its history events due to time lapse, which is very unfriendly to the cluster user. In addition to being able to view cluster events, we might also need to track specific Warning events such as Pod life cycles, replica sets, or worker node status for related alerts. So before starting this topic, let’s first understand the structure of Kubernetes Events. The following are several important field explanations given by the official interview

Message: A human-readable description of the status of this operation
Involved Object: The object that the event is about, like Pod, Deployment, Node, etc.
Reason: Short, MACHINE – Understandable string — in other words, Enum
Source: The component reporting this event; a short, machine-understandable string, i.e., kube-scheduler
Type: Currently holds only Normal & Warning, but custom types can be given if desired.
Count: The number of times the event has occurred

For these events, we expect a collection tool to output the information to a persistent place for storage and analysis. In the past, we used to export kubernetes events to Elasticsearch for index analysis.

Since this article discusses the analysis of Kubernes events by Loki, we basically follow the following process for event processing:

kubernetes-api --> event-exporter --> fluentd --> loki --> grafana
Copy the code

At present, Kubernetes Events can be collected by kube-Eventer from Ali Cloud and Kubernetes-Event from Opsgenie. (Kubesphere also has a Kube-Events. However, it needs to be used with CRD of its component, so it is not discussed.

When an event is entered in Loki, it can be visually queried on Grafana using LogQL V2 statements. For example, we can display Kubernetes events statistically by level and type. The Dashboard allows you to quickly view cluster exceptions.

kubernetes-event-exporter

The first step is to deploy Kubernetes-Event-Exporter, which prints cluster events to the container STdout for log collection

apiVersion: v1
kind: ServiceAccount
metadata:
  name: event-exporter
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: event-exporter
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: view
subjects:
  - kind: ServiceAccount
    namespace: kube-system
    name: event-exporter
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: event-exporter-cfg
  namespace: kube-system
data:
  config.yaml: | logLevel: error logFormat: json route: routes: - match: - receiver: "dump" receivers: - name: "dump" file: path: "/dev/stdout"---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: event-exporter
  namespace: kube-system
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: event-exporter
        version: v1
    spec:
      serviceAccountName: event-exporter
      containers:
        - name: event-exporter
          image: Opsgenie/kubernetes - event - exporter: 0.9
          imagePullPolicy: IfNotPresent
          args:
            - -conf=/data/config.yaml
          volumeMounts:
            - mountPath: /data
              name: cfg
      volumes:
        - name: cfg
          configMap:
            name: event-exporter-cfg
  selector:
    matchLabels:
      app: event-exporter
      version: v1
Copy the code

When the container is fully running, kubectl logs will allow you to see the cluster events that the Event-exporter container will print in JSON format.

Fluentd and FluentBit will collect the container logs by default, usually running on Kubernetes. All we need to do is send these to Loki

Fluentd and Loki: Fluentd and Loki: Fluentd and Loki: Fluentd and Loki: Fluentd and Loki

Finally, we can query the write of the Kubernetes event on the Dagger

Event extension Node Problem Detector

Kubernetes does not have many events about Node, and it is not possible to notify nodes of more low-level states (such as kernel deadlocks, container runtime unresponsive, etc.) by means of events. The Node Problem Detector is a good complement to reporting more detailed Node events to Kubernetes in NodeCondition and Event mode.

Installing the Node Problem Detector is very simple and can be done with just two commands from helm.

helm repo add deliveryhero https://charts.deliveryhero.io/
helm install deliveryhero/node-problem-detector
Copy the code

The Node Problem Detector supports users to run custom scripts to construct events. In this article, the Node Problem Detector has a defined network monitoring step to perform Conntrack checks on Node nodes in addition to the default configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: node-problem-detector-config
  namespace: kube-system
data:
  network_problem.sh: |
    #! /bin/bash
    readonly OK=0
    readonly NONOK=1
    readonly UNKNOWN=2

    readonly NF_CT_COUNT_PATH='/proc/sys/net/netfilter/nf_conntrack_count'
    readonly NF_CT_MAX_PATH='/proc/sys/net/netfilter/nf_conntrack_max'
    readonly IP_CT_COUNT_PATH='/proc/sys/net/ipv4/netfilter/ip_conntrack_count'
    readonly IP_CT_MAX_PATH='/proc/sys/net/ipv4/netfilter/ip_conntrack_max'

    if [[ -f $NF_CT_COUNT_PATH ]] && [[ -f $NF_CT_MAX_PATH]].then
      readonly CT_COUNT_PATH=$NF_CT_COUNT_PATH
      readonly CT_MAX_PATH=$NF_CT_MAX_PATH
    elif [[ -f $IP_CT_COUNT_PATH ]] && [[ -f $IP_CT_MAX_PATH]].then
      readonly CT_COUNT_PATH=$IP_CT_COUNT_PATH
      readonly CT_MAX_PATH=$IP_CT_MAX_PATH
    else
      exit $UNKNOWN
    fi

    readonly conntrack_count=$(< $CT_COUNT_PATH) | |exit $UNKNOWN
    readonly conntrack_max=$(< $CT_MAX_PATH) | |exit $UNKNOWN
    readonly conntrack_usage_msg="${conntrack_count} out of ${conntrack_max}"

    if (( conntrack_count > conntrack_max * 9 /10 )); then
      echo "Conntrack table usage over 90%: ${conntrack_usage_msg}"
      exit $NONOK
    else
      echo "Conntrack table usage: ${conntrack_usage_msg}"
      exit $OK
    fi
  network-problem-monitor.json: |
    {
        "plugin": "custom"."pluginConfig": {
            "invoke_interval": "30s"."timeout": "5s"."max_output_length": 80,
            "concurrency": 3}."source": "network-plugin-monitor"."metricsReporting": true."conditions": []."rules": [{"type": "temporary"."reason": "ConntrackFull"."path": "/config/network_problem.sh"."timeout": "5s"}}]...Copy the code

Then edit the Daemonset file of Node-Problem-Detector to introduce the following customized scripts and rules

. containers: - name: node-problem-detector command: - /node-problem-detector - --logtostderr - --config.system-log-monitor=/config/kernel-monitor.json,/config/docker-monitor.json - -- config. Custom - the plugin - monitor = / config/network - problem - monitor. Json - Prometheus - address = 0.0.0.0 - --prometheus-port=20258 - --k8s-exporter-heartbeat-period=5m0s ... volumes: - name: config configMap: defaultMode: 0777 name: node-problem-detector-config items: - key: kernel-monitor.json path: kernel-monitor.json - key: docker-monitor.json path: docker-monitor.json - key: network-problem-monitor.json path: network-problem-monitor.json - key: network_problem.sh path: network_problem.shCopy the code

Grafana analysis panel

Bai has contributed the Kubernetes event analysis panel based on Loki to Grafana Lab. We can visit the following website to download the Dashboard

Grafana.com/grafana/das…

After importing the Panel into Grafana, we need to modify the Panel’s log query to replace {job=”kubernetes-event-exporter”} with our own exporter label.

After that, we can get the following analysis panel

How is it? Is it a feeling

Follow “cloud native xiaobai” and enter the Loki learning group

How to use Loki to analyze Kubernetes events

kubernetes-event-exporter

Event extension Node Problem Detector

Grafana analysis panel

Related Posts

@babel/core

Use Gitbook to write open source books for an author’s addiction

Responsible for the daily work of tens of thousands of application and R&D personnel of the Group, Ali continues to design and iterate the delivery platform