Why open source KubeEye

Kubernetes is the de facto standard for container orchestering. Despite its elegant architecture and powerful capabilities, Kubernetes has a number of problems and hidden problems that cluster administrators and Yaml engineers can’t handle on a daily basis.

  • The infrastructure daemon process is faulty. The NTP service is interrupted.
  • Hardware problems: For example, the CPU, memory, or disk is abnormal.
  • Kernel problems: kernel deadlock, corrupted file system;
  • Container runtime problem: The runtime daemon is not responsive;

There are many more such problems, and these hidden exceptions are invisible to the control plane of the cluster, so Kubernetes will continue to schedule pods to the exception node, thus putting the cluster and the running application at great risk for security and stability.

What is a KubeEye

KubeEye is an open source automated cluster inspection tool designed to detect problems on Kubernetes, such as application configuration errors, unhealthy cluster components, and node issues. Developed using the Go language based on Polaris and Nod-Problem-Detector, KubeEye has built in a series of exception detection rules. In addition to predefined rules, it also supports custom rules.

What can KubeEye do

  • Discover and detect problems of Kubernetes cluster control plane, including kube-apiserver/ Kube-controller-Manager/ETCD, etc.
  • Help you detect Kubernetes node problems, including memory /CPU/ disk pressure, unexpected kernel error logs, etc.
  • Validate your workloads against industry best practices to the YAML specification to help you keep your cluster stable.

Architecture diagram

KubeEye retrieves cluster diagnostic data by invoking the Kubernetes API by routinely matching key error messages in the log with the rules of the container syntax. See Architecture.

Built-in check items

Yes/no Check the item describe
Square root ETCDHealthStatus If etCD is up and running
Square root ControllerManagerHealthStatus If kubernetes kube-controller-Manager is up and running
Square root SchedulerHealthStatus If kubernetes kube-schedule is up and running
Square root NodeMemory If the node memory usage exceeds the threshold
Square root DockerHealthStatus If Docker is working properly
Square root NodeDisk If the node disk usage exceeds the threshold
Square root KubeletHealthStatus If Kubelet is active and running properly
Square root NodeCPU If the CPU usage of a node exceeds the threshold
Square root NodeCorruptOverlay2 Overlay2 unavailable
Square root NodeKernelNULLPointer The node shows NotReady
Square root NodeDeadlock A deadlock is a phenomenon in which two or more processes wait for each other while competing for resources.
Square root NodeOOM Monitor processes that consume too much memory, especially those that consume too much memory very fast, and the kernel will kill them to prevent them from running out of memory
Square root NodeExt4Error Failed to mount Ext4
Square root NodeTaskHung Check whether the number of processes in state D exceeds 120s
Square root NodeUnregisterNetDevice Checking the Corresponding network
Square root NodeCorruptDockerImage Check the Docker image
Square root NodeAUFSUmountHung Check the storage
Square root NodeDockerHung Docker hang Docker hang
Square root PodSetLivenessProbe If you set a livenessProbe for each container in the POD
Square root PodSetTagNotSpecified Mirror address does not declare label or label is up to date
Square root PodSetRunAsPrivileged Running Pod in privileged mode means that Pod can access the host’s resources and kernel functions
Square root PodSetImagePullBackOff Pod cannot pull out the image correctly, so you can pull out the image manually on the corresponding node
Square root PodSetImageRegistry Check whether the mirror form is in the appropriate warehouse
Square root PodSetCpuLimitsMissing No CPU resource limit declared
Square root PodNoSuchFileOrDirectory Enter the container to check whether the corresponding file exists
Square root PodIOError This is usually due to file IO performance bottlenecks
Square root PodNoSuchDeviceOrAddress Checking the Corresponding network
Square root PodInvalidArgument Checking corresponding storage
Square root PodDeviceOrResourceBusy Check the corresponding directory and PID
Square root PodFileExists Check existing files
Square root PodTooManyOpenFiles Number of open file/socket connections exceeded system setting
Square root PodNoSpaceLeftOnDevice Check the usage of disks and inodes
Square root NodeApiServerExpiredPeriod The ApiServer certificate will be checked if the expiration date is less than 30 days
Square root PodSetCpuRequestsMissing CPU resource request value not declared
Square root PodSetHostIPCSet Setting the Host IP address
Square root PodSetHostNetworkSet Setting the Host Network
Square root PodHostPIDSet Setting the HOST PID
Square root PodMemoryRequestsMiss No memory resource request value declared
Square root PodSetHostPort Setting a Host Port
Square root PodSetMemoryLimitsMissing No memory resource limit value is declared
Square root PodNotReadOnlyRootFiles The file system is not set to read-only
Square root PodSetPullPolicyNotAlways The mirror pull strategy is not always the case
Square root PodSetRunAsRootAllowed Execute the command as the root user
Square root PodDangerousCapabilities You have risky choices in features such as ALL/SYS_ADMIN/NET_ADMIN
Square root PodlivenessProbeMissing No statement ReadinessProbe
Square root privilegeEscalationAllowed Allow privilege escalation
NodeNotReadyAndUseOfClosedNetworkConnection http 2-max-streams-per-connection
NodeNotReady Cannot start ContainerManager Cannot set property TasksAccounting or unknown property

Note: Unmarked projects are under development

How to use

  • Install KubeEye on the machine

    • Download the pre-built executable from Releases.
    • Or you can build from source code
    git clone https://github.com/kubesphere/kubeeye.git
    cd kubeeye 
    make install
    Copy the code
  • [Optional] Install the Node-problem-detector

Note: This line will install NPD on your cluster and is only needed if you want detailed reports. ke install npd

  • KubeEye performs automatic inspection:
root@node1:# ke diag NODENAME SEVERITY HEARTBEATTIME REASON MESSAGE node18 Fatal 2020-11-19T10:32:03+08:00 NodeStatusUnknown Kubelet stopped posting node status. node19 Fatal 2020-11-19T10:31:37+08:00 NodeStatusUnknown Kubelet stopped posting node status. node2 Fatal 2020-11-19T10:31:14+08:00 NodeStatusUnknown Kubelet stopped posting node status. node3 Fatal 2020-11-27T17:36:53+08:00 KubeletNotReady Container runtime not ready: RuntimeReady=false reason:DockerDaemonNotReady message:docker: failed to get docker version: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? The NAME SEVERITY TIME MESSAGE scheduler Fatal T17:2020-11-27 09:59 + 08:00 Get http://127.0.0.1:10251/healthz: Dial the TCP 127.0.0.1:10251: connect: Connection refused etcd Fatal T17:2020-11-27 0... + 08:00 Get https://192.168.13.8:2379/health: Dial the TCP 192.168.13.8:2379: connect: Connection refused NAMESPACE SEVERITY PODNAME EVENTTIME REASON MESSAGE Default Warning Node3.164b53D23ea79FC7 2020-11-27T17:37:34+08:00 ContainerGCFailed rpc error: code = Unknown desc = Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? Default Warning Node3.164 B553CA5740AAE 2020-11-27T18:03:31+08:00 FreeDiskSpaceFailed failed to garbage collect required amount of images. Wanted to free 5399374233 bytes, but freed 416077545 bytes default Warning nginx-b8ffcf679-q4n9v.16491643e6b68cd7 2020-11-27T17:09:24+08:00 Failed Error: ImagePullBackOff Default Warning Node3.164B5861E041A60E 2020-11-27T19:01:09+08:00 SystemOOM SystemOOM Encountered, victim process: stress, pid: 16713 Default Warning Node3.164B58660F8D4590 2020-11-27T19:01:27+08:00 OOMKilling Out of Memory: Kill process 16711 (stress) score 205 or sacrifice child Killed process 16711 (stress), UID 0, total-vm:826516kB, anon-rss:819296kB, file-rss:0kB, Shmem - RSS: 0 KB insights - agent Warning workloads - 1606467120.164 b519ca8c67416 T16 2020-11-27:57:05 + 08:00 DeadlineExceeded Job was active longer than specified deadline kube-system Warning calico-node-zvl9t.164b3dc50580845d 2020-11-27T17:09:35+08:00 DNSConfigForming Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 100.64.11.3 114.114.114.114.114 119.29.29.29 kube-system Warning KUbe-proxy-4bnn7.164b3dc4f4C4125d 2020-11-27t17:09:09 +08:00  DNSConfigForming Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 100.64.11.3 114.114.114.114 119.29.29.29 kube- System Warning Nodelocaldns-2zbhh.164b3dc4f42d358b 2020-11-27T17:09:14+08:00 DNSConfigForming Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 100.64.11.3 114.114.114.114 119.29.29.29 NAMESPACE SEVERITY NAME KIND TIME MESSAGE Kube-system Warning node-problem-detector DaemonSet 2020-11-27T17:09:59+08:00 [livenessProbeMissing runAsPrivileged] kube-system Warning calico-node DaemonSet 2020-11-27T17:09:59+08:00 [runAsPrivileged cpuLimitsMissing] kube-system Warning nodelocaldns DaemonSet 2020-11-27T17:09:59+08:00 [cpuLimitsMissing runAsPrivileged] default Warning nginx Deployment 2020-11-27T17:09:59+08:00 [cpuLimitsMissing livenessProbeMissing tagNotSpecified] insights-agent Warning workloads CronJob 2020-11-27T17:09:59+08:00 [livenessProbeMissing] insights-agent Warning cronjob-executor Job 2020-11-27T17:09:59+08:00 [livenessProbeMissing] kube-system Warning calico-kube-controllers Deployment 2020-11-27T17:09:59+08:00 [cpuLimitsMissing livenessProbeMissing] kube-system Warning coredns Deployment 2020-11-27T17:09:59+08:00 [cpuLimitsMissing]Copy the code

Refer to the FAQ to optimize your cluster.

Add a custom check rule

In addition to the pre-defined inspection items and rules mentioned above, KubeEye also supports custom inspection rules. Here is an example:

Add a custom NPD check rule

  • Install the NPD commandke install npd
  • Configmap kube-system/ nod-problem-detector -config
kubectl edit cm -n kube-system node-problem-detector-config
Copy the code
  • You can add exception logs under the configMap rules. The rules follow regular expressions.

Customize best practice rules

  • Prepare a rule yamL, for example, the following rule will validate your Pod specification to ensure that the image only comes from the authorized registry.
checks:
  imageFromUnauthorizedRegistry: warning

customChecks:
  imageFromUnauthorizedRegistry:
    promptMessage: When the corresponding rule does not match. Show that image from an unauthorized registry.
    category: Images
    target: Container
    schema:
      '$schema': http://json-schema.org/draft-07/schema
      type: object
      properties:
        image:
          type: string
          not:
            pattern: ^quay.io
Copy the code
  • Save the above rules as YAML, such as rule-.yaml.

  • Run KubeEye with rule-.yaml.

root:# ke diag -f rule.yaml --kubeconfig ~/.kube/config NAMESPACE SEVERITY NAME KIND TIME MESSAGE default Warning nginx Deployment 2020-11-27T17:18:31+08:00 [imageFromUnauthorizedRegistry] kube-system Warning node-problem-detector DaemonSet  2020-11-27T17:18:31+08:00 [livenessProbeMissing runAsPrivileged] kube-system Warning calico-node DaemonSet 2020-11-27T17:18:31+08:00 [cpuLimitsMissing runAsPrivileged] kube-system Warning calico-kube-controllers Deployment 2020-11-27T17:18:31+08:00 [cpuLimitsMissing livenessProbeMissing] kube-system Warning nodelocaldns DaemonSet 2020-11-27T17:18:31+08:00 [runAsPrivileged cpuLimitsMissing] default Warning nginx Deployment 2020-11-27T17:18:31+08:00 [livenessProbeMissing cpuLimitsMissing] kube-system Warning coredns Deployment 2020-11-27T17:18:31+08:00 [cpuLimitsMissing]Copy the code

Roadmap

  • Fine-grained inspection items are supported. For example, the cluster responds slowly
  • Cluster inspection reports can be generated based on inspection results
  • Cluster inspection reports can be exported to CSV or HTML files

What other features would you like KubeEye to offer? Please come to Github and submit your suggestions or requests

GitHub: github.com/kubesphere/…

Refer to the link

KubeEye Release:github.com/kubesphere/…

KubeEye FAQ documentation: github.com/kubesphere/…

Node-Problem-Detector:github.com/kubernetes/…

About KubeSphere

KubeSphere is a container hybrid cloud built on top of Kubernetes to provide full-stack IT automation capabilities and simplify DevOps workflows for enterprises.

KubeSphere has been adopted by thousands of enterprises at home and abroad such as Aqara Smart Home, Bentley Life, Sina, PICC Life insurance, Huaxia Bank, PUDONG Development Silicon Valley Bank, Sichuan Airlines, Sinopharm Group, Webank, Zijininsurance, Radore, ZaloPay and so on. KubeSphere provides an operational-friendly, wizard-like interface and rich enterprise-class functionality, It includes multi-cloud and multi-cluster management, Kubernetes resource management, DevOps (CI/CD), application lifecycle management, Service Mesh, multi-tenant management, monitoring logs, alarm notification, storage and network management, GPU support, etc. Help enterprises quickly build a powerful and rich container cloud platform.

IO/KubeSphere GitHub: github.com/kubesphere/…