Why open source KubeEye
Kubernetes is the de facto standard for container orchestering. Despite its elegant architecture and powerful capabilities, Kubernetes has a number of problems and hidden problems that cluster administrators and Yaml engineers can’t handle on a daily basis.
- The infrastructure daemon process is faulty. The NTP service is interrupted.
- Hardware problems: For example, the CPU, memory, or disk is abnormal.
- Kernel problems: kernel deadlock, corrupted file system;
- Container runtime problem: The runtime daemon is not responsive;
- …
There are many more such problems, and these hidden exceptions are invisible to the control plane of the cluster, so Kubernetes will continue to schedule pods to the exception node, thus putting the cluster and the running application at great risk for security and stability.
What is a KubeEye
KubeEye is an open source automated cluster inspection tool designed to detect problems on Kubernetes, such as application configuration errors, unhealthy cluster components, and node issues. Developed using the Go language based on Polaris and Nod-Problem-Detector, KubeEye has built in a series of exception detection rules. In addition to predefined rules, it also supports custom rules.
What can KubeEye do
- Discover and detect problems of Kubernetes cluster control plane, including kube-apiserver/ Kube-controller-Manager/ETCD, etc.
- Help you detect Kubernetes node problems, including memory /CPU/ disk pressure, unexpected kernel error logs, etc.
- Validate your workloads against industry best practices to the YAML specification to help you keep your cluster stable.
Architecture diagram
KubeEye retrieves cluster diagnostic data by invoking the Kubernetes API by routinely matching key error messages in the log with the rules of the container syntax. See Architecture.
Built-in check items
Yes/no | Check the item | describe |
---|---|---|
Square root | ETCDHealthStatus | If etCD is up and running |
Square root | ControllerManagerHealthStatus | If kubernetes kube-controller-Manager is up and running |
Square root | SchedulerHealthStatus | If kubernetes kube-schedule is up and running |
Square root | NodeMemory | If the node memory usage exceeds the threshold |
Square root | DockerHealthStatus | If Docker is working properly |
Square root | NodeDisk | If the node disk usage exceeds the threshold |
Square root | KubeletHealthStatus | If Kubelet is active and running properly |
Square root | NodeCPU | If the CPU usage of a node exceeds the threshold |
Square root | NodeCorruptOverlay2 | Overlay2 unavailable |
Square root | NodeKernelNULLPointer | The node shows NotReady |
Square root | NodeDeadlock | A deadlock is a phenomenon in which two or more processes wait for each other while competing for resources. |
Square root | NodeOOM | Monitor processes that consume too much memory, especially those that consume too much memory very fast, and the kernel will kill them to prevent them from running out of memory |
Square root | NodeExt4Error | Failed to mount Ext4 |
Square root | NodeTaskHung | Check whether the number of processes in state D exceeds 120s |
Square root | NodeUnregisterNetDevice | Checking the Corresponding network |
Square root | NodeCorruptDockerImage | Check the Docker image |
Square root | NodeAUFSUmountHung | Check the storage |
Square root | NodeDockerHung | Docker hang Docker hang |
Square root | PodSetLivenessProbe | If you set a livenessProbe for each container in the POD |
Square root | PodSetTagNotSpecified | Mirror address does not declare label or label is up to date |
Square root | PodSetRunAsPrivileged | Running Pod in privileged mode means that Pod can access the host’s resources and kernel functions |
Square root | PodSetImagePullBackOff | Pod cannot pull out the image correctly, so you can pull out the image manually on the corresponding node |
Square root | PodSetImageRegistry | Check whether the mirror form is in the appropriate warehouse |
Square root | PodSetCpuLimitsMissing | No CPU resource limit declared |
Square root | PodNoSuchFileOrDirectory | Enter the container to check whether the corresponding file exists |
Square root | PodIOError | This is usually due to file IO performance bottlenecks |
Square root | PodNoSuchDeviceOrAddress | Checking the Corresponding network |
Square root | PodInvalidArgument | Checking corresponding storage |
Square root | PodDeviceOrResourceBusy | Check the corresponding directory and PID |
Square root | PodFileExists | Check existing files |
Square root | PodTooManyOpenFiles | Number of open file/socket connections exceeded system setting |
Square root | PodNoSpaceLeftOnDevice | Check the usage of disks and inodes |
Square root | NodeApiServerExpiredPeriod | The ApiServer certificate will be checked if the expiration date is less than 30 days |
Square root | PodSetCpuRequestsMissing | CPU resource request value not declared |
Square root | PodSetHostIPCSet | Setting the Host IP address |
Square root | PodSetHostNetworkSet | Setting the Host Network |
Square root | PodHostPIDSet | Setting the HOST PID |
Square root | PodMemoryRequestsMiss | No memory resource request value declared |
Square root | PodSetHostPort | Setting a Host Port |
Square root | PodSetMemoryLimitsMissing | No memory resource limit value is declared |
Square root | PodNotReadOnlyRootFiles | The file system is not set to read-only |
Square root | PodSetPullPolicyNotAlways | The mirror pull strategy is not always the case |
Square root | PodSetRunAsRootAllowed | Execute the command as the root user |
Square root | PodDangerousCapabilities | You have risky choices in features such as ALL/SYS_ADMIN/NET_ADMIN |
Square root | PodlivenessProbeMissing | No statement ReadinessProbe |
Square root | privilegeEscalationAllowed | Allow privilege escalation |
NodeNotReadyAndUseOfClosedNetworkConnection | http 2-max-streams-per-connection | |
NodeNotReady | Cannot start ContainerManager Cannot set property TasksAccounting or unknown property |
Note: Unmarked projects are under development
How to use
-
Install KubeEye on the machine
- Download the pre-built executable from Releases.
- Or you can build from source code
git clone https://github.com/kubesphere/kubeeye.git cd kubeeye make install Copy the code
-
[Optional] Install the Node-problem-detector
Note: This line will install NPD on your cluster and is only needed if you want detailed reports. ke install npd
- KubeEye performs automatic inspection:
root@node1:# ke diag NODENAME SEVERITY HEARTBEATTIME REASON MESSAGE node18 Fatal 2020-11-19T10:32:03+08:00 NodeStatusUnknown Kubelet stopped posting node status. node19 Fatal 2020-11-19T10:31:37+08:00 NodeStatusUnknown Kubelet stopped posting node status. node2 Fatal 2020-11-19T10:31:14+08:00 NodeStatusUnknown Kubelet stopped posting node status. node3 Fatal 2020-11-27T17:36:53+08:00 KubeletNotReady Container runtime not ready: RuntimeReady=false reason:DockerDaemonNotReady message:docker: failed to get docker version: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? The NAME SEVERITY TIME MESSAGE scheduler Fatal T17:2020-11-27 09:59 + 08:00 Get http://127.0.0.1:10251/healthz: Dial the TCP 127.0.0.1:10251: connect: Connection refused etcd Fatal T17:2020-11-27 0... + 08:00 Get https://192.168.13.8:2379/health: Dial the TCP 192.168.13.8:2379: connect: Connection refused NAMESPACE SEVERITY PODNAME EVENTTIME REASON MESSAGE Default Warning Node3.164b53D23ea79FC7 2020-11-27T17:37:34+08:00 ContainerGCFailed rpc error: code = Unknown desc = Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? Default Warning Node3.164 B553CA5740AAE 2020-11-27T18:03:31+08:00 FreeDiskSpaceFailed failed to garbage collect required amount of images. Wanted to free 5399374233 bytes, but freed 416077545 bytes default Warning nginx-b8ffcf679-q4n9v.16491643e6b68cd7 2020-11-27T17:09:24+08:00 Failed Error: ImagePullBackOff Default Warning Node3.164B5861E041A60E 2020-11-27T19:01:09+08:00 SystemOOM SystemOOM Encountered, victim process: stress, pid: 16713 Default Warning Node3.164B58660F8D4590 2020-11-27T19:01:27+08:00 OOMKilling Out of Memory: Kill process 16711 (stress) score 205 or sacrifice child Killed process 16711 (stress), UID 0, total-vm:826516kB, anon-rss:819296kB, file-rss:0kB, Shmem - RSS: 0 KB insights - agent Warning workloads - 1606467120.164 b519ca8c67416 T16 2020-11-27:57:05 + 08:00 DeadlineExceeded Job was active longer than specified deadline kube-system Warning calico-node-zvl9t.164b3dc50580845d 2020-11-27T17:09:35+08:00 DNSConfigForming Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 100.64.11.3 114.114.114.114.114 119.29.29.29 kube-system Warning KUbe-proxy-4bnn7.164b3dc4f4C4125d 2020-11-27t17:09:09 +08:00 DNSConfigForming Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 100.64.11.3 114.114.114.114 119.29.29.29 kube- System Warning Nodelocaldns-2zbhh.164b3dc4f42d358b 2020-11-27T17:09:14+08:00 DNSConfigForming Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 100.64.11.3 114.114.114.114 119.29.29.29 NAMESPACE SEVERITY NAME KIND TIME MESSAGE Kube-system Warning node-problem-detector DaemonSet 2020-11-27T17:09:59+08:00 [livenessProbeMissing runAsPrivileged] kube-system Warning calico-node DaemonSet 2020-11-27T17:09:59+08:00 [runAsPrivileged cpuLimitsMissing] kube-system Warning nodelocaldns DaemonSet 2020-11-27T17:09:59+08:00 [cpuLimitsMissing runAsPrivileged] default Warning nginx Deployment 2020-11-27T17:09:59+08:00 [cpuLimitsMissing livenessProbeMissing tagNotSpecified] insights-agent Warning workloads CronJob 2020-11-27T17:09:59+08:00 [livenessProbeMissing] insights-agent Warning cronjob-executor Job 2020-11-27T17:09:59+08:00 [livenessProbeMissing] kube-system Warning calico-kube-controllers Deployment 2020-11-27T17:09:59+08:00 [cpuLimitsMissing livenessProbeMissing] kube-system Warning coredns Deployment 2020-11-27T17:09:59+08:00 [cpuLimitsMissing]Copy the code
Refer to the FAQ to optimize your cluster.
Add a custom check rule
In addition to the pre-defined inspection items and rules mentioned above, KubeEye also supports custom inspection rules. Here is an example:
Add a custom NPD check rule
- Install the NPD command
ke install npd
- Configmap kube-system/ nod-problem-detector -config
kubectl edit cm -n kube-system node-problem-detector-config
Copy the code
- You can add exception logs under the configMap rules. The rules follow regular expressions.
Customize best practice rules
- Prepare a rule yamL, for example, the following rule will validate your Pod specification to ensure that the image only comes from the authorized registry.
checks:
imageFromUnauthorizedRegistry: warning
customChecks:
imageFromUnauthorizedRegistry:
promptMessage: When the corresponding rule does not match. Show that image from an unauthorized registry.
category: Images
target: Container
schema:
'$schema': http://json-schema.org/draft-07/schema
type: object
properties:
image:
type: string
not:
pattern: ^quay.io
Copy the code
-
Save the above rules as YAML, such as rule-.yaml.
-
Run KubeEye with rule-.yaml.
root:# ke diag -f rule.yaml --kubeconfig ~/.kube/config NAMESPACE SEVERITY NAME KIND TIME MESSAGE default Warning nginx Deployment 2020-11-27T17:18:31+08:00 [imageFromUnauthorizedRegistry] kube-system Warning node-problem-detector DaemonSet 2020-11-27T17:18:31+08:00 [livenessProbeMissing runAsPrivileged] kube-system Warning calico-node DaemonSet 2020-11-27T17:18:31+08:00 [cpuLimitsMissing runAsPrivileged] kube-system Warning calico-kube-controllers Deployment 2020-11-27T17:18:31+08:00 [cpuLimitsMissing livenessProbeMissing] kube-system Warning nodelocaldns DaemonSet 2020-11-27T17:18:31+08:00 [runAsPrivileged cpuLimitsMissing] default Warning nginx Deployment 2020-11-27T17:18:31+08:00 [livenessProbeMissing cpuLimitsMissing] kube-system Warning coredns Deployment 2020-11-27T17:18:31+08:00 [cpuLimitsMissing]Copy the code
Roadmap
- Fine-grained inspection items are supported. For example, the cluster responds slowly
- Cluster inspection reports can be generated based on inspection results
- Cluster inspection reports can be exported to CSV or HTML files
What other features would you like KubeEye to offer? Please come to Github and submit your suggestions or requests
GitHub: github.com/kubesphere/…
Refer to the link
KubeEye Release:github.com/kubesphere/…
KubeEye FAQ documentation: github.com/kubesphere/…
Node-Problem-Detector:github.com/kubernetes/…
About KubeSphere
KubeSphere is a container hybrid cloud built on top of Kubernetes to provide full-stack IT automation capabilities and simplify DevOps workflows for enterprises.
KubeSphere has been adopted by thousands of enterprises at home and abroad such as Aqara Smart Home, Bentley Life, Sina, PICC Life insurance, Huaxia Bank, PUDONG Development Silicon Valley Bank, Sichuan Airlines, Sinopharm Group, Webank, Zijininsurance, Radore, ZaloPay and so on. KubeSphere provides an operational-friendly, wizard-like interface and rich enterprise-class functionality, It includes multi-cloud and multi-cluster management, Kubernetes resource management, DevOps (CI/CD), application lifecycle management, Service Mesh, multi-tenant management, monitoring logs, alarm notification, storage and network management, GPU support, etc. Help enterprises quickly build a powerful and rich container cloud platform.
IO/KubeSphere GitHub: github.com/kubesphere/…