Wechat public account: operation and development story, author: Xia teacher

Kubernetes generates millions of new metrics every day. One of the most challenging aspects of monitoring cluster health is to filter which indicators are important to collect and focus on. In this article, I’ll define 18 key Kubernetes metrics that should be monitored and alerts created. The list of organizations may vary slightly, but these 18 are the best indicators of the state of k8S cluster monitoring when formulating an organization’s Kubernetes monitoring policy.

1. Crash Loops

Crash Loops is when a POD starts, crashes, and then keeps trying to restart but can’t (it crashes and restarts in a loop). When this happens, the application will not run.

  • It could be caused by an application crash in the POD
  • It could be caused by a misconfiguration in the POD or during deployment
  • When Crash Loops occurs, you need to look at the log to resolve the problem.
  • You can use the open source component kube-eventer to push events.

2. CPU Utilization

The CPU usage is the CPU usage of the node. Monitoring is important for two reasons:

  • The application cannot use up the CPU allocated by the application. If your application is CPU-limited, you need to increase the CPU allocation or increase the number of PODS. Eventually you need to add servers.
  • You don’t want your CPU sitting idle. If server CPU usage is consistently low, resources may be overallocated and money may be wasted.

3. Disk Pressure

Disk pressure is a condition that indicates that a node is using too much disk space or too much disk space, depending on the threshold set in the Kubernetes configuration.

  • If the application legally needs more space, this may mean adding more disk space.
  • The application behaves abnormally and fills the disk prematurely in unexpected ways.

4. Memory Pressure

Memory Pressure is another resource state that indicates that a node is out of Memory.

  • Be aware of this situation because it can mean a memory leak in your application.

5. PID Pressure

PID pressure is a rare situation where a Pod or container generates too many processes and the node lacks available process ids.

  • Each node has a limited number of process ids to assign to running processes;
  • If you run out of ids, you cannot start other processes.
  • Kubernetes allows pods to set PID thresholds to limit their ability to perform out-of-control process generation, while PID pressure conditions mean that one or more Pods are running out of allocated PID and need to be checked.

6. Network Unavailable

All nodes require a Network connection. Network Unavailable This state indicates that the Network connection of the node is Unavailable.

  • Either the Settings are incorrect (due to route exhaustion or incorrect configuration), or there is a physical problem with the network connection to the hardware.
  • You can use the open source component KubeNurse for cluster network monitoring

7. Job Failures

The job is designed to run the Pod for a limited amount of time and release it when the expected functionality is complete.

  • If the job fails to complete successfully because the node crashes or restarts or resources are exhausted, you need to know that the job failed.
  • This does not usually mean that your application is inaccessible, but if not fixed, it can lead to problems later.
  • You can use the open source component kube-eventer to push events.

8. Persistent Volume Failures

A persistent volume is a storage resource specified on a cluster that can be used as persistent storage for any Pod that requests it. During their life cycle, they are bound to a Pod and then recycled when that Pod is no longer needed.

  • If this collection fails for any reason, you need to know that there is a problem with persistent storage.

9. Pod Pending Delays

During the lifetime of a POD, if it is waiting to be scheduled on a node, its state is “pending”. If it stays in a “pending” state, it usually means that there are not enough resources to schedule and deploy the POD.

  • You will need to update CPU and memory allocations, remove pods, or add more nodes to the cluster.
  • You can use the open source component kube-eventer to push events.

10. Deployment Glitches

Deployment Glitches Deployment is used to manage stateless applications — where PODS are interchangeable and need not be able to access any particular individual Pod, but only specific types of Pod.

  • You need to pay close attention to the deployments to ensure they are done correctly. The best approach is to ensure that the number of deployments observed matches the number of deployments required. If they do not match, one or more deployments fail.

11. StatefulSets Not Ready

StatefulSets are used to manage stateful applications where Pods have specific roles that require access to other specific Pods. Rather than requiring only a specific type of POD for deployment.

  • Ensure that the number of StateFulSets observed matches the number of StateFulsets required. If they do not match, one or more StatefulSets fail.
  • You can use the open source component kube-eventer to push events.

12. DaemonSets Not Ready

DaemonSets are used to manage services or applications that need to be run on all nodes in the cluster. Running the log collection daemon (FileBeat) or monitoring service on each node requires DaemonSet.

  • Ensure that the amount of DaemonSet observed matches the amount of DaemonSet required. If they do not match, one or more DaemonSet failures.
  • You can use the open source component kube-eventer to push events.

13.etcd Leaders

Etcd clusters should always have a leader (except in the process of changing the leader, which should rarely happen).

  • Etcd_server_has_leader: whether the ETCD has a leader.
  • The number of leader changes etCD_server_LEADER_Changes_seen_total may indicate a connection or resource problem in the ETCD cluster.

14.Scheduler Problems

There are two areas of concern about the scheduler.

  • Scheduler_schedule_attempts_total {result:unschedulable} because the addition of non-schedulable pods could mean that your cluster has a resource problem.
  • Second, you should keep an eye on scheduler delays using one of the above delay metrics. Increased LATENCY in Pod scheduling can cause additional problems, and can indicate resource issues in the cluster.

15.Events

In addition to collecting numerical metrics from Kubernetes clusters, it is also useful to collect and track events from clusters. Cluster events can monitor the POD lifecycle and observe major POD failures, and observing the rate of events flowing out of the cluster can be a good early warning indicator. A sudden or significant change in the event rate may indicate a problem.

  • You can use the open source component kube-eventer to push events.

16.Application Metrics

Unlike the other metrics and events we examined above, the application metrics are not emitted from Kubernetes itself, but from the workload running in the cluster. From an application perspective, this telemetry can be anything significant: error responses, request delays, processing times, and so on. There are two philosophies of how to collect application metrics.

  • The first, which has been widely adopted until recently, is that metrics should be “pushed” from the application to the collection endpoint.
  • The second metric collection idea, which is increasingly being adopted, is that metrics should be “pulled” from applications by collection agents. This makes applications easier to write, because all they have to do is publish their metrics appropriately, but the application doesn’t have to worry about how to extract or crawl those metrics. This is how OpenMetrics works and how Kubernetes cluster metrics are collected. When combined with service discovery by collection agents, this technique creates a powerful way to gather any type of metrics you need from clustered applications.

Conclusion:

Like most aspects of Kubernetes, monitoring Kubernetes health can be a complex and challenging process. It’s easy to get overwhelmed. By understanding the high-value metrics that need the most attention, you can at least begin to develop a strategy that filters out the large amounts of data noise generated by clustering and becomes more confident that the most important issues will be addressed to ensure a good experience.