Phenomenon of the problem

After configuring Prometheus Operator Chart for some time, we decided to turn on Alert Rules provided by the Kubernetes-Mixin project and integrate AlertManager with Opsgenie. Over the next few days, a KubeCPUOvercommit alert would appear every evening around 6 PM:

The screening process

Initial speculation

Grafana = Compute Resources/Cluster = CPU Requests Commmitment; Grafana = Compute Resources/Cluster = CPU Requests Commmitment; Grafana = Compute Resources/Cluster = CPU Requests Commmitment; At this point, the figure is indeed more than 100%.

However, with Cluster Autoscaler installed in our Cluster, there was theoretically no CPU Resource Requests overflow, and we confirmed when we received the alert that there were no Unscheduled Pods.

Was it a false positive? We open the panel details and extract the expression:

sum(kube_pod_container_resource_requests_cpu_cores{cluster="$cluster"})
/
sum(kube_node_status_allocatable_cpu_cores{cluster="$cluster"})
Copy the code

The kube_POD_container_resource_requests_CPU_cores Metric was checked in the Grafana Explore panel. It did not contain any labels for Pod status, so we began to wonder: Is this metric caused by Pod containing failed and completed?

Verify the guess

After manually deleting some failed and completed Pods, we alert closed, and the display on Grafana dashboard also dropped below 100%. We thought we found the cause of the problem. When we fixed the expression of the panel and prepared to adjust the Alert rule, we found that the expression of the alarm rule (source) is:

sum(namespace:kube_pod_container_resource_requests_cpu_cores:sum)
  /
sum(kube_node_status_allocatable_cpu_cores)
  >
(count(kube_node_status_allocatable_cpu_cores)- 1) / count(kube_node_status_allocatable_cpu_cores)
Copy the code

The number of CPU cores required by Pod/the number of cores that can be allocated > (number of nodes -1)/number of nodes alarm.

Kube_pod_container_resource_requests_cpu_cores :sum kube_pod_container_resource_requests_cpu_cores:sum So my guess was wrong.

Once again the debug

It is found that namespace: kube_POD_container_resource_requests_CPU_cores :sum is a custom recording rule defined by Prometheus Operator Chart.

Recording rule is mainly used to pre-calculate those expressions with high usage or large amount of calculation, and save the results as a group of new time series, which can not only improve the calculation speed of alarm rules, but also achieve the effect of reusing complex expressions.

Level: Metric: Operations Is an officially recommended naming protocol for recording rules.

After some searching, the expression of this recording rule is (source) :

sum by (namespace) (
    sum by (namespace, pod) (
        max by (namespace, pod, container) (
            kube_pod_container_resource_requests_cpu_cores{job="kube-state-metrics"}
        ) * on(namespace, pod) group_left() max by (namespace, pod) (
          kube_pod_status_phase{phase=~"Pending|Running"} = =1)))Copy the code

It’s a little bit longer, but let’s simplify. Let’s look at the inner layer.

max by (namespace, pod, container) (
    kube_pod_container_resource_requests_cpu_cores{job="kube-state-metrics"}
)

* on(namespace, pod) group_left()

max by (namespace, pod) (
  kube_pod_status_phase{phase=~"Pending|Running"} = =1
)
Copy the code

On (namespace, pod) group_left() is a PromQL syntax, literally translated as vector matching. The meaning of this example is as follows: search the right sample matching namespace and Pod in the left sample of operator (*) successively. If there is a match, participate in the operation and form a new time series; if not, directly discard the sample.

In addition the conditions here kube_pod_status_phase {phase = ~ “Pending | Running”} = = 1, said the state of the Pod must be Pending or Running, or samples will be discarded.

Recording rule namespace:kube_pod_container_resource_requests_cpu_cores:sum

sum by (namespace) (
    sum by (namespace, pod) (
        ...
    )
)
Copy the code

After verification, our conjecture was rejected. Because the time series has excluded pods in non-pending or running state, their Resource requests values will not be included in the final sample.

New speculation and confirmation

After checking that namespace: kube_pod_container_resource_requests_CPU_cores :sum is ok, read the description of the alarm carefully:

Cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure.

The cluster is overloaded with Pod CPU resource requests and cannot tolerate node failures (node crashes, drain/taint eviction, etc.).

This explains why CPU core utilization is compared to (node number -1)/node number. The current workload in the cluster does not allow the number of nodes to decrease by 1. If a Node fails, Pod will require more CPU resources than the total cluster resources.

Question to explain

To sum up, the alert is not an alert. It appears as a reminder that the cluster should be expanded to ensure high availability. Instead of unscheduled pods that are already out there.

Due to the existence of the Cluster Autoscaler, the capacity expansion operation does not need to be performed manually. Therefore, the “restore normal after manually deleting some Failed pods” mentioned in the beginning of the article is actually a “confused behavior”, which just happens to be a new node startup.

The solution

To ensure that the cluster is super available (and can tolerate at least one node failure), we need Overprovisioning. Although Cluster Autoscaler does not provide a parameter implementation directly, we found a tip in the FAQ: How can I configure Overprovisioning with Cluster Autoscaler.

The general idea is as follows:

  1. Create a lowerPriorityClass, for example namedoverprovisioningA value of- 1.
  2. createDeploymentAs a “placeholder”, willspec.template.spec.priorityClassNameSet tooverprovisioning, and set resource requests on containers to the number of reserved resources.
  3. The Cluster Autoscaler will overprovisioned a portion of nodes to meet resource requests for placeholder Pods.
  4. Because placeholder Pods have a lower priority, when other Pods with a higher priority need resources, placeholder Pods will be ejected and reserved resources will be made available to other Pods.
  5. The placeholder Pods cannot be scheduled and the Cluster Autoscaler will overprovision the new node again.

In the end, in kubernetes – a mixin project, modify the alarm rules ignoringOverprovisionedWorkloadSelector configuration. Using Selectors to ignore overprovisioned Pods and exclude them from workload calculations allows the Cluster Autoscaler to always keep one more server than is currently required to ensure high availability.

conclusion

This is how we discovered cluster availability problems through a “false positive”, and finally made a solution to improve availability. If you have a better practice, welcome to communicate with us through the message.


Welcome to follow our wechat official account “RightCapital”