Phenomenon of the problem
After configuring Prometheus Operator Chart for some time, we decided to turn on Alert Rules provided by the Kubernetes-Mixin project and integrate AlertManager with Opsgenie. Over the next few days, a KubeCPUOvercommit alert would appear every evening around 6 PM:
The screening process
Initial speculation
Grafana = Compute Resources/Cluster = CPU Requests Commmitment; Grafana = Compute Resources/Cluster = CPU Requests Commmitment; Grafana = Compute Resources/Cluster = CPU Requests Commmitment; At this point, the figure is indeed more than 100%.
However, with Cluster Autoscaler installed in our Cluster, there was theoretically no CPU Resource Requests overflow, and we confirmed when we received the alert that there were no Unscheduled Pods.
Was it a false positive? We open the panel details and extract the expression:
sum(kube_pod_container_resource_requests_cpu_cores{cluster="$cluster"})
/
sum(kube_node_status_allocatable_cpu_cores{cluster="$cluster"})
Copy the code
The kube_POD_container_resource_requests_CPU_cores Metric was checked in the Grafana Explore panel. It did not contain any labels for Pod status, so we began to wonder: Is this metric caused by Pod containing failed and completed?
Verify the guess
After manually deleting some failed and completed Pods, we alert closed, and the display on Grafana dashboard also dropped below 100%. We thought we found the cause of the problem. When we fixed the expression of the panel and prepared to adjust the Alert rule, we found that the expression of the alarm rule (source) is:
sum(namespace:kube_pod_container_resource_requests_cpu_cores:sum)
/
sum(kube_node_status_allocatable_cpu_cores)
>
(count(kube_node_status_allocatable_cpu_cores)- 1) / count(kube_node_status_allocatable_cpu_cores)
Copy the code
The number of CPU cores required by Pod/the number of cores that can be allocated > (number of nodes -1)/number of nodes alarm.
Kube_pod_container_resource_requests_cpu_cores :sum kube_pod_container_resource_requests_cpu_cores:sum So my guess was wrong.
Once again the debug
It is found that namespace: kube_POD_container_resource_requests_CPU_cores :sum is a custom recording rule defined by Prometheus Operator Chart.
Recording rule is mainly used to pre-calculate those expressions with high usage or large amount of calculation, and save the results as a group of new time series, which can not only improve the calculation speed of alarm rules, but also achieve the effect of reusing complex expressions.
Level: Metric: Operations Is an officially recommended naming protocol for recording rules.
After some searching, the expression of this recording rule is (source) :
sum by (namespace) (
sum by (namespace, pod) (
max by (namespace, pod, container) (
kube_pod_container_resource_requests_cpu_cores{job="kube-state-metrics"}
) * on(namespace, pod) group_left() max by (namespace, pod) (
kube_pod_status_phase{phase=~"Pending|Running"} = =1)))Copy the code
It’s a little bit longer, but let’s simplify. Let’s look at the inner layer.
max by (namespace, pod, container) (
kube_pod_container_resource_requests_cpu_cores{job="kube-state-metrics"}
)
* on(namespace, pod) group_left()
max by (namespace, pod) (
kube_pod_status_phase{phase=~"Pending|Running"} = =1
)
Copy the code
On (namespace, pod) group_left() is a PromQL syntax, literally translated as vector matching. The meaning of this example is as follows: search the right sample matching namespace and Pod in the left sample of operator (*) successively. If there is a match, participate in the operation and form a new time series; if not, directly discard the sample.
In addition the conditions here kube_pod_status_phase {phase = ~ “Pending | Running”} = = 1, said the state of the Pod must be Pending or Running, or samples will be discarded.
Recording rule namespace:kube_pod_container_resource_requests_cpu_cores:sum
sum by (namespace) (
sum by (namespace, pod) (
...
)
)
Copy the code
After verification, our conjecture was rejected. Because the time series has excluded pods in non-pending or running state, their Resource requests values will not be included in the final sample.
New speculation and confirmation
After checking that namespace: kube_pod_container_resource_requests_CPU_cores :sum is ok, read the description of the alarm carefully:
Cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure.
The cluster is overloaded with Pod CPU resource requests and cannot tolerate node failures (node crashes, drain/taint eviction, etc.).
This explains why CPU core utilization is compared to (node number -1)/node number. The current workload in the cluster does not allow the number of nodes to decrease by 1. If a Node fails, Pod will require more CPU resources than the total cluster resources.
Question to explain
To sum up, the alert is not an alert. It appears as a reminder that the cluster should be expanded to ensure high availability. Instead of unscheduled pods that are already out there.
Due to the existence of the Cluster Autoscaler, the capacity expansion operation does not need to be performed manually. Therefore, the “restore normal after manually deleting some Failed pods” mentioned in the beginning of the article is actually a “confused behavior”, which just happens to be a new node startup.
The solution
To ensure that the cluster is super available (and can tolerate at least one node failure), we need Overprovisioning. Although Cluster Autoscaler does not provide a parameter implementation directly, we found a tip in the FAQ: How can I configure Overprovisioning with Cluster Autoscaler.
The general idea is as follows:
- Create a lower
PriorityClass
, for example namedoverprovisioning
A value of- 1
. - create
Deployment
As a “placeholder”, willspec.template.spec.priorityClassName
Set tooverprovisioning
, and set resource requests on containers to the number of reserved resources. - The Cluster Autoscaler will overprovisioned a portion of nodes to meet resource requests for placeholder Pods.
- Because placeholder Pods have a lower priority, when other Pods with a higher priority need resources, placeholder Pods will be ejected and reserved resources will be made available to other Pods.
- The placeholder Pods cannot be scheduled and the Cluster Autoscaler will overprovision the new node again.
In the end, in kubernetes – a mixin project, modify the alarm rules ignoringOverprovisionedWorkloadSelector configuration. Using Selectors to ignore overprovisioned Pods and exclude them from workload calculations allows the Cluster Autoscaler to always keep one more server than is currently required to ensure high availability.
conclusion
This is how we discovered cluster availability problems through a “false positive”, and finally made a solution to improve availability. If you have a better practice, welcome to communicate with us through the message.
Welcome to follow our wechat official account “RightCapital”