The introduction
In the operation process of K8s cluster, it is often troubled by the high CPU and memory utilization rate of nodes, which not only affects the stable operation of Pod on nodes, but also increases the probability of node failure. In order to cope with the problem of high load of cluster nodes and balance resource utilization among nodes, the following two strategies should be adopted based on the actual resource utilization monitoring information of nodes:
- During the Pod scheduling phase, pods should be preferentially scheduled to nodes with low resource utilization rather than nodes with high resource utilization
- When the node resource rate is high, it can automatically intervene to migrate some PODS on the node to the node with low utilization
For this purpose, we provide dynamic scheduler + Descheduler scheme to achieve, currently in the public cloud TKE cluster [component management] – [scheduling] classification has provided the installation entrance of these two plug-ins, the end of the paper also provides a best practice example for specific customer cases.
Dynamic scheduler
Native Kubernetes scheduler has some good scheduling strategy is used to deal with the unequal distribution of resources in the nodes, such as BalancedResourceAllocation, but one problem is the allocation of resources is static, not represent the real use of resources, The CPU/ memory utilization of nodes is often unbalanced. Therefore, a policy is required for scheduling based on the actual resource utilization of nodes. The dynamic scheduler does just that.
The technology principle
The native K8s scheduler provides a Scheduler Extender mechanism to provide the ability to schedule extensions. The scheduler Extender approach is less intrusive and more flexible than modifying the native Scheduler code addition strategy or implementing a custom scheduler. Therefore, we choose a scheduler Extender based approach to add scheduling policies based on the actual resource utilization of nodes.
The Scheduler Extender can add custom logic during the pre-selection and optimization phases of the native scheduler to provide the same effect as the native scheduler internal policies.
architecture
- Node-annotator: Pulls monitoring data from Prometheus, periodically synchronizes it to node’s annotations, and is responsible for other logic, such as dynamic scheduler scheduling efficiency measures and preventing scheduling hot spots.
- Dynamic-scheduler: implements the optimization and pre-selection interface logic of the Scheduler Extender. In the pre-selection phase, nodes with higher resource utilization than the threshold are filtered out, and nodes with lower resource utilization are preferentially selected for scheduling in the pre-selection phase.
Implementation details
- How is the policy weight of the dynamic scheduler configured in the optimization phase?
The scheduling policy of the native scheduler has a weight configuration in the optimization stage, and the score of each policy is multiplied by the weight to get the total score of the policy. For a policy with a higher weight, it is easier to schedule the nodes that meet the conditions. By default, the weight of all policies is 1. In order to improve the effect of the dynamic scheduler policy, we set the weight of the preferred policy of the dynamic scheduler to 2.
- How does the dynamic scheduler prevent scheduling hot spots?
In the cluster, if a newly added node appears, in order to prevent the newly added node from scheduling too many nodes, we will monitor the scheduling success event of the scheduler to obtain the scheduling result and mark the number of scheduling PODS of each node in a past period of time, such as the number of scheduling pods within 1min, 5min and 30min. The hotspot value of the node is measured and then compensated into the node’s optimization score.
Product ability
The component dependence
Components are less dependent and rely only on the basic node monitoring components Node-Exporters and Prometheus. Prometheus supports hosting and self-built, which allows one-click installation of dynamic schedulers, and self-built Prometheus provides monitoring indicator configuration methods.
Component configuration
Scheduling policies can currently be based on CPU and memory utilization.
Primary stage
Configure the CPU usage within 5 minutes, maximum CPU usage within 1 hour, average memory usage within 5 minutes, and maximum memory usage within 1 hour. If the threshold is exceeded, nodes will be filtered in the pre-selection phase.
The optimization phase
The score in the optimization stage of the dynamic scheduler is obtained according to the comprehensive score of the 6 indicators in the screenshot. The weight of each of the 6 indicators indicates which index value is more focused on in optimization. The significance of using the maximum utilization rate in 1h and 1D is to record the peak utilization rate of the node in 1h and 1D. Because the peak cycle of some business Pods may be in hours or days, the scheduling of new pods will not lead to a further increase in load at peak time nodes.
Product effect
In order to measure the improvement effect of dynamic scheduler on enhanced Pod scheduling to low-load nodes, the following indicators are calculated after obtaining the CPU/ memory utilization rate of all nodes to be scheduled at the scheduling time combining with the actual scheduling results of the scheduler:
- Cpu_utilization_total_avg: average CPU usage of all scheduled nodes.
- Memory_utilization_total_avg: average memory utilization of all scheduled nodes.
- Effective_dynamic_schedule_count: Effective scheduling times. When the CPU usage of a node is smaller than the median CPU usage of all nodes, this is an effective scheduling. Effective_dynamic_schedule_count adds 0.5 points, and the same applies to memory.
- Total_schedule_count: indicates The Times of all scheduling times. Each new scheduling time accumulates by 1.
- Effective_schedule_ratio: Effective_dynamic_schedule_count /total_schedule_count Indicates the effective_dynamic_schedule_count/total_schedule_count indicator changes when dynamic scheduling is not enabled and dynamic scheduling is enabled for one week respectively in the same cluster. You can see the enhanced effect on cluster scheduling.
indicators | Dynamic scheduling is disabled | Enabling Dynamic Scheduling |
---|---|---|
cpu_utilization_total_avg | 0.30 | 0.17 |
memory_utilization_total_avg | 0.28 | 0.23 |
effective_dynamic_schedule_count | 2160 | 3620 |
total_schedule_count | 7860 | 7470 |
effective_schedule_ratio | 0.273 | 0.486 |
Descheduler
The existing cluster scheduling scenarios are all one-time scheduling, that is, a hammer sale. The Pod distribution cannot be automatically adjusted when the node CPU and memory utilization is too high, unless the node eviction Manager is triggered, or manual intervention is performed. In this way, when the CPU/ memory utilization of the node is high, the stability of all the PODS on the node is affected, and the resources of the node with low load are wasted.
Aiming at this scenario, the design idea of Descheduler rescheduling in K8s community was used for reference, and the expulsion strategy based on the actual CPU/ memory utilization of each node was proposed.
architecture
Descheduler obtains Node and Pod information from Apiserver, Node and Pod monitoring information from Prometheus, and then through Descheduler’s expulsion strategy, At the same time, we strengthened the sorting and checking rules of Descheduler when ejecting PODS to ensure that the service will not fail when ejecting PODS. After the expulsion, THE Pod will be dispatched to the node with low water level after the dynamic scheduler, which can reduce the failure rate of the node with high water level and improve the overall resource utilization.
Product ability
Products rely on
Node monitoring components node- Exporter and Prometheus depend on the base. Prometheus supports hosting and self-building. Descheduler can be installed in one click by using hosting. Self-building Prometheus also provides monitoring indicator configuration methods.
Component configuration
Descheduler according to the utilization of user configuration threshold value, has expelled after the Pod, higher than the threshold level envoys point load as far as possible to reduce below the water level of the target utilization.
Product effect
Pass K8s event
You can see Pod rescheduling through K8s events, so you can enable cluster event persistence to view Pod expulsion history.
Node load variation
In a node CPU usage monitoring view similar to the following, you can see that the NODE CPU usage decreases after the eviction starts.
Best practices
State of the cluster
Take a customer cluster as an example. Since most of the customer’s services are memory consuming, it is more likely to have nodes with high memory utilization, and the memory utilization of each node is very uneven. The monitoring of each node before the dynamic scheduler is used is as follows:
Dynamic scheduler configuration
The parameters for configuring pre-selection and optimization are as follows:
During the pre-selection phase, the nodes whose average memory utilization exceeds 60% within 5 minutes or whose maximum memory utilization exceeds 70% within 1 hour are filtered out, that is, Pod will not be scheduled to these nodes.
In the optimization phase, the weight of 5-minute average memory utilization was set as 0.8, the weight of maximum memory utilization in 1h and 1D was set as 0.2 and 0.2, and the weight of CPU indicator was set as 0.1. In this case, nodes with low memory usage are preferentially scheduled.
Descheduler configuration
The Descheduler parameters are as follows: When the node memory usage exceeds the threshold of 80%, Descheduler evicts pods from the node and tries to reduce the node memory usage to 60%.
Cluster optimization status
After running for a period of time, the memory usage of each node in the cluster is balanced as follows: