Summary: Annoying CPU limiting affects container running, and sometimes people have to sacrifice container deployment density to avoid CPU limiting. The CPU Burst techniques described in this article can help you ensure that your container runs with quality of service without reducing the density of your container deployment. The second part of this two-part article will look at the differences between using CPU Burst and other ways to avoid limiting streams, and how to configure THE CPU Burst capability for the best results.
In K8S container scheduling, the CPU limits of a container are specified by the CPU limits parameter. Setting CPU resource caps can limit the amount of CPU time consumed by individual containers and ensure that other containers get enough CPU resources. CPU Limits are implemented in the Linux kernel using the CPU Bandwidth Controller, which limits the resource consumption of cgroups by CPU traffic limiting. So when processes in a container are using resources that exceed THE CPU limits, those processes are being restricted, their CPU usage is limited, and some of the key latency metrics in the process become worse.
What should we do in this situation? Normally, limits are set to the container’s daily peak CPU usage and multiplied by a relatively safe number. In this way, the limits are not restricted to the container’s service quality, but also to CPU usage. For a simple example, if we have a container whose daily peak CPU usage is around 250%, we can ensure that the container’s CPU utilization is 62.5% (250%/400%) by setting the limits to 400%.
But is life really that good? Obviously not! CPU limiting is happening much more frequently than expected. How to do? It seems like we’ll just have to keep increasing the CPU limits to fix this. In many cases, when the LIMITS of a container are magnified 5 to 10 times, the service quality of the container is guaranteed and the total CPU utilization of the container is only 10 to 20 percent. So in order to cope with possible spikes in container CPU usage, the container deployment density must be significantly reduced.
Historically, people have fixed some bugs in CPU Bandwidth Controller that cause CPU Bandwidth limiting. We found that the current unexpected Bandwidth limiting is caused by a Burst of CPU usage at the 100ms level, and proposed a CPU Burst technique that allows certain bursts of CPU usage. Avoid CPU flow limiting when the average CPU usage is lower than the limit. In cloud computing scenarios, the value of CPU Burst techniques is:
- Improve CPU resource service quality without increasing CPU configuration.
- Allows the resource owner to reduce THE CPU resource configuration and improve the CPU resource utilization without sacrificing the resource service quality.
- Reduce TCO (Total Cost of Ownership).
The CPU utilization you see is not the whole truth
The second-level CPU usage cannot reflect the 100ms CPU usage of the Bandwidth Controller. This is the cause of unexpected CPU traffic limiting.
Bandwidth Controller Applies to CFS tasks and uses period and quota to manage CPU time consumption of cgroups. If period of cgroup is 100ms and quota is 50ms, cgroup processes can use a maximum of 50ms CPU time per 100ms. When the CPU usage of the 100ms cycle exceeds 50ms, the process is restricted and the CPU usage of cGroup is limited to 50%.
CPU usage indicates the average CPU usage during a period of time. Statistics on CPU usage are collected in coarser granularity. The CPU usage tends to be stable. When the granularity of the observation becomes finer, the burst characteristics of CPU usage become more obvious. The container load operation is observed in the granularity of 1s and 100ms at the same time. When the observation granularity is 1s, the average second-level CPU utilization is about 250%. At the 100ms level of the Bandwidth Controller, the peak CPU usage exceeded 400%.
Set the container quota to 400ms and period to 100ms based on the CPU usage observed in seconds 250%. The fine-grained burst of container processes is limited by the Bandwidth Controller, which affects the CPU usage of container processes.
How to improve
We meet this fine-grained CPU Burst requirement with the CPU Burst technique, which introduces the concept of a Burst based on the traditional CPU Bandwidth Controller quota and period. When the container’s CPU usage falls below quota, the resources available for a burst accumulate; When the container’s CPU usage exceeds quota, the accumulated burst resources are allowed to be used. The end result is to limit the container’s average CPU consumption within its quota for longer periods of time, allowing CPU usage to exceed its quota for shorter periods of time.
If the Bandwidth Controller algorithm is used to manage vacation, the vacation period is a year, and the amount of vacation in a year is quota. With the CPU Burst technology, the vacation time that is not completed this year can be taken later.
After CPU Burst
After using CPU Burst in the container scenario, the quality of service of the test containers improved significantly. A 68% decrease in the mean RT was observed (from 30+ms to 9.6ms); 99% RT decreased by 94.5% (from 500+ms to 27.37ms).
If the container is running with a latency-sensitive load and you have CPU limits caused by a quota configuration, you may want to try CPU Burst techniques to optimize latency. CPU Burst modifications have been incorporated into Linux 5.14, and Alibaba Cloud Linux already supports CPU Burst technology.
About the author
Chang Huaixin (Yizhai) is a core group engineer of Ali Cloud, specializing in THE field of CPU scheduling.
The original link
This article is ali Cloud original content, shall not be reproduced without permission.