The author | xiaoyu (ZhongYuan) ali cloud container platform technology experts

Pay attention to the “Alibaba Cloud original” public account, reply to the keyword “1010”, you can get the PPT of this article.

Resource utilization has always been a topic of concern for many platform managers and developers. Based on the work practice of Alibaba container Platform team in this field, the author sorted out a set of solutions to improve resource utilization, hoping to bring some discussion and thinking to everyone.

The introduction

Have you ever had the experience that when you have a Kubernetes cluster and start deploying your application, how many resources should you allocate to the container?

It’s hard to say. Because of Kubernetes’ own mechanism, we can understand that the container’s resources are essentially a static configuration.

If I find that there are insufficient resources, we need to rebuild the Pod in order to allocate more resources to the container;

If we allocate redundant resources, our worker node seems to have few containers to deploy.

Can we allocate container resources on demand? In this post, we’ll explore the answer to that question.

Real challenges in production environments

First let us throw out the challenges of our actual production environment based on our actual situation. You may remember the Tmall Double 11 event in 2018, which saw a total transaction volume of 213.5 billion yuan. From this, we can see the whole leopard, which can support the system behind such a large volume of transactions, and what kind of scale and number of applications it should be.

At this scale, container scheduling, load balancing, cluster scaling, cluster upgrading, application publishing, application graying, and so on that we often hear about are no longer easy things to deal with when the term “super-scale cluster” is applied. Scale alone is our biggest challenge. How to operate and manage such a massive system, and live up to the industry’s dev-Ops hype, is like asking an elephant to dance. But Teacher Ma once said, elephant should do elephant should do things, why to dance.

The help of Kubernetes


Can elephants dance? With this question, we need to start from the system behind taobao, Tmall and other apps.

This Internet system application deployment can be roughly divided into three phases: traditional deployment, VIRTUAL machine deployment and container deployment. Virtual machine deployment provides better isolation and security than traditional deployment, but inevitably incurs significant performance losses. Container deployment provides a lighter solution in the context of virtual machine deployment to achieve isolation and security. Our system also runs along this main waterway. Assuming that the underlying system is like a huge ship, facing a huge number of containers, we need an excellent captain to arrange them, so that the ship can avoid layers of obstacles, reduce the difficulty of operation, and have more flexibility, and finally achieve the purpose of navigation.

Ideal and reality

At the beginning, thinking of all the wonderful scenarios of containerization and Kubernetes, our ideal container choreography would look something like this:

Calm: our engineers are more calm in the face of complex challenges, no longer frown but more smile and confidence;

Elegance: Every online change operation can be as calm as wine, elegantly press the enter key for execution;

Orderly: from development to testing, and then to gray release, seamless, flowing;

Stability: the robustness of the system is good. Annual system availability N + 9;

Efficient: save more manpower, realize “happy work, serious life”.

However, the ideal is full, the reality is very skinny. We are greeted by disarray and various forms of embarrassment.

It’s messy because many of the tools and workflows are in the early stages of being built as a new technology stack emerges. When the tools that work well in the Demo version are rolled out on a large scale in the real world, all kinds of hidden problems will be exposed and emerge one after another. From development to operations, all staff are on the run in various passive ways. In addition, “scale-out” also means dealing directly with a variety of production environments: heterogeneous configurations of machines, complex requirements, and even adapting to the user’s past usage habits.


In addition to the nerve-racking chaos, the system also faced various application container crashes: OOM due to insufficient memory, process throttle due to CPU quota allocation, low bandwidth, and dramatically increased response latency. Even trading volume in the face of peak access due to the system not to force the cliff type fall and so on. This gives us a lot of experience in large-scale commercial Kubernetes scenarios.

Face the problem

The stability of

Problems will have to be faced. As someone great said: If something feels wrong, something is wrong. So we have to analyze what the problem is. For OOM, CPU resources are being throttle, so we can infer that the initial resources allocated to the container are insufficient.


Lack of resources is bound to cause the stability of the entire application service. For example, in the scenario shown in the figure above, resources with the same value may have equal value and significance for different copies of the same application, although they are copies of the same application, perhaps because the load balancing is not strong enough, or because the application itself is heterogeneous, or even because the machine itself is heterogeneous. Numerically they seem to allocate the same amount of resources, but when it comes to the actual workload, it’s more likely to be uneven.


In the overcommit scenario, serious resource competition occurs when the entire node or the CPU share pool in which the node resides is insufficient. Resource competition is one of the biggest threats to application stability. So we try to eliminate all threats in the production environment.

We all know that stability is an important thing, especially for front-line developers who control the life and death of millions of containers. A careless operation may cause a production accident with a large impact.

Therefore, we have also done systematic prevention and bottom-saving work in accordance with the general process.

In the prevention dimension, we can carry out full-link pressure test, and predict the number of copies and resources required by the application through scientific means in advance. If you can’t budget your resources accurately, you just allocate them redundantly.

In the bottom-of-the-pocket dimension, we can downscale non-critical services and temporarily expand major applications at the same time as mass access traffic arrives.

But for adding a few minutes of sudden traffic, the cost of such a combination of punches seems uneconomical. Maybe we can come up with some solutions that meet our expectations.

Resource utilization

To review our application deployment, containers on nodes are typically distributed to multiple applications, which are not necessarily and are not usually at peak access at the same time. For a host of mixed applications, it may be more scientific to allocate resources from the running container to the host at different peaks.


The resource requirements of an application can ebb and flow like the moon. For example, online business, especially trading business, shows a certain periodicity in the use of resources. For example, in the early morning and morning, the use of resources is not very high, but in the noon and afternoon.

For example: for the important moment of APPLICATION A, application B may not be so important. It is A good choice to suppress APPLICATION B appropriately and transfer resources to application A. This sounds a bit like time-sharing. But there is a lot of waste if we allocate resources to demand at peak traffic.


In addition to online applications that require high real-time performance, we also have offline applications and real-time computing applications. Offline computing is not sensitive to CPU, Memory or network resource usage and time, so it can run at any time. Real-time computing can be very sensitive to time.

In the early days, our business was deployed independently on different nodes based on the type of application. According to the figure above, if they reuse resources in time-sharing, we can find that the actual maximum usage of real-time resources is not 2+2+1=5, but the maximum demand for critical and urgent applications at a certain moment, namely 3. If we can monitor the actual usage of each application and assign a reasonable value to it, we can have the actual effect of improving resource utilization.

For e-commerce applications, for Web applications with heavyweight Java frameworks and associated technology stacks, HPA or VPA is not easy in the short term.

Starting with HPA, we might be able to pull up a Pod in seconds to create a new container, but whether the container is actually usable. From the creation to availability, it may take a long time, for the promotion and purchase seconds – this kind of traffic “flood peak” may only maintain a few minutes or ten minutes of the actual scene, if we wait until the HPA copy is all available, the market activity may have already ended.

As for the community’s current VPA scenario, the logic of dropping the old Pod and creating a new one is even harder to accept. Therefore, we need a more practical solution to fill the vacancy of HPA and VPA in this single machine resource scheduling.

The solution

Delivery standards

We first set a deliverable standard for the solution, that is, “stability, utilization, automation, of course, if it can be intelligent, that is better”, and then deliver the standard for refinement:

Security and stability: The tool itself is highly available. The algorithms and implementation methods used must be controllable;

The business container allocates resources on demand: it can predict resource consumption in the near future based on real-time resource consumption of the business, so that users can understand the real demand for resources in the future.

Low tool resource cost: The tool resource consumption should be as small as possible and do not become a burden on O&M.

Easy to operate, strong scalability: can do without training can play this tool, of course, the tool also has good scalability, for users DIY;

Quick discovery & timely response: real time is the most important characteristic, which is different from HPA or VPA in solving resource scheduling problems.

Design and Implementation


Here is our initial tool flow design: When an application is faced with a high demand for business access, which is reflected in the increasing demand of CPU, Memory or other resource types, we use Data Aggregator to generate a picture of a container or the whole application according to the real-time basic Data collected by the Data Collector. This is then fed back to Policy Engine. The Policy Engine instantly changes the parameters in the container’s Cgroup directory.

Our original architecture was as unpretentious as we thought, with intrusive modifications at Kubelet. We just added a few interfaces, but it wasn’t elegant enough. Every time Kubenrnetes is upgraded, there are challenges to upgrade Policy Engine components.


In order to iterate quickly and decouple from Kubelet, we made a new evolution of the implementation. That is, containerize key applications. This can achieve the following effects:

Non-intrusive modification of K8s core components;

Easy iteration & release;

With the help of Kubernetes QoS Class mechanism, the container resource configuration, resource cost controllable.

Of course, in the subsequent evolution, we are also trying to connect with HPA and VPA, after all, these are complementary to Policy Engine. So our architecture evolved further into the following scenario. When Policy Engine is unable to handle more complex scenarios, reporting events allows the central end to make more global decisions. Expand resources horizontally or vertically.


Let’s look at the design of Policy Engine in detail. Policy Engine is the core component for intelligent scheduling and Pod resource adjustment on a single node. It consists of API Server, command Center, and Executor.

The API server is used to serve external requests for the query and setting of the running status of policy Engine.

The Command Center makes Pod resource adjustment decisions based on real-time container images and physical machine load and resource usage.

Executor then adjusts the container’s resource limits based on command Center decisions. Executor also persists each revision info change so that it can be rolled back in the event of a failure.

The command center regularly obtains real-time images of containers from the Data Aggregator, including aggregated statistics and forecast data. First, it determines the node status. For example, if the node disk is abnormal or the network is abnormal, it indicates that the node is abnormal and the site needs to be protected. Operation and commissioning are affected. If the node status is normal, the command center will policy rules to filter the container data again. For example, the CPU rate of the container skyrocketed, or the response time of the container exceeded the security threshold. If the condition is met, resource adjustment suggestions are given to the collection of containers that meet the condition and passed to executor.

In architectural design, we follow the following principles:

Plug-in: All rules and policies are designed to be modified through configuration files, as far as possible decoupled from the code of the core control process, and from the update and publication of other components such as data Collector and data Aggregator to improve scalability;

Stability, which includes the following aspects:

Controller stability. The decision of the command center is based on the premise of not affecting the stability of single machine and even global stability, including the stability of container performance and resource allocation. For example, each controller is only responsible for one type of Cgroup resource control. That is, the Policy Engine adjusts multiple resources at the same time in the same time window to avoid resource allocation shock and interfere with the adjustment effect.

Trigger rule stability. For example, the original triggering condition of a rule is that the performance index of the container exceeds the safety threshold, but in order to avoid the control action being triggered by a sudden peak value, the triggering rule is customized as: the low percentile of the performance index exceeds the safety threshold in a past period of time; If the rule is met, it indicates that most of the performance index values have exceeded the safety threshold during this period, and the control action needs to be triggered.

In addition, unlike vertical-pod-AutoScaler, Policy Engine does not actively remove containers. Instead, it directly changes the container’s Cgroup file.

Self-healing: the execution of actions such as resource adjustment may produce some abnormalities. We have added a self-healing rollback mechanism in each controller to ensure the stability of the whole system.

Independent of prior application knowledge: Pressing all different applications individually, customizing strategies, or pressing applications that may be clustered together in advance can lead to significant overhead and reduced scalability. Our policies are designed as general as possible, and try to adopt indicators and control policies independent of specific platforms, operating systems and applications.

In terms of resource adjustment, Cgroup supports us to isolate and limit the CPU, memory, network and disk IO bandwidth resources of each container. Currently, we mainly adjust the CPU resources of the container. At the same time, the feasibility of dynamically adjusting memory limit and swap Usage to avoid OOM is explored in the test. In the future we will support dynamic tuning of the container’s network and disk IO.

Adjust the effect


The figure above shows some of our experimental results in a test cluster. We deployed a mix of high-priority online and low-priority offline applications in the test cluster. Slos are 250ms, and we expect latency at 95th percentile for online applications to be below the threshold 250ms.

It can be seen in the experimental results:

Before 90s, the load of online applications is very low. The mean and percentile of latency are both below 250ms.

After 90s, we put pressure on online applications, resulting in increased traffic and load, resulting in the 95th percentile value of latency exceeding that of SLO.

At about 150s, our small step and fast run control strategy is triggered, and we progressively throttle offline applications that compete with online applications for resources.

At about 200s, online application performance returns to normal, with latency in the 95th percentile falling below SLO.

This illustrates the effectiveness of our control strategy.

Lessons learned

Let’s summarize some of the lessons we learned during the entire project, which we hope will be helpful to others who encounter similar problems and scenarios.

Avoiding hard coding, component microservitization not only facilitates rapid evolution and iteration, but also facilitates fusing abnormal services.

If possible, do not call the library interface that is still alpha or beta. For example, we used to call CRI interface directly to read some information of the container, or do some update operations, but with the modification of interface fields or methods, some functions will become unavailable. Maybe sometimes, it is better to call unstable interface than to directly obtain the printed information of an application.

Qos-based resource dynamic adjustment: As we mentioned before, there are tens of thousands of applications within Ali Group, and the call chain between applications is quite complex. The exception of application A’s container performance may not be caused by resource shortage or resource competition on A single node, but may be caused by downstream application B and APPLICATION C, or database and cache access delays. Due to the limitations of this information on the standalone node, resource adjustment based on the standalone node information can only be implemented with the “best effort” strategy, which is the best effort strategy. In the future, we plan to single node and the center of the resources control link, from the center (integrated standalone node to the performance of the information and resources adjustment request, unified resource redistribution, or container reorganized, or trigger HPA, thus forming a closed loop of the cluster level intelligence resources control link, This will greatly improve the stability and overall resource utilization of the whole cluster dimension.

Resource V.S. performance model: Some of you may have noticed that our tuning strategy does not explicitly mention a resource V.S. for containers. Performance “model. This model is very common in academic papers. It is generally used to perform offline or online pressure measurement on several applications under test, change the resource allocation of the application, measure the performance index of the application, and get the curve of performance changing with resources, which is finally used in the real-time resource control algorithm. When the number of applications is relatively small, the call chain is relatively simple, and the hardware configuration of the physical machine in the cluster is relatively small, the method based on pressure measurement can enumerate all possible situations to find the optimal or sub-optimal resource adjustment scheme, so as to get better performance. However, in the case of Alibaba Group, we have tens of thousands of applications, and the versions of many key applications are released very frequently. Often, after the release of new versions, the old pressure data, or resource performance model, is not applicable. In addition, many of our clusters are heterogeneous, and the performance data tested on one physical machine will not be replicated on another physical machine of a different model. All these have brought obstacles to our direct application of resource regulation algorithms in academic papers. Therefore, for the internal scenario of Ali Group, we adopted such a strategy: do not perform offline pressure test on the application, and obtain the displayed resource performance model. Instead, a real-time dynamic container portrait is built, and the statistics of container resource usage in the past period of time are used as a prediction for a short period of time in the future, and dynamically updated. Finally, based on this dynamic container image, execute the resource adjustment strategy of small and fast running, and see as you go, and do your best.

Summary and Prospect

To sum up, our work has mainly achieved the following benefits:

Through time-sharing multiplexing and mixed deployment of containers with different priorities (i.e., online and offline tasks), and dynamic adjustment of container resource limits, online applications can get enough resources under different load conditions, so as to improve the comprehensive resource utilization of cluster.

Intelligent dynamic adjustment of container resources on a single node reduces the performance interference between applications and ensures the performance stability of high-priority applications

Various resource adjustment strategies can operate automatically and intelligently on nodes through Daemonset deployment, reducing manual intervention and reducing the labor cost of operation and maintenance.

Looking ahead, we hope to strengthen and expand our work in the following areas:

Closed-loop control link: as mentioned above, the adjustment of resources on a single node is limited due to the lack of global information, so we can only do our best. In the future, we hope to open the path with HPA and VPA, so that the single node and the central end can be linked to adjust resources and maximize the benefits of elastic expansion.

Container rearrangement: Even for the same application, the load and physical environment of different containers also change dynamically. Adjusting pod resources on a single machine may not meet dynamic requirements. We hope that the real-time container portrait on the single machine can provide more effective information for the central end and help the central end scheduler to make more intelligent container rearrangement decisions.

Intelligent strategy: our current resource adjustment strategy is still coarse-grained, and the resources that can be adjusted are relatively limited. In the future, we hope to make the resource adjustment strategy more intelligent and take more resources into account, such as the adjustment of disk and network IO bandwidth, so as to improve the effectiveness of resource adjustment.

Refinement of container portrait: The current container portrait is also relatively rough, relying only on statistical data and linear prediction; The types of metrics that describe container performance are also limited. We hope to find more accurate and general indicators that reflect the performance of the container, so as to describe the current state of the container and the degree of demand for different resources more finely.

Interference sources: We hope to find an effective solution on a single node to accurately locate interference sources when application performance is impaired; This also has great significance for strategy intelligence.

Q & A

Q1: Do you get resources if you modify cgroup containers directly?

A1: The technical basis of container technology isolation is the Cgroup level. Setting a larger value to the cgroup can fetch more resources if the host machine frees up enough resources. Similarly, for low-priority applications, setting a low cgroup resource value will inhibit container operation.

Q2: How does the underlying hierarchy distinguish online and offline priorities?

A2: There is no automatic access to who is online, who is offline, or who has a higher priority and who has a lower priority. We can do this through various Kubernetes extensions. The easiest thing to do is identify by label, Annotation. Of course, extending QoS class is also an idea. The QoS class Settings of the community version are too conservative, leaving little room for users. We’ve enhanced it in those areas as well. It may be introduced to the community in due course. Automatic awareness is the direction, awareness of who is the source of interference, awareness of who is some kind of resource-based application, which we’re still working on. To be truly dynamic, it must be an intelligent system with automatic perception.

Q3: “Different from vertical-pod-AutoScaler, Policy Engine does not actively remove containers, but directly modify the container’s cgroup file.”

A3: That’s a good question. If the CPU type is CPU type, we can set the CPU quota in the cgroup of the lower priority container to prevent the lower priority container from competing for the CPU. Then adjust the resource values of the higher-priority containers appropriately. If it is an in-memory resource, you cannot directly reduce the cgroup value of the low-priority container, otherwise it will cause OOM. We will continue to discuss the adjustment of in-memory resources in other shares. It’s a special technique.

Q4: Only change cgroup, how to ensure that K8s can allocate more containers to a single physical machine?

A4: Text live shows that the resource consumption of containers is not constant. In many cases, the resource consumption of containers is tidal. Under the same resource condition, more applications and more operations are deployed to maximize the resource utilization. Oversold resources are the biggest value of our discussion on this topic.

Q5: That is, for low-priority containers, request is set much smaller than limit, and then you dynamically adjust cgroup?

A5: In the existing QoS scenario, you can understand that the adjusted pods are burstable. Instead of directly adjusting the Pod metadata limit value, we adjust the limit value reflected in cgroup, which will be adjusted back when resource competition eases. We do not recommend that the cgroup data on a single machine be separated from the central data on etCD for too long. If it deviates for a long time, we will send out an alarm to VPA and make adjustment in linkage with VPA. Of course, any attempt to rebuild a container at its peak would be unwise.

Q6: The overall understanding is that you have the physical machine with a certain proportion of pod at the beginning, and then adjust the cgroup value of the container dynamically through the strategy?

A6: This dynamic adjustment also makes sense if the resource is completely abundant and redundant. In fact, when the CPU of the host reaches a certain percentage, for example, 50%, the latency of the application increases. In order to fully ensure the SLO of a high-priority application, it is also valuable to sacrifice a low-priority CPU for normal operation.

Q7: Is Open source considered for Policy Engine?

A7: There is a plan to open source. Policy Engine is more related to its own application attributes, and the strategies of e-commerce applications or big data processing applications are different. We will open source the framework first and attach some simple strategies, and more strategies can be customized by users.

Q8: Most of the applications I have encountered before are not aware of the configuration of cgroup correctly, so many cases need to set parameters according to CPU or MEM in the startup parameter, that is to say, even if the cgroup is changed, it is invalid for them, so the usage scenarios are limited

A8: Limiting the resource usage of the container is still valuable. Limiting low-priority apps can itself improve the SLO of high-priority apps, although the effect is less dramatic. Stability considerations are also important.

Q9: How is Policy Engine currently used in Ali? How large is the production scale that is dynamically adjusted in this way? Does it work with community HPA VPA?

A9: Policy Engine is already used in some clusters in Ali. As for the scale can not be disclosed. The linkage between many components is involved, and the community’s HPA and VPA are currently inadequate to meet our needs. Therefore, Ali’s HPA and VPA are developed by ourselves, but are consistent with the principles of the community. Ali HPA’s open source can follow the Openkruise community. I don’t have any concrete information about VPA open source plans.

Q10: Can HPA or VPA capacity expansion be performed when the resources of a single node are insufficient to provide container capacity expansion?

A10: When a single node is insufficient, the application can use HPA to add copies. However, VPA fails to update the original node. Only other resource-rich nodes can be scheduled. In the case of a sudden increase in traffic, the rebuilt container may not meet the demand, which may lead to an avalanche. That is, during the reconstruction process, other unupgraded copies of the entire application receive more traffic, OOM drops, and the newly started container becomes OOM instantly. Therefore, it is necessary to be careful when restarting the container. Rapid expansion (HPA) or rapid promotion of high-priority resources and suppression of low-priority container resources are more effective.

Pay attention to the “Alibaba Cloud original” public account, reply to the keyword “1010”, you can get the PPT of this article.

“Alibaba cloudnative wechat public account (ID: Alicloudnative) focuses on micro Service, Serverless, container, Service Mesh and other technical fields, focuses on cloudnative popular technology trends, large-scale implementation of cloudnative practice, and becomes the technical public account that most understands cloudnative developers.”

Author: a green boat

The original link: yq.aliyun.com/articles/72…

This article is the original content of the cloud habitat community, shall not be reproduced without permission.