takeaway: Serverless and Autoscaling have been of great concern to developers in recent years. Some say Serverless is Container 2.0, and one day containers and Serverless will have a showdown. In fact, containers and Serverless can coexist and complement each other, especially in Autoscaling related scenarios, where Serverless can be perfectly compatible with containers to make up for the lack of simplicity, speed and cost of container scenarios. In this article, you will learn about the principles, solutions, and challenges of containers in an elastic scenario, and how Serverless helps containers solve these problems.


When we’re talking about elastic stretching

What are we talking about when we talk about “elastic stretching”? “Flex” means different things to different roles on the team, and that’s part of the beauty of flex.

Start with a resource graph

This diagram is often used to illustrate the elastic scaling problem. It shows the relationship between the actual resource capacity of a cluster and the capacity required by an application.

  • The red curve represents the actual capacity required by the application, because the amount of application resources is much smaller than the node, so the curve is relatively smooth.
  • The green line indicates the actual resource capacity of the cluster. The inflection point of the line indicates that the capacity is adjusted manually, for example, adding or removing nodes. The resource capacity of a single node is fixed and relatively large, so the broken line is mainly used.


First of all, let’s look at the area of the first yellow grid on the left. This area indicates that the capacity of the cluster cannot meet the capacity requirements of the business. In actual scenarios, it is usually accompanied by the phenomenon of Pod that cannot be scheduled due to insufficient resources.

In the middle grid area, the capacity of the cluster is much higher than the capacity required by the actual resources, and there will be a waste of resources. The actual performance is usually uneven load distribution of nodes, some nodes have no scheduling load, while other nodes have relatively high load.

Grid area on the right side of the peak value of the surge in capacity, we can see that the curvature of the peak before it is very steep, this scenario is usually due to unusual capacity planning, traffic surges, large quantities of tasks within the scene, the surge of peak flow to ops students very short reaction time, once the improper handling may cause an accident.

Elastic expansion has different meanings for different roles:

  • Developers hope to ensure the high availability of applications by flexible scaling;
  • Operation and maintenance personnel hope to reduce the management cost of infrastructure through flexible scaling.
  • Architects want to be flexible by scaling up and building an architecture that is resilient to sudden spikes.

There are many different components and solutions for elastic scaling. Selecting a solution suitable for your business needs is the first step before implementation.

Kubernetes elastic expansion ability interpretation

Kubernetes elastic scaling related components



Kubernetes elastic scaling components can be interpreted in two dimensions: one is the scaling direction, and the other is the scaling object.

From the expansion direction, divided into horizontal and vertical. From the telescopic object, it is divided into nodes and pods. If you expand this quadrant, you get the following three categories of components:

  1. Cluster-autoscaler: nodes scale horizontally.
  2. HPA & Cluster-Proportional – AutoScaler, Pod horizontal scaling;
  3. Vertical pod autoscaler&addon resizer.

Among them, HPA and Cluster-AutoScaler are the most commonly combined elastic scaling components used by developers. HPA is responsible for horizontal scaling of containers, and Cluster-AutoScaler is responsible for horizontal scaling of nodes. Many developers ask: why does elastic scaling need to be refined into so many components? Can’t you just set a threshold and achieve automatic water level management for the cluster?

Kubernetes elastic scaling challenge



Understanding the scheduling method of Kubernetes can help developers better understand the design philosophy of Kubernetes elastic scaling. In Kubernetes, the minimum unit of scheduling is a Pod, which will be scheduled to meet the conditions of the node according to the scheduling policy, such as resource matching relationship, affinity and anti-affinity, among which the calculation of resource matching relationship is the core element of scheduling.

There are four common concepts associated with resources:

  • Capacity Indicates the total Capacity that can be allocated by a node.
  • Limit Indicates the total number of resources that a Pod can use.
  • Request represents the resource space occupied by a Pod in scheduling.
  • Used represents the actual resource usage of a Pod.

With these four basic concepts and usage scenarios in mind, let’s look at the three challenges of Kubernetes elastic scaling:

1. Capacity Planning Bomb Remember how capacity planning was done before containers were used? For example, two 4C8G machines are required for application A and four 8C16G machines are required for application B. The machines in application A and application B are independent and do not interfere with each other. In the container scenario, most developers don’t need to care about the underlying resources, so where does capacity planning go?

Kubernetes is set by Request and Limit. Request indicates the Request value of a resource, and Limit indicates the Limit value of a resource. Since Request and Limit are equivalent concepts for capacity planning, this means that the actual calculation rules for resources are more accurate based on Request and Limit. As for the reserved resource threshold of each node, it is likely that the reservation of small nodes cannot meet the scheduling requirements, and the reservation of large nodes cannot be completely scheduled.

2. Percentage fragmentation traps In a Kubernetes cluster usually contain more than one size of machine. For different scenarios and requirements, the configuration and capacity of the machine may vary greatly, so the percentage of cluster scaling can be very confusing.

Suppose there are two different specifications of 4C8G machines and 16C32G machines in our cluster. For 10% resource reservation, these two specifications represent completely different meanings. Especially in the scenario of capacity reduction, in order to ensure that the cluster after capacity reduction is not in the oscillating state, we usually reduce the capacity of nodes one by one. Therefore, it is particularly important to determine whether the current node is in the capacity reduction state according to the percentage. At this time, if the large size machine has a low utilization rate is judged to shrink, then it is likely to cause the scramble hunger after node reduction and container rescheduling. If the judgment condition is added to reduce the capacity of nodes with small configurations first, a large number of redundancies of resources may be caused after the capacity reduction, and only all monolith nodes may be left in the cluster.

3. Resource utilization Dilemma Does the resource utilization of a cluster really represent the current cluster status? When a Pod has low resource utilization, it does not mean that it can encroach on the requested resource. In most production clusters, resource utilization is not maintained at a very high water level, but in terms of scheduling, resource scheduling should be maintained at a high water level. In this way, the stable availability of the cluster can be ensured without too much waste of resources.

What does it mean if Request and Limit are not set and the overall resource utilization of the cluster is high? This means that all the Pod was based on real load scheduling unit, exist serious competition between each other, and simple to join node is no can’t solve the problem, because for a scheduled Pod, in addition to manual scheduling and expulsion, there is no way the Pod can be removed from the high load of node. So what happens if we set Request and Limit and the node gets very high utilization? Unfortunately, this is not possible in most scenarios, because different applications and different loads will have different resource utilization at different times, and there is a high probability that the cluster will not be able to schedule the Pod before triggering the set threshold.

After understanding the three problems of Kubernetes elastic scaling, we look at the solution of Kubernetes is what?

Kubernetes’ flexible and scalable design philosophy



Kubernetes is designed to divide elastic scaling into scheduling layer scaling and resource layer scaling. The scheduling layer is responsible for scaling out the scheduling unit according to indicators and thresholds, while the resource layer is responsible for meeting the resource requirements of the scheduling unit.

In the scheduling layer, Pod is usually scaled horizontally by HPA. The use of HPA is very close to and similar to elastic scaling in our traditional sense. It is scaled horizontally by setting judged indicators and thresholds.

At the resource layer, the current mainstream scheme is toscale nodes horizontally through cluster-AutoScaler. When pods cannot be scheduled due to insufficient resources, cluster-AutoScaler will try to select a scaling group that can meet the scheduling requirements and automatically add instances to the group. After the instance is registered with Kubernetes, Kube-scheduler will re-trigger the scheduling of Pod, and the previously unschedulable Pod will be scheduled to the newly generated node, thus completing the capacity expansion of the whole link.

During capacity reduction, the scheduling layer compares the resource usage with the preset threshold to achieve capacity reduction at the Pod level. When the Pod scheduling resource on a node reaches the resource layer capacity reduction threshold, cluster-AutoScaler drains the node with a low scheduling percentage. After the node capacity is drained, cluster-AutoScaler shrinks the node capacity to complete the entire link shrinkage.

The Achilles heel of the Kubernetes elastic scaling scheme

Classic Kubernetes elastic scaling case



This diagram is a very classic example of elastic scaling that can represent most online business scenarios. The initial architecture of the application is a Deployment, and there are two PODS below. The access layer of the application is exposed through Ingress Controller. We set the scaling strategy of the application as follows: When the QPS of a Pod reaches 100, the capacity is expanded to a maximum of 10 and a minimum of 2 pods.



The HPA controller will continuously train the Alibaba-Cloud-metrics-Adapter to obtain the QPS indicator of the current route to the Ingress Gateway. When the Ingress Gateway traffic reaches the QPS threshold, the HPA Controller triggers a change in the number of pods in the Deployment. When the requested Pod capacity exceeds the total capacity of the cluster, the cluster-AutoScaler selects an appropriate scaling group and ejects nodes to host unscheduled pods.

Such a classic elastic expansion case is resolved, so in the actual development process, what problems will be encountered?

Shortcomings and solutions of classical Kubernetes elastic expansion



The standard community mode is to create and release ECS, and the delay of capacity expansion is about 2min-2.5min. However, the independent extreme mode of Ali Cloud is to create, stop and start, and only storage charges are charged, not computing charges. More than 50% elastic efficiency can be achieved at a very low price.


In addition, complexity is also a problem around the Cluster-AutoScaler. In order to use the cluster-AutoScaler well, it is necessary to have a deep understanding of some internal mechanisms of the cluster-AutoScaler, otherwise it is very likely to cause unable to pop up or unable to shrink the scene.



For most developers, Cluster-AutoScaler works in a black box, and the best way to troubleshoot problems with Cluster-AutoScaler is still to look at logs. Once cluster-AutoScaler runs abnormally, the latter cannot scale as expected due to developer configuration errors, then more than 80% of developers are difficult to correct their own errors.

Ali Cloud container service team developed a Kubectl plugin, which can provide cluster-AutoScaler with a deeper level of observability, such as viewing the scaling stage of the current cluster-AutoScaler and automatic elastic scaling error correction.

Although the current encountered several core problems, are not the last straw. But we’ve been thinking, are there other ways to make flex easier and more efficient?

Achilles’ Martin boot – Serverless Autoscaling



The core problems of resource layer scaling lie in high learning cost, difficulty in troubleshooting and poor timeliness. When we look back at Serverless, we can find that these problems are exactly the characteristics and advantages of Serverless. Is there a way to make Serverless an elastic solution for Kubernetes resource layer?

Serverless Autoscaling component – Virtual-Kubelet-AutoScaler



Ali Cloud Container Services team developed Virtual-Kubelet-AutoScaler, a component that implements Serverless Autoscaling in Kubernetes.



When there is a Pod that cannot be scheduled, virtual-kubelet is responsible for carrying the real load, which can be understood as a virtual node with infinite capacity. When a Pod is scheduled on a Virtual-Kubelet, the Pod is started through a lightweight instance of ECI. Currently, ECI startup time is less than 30 seconds, and programs are generally pulled up within 1 minute from scheduling to running.



Similar to cluster-AutoScaler, virtual-Kubelet-AutoScaler also needs to use simulation scheduling mechanism to determine whether Pod can be processed and loaded. However, compared with cluster-AutoScaler, there are the following differences:

  1. Virtual-kubelet-autoscaler simulates scheduling by adding a scheduling policy Pod Template rather than Node Template.
  2. The core of Virtual-Kubelet-AutoScaler is to select Virtual-Kubelet to carry the load. Once Pod simulation scheduling is successfully bound to Virtual-Kubelet, Pod’s life cycle management and problem detection are no different from traditional Pod. It is no longer a black box problem detection.

Virtual-kubelet-autoscaler is not a “silver bullet”

Virtual-kubelet-autoscaler is not used to replace cluster-AutoScaler. The advantage of virtual-Kubelet-AutoScaler lies in its simplicity, high elasticity, high concurrency, and on-demand charging. However, at the same time, part of the compatibility is sacrificed. Currently, the support for cluster-PI, CoreDNS and other mechanisms is not perfect. With a little configuration, Virtual-Kubelet-AutoScaler can be compatible with cluster-AutoScaler. Virtual-kubelet-autoscaler is particularly suitable for scenarios such as big data offline tasks, CI/CD jobs, and burst online loads.

The last

Serverless Autoscaling has gradually become an important part of Kubernetes elastic scaling. When serverless autoscaling compatibility is basically completed, Serverless is easy to use, no operation and maintenance, cost saving characteristics and Kubernetes will form a perfect complement to achieve Kubernetes elastic expansion of the new leap.


The original link

This article is the original content of the cloud habitat community, shall not be reproduced without permission.