Abstract: In 2018, the “Double 11” transaction volume reached a new historical high of 213.5 billion yuan. Compared to ten years ago, our transaction volume has increased more than 360 times, and the peak transaction volume has increased more than 1,200 times. Correspondingly, the number of systems shows explosive growth. The complexity and difficulty of the system in the process of supporting “Double 11” are increasing exponentially.
As alibaba’s group-wide container scheduling system, Sigma successfully supported the deployment of all containers (transaction line middleware, database, advertising and more than 20 other businesses) of the group during the “Double 11” period, which is an important underlying infrastructure of Alibaba’s operation and maintenance system. Sigma has been the core role in the online service management and control of all computer rooms in the whole network of Ali. The host computer resources under control reach millions, which is self-evident in its importance. The quality of its algorithm affects the overall business stability and resource utilization rate of the group.
When a user applies for computing resources (such as cpus, memory, and disks) required by containers from the scheduling system, the scheduler selects physical machines that meet specifications requirements to deploy these containers. Under the same resource demand, the scheduling strategy determines the utilization level of cluster computing resources. This paper will briefly introduce the application of swarm enhanced learning algorithm in scheduling policy optimization.
Author: Wang Mengchang damo Institute machine intelligence technology algorithm expert Han Tang Alibaba Group technology expert
1. Computing resource scheduling and online policies
When users apply for computing resources (such as CPUS, Memory, and disks) required by containers from Sigma, the scheduler selects physical machines meeting specifications to deploy the containers. In general, the physical machine that meets the requirements is not unique, and the water level is different, and different allocation methods ultimately get different allocation rates. Therefore, a core task of the scheduler is to select the most suitable physical machine from many candidate machines according to a certain strategy.
In the literature, computing resource scheduling is generally expressed as vector bin packing problem. If the number of containers of each application is known in advance (such as the big promotion scenario), the scheduler can generate an optimized layout scheme for all containers at one time. In this case, the problem can be expressed as integer programming. General solver or specially developed algorithm can be used to solve; If the requests of each application arrive at Sigma successively (such as daily scenarios), the scheduler needs to generate deployment decisions in real time (online) at each request arrival. At this time, the problem can be expressed as Markov Decision Process (MDP). In principle, the optimal strategy can be obtained through value iteration or policy iteration.
The most commonly used scheduling policies include first-fit (FF) and Best-fit (BF). If the first-fit algorithm is used, the scheduler deployable the container on the First physical machine encountered in the traverse that meets all the requirements; The best-Fit algorithm selects the machine with the highest water level among the physical machines that meet the requirements to deploy the container. For the classic bin packing problem (i.e., one-dimensional vector packing problem), the approximate ratios of first-fit and best-fit are 1.7, that is, both can ensure that the number of machines used is no more than 170% of the optimal solution. Theoretically, there is no polynomial algorithm with definite approximate ratio guarantee for vector packing problems of 2 dimensions and above. When a resource dimension of the physical machine is obviously the bottleneck, resulting in the surplus of other resource dimensions, its effective dimension can be regarded as 1. First-fit or best-fit can generally achieve a good allocation rate. Once the bottleneck is not concentrated in the same dimension, the effectiveness of the two strategies will be questioned.
In addition to resource requirements, disaster recovery and interference isolation are also considered in actual scheduling. For example, containers of the same application cannot all be deployed on the same physical machine, and many applications can only have one instance on each machine. Some applications are mutually exclusive (for example, resource contention), which seriously affects application performance. Therefore, they cannot be deployed on the same physical machine. The introduction of these constraints makes the usual strategy even more alien. Through repeated trial and error of human flesh, reluctantly carried the pressure of the construction station for many times. However, with the expansion of various businesses, the scale of online containers is getting bigger and bigger, and resources are becoming increasingly tight, so the efficiency of human flesh adjustment is gradually inadequate.
To turn scheduling his classmates from liberated, have limited resources carry live more pressure, dharma school machine intelligence technology laboratory (factory) decision intelligent algorithm carried out close cooperation, team and the Sigma scheduling for online scheduling policy problems are studied, and developed based on group to enhance learning (SwarmRL) algorithm.
2. Online scheduling model
A⊆A set of candidate physical machines can be expressed as A function π:S×P→A (π∈ π), given that the current container to be deployed has A vector P ∈P, the cluster state for resource allocation is vector S ∈S, and the set of candidate physical machines is A⊆. When the physical machine a=π(s,p) is selected to deploy the container according to the strategy π, the immediate cost of the choice is R (a), and the new state s’ of the cluster is determined by the state quantity S, P and action A, denoted as S ‘=L(S, P,a). Remember the container specification P ‘of subsequent arrival. For online scheduling, P’ is a random quantity. Introducing discount coefficient γ∈[0,1], the Bellman equation of the system is:
The optimal scheduling policy can be expressed as:
Theoretically, we can search for superior strategies in the strategy space π through stochastic gradient descent, but other methods are needed to further optimize, or even to achieve a global optimal strategy, especially when the optimal strategy may be multi-modal.
3. Group reinforcement learning SwarmRL
In order to prevent the strategy from falling into a poor local optimal solution and have a fast convergence rate, we design the algorithm based on the framework of group increment learning. Compared with the traditional reinforcement learning method, the algorithm uses multiple agents to explore the strategy space of the problem, and there is a mutual learning mechanism among multiple agents, which makes the algorithm has the ability to jump out of the local trap. In order to obtain the estimation of each state value (V^π^), an accurate Sigma simulator is necessary. The students in the team developed Cerebro, a “full fidelity” simulator based on Sigma scheduler.
Firstly, the algorithm randomly initializes the policies of a group of agents. For each policy, the corresponding state value estimation is obtained through the simulator to record the current global optimal policy. In each subsequent iteration, each agent constantly updates its own local optimal strategy, and updates its own current strategy by referring to the local optimal strategy and the current global optimal strategy of the group. Then simulation is carried out to obtain the state value estimation of the new strategy and update the global optimal strategy. And so on until the convergence condition is satisfied.
In the estimation of the status values of each agent, samples (multiple randomly selected cluster snapshots and capacity expansion request sequences) and the current policies of each agent are input into the simulator Cerebro to track the trajectory of cluster status during simulation, and then the total cost of the trajectory can be obtained. Average the total trajectory cost based on multiple samples, that is, get the state estimate under the corresponding strategy.
In SwarmRL, the evolution direction and step size of a policy are represented by velocity (v), which involves the difference between the local optimal strategy (πL) and the global optimal strategy (πG) and the agent’s current strategy (π). And is regulated by the strategy inertia factor W, local learning factor C~1~ (self-learning), group learning factor C~2~ (social-learning) and other parameters:
ξ1 and ξ2∈[0,1] are random quantities, and φ is the feasibility holding mapping, which is used to “pull back” the agent escaping from the feasible region. In the iteration, the local optimal strategy (πL) and the group global optimal strategy (πG) are updated continuously:
4. Algorithm application
Let’s compare the effect of the algorithm with a small example of random generation. The example involves 30 applications (see the table below), whose container specifications are mainly 4C8g and 8C16g, and the host computer specifications are 96C512g.
If both the order and number of requests are known during scheduling (” God perspective “), that is, post-arrangement is carried out, and the allocation rate corresponding to the optimal solution obtained by integer programming is 94.44% (which is also the upper bound of the allocation rate obtained by all scheduling strategies in this example), a total of 15 hosts are enabled, and the specific arrangement scheme is as follows:
In the real scene, the order and number of containers of each request are revealed only when it reaches Sigma. If best-Fit is adopted for dynamic scheduling, the allocation rate is 70.83%, and 20 host machines are enabled in total. The specific arrangement is as follows:
If SwarmRL learning strategy is used for dynamic allocation, the allocation rate is 94.44%, and a total of 15 hosts are enabled. The final container layout is as follows:
In this example, SwarmRL’s learning strategy performance (94.44%) was consistent with that of the optimal arrangement under the “God view” (upper bound) and significantly superior to that of best-Fit (70.83%), with an improvement of 23.61%.
We then randomly generate large-scale request data: a total of 3K requests and 5K containers, whose specification distribution is shown as follows.
Because the variable scale of integer programming model in this scenario is too large, it is impossible to directly obtain the optimal solution of “God’s perspective” in a short time. Compared with best-Fit (and human flesh strategy), the results of the new strategy obtained by the algorithm are as follows:
Compared with best-Fit, the new strategy saves 13 hosts (4.48%) and improves the allocation rate by 4.30%. Compared with the human flesh strategy, the new strategy saves 7 hosts (2.46%) and improves the allocation rate by 2.36%.
Considering the randomness of application request arrival order in actual scenarios, we randomly scramble requests to generate multiple different request orders, and then apply three strategies to dynamically allocate requests according to different request orders:
The extreme difference of best-fit in the number of hosts under different request sequences was 39, which was relatively stable compared to the 84 hosts of human flesh strategy, and its fluctuation range was about half of that of human flesh strategy. The average distribution rate of the human flesh strategy was as low as 81.85%, compared with 93.44% in the original order, indicating that the performance of the human flesh strategy was not stable and showed severe fluctuations. However, the performance of the new learning strategy is quite stable, the range of host number difference is only 3, the fluctuation range is about 1/30 of the human human strategy. The allocation rate of the new strategy was on average 13.78% higher than that of the human flesh strategy and 3.02% higher than that of best-Fit.
5. Summary and outlook
SwarmRL produces better results than common (and human) strategies in terms of increasing allocation rates and saving resources, and consistently performs better. After the algorithm is deployed to the online environment, the peak allocation rate of the common resource pool is significantly higher than before.
With CPU share and mixing rolled out, new scenarios involve more goals than allocation rates, such as shredding, load balancing, and so on, and these goals may even conflict with each other. SwarmRL is naturally suited to multiple goal optimization problems. Pareto Front can be easily constructed in the policy space. Therefore, we will continue to study online scheduling strategies in new scenarios to fully tap the potential of SwarmRL and further improve the scheduling capability of Sigma.