Brief introduction: Recently, at the USENIX ATC2021, the international architecture summit, The paper “Scaling Large Production Clusters with Partitioned Synchronization” by Feitian Fuxi team of Aliyun and the Chinese University of Hong Kong was not only successfully accepted by the conference, It was also awarded one of the three Best Paper Awards by the expert group of the General Assembly.
The author | feng also wield, Yang tzu-chung, Zhao Yunjian, Jin Xiaoyue, yidi wu, Yang, Zheng Shangce, li chao, GuanTao source | ali technology to the public
The introduction
Recently, at the USENIX ATC2021, the international architecture summit, The paper “Scaling Large Production Clusters with Partitioned Synchronization” by Feitian Fuxi team of Aliyun and the Chinese University of Hong Kong was not only successfully accepted by the conference, It was also awarded one of the three Best Paper Awards by the expert group of the General Assembly.
ATC is very influential in the field of computer systems. Since 1992, ATC has been successfully held for 31 sessions, attracting top universities such as Princeton, Stanford, University of California, Berkeley, Cornell, Tsinghua University, Peking University and other Chinese universities, as well as technology giants such as Microsoft, Intel and Samsung to release research results. ATC requires papers that meet the requirements of fundamental contributions, forward-looking impact, and solid system implementation. The USENIX 2021 organizing Committee has accepted 64 papers (18%). Only three of the best papers from around the world are selected (the other two are from Stanford University and Columbia University). This is also the first time ATC best paper appeared in the figure of Chinese companies.
At this conference, we introduced the latest achievements of Fuxi 2.0 project in detail, the scheduling architecture of decentralized super-scale distributed cluster, and disclosed the implementation details of Ali Cloud in the scheduling of super-scale cluster for the first time to the outside world, which was also another successful demonstration of the core capabilities of feitian operating system.
I. Background
In AI/ Big data computing scenarios, with the rapid growth of computing demand, cloud computing clusters exceed 10,000 units per cluster (a cluster may have 100,000 machines, executing billions of tasks every day, especially short-term tasks), to achieve high utilization and low cost added value, which is of great significance. As the core component of a large production cluster, the resource scheduler is responsible for the efficient matching of multi-dimensional resource requests and machine resources in the cluster. However, the increase of the cluster size means that there are more concurrent requests, resulting in the “product” effect, which makes the scheduling complexity increase sharply. Therefore, how to achieve the scalability of the cluster, maintain good scheduling effect, achieve high concurrency, low latency, is recognized as a very difficult task in the industry. Traditional central schedulers, limited by their single-point scheduling capabilities, are mostly unable to handle production-level scale, nor can they guarantee the stability and robustness to make the upgrade process transparent to users.
2. Status Analysis
1 Operation Load
At Alibaba, a single computing cluster runs millions of jobs a day. Figure 1A (solid curve) plots the number of jobs randomly processed per day in a cluster in a given month, from 3.34 million to 4.36 million, while a job consists of many tasks. Figure 1A (dotted line) shows that the number of tasks per day ranges from roughly 3.1 billion to 4.4 billion. Most of these were short tasks, as shown in Figure 1B, with 87% completed in less than 10 seconds. The scheduling load of large clusters is very large.
2. Necessity of upgrading the scheduling architecture
In Fuxi1.0, the scheduler follows a typical master-worker architecture. The FuxiMaster is responsible for managing and scheduling all the resources in the cluster. At the same time, there is an agent, Tubo, on each machine, which periodically synchronizes the status to the FuxiMaster through heartbeat messages. Each job submitted by a user has information about the quota group to which it belongs, and the maximum and minimum values of resources that can be used by the quota group are set by the SRE. Our quota mechanism not only ensures fairness between quota groups when the cluster is under high load, but also allows peak load to be filled when the cluster is relatively idle, so that the cluster resources are fully used.
The size of computing clusters has increased significantly in recent years and is likely to exceed 100,000 in the foreseeable future. In the face of super-large clusters, one method is to divide the cluster into several small clusters statically, but this method has obvious limitations. First, some very large jobs may have resource requirements that exceed the size of a single cluster. Second, cluster fragmentation will also bring resource fragmentation, and the local view can not ensure the optimal global scheduling results; Finally, there are other non-technical factors, such as dependencies between projects, different projects in the same business unit that need to access data from each other, and deploying them in the same cluster (rather than breaking them up into smaller clusters) can greatly reduce operational and management costs.
However, a single master architecture cannot handle a cluster size of 100,000 levels for two main reasons: 1) With the expansion of the cluster scale, the heartbeat delay between the master and worker will increase due to the limitation of the processing capacity of the monotonous scheduler, and the scheduling information cannot be delivered in time, resulting in the decrease of cluster utilization; 2) The increase of scale means higher task concurrency, which makes the scheduling complexity increase sharply and eventually exceeds the processing capacity of the monotonous scheduler.
3. Objectives and challenges of scheduling
In addition to the scalability challenge, the scheduler should also make trade-offs among the following scheduling goals. The main goals we focus on include:
- Scheduling efficiency (or latency) is how long a task needs to wait on a resource, and a good scheduler should allow resources to flow quickly.
- Whether the scheduling quality and resource constraints are met, such as data Locality, larger memory, faster CPU model, etc.
- Fairness and priority: In a multi-tenant shared production environment, you need to ensure the fairness of resource usage among tenants and provide a guarantee mechanism for high-priority jobs.
- Resource utilization is an extremely important goal, and low cluster utilization can present many challenges, especially financial ones.
However, these goals are often in conflict with each other. For example, better scheduling results often mean longer scheduling delays, and absolute fairness sometimes leads to under-utilization of resources, resulting in a decrease in cluster utilization.
After years of accumulation, the fu of the resource scheduler through a variety of strategies in the above achieved a good balance between several big goals, but considering the resource scheduling and other surrounding brother team development application components, when we design new scheduler, should also do as little as possible changes, to maintain the system’s robustness and forward compatibility. System upgrades introduced by scheduler architecture adjustments should be transparent to users, both internal and external.
Overview of three theories
In view of the scaling problem of schedulers, we have done extensive research on the existing scheduling models in the industry (see the paper), and selected one of the most suitable for our scenario (Omega) for further analysis. The shared state multi-scheduler architecture represented by Omega satisfies the two constraints we mentioned earlier, backward compatibility and user transparency. However, the share-state solution will inevitably bring about scheduling conflicts. We hope to clarify the following issues:
- What are the factors that affect conflict, and how much weight are they given?
- How bad will the scheduling delays get?
- How can conflict be avoided or slowed down?
Firstly, we model the Conflict and conclude that the expectation of Conflict is (see the thesis for the derivation) :
In the above formula, Yi is the expectation of multiple schedulers conflicting on a slot, N is the number of schedulers, K is the processing power of a single scheduler, and S is the number of slots the machine can schedule. So, if you want to reduce the probability of conflict, you can do it by increasing S or N. Increasing S is a very intuitive way to reduce the probability of conflict by providing additional resources. Increase N the way some counterintuitive because the scheduler, the more the more likely it is to increase the conflict, however, although more during a scheduling conflict, but each scheduler initially assigned to the task scheduling of proportion pressure is reduced, so I have more time to resolve conflicts, eventually instead played an effect to reduce the conflict probability. To sum up, increasing N can reduce conflicts by reducing the scheduling pressure of each scheduler under the condition that the overall pressure is constant.
In addition, we also prove through the formula that conflicts cannot be completely eliminated when the number of schedulers is greater than 1.
The following experiments reflect the impact of different conflict factors on conflict:
- Figure A considers the impact of changes in task stress on conflict. R represents the rate of tasks received by the scheduler. It can be seen that with the same number of schedulers, with the increase of R, more additional slots need to be added in order to keep the number of conflicts unchanged. Conversely, under the condition of constant R, as the number of schedulers increases, the scheduling pressure borne by each scheduler decreases, and the number of additional slots that need to be supplemented decreases.
- Figure B shows the impact of resource view synchronization frequency changes on conflicts. G represents the synchronization delay. It can be seen that with the same number of schedulers, with the increase of G, in order to keep the number of conflicts unchanged, more additional slots need to be added. Conversely, under the condition that G is unchanged, the scheduling pressure of each scheduler decreases with the increase of the number of schedulers, and the number of additional slots that need to be supplemented decreases.
- Figure C reflects the impact of machine score (such as better hardware performance) on the conflict, and V represents the variance of machine score. It can be seen that with the same number of schedulers, with the increase of V, more additional slots are needed to keep the conflict unchanged. Conversely, under the condition that V remains unchanged, with the increase of the number of schedulers, the scheduling pressure borne by each scheduler decreases, and the number of additional slots that need to be supplemented decreases.
- Figure D shows the effect of the number of machine partitions on the conflict. You can see that this factor has little effect on the conflict. Because the scheduler always schedules with its own internal view state regardless of the number of machine partitions, even if the state of some views is not new enough, the number of partitions does not have a significant impact on conflicts.
From the above analysis, it is not difficult to find that in the shared state architecture, if we want to reduce the conflict as much as possible, we can increase additional resources or increase the number of schedulers to reduce the conflict, but in the actual production environment, it is impossible to increase additional resources. On the one hand, the cluster size is relatively fixed. In addition, the introduction of new slots will significantly increase the cost of clusters; The addition of schedulers will bring significant maintenance/upgrade costs.
Realization of four schemes
Since our goal is to reduce conflicts, let’s briefly introduce the next strategy that can eliminate conflicts completely, the pessimistic locking strategy. The pessimistic locking policy is that every machine that a scheduler can schedule is “statically exclusive \ statically divided,” which obviously eliminates conflicts, but is very detrimental to utilization because of the waste of resources that other schedulers could have scheduled. Another strategy is the lock-based preemption scheduling strategy implemented by components such as ZooKeeper. When a batch of machines are locked by a scheduler, other machines cannot schedule temporarily because they cannot get the machine lock. When the scheduler holding the lock releases the lock, other schedulers can try to schedule through lock competition. However, in large scale and high concurrency scheduling scenarios, such high frequency interactions will have a great negative effect on scheduling efficiency.
From the previous analysis, we can know that reducing the delay of resource synchronization can effectively reduce the conflict, so we proposed a strategy based on “partition synchronization” (hereinafter called ParSync) : first, the cluster of machines into P partition, and require P>N. The round robin policy allows each scheduler to synchronize the resource views of different partitions at the same time. This ensures that each scheduler can update the resources of all partitions in each round. P>N guarantees that different schedulers will not synchronize the same patition resources at the same time.
As mentioned above, in a large-scale and high-concurrency scheduling scenario, the priority of the scheduler is often not locality prefer but speed prefer, so the scheduler preferentially schedules resources within the newly synchronized partition machine. Under this policy, resource conflicts will not occur among multiple schedulers. ParSync actually reduces the synchronization delay of each patition to its current synchronous scheduler, because from the perspective of the current updated scheduler, the synchronization delay of this partiton is actually lower than that of other unsynchronized schedulers, which is also the theoretical reason for effectively reducing conflicts.
We provide three scheduling policies: Latency-first, quality-first and adaptive. Latency-first means that resources are scheduled on the latest partition first, quality-first means that resources are scheduled on the machine with the best score first, and AdaptVie adopts the quality-first scheduling policy first. When the resource waiting time exceeds the threshold, the latency-first policy is adopted. We use A set of experiments to verify the scheduling effect of the three scheduling policies. We divide the schedulers into two classes. The schedulers of class A and class B all receive 2/3 of the scheduling requests of their own scheduling capability in phase 1. In stage 2, the scheduler of Class A receives scheduling requests equal to its own scheduling capability, while the scheduler of Class B does not change. In phase 3, class A\B schedulers receive scheduling requests equal to their own scheduling capabilities.
- It can be seen from (a) and (b) that in stage 1, 2, 3, the performance of scheduling quality is very good, but in stage 2, 3, with the increase of scheduling pressure, the scheduling latency increases linearly, which is in line with the intuitive expectation.
- From (c) and (d), it can be seen that when the scheduler pressure in phase 1 is only 2/3, both scheduling latency and quality have good performance. As scheduling pressure increased, quality began to decline, but latency increased only slightly, which was intuitively expected.
- It can be seen from (e) and (f) that in stage 2, the scheduling pressure of class B scheduler is still only 2/3, so it still stays in the quality-first strategy, while class A scheduler enters the Latences-first strategy due to the increase of scheduling latency. Through careful observation, it can be seen that the quality quality (light gray) of class B scheduler is superior to latency-first strategy in the quality diagram. In phase 3, the class A\B scheduler also restricts the scheduling latency to the threshold value (1.5s), which is in line with design expectations.
5. Experimental Analysis
We validated the entire scheduling framework with a “wind tunnel” system. In the wind tunnel environment, in addition to the fact that the scheduler is real, the single node and AM are program-simulated. They interact with the real resources of the scheduler and simulate the execution of the job through sleep, so that a 1:500 simulation can be performed on a node.
Test environment:
- 20 schedulers, 2 Resource Managers
- A cluster contains 20w slots
- The scheduler has a scheduling capability of 40K tasks /s
- The scheduling is divided into three stages, and the scheduling pressure is respectively 50%, 80%, 95%, 80% and 50% of the capacity of the scheduler
As can be seen from Figure (a), with the change of scheduling pressure, Latency-first is superior to adaptive in scheduling latency, while quality-first is superior to StateSync(view synchronization strategy represented by Omega, the scheduler synchronates the whole view information each time). The Latency-first strategy is able to control latency at a very good level, whereas StateSync’s latency is already out of control, which is a good example of ParSync’s effectiveness in conflict control. For the quality-first strategy, its latency is also out of control, which is the side effect caused by the scheduler trying to schedule on the machine with the highest score. The adaptive strategy makes a good compromise between latency-first and quality-first.
As can be seen from Figure (b), with the change of scheduling pressure, latency-first and adaptive strategies have a significant decline in quality performance, which is in line with expectations. The performance of Quality-First was roughly equal to that of StateSync.
To sum up, ParSync’s latency was much better than StateSync’s when its quality performance was equal to that of StateSync. More detailed data analysis can be found in the paper.
Six summarize
Firstly, this paper introduces the common practice in the field of distributed scheduler to solve the scale problem: multi-scheduler joint scheduling, and then introduces StateSync, a common resource supply model of multi-scheduler. In StateSync mode, serious scheduling conflicts occur among different schedulers, which affects the scheduling efficiency and utilization of the cluster. In view of the above problems, the paper gives the methods to alleviate the conflict through theoretical analysis: increasing the scheduler or scheduler, but these two methods are not acceptable in practice.
This article proposes a new resource provisioning model: ParSync. In ParSync mode, different schedulers update machine resources in round robin mode. At the same time, ParSync provides three scheduling policies: Latency-first, quality-first and Adaptive, which are used to meet latency\quality requirements of schedulers in different scenarios. Large-scale experiments show that ParSync is on par with other schedulers in scheduling quality, but far superior in scheduling delay.
The appendix
Resource scheduling architecture of a production cluster
The original link to this article is ali Cloud original content, shall not be reproduced without permission.