The author | Huang Tao, Wang Menghai source | alibaba cloud native public number

As a further form of cloud computing, cloud native is becoming the new technical standard in the cloud era and the shortest path to release cloud value by reshaping the entire software life cycle.

Within the enterprise, cloud native infrastructure as a unified architecture within the enterprise has become a trend. At the same time, it is also inevitable to bring compatibility problems brought by the integration of various basic platforms, especially for enterprises with larger scale and more historical precipitation, this “technical debt” is more obvious.

The experience shared in this paper comes from alibaba’s production practice accumulated in mixed scheduling in the past few years, which has strong production practical value. From a simple introduction, the content gradually deeps into the inside of scheduler, which tells how Alibaba’s unified infrastructure ASI (Alibaba Serverless Infrastructure) scheduler designed for cloud native applications manages Alibaba’s complex and busy resource scheduling tasks in the large-scale container scheduling scenario. And try to let you fully understand through some specific cases, I believe that it will open the design ideas for readers with similar problems, and provide reference to the ground. Through this article, I believe you will have a systematic understanding of Alibaba’s complex task scenarios of mixed resource scheduling.

Scheduler overview

ASI leads the implementation of container cloud in ali Group, undertakes the responsibilities of lightweight container architecture evolution and cloud protochemistry of operation and maintenance system, and further accelerates the implementation of emerging technologies such as Mesh, Serverless and Faas in Ali Group. It has supported the internal business of taobao, Tmall, Youku, Autoravi, Ele. me, UC, Kaola and other nearly all economies, as well as the numerous scenes and ecology of Aliyunyun products.

ASI’s core is based on Kubernetes and provides complete cloud native stack support. Now ASI has successfully achieved a meeting with Alibaba Cloud Container Service for Kubernetes. ACK not only retains various capabilities on the cloud, but also successfully deals with the complex business environment of Alibaba Group.

The ASI scheduler is one of the core components of ASI cloud native. The role of the scheduler has been critical in the evolution of ASI cloud native. The most intuitive recognition is that alibaba’s massive online e-commerce transaction containers, such as shopping cart, order, taobao details, etc. are distributed and scheduled by the scheduler, including container arrangement, stand-alone computing resources and memory resources. Especially in the double 11 zero peak scenario, a few container arrangement errors may have a fatal impact on the business. Therefore, the scheduler is responsible for controlling the calculation quality of each container at the peak, which is of great importance.

ASI scheduler originated from online e-commerce transaction container scheduling in 2015. The earliest scheduler in this year only covers online transaction T4 (Ali’s container technology customized based on LXC and Linux Kernel in the early stage) and Alidocker scene, so it has great responsibility from birth. And it played a role in holding the peak flow of Double 11 in 2015.

The evolution of ASI scheduler also follows the whole process of cloud native development. It went through the earliest online transaction container scheduler, Sigma scheduler, Cerebulum scheduler, ASI scheduler; Now we are building the next generation Scheduler Unified Scheduler, which will further absorb and integrate the advanced experience of Alibaba ODPS (Fuxi), Hippo search and online scheduling in various fields in the past few years. The scheduling interpretation of each stage is as follows:

There are many challenges that need to be addressed during the evolution of ASI scheduler, including:

The scheduler schedules a variety of tasks, including a large number of online long-life cycle containers and POD instances, Batch tasks, numerous forms of BestEffort tasks and other tasks of different SLO levels. There are many tasks of different resource types, such as computing, storage, network, and heterogeneous. The demands and scenarios of different tasks vary greatly.
The host resources above the schedule vary. It manages a large number of host resources within Ali Group, including stock non-cloud physical machines, Shenlong, ECS, heterogeneous models such as GPU/FPGA, etc.
The dispatcher service scenarios are extensive. For example: the most typical pan-transaction scenario; The most complex middleware scenario; Faas/Serverless/Mesh/Job and other emerging computing scenarios; Emerging ecological scenes such as Ele. me, Koala and What; The public cloud is accompanied by scheduling demands of multi-tenancy security isolation. There are also the global challenging ODPS (Fuxi), Hippo, Ant, ASI unified scheduling scenarios.
The responsibilities at the infrastructure level are numerous. The scheduler is responsible for the definition of infrastructure models, integration of computing and storage network resources, convergence of hardware forms, transparency of infrastructure and many other responsibilities.

About the detailed development of Aliyun native, interested students can learn about it through the article “A box that changes the world”. Below, we will focus on sharing how ASI scheduler manages ali’s huge, complex and busy computing resource scheduling tasks.

Preliminary study on scheduler

1. What is a scheduler

The scheduler is central to ASI’s many components. Scheduler is one of the core components in cloud native container platform scheduling system. The scheduler is the cornerstone of resource delivery. The scheduler is the native brain of ASI cloud. The value of the scheduler is mainly reflected in:

Powerful & Scenario-rich resource delivery (computing, storage)
Cost-effective resource delivery
Stable optimal resource delivery while the business is running

More generally, what the scheduler does is:

Optimal job scheduling: select the most suitable host in the cluster, and run the computing jobs submitted by users on this host with the best resource usage and the least mutual interference (such as CPU distribution and IO contention).
Optimization of cluster global scheduling: to ensure the optimal global resource scheduling (such as fragmentation), the most stable resource operation, and the optimal global cost.

In the ASI cloud native system, the location of the central scheduler is shown in the figure below (the box marked in red) :

2. Generalized scheduler

For the most part, the industry refers to “central schedulers,” such as the community’s K8s Kube-Scheduler. However, the real scheduling scene is complex, and every scheduling is a complex and flexible complex coordination. After the job is submitted, it needs the coordination of central scheduler, single machine scheduler, kernel scheduler and multi-level scheduler, and further completes the execution with the cooperation of K8s components kubelet and Controller. Online scheduling scenarios, and a batch scheduling scheduler; Multiple scheduling under rescheduling ensures that the cluster always remains optimal.

ASI scheduler in its broad sense is understood as a complex of central scheduler, single-machine scheduling, kernel scheduling, rescheduling, large-scale scheduling and multi-layer scheduling.

1) Central scheduler

The central scheduler is responsible for calculating the resource scheduling calculations for each (or batch) job, which ensures optimal scheduling at one time. The central scheduler calculates information such as cluster, region, execution node (host) for this specific task, and further details the CPU allocation, storage, and network resource allocation on the node.

The central scheduler manages the life cycle of most tasks in collaboration with the K8s ecosystem components.

In the evolution of ASI cloud native, the central scheduler is the Sigma scheduler, Cerebulum scheduler, ASI scheduler, and so on described above.

2) Single machine scheduling

Mainly responsible for two types of responsibilities:

The first responsibility is to coordinate the optimal operation of multiple pods in a single machine. After receiving the node selection instruction from the central scheduler, ASI will schedule tasks to specific nodes for execution, and the single machine scheduling starts:

Single-machine scheduling dynamically ensures the optimal work of multiple pods within a single machine, either immediately, periodically, or operationally. This means that it will coordinate resources within a single machine, such as the optimal adjustment of CPU core allocation for each POD.
According to POD operating indicators such as load and QPS in real time, VPA capacity expansion and scaling in a single machine can be performed for some runtime resources, or low-priority tasks can be expelled. For example, dynamically expand the CPU capacity of a POD.

The second responsibility: collecting, reporting and aggregating the single resource information to provide decision-making basis for the central scheduling. In ASI, the single-machine scheduling component mainly refers to the partially enhanced capabilities of SLO-Agent and Kubelet. In the Unified Scheduler scheduling under construction, single-machine scheduling mainly refers to the partial enhancement capabilities of SLO-Agent, Task-Agent and Kubelet.

3) Kernel scheduling

Single machine scheduling coordinates the optimal operation of multiple PODS in a single machine from the perspective of resources, but the running state of tasks is actually controlled by the kernel. This requires kernel scheduling.

4) Rescheduling

The central scheduler ensures the optimal scheduling of each task, namely the one-time scheduling problem; However, the central scheduler cannot achieve global optimization of cluster dimensions, which requires rescheduling.

5) Large-scale scheduling

Large-scale scheduling is the unique scenario of Alibaba’s large-scale online scheduling. Since its construction in 2017, it has become very mature and is still being continuously enhanced.

With scale orchestration capabilities, we can schedule tens of thousands or hundreds of thousands of containers at a time to ensure global optimal orchestration for all containers in a cluster dimension at once. It cleverly makes up for the disadvantages of one-time central scheduling and avoids the complexity of repeated rescheduling in large-scale station construction scenarios.

We will elaborate on kernel scheduling, rescheduling, and scale scheduling in the following sections.

6) Scheduling stratification

In the other dimension, we will also define scheduling layers, including layer 1 scheduling, layer 2 scheduling, layer 3 scheduling… And so on; Sigma even introduces the concept of zero-layer scheduling in off-line mixing scenarios. Each scheduling system has different understanding and definition of scheduling layering, and has its own concept. For example, in the past Sigma system, scheduling is divided into 0, 1 and 2 layers:

Layer 0 scheduler is responsible for global resource view and management, and to undertake scheduling arbitration between each layer 1 scheduler, as well as specific execution; Layer 1 scheduling is mainly corresponding to Sigma scheduler, Fuxi scheduler [can also include other schedulers].
In the Sigma system, the Sigma scheduler acts as a layer 1 scheduler and is responsible for the allocation of resource layers.
Layer 2 scheduling is implemented by different access services (such as e-commerce transaction, advertising Captain, database AliDB, etc.). Layer 2 scheduling is fully close to and understands their respective businesses, and makes numerous optimisations from the global perspective of business to build scheduling capabilities, such as business expulsion and automatic operation and maintenance of stateful applications, so as to provide intimate services.

The fatal drawback of Sigma’s scheduling hierarchical system is that the technical capacity and input of each layer scheduling are uneven. Layer-2 scheduling systems for advertising, for example, are excellent, but not all layer-2 scheduling systems are perfect for business. ASI learned its lesson, sank many capabilities into ASI, and further standardized the upper layer PAAS, simplifying the upper layer and enhancing the upper layer capabilities.

The concept of the next generation scheduler under construction today is also divided into multiple layers, such as: computing load layer (mainly referring to Workload scheduling management), computing scheduling layer (such as DAG scheduling, MR scheduling, etc.), business layer (the same concept as Sigma 2 layer).

3. Schedule resource types

I’ll try to make it easier to understand using the Unified Scheduler Scheduler being built. In the Unified Scheduler, Product resources, Batch resources, and BE computing resources are scheduled.

Different schedulers have different definitions of hierarchical resource forms, but they are essentially the same. To give you a better understanding of this essence, I describe the ASI scheduler in detail in subsequent chapters.

1) Product (online) resources

There are Quota budget resources, and the scheduler needs to ensure the highest level of resource availability. The typical example is the long-life POD instance of the core transaction of online e-commerce. The most classic example is shopping cart (Cart2), order (tradeplatform2) and other transaction core business PODS on the double 11 core link. These resources require strict guarantee of computing power, high priority, real time, low response delay, non-interference and so on.

Long-life PODS for online transactions, for example, exist for long periods of time, days, months, even years. Most of the applications applied by students in application r&d need to apply for several instances of long life cycle after construction, which are Product resources. Taobao, Tmall, juhuasuan, Autonavi, Friendship, unity, rookie, internationalization, xianyu…. Many business r & D students applied for POD (or container) instances, most of which are product resources.

Product resources are not just online long-life pods; All resource requests that meet the above definition are Product resources. But not all long-life pods are Product resources. For example, the POD used by Alibaba’s internal “Aone lab” to perform CI construction tasks can exist for a long life cycle, but it can be preempt by low-cost eviction.

2) Batch resources

The Gap between Allocate and Usage of Product resources used by online businesses is relatively stable in a period of time, and the unallocated resources of this Gap and Prod are regarded as BE resources. Sell to businesses that are less sensitive to latency and have certain requirements for resource stability. Batch has a quota budget, but guarantees a certain probability (say 90%) of resource availability over a period of time (say 10 minutes).

In other words, the Product (online) resource application takes the resource on the account, but in fact there may be a lot of computing power not used from the load utilization indicator; In this case, the differentiated SLO hierarchical scheduling capability of the scheduler will be used, and the parts that are not fully used will be sold to the Batch resources as oversent resources.

3) Best Effort(BE) resources

There is no Quota budget, no guarantee of resource availability, can be suppressed and seized at any time; A Gap allocated on a node whose Usage is below a watermark is considered by the scheduler to BE a “less stable/non-billing” resource, and is called a BE resource.

For example, the Product and Batch resource are responsible for the hunks of meat, and the BE resource is responsible for the consumption of the Product and Batch leftovers. For example, in daily development work, research and development needs to run a lot of UT test tasks, which have low requirements on the quality of computing resources, high tolerance of time delay, and difficult to evaluate the amount of budget. It will be very uneconomical to buy a large number of products or Batch resources for such scenarios. But if you use the cheapest BE resources, the benefits can BE substantial. In this case, the BE resource is the resource that is not used in the Product/Batch run.

It is easy to understand that it is through this hierarchical resource scheduling capability that the Unified Scheduler can use the resources of a physical node to the maximum from a technical level.

Overview of scheduler capabilities

The following figure shows an overview of ASI’s scheduling capabilities built around the responsibilities to be covered by generalized scheduling, and the diverse business scenarios for different resource levels and services. From this diagram, you can understand the technical overview of the ASI scheduler.

Typical online scheduling capabilities

1. Service requirements for online scheduling

On ASI cloud native container platform, the online part serves various scheduling scenarios of dozens of BU, such as transaction, shopping guide, live broadcast, video, local life, Cainiao, Autonavi, Heyi, Umeng and overseas. Among them, the scheduling of “Product resources” of the highest level accounts for the largest proportion.

Online business scheduling is typically different from offline scheduling and job-type scheduling. (When describing the online scene, you can imagine that the world of offline scheduling is equally wonderful.)

1) Life cycle

Long Running: Online applications tend to have Long container lifetimes. At least a few days, most in months, some long tail applications even live for years.
Long startup time: The image volume of the application is large, the image download takes a long time, and the memory for service startup is preheated. As a result, the startup time of the application may be several seconds or tens of minutes.

The characteristics of long life cycle are fundamentally different from some typical short life cycle task scheduling (such as FaaS function computation), and the technical challenges behind them are quite different. For example, the challenges of relatively short life cycle function computing scenarios are: maximum scheduling efficiency, hundred-millisecond execution efficiency, fast scheduling throughput, POD runtime performance, etc. The differentiation challenges brought by POD with long life cycle are: the globally optimal scheduling must rely on rescheduling for continuous iterative optimization; The optimal scheduling at run time must depend on continuous optimization guarantee of single machine rescheduling. As you can imagine, in the past non-cloud native era, many services could not be migrated, which was a nightmare for scheduling; This means that the scheduler not only faces the technical problem of scheduling ability, but also needs to face the huge difficulty of stock business governance promotion; Online applications take a long time to start, which further reduces the flexibility of rescheduling and brings more complexity.

2) Container runtime

The container must support real-time service interaction, fast response, and low service RT. When the online container is running, most systems are responsible for real-time interaction and are extremely sensitive to latency, and a small amount of latency can lead to a significantly poor business feel.
Resource characteristics are obvious: such as network consumption, IO consumption, computing consumption and so on. When instances with the same characteristics coexist, it is easy to have obvious competition for resources among them.

The runtime of online container is very sensitive to both business and computing power, so it poses a very demanding challenge to scheduling quality.

3) Deal with the complex business model unique to Alibaba online applications

Characteristics of high and low flow peaks: The services of online business generally have obvious highs and lows. For example, the peak of Ele. me is at noon and evening, and the peak of Taobao also has obvious troughs and peaks.
Burst traffic: Due to the complexity of services, burst traffic may not be regular. For example, the live streaming service may cause a surge in traffic due to an unexpected event. The technical appeal behind sudden flow is often elasticity, and the most classic case is the elastic appeal of nails during the epidemic in 2020.
Resource redundancy: Online businesses have defined redundant resources from the moment they were born. This is mainly for disaster recovery. However, from the overall perspective of Alibaba, quite a number of long-tail applications are not sensitive to cost and utilization due to their small scale, which is a huge waste of computing power.

4) Unique demands for large-scale operation and maintenance

Complex deployment model: For example, complex scheduling requirements such as application unitary deployment, multi-room Dr, and low-traffic, gray scale, and formal multi-environment deployment are required.
Large-scale promotion & Seckilling peak features: Alibaba’s various promotions run throughout the year, such as the familiar Double 11, Double 12, Spring Festival red envelopes and so on. The application pressure and resource consumption on the whole link will increase exponentially with the increase of peak traffic, which requires the powerful large-scale scheduling capability of the scheduler.
Promotion of site construction: The promotion time is planned. In order to save the procurement cost of cloud resources, the retention time of cloud resources must be reduced as much as possible. The scheduler needs to complete the construction of the station before the promotion as soon as possible, and quickly return resources to Ali Cloud after the promotion. This means extremely demanding demands for efficiency at scale and leaving more time for business.

2. Once scheduling: scheduling basic ability

The following table details the most common scheduling capabilities for online scheduling:

Basic application requirements refer to basic application expansion requirements, such as POD specifications and OS. Within the ASI scheduler, it is abstracted as a normal label matching schedule.
Locality scheduling: ASI has obtained a lot of details through various means, such as the network core and ASW shown in the preceding figure.
Advanced policies: ASI will standardize and generalize service requirements as much as possible. However, some services inevitably have specific requirements on resources and runtime, such as specific infrastructure environment, such as hardware, and specific requirements on container capabilities, such as HostConfig parameters and kernel parameters.
About the scheduling rule center: the specific requirements of the business on the policy determine that there will be a powerful scheduling policy center behind the scheduling, which guides the scheduler to use the correct scheduling rules; The data in the scheduling rule center comes from learning, or expert operations and maintenance experience. The scheduler takes these rules and applies them to the capacity allocation of each POD.

3. Scheduling policies between applications

Due to the limited number of cluster nodes, many applications with potential interference with each other have to coexist at the same node. At this time, inter-application choreography strategy is needed to ensure the optimal runtime of each host node and each POD.

In actual scheduling production practice, “service stability” is always ranked first, but resources are always limited; It is difficult to balance “resource cost optimality” with “business stability”. For the most part, interapp choreography works perfectly for this balance; By defining coexistence policies among applications (such as CPU intensive, network consuming, IO intensive, peak model characteristics, etc.), the cluster is fully scattered, or the same node is fully protected by policy constraints, so as to minimize the interference probability among different PODS.

Furthermore, the scheduler is optimized by more technical means at runtime, such as network priority control and CPU fine-choreography control, to avoid the potential impact of runtime between applications as much as possible.

Another challenge posed by the inter-application choreography strategy is that the scheduler needs to fully understand the operational characteristics of every business running on it, in addition to building its own inter-application choreography capabilities.

4. CPU refinement orchestration

CPU orchestration is a very interesting topic in the field of online scheduling, including CpuSet scheduling and CpuShare scheduling. The scheduling domain of other scenarios, such as offline scheduling, is not as important or even understood; But in online trading scenarios, whether it’s theoretical reasoning, lab scenarios, or countless large pressure tests, precise CPU scheduling is just as important.

The one-sentence interpretation of CPU fine choreography is: tune the core to ensure that the CPU core is maximized and used in the most stable way.

CPU orchestration is so important that ASI has learned and used this rule to its fullest extent over the years. I believe that after you see the table below (only with CpuSet fine scheduling), you will also sigh ASI has even played it out of tricks.

Take a 96-core X86 physical machine or Dragon. It has 2 sockets, each Socket has 48 physical cores, and each physical core has 2 logical cores. Of course, ARM’s architecture is different from X86’s.

Due to the L1, L2, and L3 Cache design of the CPU architecture, the optimal allocation is: two logical cores under the same physical core, one core is allocated to the core online transaction application such as Carts2 (shopping cart), and the other core is allocated to another non-core application that is not busy. This allows Carts2 to take advantage of the daily or double 11 zero peak. This kind of usage, in the actual production environment, pressure test drill environment are always successful.

If we assign both logical cores on the same physical core to Carts2 at the same time, the maximum use of resources is compromised due to the same business peak (especially the same POD instance).

Theoretically, we should also try to avoid two applications that are also core to transactions, such as Carts2 (shopping cart business) and tradePlatform2 (orders), so that they do not share the two logic cores. But actually at the micro level, the peaks of Carts2 and tradePlatform2 are different, so the effect is actually small. Although this CPU allocation may seem like a “compromise”; But physical resources are always limited, and can only keep this “compromise”.

When NUMA-AWARE is enabled, in order to maximize the use of L3 Cache to improve computing performance, more cores of the same POD should be ensured to avoid cross-socket as much as possible.

When using CPUShare, how to allocate Request and Limit is also very knowledgeable. When CPUSet and CPUShare coexist, scheduling will be more complicated (for example, new capacity expansion or offline of CPUSet container, potentially requesting CPU rescheduling of all PODS in the machine). In the emerging heterogeneous GPU scheduling scenario, the co-allocation of CPU and GPU also has certain skills.

5. Large-scale scheduling

Large-scale layout is mainly applied to the scene of station construction, station relocation or large-scale migration, such as Alibaba’s frequent promotion of station construction and super-large scale relocation under the demand of machine room migration. For cost reasons, we needed to quickly create hundreds of thousands of pods in the shortest possible time and with minimal labor costs.

The randomness and unpredictability of multiple task requests determine that central scheduling has many disadvantages in the field of scale. Before there is no large-scale arrangement ability, alibaba’s large-scale site construction often needs to go through a complex process of “self-expansion of business -> repeated rescheduling”, which will cost a lot of manpower and energy for several weeks. Fortunately, we have large-scale scheduling, which can ensure more than 99% resource allocation rate while delivering efficiency on an hourly scale.

Universal scheduling capability

1. Heavy schedule

The central scheduler achieves one-time optimal scheduling; However, it is completely different from the global scheduling optimization of cluster dimensions. Rescheduling also includes global central rescheduling and single-machine rescheduling.

Why does central rescheduling have to be a compensation for one-time scheduling? Let’s take a few examples:

There are many POD instances with long life cycle in ASI scheduling cluster. As time goes by, cluster dimensions will inevitably lead to resource fragmentation and UNEVEN CPU utilization.
The allocation of large core PODS requires dynamic and shifting scheduling capability (real-time expulsion of some small core pods and idle resources) or global rescheduling based on pre-planning to idle some large cores on many nodes.
The supply of resources is always tight. When scheduling a POD once, there may be some “compromise,” which means some kind of flaw and imperfection; However, cluster resources change dynamically, and at some later point, we can initiate a live migration, or rescheduling, of the POD, which leads to a better run time experience for the business.

The algorithm and implementation of central rescheduling are often very complicated. We need to deeply understand and fully cover the various rescheduling scenarios, define clear rescheduling DAG diagrams, dynamically execute and ensure the success rate of execution.

Many scenarios also require single-machine rescheduling. For example, SLO optimization based on CPU fine orchestration, single-machine rescheduling optimization based on OQS data drive, etc.

It should be emphasized that the implementation of single-machine rescheduling must first solve the problem of security risk control and avoid uncontrollable explosion radius. Before the risk control capability of the single node is insufficient, we suggest that you do not adopt the node autonomy mode for the time being, but change to the centralized trigger under strict protection control. In fact, there are many unavoidable node autonomy scenarios in the K8s domain (such as when Pod Yaml changes, Kubelet will perform the corresponding changes), and ASI has spent years continually sorting out every potential risk control point. Defender system of graded risk control management (nuclear button, high risk, medium risk, etc.) is constructed iteratively. For potential risk items, interact with Defender in the center before performing single-machine actions to avoid disaster events through security prevention and control. We suggest that the scheduler must also achieve a tight security level to allow nodes to operate autonomously.

2. Kernel scheduling

Kernel scheduling exists in the following context: A busy hosting run in parallel with a multitude of tasks, even if the center scheduling & scheduling, have together to ensure the best allocation of resources (such as the distribution of CPU, IO scattered, etc.), but the actual runtime, multitasking inevitably between kernel mode of resource competition, in the familiar scenario is particularly fierce competition in the offline department. This requires central scheduling, single-machine scheduling, and kernel scheduling to coordinate various resource priorities of tasks and submit them to kernel scheduling control for execution.

This also corresponds to many kernel isolation techniques. Including CPU: scheduling priority BVT, Noise Clean mechanism, etc. Memory: Memory reclamation and OOM priority. Network: Network gold, silver, copper priority, IO, etc.

Today we also have safe containers. Based on the isolation mechanism of Guest Kernel and Host Kernel, we can more elegantly avoid the contention problem of Kernel running state.

3. Elastic scheduling and time-sharing scheduling

Both elastic and time-sharing logic are better resource reuse, just in different dimensions.

ASI scheduler fully coordinated with Ali Cloud infrastructure layer and made use of the strong elastic capability provided by ECS to return resources to the cloud in ele. me scenarios during low peak periods and re-apply for corresponding resources during peak periods.

We can either use the built-in elastic Buffer of ASI large resource pool (note: host resources of ASI resource pool come from Ali Cloud resources) or directly use the elastic technology of Ali Cloud IaaS layer. The balance between the two is a very controversial topic, but also a relatively artistic process.

ASI’s time-sharing scheduling achieves the ultimate of resource reuse and brings huge cost optimization. By massively shutting down POD instances for online transactions every night, the resources released are used for ODPS offline tasks, which dehydrate and restart online applications every morning. This classic scene maximizes the value of offline blending technology.

The essence of time-sharing is resource reuse and the construction and management of large dependent resource pool, which is the synthesis of resource operation and scheduling technology. This requires the scheduler to accumulate as many forms of work as possible and as many tasks as possible.

4. Vertical scaling scheduling /X+1/VPA/HPA

Vertical scaling scheduling is a second delivery technology, which perfectly solves the problem of burst traffic. Vertical scaling scheduling is also a killer mace to reduce the risk of zero peak pressure. It achieves the second-level delivery of computing resources by vertically adjusting the resources of stock POD, accurate and reliable CPU scheduling and shuffling algorithm. VPA is one of the scenarios of VPA.

X+1 horizontal capacity expansion scheduling can also be regarded as one of the HPA scenarios, but X+1 horizontal capacity expansion scheduling is manually triggered. “X+1” focuses on the ultimate resource delivery efficiency, which is behind the great improvement of R&D efficiency: online POD “X (several)” minutes can be started to provide business services; All operations except application startup must be completed within 1 minute.

Vertical scaling scheduling and “X+1” horizontal capacity expansion scheduling complement each other and jointly escort all kinds of peak values.

ASI is also implementing more VPA and HPA scenarios. For example, through VPA technology, we can provide additional ant Spring Festival red packets with more free computing power, which will be very large cost savings.

The ultimate implementation of more scenarios of scheduling technologies such as VPA/HPA is also what Alibaba will continue to pursue perfection in the future.

5. Hierarchical [differentiated SLO] resource scheduling

Differential SLO scheduling is one of the essence of the scheduler. This section has some repetition with the [Scheduling resource Types] section above. Differentiated SLOs are intentionally covered in the last section of this chapter due to their complexity.

SLO(Quality of Service Objectives), QoS, and Priority are also defined very precisely in the ASI scheduler.

1) the SLO

SLO describes quality of service objectives. ASI provides differentiated slOs with different QoS and Priority, and different SLOs have different pricing. Users can decide which slO-protected resources to “subscribe to” based on different service characteristics. For example, for offline data analysis tasks, you can use a lower grade SLO to enjoy a lower price. For critical business scenarios, a higher-grade SLO can be used at a higher price.

2) QoS

QoS describes the quality of resource assurance. QOS defined by K8s community includes Guaranteed, Burstable and BestEffort. The QOS defined in ASI is not fully mapped to the community (the community uses Request/Limit mapping). To clearly describe group scenarios (such as CPUShare and mixed service), ASI defines QOS from another dimension, including LSE, LSR, LS, and BE, to clearly classify different resource guarantees. Different services can choose different QOS based on their own latency sensitivity.

3) PriorityClass

PriorityClass and QoS are two dimensional concepts. PriorityClass describes the importance of the task.

There are different combinations of resource allocation policies and task importance (i.e., PriorityClass and QoS), and of course there needs to be a corresponding relationship. For example, we can define a PriorityClass named Preemptible, most of whose tasks correspond to the QoS of BestEffort.

Each scheduling system has different definitions of PriorityClass. Such as:

In ASI, priority is defined as System, Production, Preemptible, Production, and Preemptible. The details of each level are not covered here.
Hippo searches for more detailed types and granularity, including System, ServiceHigh, ServiceMedium, ServiceLow, JobHigh, JobMedium, and JobLow. The details of each level are not covered here.

Globally optimal scheduling

1. Scheduling simulator

The scheduling simulator is somewhat similar to alibaba’s full-link pressure measurement system. It verifies new scheduling capabilities in the simulated environment through online real traffic playback or simulated traffic playback, so as to continuously temper various scheduling algorithms and optimize various indicators.

Another common use of scheduling simulator is to do offline simulation of difficult and miscellaneous problems online, so as to locate all kinds of problems harmlessly and efficiently.

To some extent, scheduling simulator is the basis of global scheduling optimization. With the scheduling simulator, we can repeatedly exercise various algorithms, technical frameworks and technical links in the simulation environment, and then achieve the optimization of global indicators, such as: global allocation rate, scheduling performance under different scenarios, scheduling stability and so on.

2. Elastic Scheduling Platform (ESP)

In order to achieve global optimal Scheduling, ASI has built a new Elastic Scheduling Platform (ESP Platform) around the scheduler to create a one-stop self-closed-loop Scheduling efficiency system based on Scheduling data guidance, core Scheduling capabilities and productized Scheduling operations.

In the past, we have built many similar modules, such as scheduling SLO inspection, many scheduling tools, layer 2 scheduling platform for different scenarios, etc. Based on the ESP platform, more layer 2 scheduling capabilities are integrated to provide ASI with globally optimal scheduling quality and provide customers with more considerate services based on service stability, resource cost, and user efficiency improvement.

More scheduling capabilities

This article tries to systematically explain the basic concepts, principles and various scenarios of ASI scheduler, and leads you into the beautiful and wonderful world of ASI scheduler. The scheduler is extensive and profound. Unfortunately, due to the limitation of space, it has to control the space, and a lot of content has not been further developed. In the scheduler, there are more and deeper Scheduling insidities, such as heterogeneous model Scheduling, Scheduling portrait, fairness Scheduling, priority Scheduling, shift Scheduling, preemption Scheduling, disk Scheduling, Quota, CPU normalization, GANG Scheduling, Scheduling Tracing, Scheduling diagnosis and many other Scheduling capabilities. None of these are described in this paper. Due to the limitation of space, this paper does not describe ASI’s powerful scheduling framework structure, optimization, scheduling performance optimization and other more in-depth technical insider.

As early as 2019, ASI has optimized the K8s single cluster to the industry-leading ten thousand level node scale, and thanks to ali Cloud ACK’s powerful K8s operation and maintenance system, Ali Group continues to maintain a large number of large-scale computing clusters, but also accumulated industry-leading K8s multi-cluster production practice. It is in these large-scale K8s clusters that ASI continues to provide computing power for many complex task resources based on the perfect container scheduling technology.

In the past few years, with the opportunity of the group’s comprehensive cloud development, Ali Group has realized the comprehensive migration and evolution from ASI control to ALIBABA Cloud container service ACK in the scheduling field. The complex, rich and large-scale business scenes within Ali Group will continue to output, enhance and temper the technical capabilities of cloud in the future.

Unveils Alibaba’s complex task resource hybrid scheduling technology