Abstract: This paper will start from the most basic scheduling algorithm, one by one analysis of the principle of various mainstream scheduling algorithms, with you to explore the mystery of CPU scheduling.

This article is shared by Yuan Runzi from Exploring CPU Scheduling Principles in Huawei Cloud community.

preface

Software engineers are always used to the Operating System (OS) as a very trustworthy steward, we just host programs to run on the OS, but rarely in-depth understanding of the Operating principle of the Operating System. Indeed, as a general-purpose software system, THE OS performs well enough in most scenarios. But there are still special scenarios where we need to tune the OS so that the business system can perform its tasks more efficiently. This requires a deep understanding of how the OS works, not just to tell the butler what to do, but how to make the butler better.

OS is a very large software system, this paper mainly explores the tip of the iceberg: CPU scheduling principle.

Speaking of the principle of CPU scheduling, many people’s first reaction is based on time slice scheduling, that is, each process has a time slice occupied by CPU running, after the time slice used up, let the CPU to other processes. Not many people seem to understand the deeper principles of how the OS determines whether a time slice is running out, how to switch to another process, and so on.

In fact, scheduling based on time slice is just a class of many CPU scheduling algorithms. This paper will start from the most basic scheduling algorithm, analyze the principle of various mainstream scheduling algorithms one by one, and take you to explore the mystery of CPU scheduling.

CPU context switch

Before exploring the principles of CPU scheduling, let’s take a look at CPU context switching, which is fundamental to CPU scheduling.

Today’s OSS almost all support “simultaneous” running tasks that are much larger than the number of cpus they allocate in turn. This requires the OS to know where to load the task from and where to run it from, and this information is stored in the CPU’s registers, where the address of the next instruction to be executed is stored in a special register, the program counter (PC). We refer to this information in registers as the CPU context, also known as the hardware context.

When the OS switches running tasks, it saves the context of the previous task and loads the context of the upcoming task into the CPU register, which is called CPU context switch.

The CPU context is part of the process context, which consists of the following two parts:

  • User-level context: Contains information about the process’s runtime stack, data blocks, code blocks, and so on.

  • System-level context: contains process identification information, process site information (CPU context), and process control information.

This involves two questions :(1) how does the CPU context of the previous task survive? (2) When do context switches take place?

Question 1: How is the CPU context of the previous task saved?

The CPU context is stored in the process’s kernel space. When the OS allocates virtual memory space to each process, it allocates a kernel space, which can only be accessed by kernel code. Before switching the CPU context, the OS saves the process information, such as the general register and PC, in the kernel space of the current CPU. After the next switchover, the system loads the information to the CPU to restore the task running.

Question 2: When is a context switch performed?

For the OS to perform a task context switch, it must occupy the CPU to perform the switch logic. However, while the user program is running, the CPU is occupied by the user program, that is, the OS is not running at the moment, and therefore context switch cannot be performed. To solve this problem, there are two strategies: cooperative strategy and preemptive strategy.

Cooperative policies rely on the user program actively relinquishing the CPU, for example by performing System calls or exceptions such as division by zero. However, this strategy is not reliable, if the user program does not actively give up CPU, or even malicious loop, then the program will always occupy CPU, the only way to recover is to restart the system.

Preemptive strategies rely on the hardware’s Timer Interrupt mechanism, with the OS registering Interrupt handling callbacks with the hardware at initialization. When the hardware interrupts, the hardware hands over CPU processing to the OS, and the OS can switch CPU context on interrupt callbacks.

A measure of scheduling

The quality of a CPU scheduling algorithm is generally measured by the following two indicators:

  • Turnaround time: T_{turnaround}=T_{completiong}-T_{arrival}Tturnaround=Tcompletiong−Tarrival
  • Response time: T_{response}=T_{firstrun}-T_{arrival}Tresponse=Tfirstrun−Tarrival

The two metrics are somewhat antithetical, requiring a high average turnaround time will inevitably reduce the average response time. The specific target is related to the task type. For example, the task of program compilation class requires a small turnaround time to complete compilation as quickly as possible. For user interaction tasks, the response time must be small to avoid affecting user experience.

Workload hypothesis

The workload on OS is always variable. To better understand the principles of various CPU scheduling algorithms, let’s make the following assumptions about the workload:

  • Hypothesis 1: All tasks run for the same duration.

  • Hypothesis 2: All tasks start at the same time

  • Hypothesis 3: Once a task is started, it runs until the task is complete.

  • Assumption 4: All tasks use ONLY CPU resources (such as no I/O operations).

  • Hypothesis 5: Know in advance how long all tasks will run.

Now that we’re ready, let’s dive into the wonderful world of CPU scheduling algorithms.

FIFO: First in, first out

FIFO (First In First Out, First In First Out) scheduling algorithm is known for its simple principle and easy implementation. It First schedules the First task until the end, and then schedules the next task, and so on. If more than one task arrives at the same time, select one at random.

The FIFO is efficient under our hypothetical workload. For example, if three tasks A, B and C meet all the above load assumptions and each task runs for 10s and arrives at t=0, the task scheduling situation is as follows:

According to the scheduling principle of FIFO, A, B and C finish their tasks at 10, 20 and 30 minutes respectively, with an average turnaround time of 20s (\ FRAc {10+20+30}{3}310+20+30).

However, the reality is always cruel. If 1 is broken, for example, the running time of A becomes 100s and that of B and C remains 10s, then the scheduling situation looks like this:

According to the scheduling principle of FIFO, the average turnaround time deteriorated to 110 (\ FRAc {100+110+120}{3}3100+110+120) due to the long running time of A, B and C can not be scheduled for A long time.

Therefore, FIFO scheduling strategy in the task running time difference of the scene, easy to appear task starvation problem!

In view of this problem, if B and C with short running time are scheduled first, the problem can be solved, which is exactly the idea of SJF scheduling algorithm.

SJF: Shortest task first

SJF (Shortest Job First) Selects the Shortest task from multiple tasks with the same arrival time to schedule, and then schedules the second Shortest task, and so on.

For the workload in the previous section, the SJF scheduling situation is as follows: The turnaround time becomes the 50s (\ FRAc {10+20+120}{3}310+20+120), which is more than double the 110s for THE FIFO.

Let’s continue to break hypothesis 2, A arrives at t=0 and B and C arrive at t=10, then the scheduling situation will look like this:

Because tasks B and C arrive later than A, they have to wait until A finishes running before they have A chance to schedule, even if A takes A long time to run. The turnaround time deteriorated to 103.33s (\frac {100+(110-10)+(120-10)}{3}3100+(110−10)+(120−10)), and once again the task starved to death!

STCF: The shortest completion time is preferred

To solve the task starvation problem of SJF, we need to break hypothesis 3 that tasks are allowed to be interrupted during execution. If B and C were dispatched immediately upon arrival, the problem would be solved. This is preemptive scheduling, as mentioned in the CPU context switch section. After the interrupt timer arrives, the OS completes the context switch for tasks A and B.

On the basis of SJF algorithm of cooperative scheduling, preemptive scheduling algorithm is added to evolve into STCF algorithm (Shortest time-to-completion First). The scheduling principle is to interrupt the current task when the task with short running Time arrives. Tasks with short running duration are scheduled first.

Scheduling this workload using the STCF algorithm is optimized to the 50s (\ FRAc {120+(20-10)+(30-10)}{3}3120+(20−10)+(30−10)), again solving the task starvation problem!

So far, we have only been concerned with the turnaround time, but what about the response time of THE FIFO, SJF and STCF scheduling algorithms?

Suppose that tasks A, B and C all arrive at t=0 and the running time is 5s, then the scheduling situation of these three algorithms is as follows, and the average response time is 5s (\ FRAc {0+(5-0)+(10-0)}{3}30+(5−0)+(10−0)) :

To make matters worse, the average response time increases as the task duration increases, which can be disastrous for interactive tasks and severely affect the user experience. The root cause of this problem is that when tasks all arrive at the same time and run at the same time, the last task must wait for all other tasks to complete before starting scheduling.

In order to optimize response time, the familiar timeslice based scheduling emerged.

RR: time slice based round robin scheduling

The Round Robin (RR) algorithm assigns a time slice to each task. When the time slice of a task is used up, the scheduler interrupts the current task and switches to the next task, and so on.

Note that the time slice length must be an integer multiple of the interrupt timer. For example, if the interrupt timer duration is 2ms, the time slice of the task can be set to 2ms, 4ms, 6ms… Otherwise, the OS cannot switch tasks even if the timed interrupt does not occur after the time slice of the task runs out.

Frac {0+(1-0)+(2-0)}{3}30+(1−0)+(2−0))

According to the RR scheduling principle, the smaller the time slice, the smaller the average response time. However, as the time slice becomes smaller, the number of task switching also increases, that is, the consumption of context switching will become larger. Therefore, setting the time slice size is a trade-off process, and the response time should not be pursued at the expense of CPU context switching.

CPU context switching costs more than just saving and restoring registers. In the process of running, the program will gradually establish its own cache data in the CPU cache, TLB, branch predictor and other hardware. When the task is switched, it means you have to warm up the cache all over again, which can be costly.

In addition, the RR scheduling algorithm of the turnaround time for 14 s (\ frac {(13-0) + (14-0) + (0) 15} {3} 3 (13-0) + (14-0) + 15 – (0)), Compared with FIFO, SJF and STCF 10 s (\ frac {(5-0) + (10-0) + (0) 15} {3} (5-0) + 3 (10-0) + 15 – (0)) sent a lot. This also validates the previous statement that turnaround time and response time are somewhat opposed. If you want to optimize turnaround time, it is recommended to use SJF and STCF; If you want to optimize the response time, RR is recommended.

Impact of I/O operations on scheduling

So far, we haven’t considered any I/O operations. We know that when an I/O operation is triggered, the process does not occupy the CPU, but blocks waiting for the I/O operation to complete. Now let’s break hypothesis 4 and consider that tasks A and B both arrive at time t=0 and both run for 50ms, but A performs A 10ms blocking I/O every 10ms, while B has no I/O.

If STCF is used for scheduling, the situation looks like this:

It can be seen from the figure above that the total scheduling time of task A and B reaches 140ms, which is larger than the actual total running time of task A and B of 100ms. Also, A blocks during I/O operations, and the scheduler does not switch to B, causing CPU idling!

To solve this problem, use the RR scheduling algorithm to allocate A time slice of 10ms to task A and task B. In this way, when task A is blocked in I/O operations, task B can be scheduled. When task B uses up the time slice, task A returns from the I/O block, and so on, the total scheduling time is optimized to 100ms.

This scheduling scheme is based on hypothesis 5, which requires the scheduler to know the running time of A and B, the I/O operation time and other information in advance, so as to make full use of the CPU. However, the actual situation is much more complicated than that. The I/O blocking duration is not the same every time, and the scheduler does not know exactly what A and B are doing. When hypothesis 5 is also broken, how can the scheduler achieve maximum CPU utilization and scheduling rationality?

Next, we’ll introduce a CPU scheduling algorithm used by many modern operating systems, MLFQ, that performs well even when all workload assumptions are broken.

MLFQ: multi-level feedback queue

MLFQ (Multi-level Feedback Queue) scheduling algorithm has the following objectives:

  1. Optimize turnaround time.
  2. Reduces the response time of interactive tasks and improves user experience.

From the previous analysis, we know that to optimize turnaround time, we can prioritize tasks with long runtimes (as SJF and STCF do); To optimize the response time, a time slice based scheduling similar to RR is used. However, these two goals seem contradictory, as reducing response time necessarily increases turnaround time.

Then for MLFQ, the following two problems need to be solved:

  1. How do you balance turnaround time and response time without knowing the task’s running time, I/O operations, and so on up front?
  2. How can we learn from historical scheduling to make better decisions in the future?

Prioritize tasks

The most significant feature of MLFQ and the scheduling algorithms introduced above is that priority queues are added to store tasks of different priorities, and the following two rules are set:

  • Rule 1: If Priority(A)> Priority(B), schedule A

  • Rule 2: If Priority(A)= Priority(B), schedule A and B according to the RR algorithm

Changes in priorities

MLFQ must consider changing the priority of the task, otherwise according to rules 1 and 2, for task C in the figure above, C will not be given A chance to run until A and B have finished running, resulting in A long response time for C. Therefore, the following priority change rules can be set:

  • Rule 3: When a new task arrives, put it on the highest priority queue

  • Rule 4A: If task A runs for A slice of time without actively relinquishing CPU (such as I/O operations), the priority is reduced by one level

  • Rule 4b: If task A voluntarily cedes CPU before the time slice runs out, the priority remains unchanged

Rule 3 mainly considers that new tasks can be assigned opportunities to avoid the problem of task starvation

Rules 4A and 4B mainly consider that most interactive tasks are short-running and frequently yield CPU, so the existing priority needs to be maintained to ensure the response time. Cpu-intensive tasks, on the other hand, tend to be less concerned with response time and can therefore be prioritized.

According to the above rules, when A long-running task A arrives, the scheduling situation looks like this:

If short-time task B arrives when task A runs to t=100, the scheduling situation looks like this:

As can be seen from the above scheduling situation, MLFQ has the advantages of STCF, that is, it can prioritize the scheduling of short-running tasks and shorten the turnaround time.

If task A runs to t=100 and interactive task C arrives, the scheduling situation looks like this:

MLFQ selects other tasks based on their priorities when a task is blocked to avoid CPU idling. Thus, in the figure above, task A gets the elapsed time slice while task C is in the I/O blocking state, and hangs again when task C returns from the I/O blocking state, and so on. In addition, because task C actively cedes CPU within the time slice, the priority of C remains unchanged, which effectively improves user experience for interactive tasks.

Cpu-intensive tasks starve to death

So far, MLFQ seems to be able to accommodate both turnaround time and response time for interactive tasks. Is it really perfect?

Consider the following scenario: when task A runs to t=100, interactive task C and D arrive at the same time, then the scheduling situation will be like this:

Therefore, if there are many interactive tasks on the current system, cpu-intensive tasks will starve to death!

To solve this problem, the following rules can be established:

  • Rule 5: Put all tasks on the highest Priority queue after the system has been running for S (Priority Boost)

After adding this rule, and assuming that S is set to 50ms, the scheduling situation looks like this, and the starvation problem is solved!

Malicious task problem

Consider the following malicious task E, which, in order to occupy the CPU for a long time, deliberately performs an I/O operation with 1% of the time slice remaining and returns quickly. According to rule 4b, E will remain on the original queue with the highest priority, so it will still get the scheduling priority in the next schedule:

To solve this problem, we need to adjust rule 4 to the following rule:

  • Rule 4: Assign a time slice to each priority. When the task uses up the time slice of the priority, the priority is reduced by one level

With the application of the new rule 4, the same workload is scheduled as follows, and malicious task E no longer consumes a large amount of CPU.

So far, the fundamentals of MLFQ have been covered. Finally, let’s summarize the five key rules of MLFQ:

  • Rule 1: If Priority(A)> Priority(B), schedule A

  • Rule 2: If Priority(A)= Priority(B), schedule A and B according to the RR algorithm

  • Rule 3: When a new task arrives, put it on the highest priority queue

  • Rule 4: Assign a time slice to each priority. When the task uses up the time slice of the priority, the priority is reduced by one level

  • Rule 5: Put all tasks on the highest Priority queue after the system has been running for S (Priority Boost)

Now, to return to the two questions raised at the beginning of this section:

1. How does MLFQ balance turnaround time and response time without knowing the operation information of the task in advance (including running time, I/O operations, etc.)?

If it is not clear whether the task is long-running or short-running in advance, MLFQ will first assume that the task is rot-RUNNING. If the assumption is correct, the task will be completed quickly, and turnaround time and response time are optimized. Even if the assumption is wrong, the priority of the task can be gradually reduced, giving more scheduling opportunities to other short-running tasks.

2. How can MLFQ learn from historical scheduling to make better decisions in the future?

MLFQ determines whether a task is an interactive task based on whether it actively cedes CPU. If it is, it maintains the current priority to ensure the scheduling priority of the task and improve the responsiveness of the interactive task.

Of course, MLFQ is not a perfect scheduling algorithm, and it also has various problems, among which the most troubling is the setting of MLFQ parameters, such as the number of Priority queues, the length of time slices, the interval of Priority Boost, etc. There are no perfect reference values for these parameters and they can only be set according to different workloads.

For example, we can set the time slice of tasks on a low-priority queue to be longer, because low-priority tasks tend to be CPU-intensive and don’t care much about response time, and longer time slices reduce the consumption of context switches.

CFS: Completely fair scheduling for Linux

In this section, we will introduce a scheduling algorithm that we deal with most frequently, CFS (CompletelyFair Scheduler, CompletelyFair scheduling) in Linux system. Unlike MLFQ, which was introduced in the previous section, CFS does not aim to optimize turnaround time and response time, but rather to evenly distribute the CPU to each task.

Of course, CFS also provides the ability to prioritize processes, allowing users/administrators to decide which processes need more scheduling time.

The basic principle of

While most scheduling algorithms are based on fixed time slices, CFS takes a different approach and uses a counting based scheduling method, which is called Virtual Runtime.

CFS maintains a VRuntime value for each task and accumulates its VRuntime each time a task is scheduled. For example, after task A has run A 5ms time slice, it is updated to vRuntime += 5ms. CFS selects the task with the lowest VRuntime value to schedule the next time, for example:

When should CFS switch tasks? Switch more frequently, and tasks are more fairly scheduled, but context switching is more costly. Therefore, CFS provides users with a tunable parameter sched_latency to decide when to switch. CFS sets the time slice assigned to each task to time_slice= sched_latency/n (n is the current number of tasks) to ensure that cpus are evenly divided between tasks for fairness during the sched_latency cycle. For example, with sched_latency set to 48ms, there are four tasks A, B, C and D, and each task is assigned A time slice of 12ms. After the end of C and D, the time slice allocated to A and B was also updated to 24ms:

From this point of view, with sched_latency unchanged, the amount of time slices assigned to each task decreases as the number of tasks in the system increases, and the cost of task switching increases. In order to avoid excessive task switching consumption, CFS provides a configurable parameter min_GRANULARITY to set the minimum time slice for a task. For example, if sched_latency is set to 48ms and min_granularity is set to 6ms, each task will be partitioned to 6ms rather than 4ms, even if there are 12 tasks currently.

Assign weight to tasks

Sometimes we want to allocate more slices to one important business process in the system and less slices to other unimportant processes. But following the basic principles described in the previous section, is there a way to do this with CFS scheduling, where each task is evenly distributed between cpus?

You can assign weights to tasks so that tasks with higher weights use more CPU!

After adding the weight mechanism, the calculation method of the task time slice becomes as follows:

Time \ _slice_} {k = \ frac {weight_k} {\ sum_ {I = 0} ^ {}, n – 1 weight_ {I}} * sched \ _latencytime_slicek = ∑ I = 0 n – 1 weightiweightk ∗ sched_latenc y

For example, with sched_latency still set to 48ms, there are two tasks A and B with weight set to 1024 and weight set to 3072 for A and B with time slices of 12ms and 36ms respectively.

As can be seen from the previous section, CFS selects the task with the smallest VRuntime value each time to schedule it. After each schedule is completed, the calculation rule of VRuntime is VRuntime += Runtime. Therefore, only changing the calculation rule of time slice does not take effect.

= vruntime_ vruntime_ {I} {I} + \ frac {weight_ {0}} {weight_ {I}} * runtime_ {I} vruntimei = vruntimei + weightiweight0 ∗ runtimei

Again, assuming that both A and B have no I/O operations, after updating the vRuntime calculation rules, the scheduling situation is as follows: Task B gets more CPU than task A does.

Use red-black trees to improve Vruntime lookup efficiency

Every time CFS switches tasks, it selects the task with the smallest VRuntime value to schedule, so it needs a data structure to store each task and its VRuntime information.

The most intuitive way to store this information is to select an ordered list, sorted by VRuntime. In this way, when switching tasks, CFS only needs to get the tasks in the head of the list, with time complexity O(1). For example, vruntime saves 10 tasks as an ordered linked list [1, 5, 9, 10, 14, 18,17, 21, 22, 24], but the time complexity is O(N) every time the task is inserted or deleted, and the time increases linearly as the number of tasks increases!

CFS uses a red-black tree to store tasks and Vruntime information in order to maximize the efficiency of query, insert, and delete. In this way, the complexity of query, insert, and delete operations becomes log(N) and does not increase linearly as the number of tasks increases, greatly improving the efficiency.

In addition, to improve storage efficiency, CFS only saves information about tasks in the Running state in the red-black tree.

Dealing with I/O and hibernation

In this strategy, the task with the lowest VRuntime value is selected each time, and the task will starve to death. Consider that there are two tasks, A and B, whose time slice is 1s. At first, CPUS of A and B run in turn evenly. After A certain scheduling, CPUS of B sleep for 10 seconds. After B wakes up, vruntime_{B}vruntimeB will be 10 seconds smaller than vruntime_{A}vruntimeA. For the next 10 seconds, B will continue to be scheduled so that task A will starve to death.

To solve this problem, CFS specifies that when a task returns from sleep or I/O, the vRuntime of the task is set to the minimum vRuntime value in the current red-black tree. In the above example, vruntime_{B}vruntimeB will be set to 11 after B wakes up from hibernation and therefore will not starve task A.

There are drawbacks to this approach. If a task has a short sleep period, it will still be prioritized when it wakes up, which is unfair to other tasks.

Write in the last

This paper spent a long time to explain the principle of several common CPU scheduling algorithms, each algorithm has its own advantages and disadvantages, there is no perfect scheduling strategy. In the application, we need to select the appropriate scheduling algorithm according to the actual workload, configure reasonable scheduling parameters, balance the turnaround time and response time, task fairness and switching consumption. These are all true to the adage in Fundamentals of Software Architecture: Everythingin Software Architecture is a trade off.

The scheduling algorithms described in this paper are all analyzed based on single-core processors, while the scheduling algorithms on multi-core processors are much more complex than this, such as shared data synchronization between processors and cache affinity, etc., but the essential principles are still dependent on the basic scheduling algorithms described in this paper.

reference

  1. Operating Systems: Three Easy Pieces, Remzi H Arpaci-Dusseau / Andrea C Arpaci-Dusseau

  2. Fundamentals of Computer Systems (3) : Anomalies, Interrupts and Input/Output yuan Chunfeng, Nanjing University

Click to follow, the first time to learn about Huawei cloud fresh technology ~