“This is the first day of my participation in the First Challenge 2022. For details: First Challenge 2022”

Most modern operating systems attempt to extract the best performance from underlying hardware resources. This is achieved primarily through the virtualization of two primary hardware resources: CPU and memory. Modern operating systems provide a multitasking environment, essentially providing their own virtual CPU for each task. A task is often unaware of the fact that it does not have exclusive CPU usage.

Similarly, memory virtualization is achieved by giving each task its own virtual memory address space, which is then mapped to the system’s real memory. Similarly, a task is often unaware that its virtual memory address may not map to the same physical address in real memory.

CPU virtualization is achieved by “sharing” the CPU between multiple tasks — that is, each running task periodically gets a small portion of the CPU. The algorithm that selects one task at a time from a number of available runnable tasks is called a scheduler, and the process of selecting the next task is called scheduling.

The scheduler is one of the most important components of any operating system. Implementing scheduling algorithms is difficult for several reasons. First, an acceptable algorithm must allocate CPU time so that higher-priority tasks (such as interactive applications such as Web browsers) take precedence over lower-priority tasks (such as non-interactive batch processing such as program compilation). At the same time, the scheduler must prevent low-priority process starvation. In other words, no matter how many high-priority processes compete for CPU time, low-priority processes must eventually be allowed to run. The scheduler must also be carefully designed so that the processes appear to be running simultaneously without too much impact on system throughput.

For interactive processes such as guIs, the ideal scheduler would give each process a small amount of CPU time and loop quickly between processes. Because users expect interactive processes to respond to input immediately, ideally the delay between user input and process execution should be unobvious to humans — between 50 and 150 milliseconds at most.

For non-interactive processes, the opposite is true. Switching between processes or context switching is a relatively expensive operation. Therefore, more time slices and fewer context switches on the processor can improve system performance and throughput. Scheduling algorithms must balance all these competing requirements.

Like most modern operating systems, Linux is a multitasking operating system, so it has a scheduler. Linux schedulers have evolved over time.

1.O (1) scheduler

With the release of kernel 2.6, the Linux scheduler was overhauled. This new scheduler is called the O(1) scheduler — O(…) It’s called the big O notation. This name was chosen because the scheduler’s algorithm requires constant time to make scheduling decisions regardless of the number of tasks. The O(1) scheduler uses an algorithm that relies on an array of active and expired processes to achieve a constant scheduling time. Each process has a fixed period of time before it is preempted and moved into an expired array. Array switching occurs once all tasks in the active array have exhausted their time slices and moved to the expired array. This switch makes the active array the new empty expired array and the expired array the active array.

The main problem with this algorithm is the complex heuristic used to mark tasks as interactive or non-interactive. The algorithm attempts to identify interactive processes by analyzing average sleep time (the amount of time a process spends waiting for input). Long-dormant processes may be waiting for user input, so the scheduler assumes they are interactive. The scheduler rewards interactive tasks with priority (for better throughput) and penalizes non-interactive tasks by reducing the priority of non-interactive tasks. All of the calculations used to determine the interactivity of a task are complex and prone to potential miscalculations that can lead to non-interactive behavior during an interaction.

As I explain later in this article, CFS is not affected by any such calculations and simply tries to be “fair” to each task running on the system.

2. Completely Fair Scheduler (CFS)

According to CFS author Ingo Molnar, its core design can be summed up in one sentence: “CFS basically simulates an ‘ideal, accurate multitasking CPU’ on real hardware.”

Let’s try to understand what it means to be an “ideal, precise, multitasking CPU,” because CFS tries to emulate this CPU. An “ideal, precise, multitasking CPU” is a hardware CPU that can run multiple processes simultaneously (in parallel), giving each process equal processor power (not time, but power). If a single process is running, it gets 100% processor power. For both processes, each will have 50% physical power (parallel). Similarly, if you have four processes running, each process gets the exact 25% of physical CPU power in parallel, and so on. Thus, this CPU is “fair” to all tasks running on the system (Figure 1)

Figure 1. Ideal, precise, multitasking CPU

Obviously, such an ideal CPU does not exist, but CFS tries to emulate such a processor in software. On a real real processor, only one task can be assigned to the CPU at any given time. Therefore, all other tasks wait in the meantime. Therefore, while the currently running task gets 100% CPU power, all other tasks get 0% CPU power. This is clearly unfair (Figure 2).

Figure 2. The actual hardware CPU CFS tries to eliminate this inequity from the system. CFS tries to track the fair share of CPUS available to each process in the system. Therefore, CFS runs a fair clock at a fraction of the actual CPU clock speed. The growth rate of the fair clock is calculated by dividing the wall hanging time (in nanoseconds) by the total number of processes waiting. The resulting value is the amount of CPU time each process is entitled to.

As a process waits for the CPU, the scheduler keeps track of the amount of time it uses on the desired processor. This wait time, represented by the Wait_Runtime variable for each task, is used to sort processes to schedule and determine how much time the process is allowed to execute before being preempted. The process that waits the longest (that is, the process that needs the CPU the most) is picked by the scheduler and assigned to the CPU. When this process is running, its wait time decreases, while the time of other waiting tasks increases (because they are waiting). This essentially means that after a certain amount of time, there will be another task with the longest waiting time (requiring the most CPU), and the currently running task will be preempted. Using this principle, CFS tries to be fair to all tasks and always tries to have the system provide zero wait time for each process — each process has an equal share of CPU (an “ideal, accurate, multitasking CPU” would do this). Kernel 2.6.23 In order for CFS to emulate an “ideal, precise, multitasking CPU,” providing equal slices of execution time for each runnable process, CFS needs the following:

Figure 2. The actual hardware CPU CFS tries to eliminate this inequity from the system. CFS tries to track the fair share of CPUS available to each process in the system. Therefore, CFS runs a fair clock at a fraction of the actual CPU clock speed. The growth rate of the fair clock is calculated by dividing the wall hanging time (in nanoseconds) by the total number of processes waiting. The resulting value is the amount of CPU time each process is entitled to. As a process waits for the CPU, the scheduler keeps track of the amount of time it uses on the desired processor. This wait time, represented by the Wait_Runtime variable for each task, is used to sort processes to schedule and determine how much time the process is allowed to execute before being preempted. The process that waits the longest (that is, the process that needs the CPU the most) is picked by the scheduler and assigned to the CPU. When this process is running, its wait time decreases, while the time of other waiting tasks increases (because they are waiting). This essentially means that after a certain amount of time, there will be another task with the longest waiting time (requiring the most CPU), and the currently running task will be preempted. Using this principle, CFS tries to be fair to all tasks and always tries to have the system provide zero wait time for each process — each process has an equal share of CPU (an “ideal, accurate, multitasking CPU” would do this). Kernel 2.6.23 In order for CFS to emulate an “ideal, precise, multitasking CPU,” providing equal slices of execution time for each runnable process, CFS needs the following:

  1. A mechanism for calculating the fair CPU share of each process. This is done by using the system-wide runqueue fair_clock variable (cfs_rq->fair_clock). This fair clock runs at a fraction of real time, so when there are N runnable tasks in the system, it runs a single task at an ideal speed. For example, if you have four runnable tasks, fair_clock increases at a quarter of the wall-hanging time (which means 25% fair CPU power).
  2. A mechanism for keeping track of how long each process waits while CPU is allocated to the currently running task. This wait time accumulates in each process variable wait_runtime (process-> wait_Runtime).

3. Red Black Tree

A red-black tree is a self-balanced binary search tree — a data structure commonly used to implement associative arrays. It is complex, but its operations have good worst-case runtime and are efficient in practice. It can search, insert, and delete in O(log n) time, where n is the number of elements in the tree. In a red-black tree, the leaves are unrelated and contain no data. These leaves do not need to be explicit in computer memory — the empty child pointer encodes the fact that the child is a leaf — but if the leaf is indeed an explicit node, it simplifies some algorithms for operating on red-black trees. To save memory, sometimes a single sentinel node acts as all leaf nodes. All references from the inner node to the leaf node point to the sentinel node. (Source: Wikipedia.)

CFS uses fair clock and wait runtime to keep all runnable tasks sorted by the Rq ->fair_clock -p ->wait_runtime key in rBtree (see red-black tree sidebar). Thus, the leftmost task in the tree is the “most CPU-intensive task,” and CFS selects the leftmost task and sticks with it. As the system progresses, newly awakened tasks are pushed more and more into the tree, further and further to the right — slowly but surely giving each task a chance to become the leftmost task and thus enter the CPU at a certain time.

Because of this simple design, CFS no longer uses active and expired arrays, nor does it use complex heuristics to mark tasks as interactive and non-interactive.

CFS achieves priority by using weighted tasks — each task is assigned a weight based on its static priority. Thus, at run time, tasks with lower weight (lower priority) will pass faster than tasks with higher priority. This means that its WAIT_Runtime will consume faster than a high-priority task, so a low-priority task will get less CPU time than a high-priority task.

Kernel 2.6.24 CFS was further modified in 2.6.24. While the basic concept of fairness still exists, some implementation details have changed. Instead of chasing the global fair clock (RQ ->fair_clock), tasks now chase each other. A clock vRuntime for each task is introduced and initialized for the new task using an approximate average. Each task tracks its running time and is queued in the RBTree using this parameter. Therefore, the least run task (the one with the most CPU requirements) is the leftmost node of the RBTree, which will be selected by the scheduler. See Resources for more details on this implementation.

Another major addition to CFS in kernel 2.6.24 is group scheduling. Plain CFS tries to be fair to all the tasks running on the system. For example, suppose there are 25 runnable processes on the system. CFS tries to be fair by allocating 4% of the CPU to everyone. However, suppose that of the 25 processes, 20 belong to user A and 5 belong to user B. User B is at an inherent disadvantage because USER A gets more CPU power than user B. Group scheduling tries to eliminate this problem. It tries to be fair first to a group and then to individual tasks within the group. So CFS allocates 50% of the CPU to each user A and B with group scheduling enabled. A’s 50% share will be equally divided among A’s 20 tasks, while the other 50% of CPU time will be equally divided among B’s five tasks.

In kernel 2.6.23, the Linux scheduler was also modularized. Each scheduling policy (SCHED_FIFO, SCHED_RR, SCHED_OTHER, and so on) can be implemented independently of the base scheduler code. This modularity is similar to an object-oriented class hierarchy (Figure 3).

Figure 3. Modular scheduler

The core scheduler does not need to know the implementation details of the individual scheduling policies. In kernel 2.6.23, sched.c (” scheduler “in older kernels) was split into the following files to make the scheduler modular:

Kernel /sched.c: Code that contains the generic scheduler, exposing functions like sched(). Specific scheduling policies are implemented in different files.

Kernel /sched_fair.c: This is the main file that implements the CFS scheduler and provides SCHED_NORMAL, SCHED_BATCH, and SCHED_IDLE scheduling policies.

Kernel/sched_rT.c: Provides SCHED_RR and SCHED_FIFO policies for real-time (RT) threads.

Each of these scheduling policies (FAIR and RT) registers its function pointer with the core scheduler. The core scheduler invokes the appropriate scheduler (FAIR or RT) based on the scheduling policy for a particular process. As with the O(1) scheduler, real-time processes will have a higher priority than normal processes. CFS primarily addresses non-real-time processes, and the RT scheduler is more or less the same as before (except for some changes on how to maintain inactive/expired arrays).

With this new modular scheduler, people who want to write new schedulers for a particular policy can do so simply by registering these new policy functions with the core scheduler.

The way to generalize CFS design is very radical and innovative. Features such as modular scheduler simplify the task of integrating new scheduler types into the core scheduler.

resources

Linux kernel source code (2.6.23/2.6.24) : sched-design-cfs.txt by Ingo Molnar

IBM Developer Works article Multiprocessing with the Completely Fair Scheduler: kerneltrap.org/node/8059

CFS related blog posts: immike.net/blog/2007/0…

Brother Dei, if you think my writing is good, please do me a favor

1, pay attention to my original wechat public number “Xiaoliang programming”, every day on time push dry goods technical articles, focus on writing algorithm + computer basic knowledge (computer network + operating system + database +Linux) + distributed + interview guide, heard that the attention is not good will become good oh.

Give me a thumbs up, can let more people see this article, by the way inspire me, hee hee.

Author’s brief introduction

Author: hello everyone, I am the trabecular, won a ICPC Asian games MEDALS, now at tencent, I knew algorithm, the importance of basic computer knowledge, so I apply for a WeChat public “trabecular programming hui”, professional in writing these underlying knowledge, improve our internal work, trabecular looking forward to your attention, and I study together. Reprint description: Without authorization, reprint is prohibited