The Linux process scheduling algorithm has undergone the following versions of development:
-
Time slice based round robin scheduling algorithm. (Versions prior to 2.6)
-
O(1) scheduling algorithm. (Versions prior to 2.6.23)
-
Completely fair scheduling algorithm. (2.6.23 and later)
Earlier I wrote an analysis of O(1) scheduling algorithms: O(1) scheduling algorithms, and this one focuses on the fully fair scheduling algorithms used by Linux today.
Before analyzing the completely fair scheduling algorithm, we first understand the basic principle of the completely fair scheduling algorithm.
Basic principles of completely fair scheduling algorithm
Completely fair scheduling algorithm is reflected in the treatment of every process is fair, so how to achieve completely fair? There is a relatively simple method: let each process run a period of the same time slice, which is based on the time slice polling scheduling algorithm. As shown below:
As shown in the figure above, process 1 gets the CPU run time of time0 to time1 at the start. When the time slice of process 1 is used up, process 2 obtains the CPU elapsed time from time1 to time2. When the time slice of process 2 is used up, process 3 obtains the CPU elapsed time from time2 to time3.
And so on, since each time slice is equal, each process theoretically gets the same CPU run time. This algorithm looks pretty good, but there are two problems:
-
It is not possible to assign different running time by weight, for example some processes that are heavy should get more running time.
-
Each time, all processes in the run queue are traversed and the process with the highest priority is found to run. The time complexity is O(n). This is unacceptable for a high-performance operating system.
To solve these two problems, the developers of the Linux kernel created a completely fair scheduling algorithm.
Completely fair scheduling of both times
To implement the completely fair scheduling algorithm, two times are defined for the process:
- Actual running time:
Actual running time = Scheduling period * Weight of processes/Sum of weight of all processes
-
Scheduling period: the time required for all processes to run once.
-
Process weight: Assign different weights to each process according to its importance.
For example, if the scheduling period is 30ms, the weight of process A is 1 and that of process B is 2. So:
The actual running time of process A is 30ms x 1 / (1 + 2) = 10ms
The actual running time of process B is 30ms * 2 / (1 + 2) = 20ms
- Virtual uptime:
Virtual running time = Actual Running time x 1024 / Process weight = (Scheduling interval x Process weight/Sum of all process weight) x 1024 / Process weight = Scheduling interval x 1024 / Total process weight
As you can see from the above formula, the virtual running time of all processes is the same in a scheduling cycle. Therefore, during process scheduling, you only need to find the process scheduling with the smallest virtual running time.
In order to quickly find the process with the smallest virtual running time, the Linux kernel uses a red-black tree to store runnable processes, and the key for comparison is the virtual running time of the process.
If you don’t know about red-black trees, you can just think of it as an auto-sorting container. As shown below:
As shown in the figure above, the left node of a red-black tree is smaller than the parent, while the right node is larger than the parent. Therefore, when searching for the smallest node, we only need to obtain the leftmost node of the red-black tree, and the time complexity is O(logN).
Completely fair scheduling of two objects
To implement the fully fair scheduling algorithm, the Linux kernel defines two objects: CFs_rq (runnable process queue) and sched_entity (scheduled entity).
-
Cfs_rq (Runnable process queue) : Uses a red-black tree structure to hold runnable processes in the kernel.
-
Sched_entity: An entity that can be scheduled by the kernel and can be considered a process if group scheduling is ignored (again, not covered in this article).
Cfs_rq Object definition
struct cfs_rq { struct load_weight load; unsigned long nr_running; U64 exec_clock; // Current clock u64 min_vRuntime; Struct rb_root tasks_timeline; Struct rb_node *rb_leftmost; struct rb_node *rb_leftmost; // Cache the left-most node of the red-black tree, used to quickly get the smallest node... };Copy the code
For the CFS_rQ object, the focus is now on the Tasks_Timeline member, which holds the root node of the runnable process queue (the root node of the red-black tree).
Sched_entity object definition
struct sched_entity { struct load_weight load; struct rb_node run_node; Struct list_head group_node; struct list_head group_node; unsigned int on_rq; U64 exec_start; U64 sum_exec_runtime; U64 vRuntime; U64 prev_SUM_EXEC_runtime; U64 last_wakeup; u64 avg_overlap; . };Copy the code
For sched_entity objects, the main focus is now on run_node members, which are used to connect scheduling entities to the queue of runnable processes.
In addition, a member variable se of type sched_entity is defined in the process descriptor task_struct object as follows:
struct task_struct { ... struct sched_entity se; . }Copy the code
So, the relationship between them is as follows:
As you can see from the figure above, all sched_entity objects form a red-black tree with their RUN_node members, the root of which is stored in the TASK_timeline member of the CFs_rq object.
In addition, the process descriptor task_sturct defines a member variable SE of type sched_entity, so the process can be saved to the runnable queue through the SE field of the process descriptor.
Completely fair scheduling algorithm implementation
With that in mind, it’s time to analyze how to implement a completely fair scheduling algorithm in the Linux kernel.
Let’s start by looking at how to update the virtual running time of a process.
1. Update the virtual running time of the process
Updating the virtual runtime of a process is done with the __update_curr() function, which looks like this:
/src/kernel/sched_fair.c
static inline void __update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr, unsigned long delta_exec) { unsigned long delta_exec_weighted; curr->sum_exec_runtime += delta_exec; Delta_exec_weighted = delta_exec; If (unlikely(curr->load.weight! = NICE_0_LOAD)) { delta_exec_weighted = calc_delta_fair(delta_exec_weighted, &curr->load); } curr->vruntime += delta_exec_weighted; // Update the virtual running time of the process}Copy the code
The meanings of the __update_curr() arguments are as follows:
-
Cfs_rq: runnable queue object.
-
Curr: scheduling entity of the current process.
-
Delta_exec: the actual running time.
The __update_curr() function does the following:
-
Update the total actual running time of the process scheduling entity.
-
According to the weight value of the process scheduling entity, the virtual running time used by it is calculated.
-
Adds the result of calculating the virtual runtime to the vRuntime field of the process scheduling entity.
We then examine how processes are added to the run queue.
2. Add the process scheduling entity to the run queue
To add a process scheduling entity to the run queue, call the __enqueue_entity() function, which does the following:
/src/kernel/sched_fair.c
static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
struct rb_node **link = &cfs_rq->tasks_timeline.rb_node; // 红黑树根节点
struct rb_node *parent = NULL;
struct sched_entity *entry;
s64 key = entity_key(cfs_rq, se); // 当前进程调度实体的虚拟运行时间
int leftmost = 1;
while (*link) { // 把当前调度实体插入到运行队列的红黑树中
parent = *link;
entry = rb_entry(parent, struct sched_entity, run_node);
if (key < entity_key(cfs_rq, entry)) { // 比较虚拟运行时间
link = &parent->rb_left;
} else {
link = &parent->rb_right;
leftmost = 0;
}
}
if (leftmost) {
cfs_rq->rb_leftmost = &se->run_node; // 缓存红黑树最左节点
cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, se->vruntime);
}
// 这里是红黑树的平衡过程(参考红黑树数据结构的实现)
rb_link_node(&se->run_node, parent, link);
rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);
}
Copy the code
The __enqueue_entity() function does the following:
-
Gets the root node of the runqueue red-black tree.
-
Gets the virtual runtime of the current process scheduling entity.
-
Add the current process scheduling entity to the red-black tree (see red-black tree insertion algorithm).
-
Cache the leftmost node of the red-black tree.
-
The red black tree balance operation (see the red black tree balance algorithm).
After calling the __enqueue_entity() function, you can insert the process scheduling entity into the red-black tree of the runqueue. At the same time, the left-most node of the red-black tree is cached in the rb_leftMOST field of the run queue, which is used to quickly obtain the next runnable process.
3. Obtain the next runnable process from the runnable queue
To get the next runnable process in the runqueue, call the __pick_next_entity() function, which does the following:
/src/kernel/sched_fair.c
static struct sched_entity *__pick_next_entity(struct cfs_rq *cfs_rq) { // 1. Call first_fair() to get the left-most node of the red-black tree // 2. Call rb_entry() to return the scheduling entity corresponding to the node. Return rb_entry(first_fair(cfs_rq), struct sched_entity, run_node); } static inline struct rb_node *first_fair(struct cfs_rq *cfs_rq) { return cfs_rq->rb_leftmost; // get the left-most node of the red-black tree}Copy the code
As mentioned earlier, the left-most node of the red-black tree is the process with the least virtual running time, so the flow of the __pick_next_entity() function is very simple, as follows:
-
First_fair () is called to get the left-most node of the red-black tree.
-
Rb_entry () is then called to return the scheduling entity corresponding to the node.
Scheduling time
Given how the completely fair scheduling algorithm adds scheduled processes to the runnable queue and gets the next runnable process from the runnable queue, when does the Linux kernel schedule processes?
The answer is triggered by a clock interrupt in the Linux kernel.
What is a clock interrupt? If you are not familiar with operating system principles, you may not understand this thing, but it doesn’t matter, do not understand it can be regarded as a timer, that is, every fixed period of time will call a callback function to handle the event.
Like the heart of Linux, clock interrupts trigger calls to the scheduler_tick() function at regular intervals, as follows:
void scheduler_tick(void)
{
...
curr->sched_class->task_tick(rq, curr, 0); // 这里会调用 task_tick_fair() 函数
...
}
Copy the code
Scheduler_tick () calls task_tick_fair(), while Task_tick_fair () calls Entity_tick () to schedule tasks. The call chain is as follows:
scheduler_tick() -> task_tick_fair() -> entity_tick()
The entity_tick() function is implemented as follows:
static void entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, Int queued) {// Update the virtual running time of the current process update_curr(cfs_rq); . if (cfs_rq->nr_running > 1 || ! sched_feat(WAKEUP_PREEMPT)) check_preempt_tick(cfs_rq, curr); // Check whether process scheduling is required}Copy the code
The entity_tick() function does the following:
-
Call update_curr() to update the virtual running time of the process, as described earlier.
-
Call the check_preempt_tick() function to determine whether process scheduling is required.
We then examine the implementation of the check_preempt_tick() function:
static void check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr) { unsigned long ideal_runtime, delta_exec; Ideal_runtime = sched_slice(cfs_rq, curr); Delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime; If (delta_exec > ideal_Runtime) resched_task(rq_of(cfs_rq)->curr); }Copy the code
The check_preempt_tick() function does the following:
-
Calculate the time slice available to the current process by calling sched_slice().
-
Gets the actual running time of the current process in the current scheduling period.
-
Resched_task () is called to schedule the process if the actual running time of the process is greater than its available time slice.
As you can see, in the processing of clock interrupts, it is possible to schedule processes. In addition to clock interrupts, some active CPU relinquish operations will also perform process scheduling (such as some I/O operations), here is not detailed analysis, interested in your own research.