By: no dishes studio – Marklux


Reference: marklux. Cn/blog / 64

The copyright belongs to the author, please indicate the source

This article is the “Linux kernel design and implementation” chapter 4 reading notes, the code is excerpted from the latest version 4.6 Linux source code (Github), please note the source.

multitasking

Concurrency and parallelism

Linux, as a multitasking operating system, must support concurrent execution of programs.

classification

  1. Non-preemptive multitasking

    The task will continue until it ends on its own.

  2. Preemptive multitasking (Linux)

    In this case, it is up to the scheduler to decide when to stop a process, and this forced suspend action is known as “preemption”. The basis for the adoption of preemptive multitasking is the use of a time-slice rotation mechanism to allocate units of time that each process can run.

Linux Process Scheduling

The development history

Linux introduced a scheduler named O(1) from version 2.5, and then introduced the concept of fair scheduling into the scheduler in version 2.6, replacing the previous scheduler, called CFS algorithm (Completely Fair scheduling algorithm).

strategy

I/O consuming and processor consuming

I/O consuming processes are those that spend most of their time waiting for I/O operations, and processor-consuming processes are those that spend most of their time executing code and are generally running unless preempted.

To ensure the performance of interactive applications and desktop systems, Linux generally prefers to schedule I/O consuming processes first.

Process priority

Linux uses two different priority ranges.

  1. Use nice values: A larger nice value means a lower priority. (between -19 and 20)

  2. Real-time priority: configurable. A higher priority indicates a higher process priority.

    Any real-time process has a higher priority than a normal process, so the two priority ranges are mutually exclusive.

  3. Time slices: Linux does not allocate time slices with a fixed time value (such as 10ms), but rather divides processor usage into processes as “time slices.” Thus, the actual CPU time gained by the process is closely related to the load on the system.

Preemption timing in Linux depends on how much processor ratio a new runnable process consumes; if it consumes less than the current process, it runs immediately; otherwise, it is delayed.Copy the code

For example,

Now let’s look at a simple example. Assume that our system has only two processes running, a text editor (I/O consuming) and a video decoder (processor consuming).

Ideally, the text editor should get more processor time, and at least the processor should be allocated to it immediately when it needs it (so that the user can interact with it), which means that when the text editor is awakened, it should preempt the video decoder.

Normally, the OS would assign a larger priority and more time slices to the text editor, but in Linux, both processes are normal and have the same NICE value, so they will get the same processor usage ratio (50%).

But what happens in practice? CFS will be able to notice that the text editor is using much less processor time than it is allocated (because most of the time is waiting for I/O), and in this case, to share the processor “fairly” across all processes would have the text editor preempting the video decoder as soon as it needs to run (every time).

Linux Scheduling Algorithm

The scheduler class

The Linux scheduler is provided as a module that allows different types of processes to choose different scheduling algorithms according to their needs.

The CFS algorithm described above is a scheduler class for common processes. The basic scheduler iterates through the scheduling classes in priority order. The scheduler class with the highest priority of an executable wins out and selects the next process to execute.

Process scheduling in Unix

Existing problems:

  1. Nice must be mapped to the processor’s absolute time, which means that two processes of the same priority, divided between 100ms, may not have the same number of context switches, which can be quite different. Lower-priority processes are allocated smaller timeslices, but in practice they tend to do a lot of background computing, which makes no sense.

  2. Problems caused by relative nice values: Two processes with different but identical nice values are divided into time slices that are affected by their NICE values: For example, the two processes with nice values 18 and 19 are allocated 10ms and 5ms, while the two processes with nice values 0 and 1 are allocated 100ms and 95ms. This mapping is not reasonable.

  3. If nice values are to be mapped to time slices, we must be able to have a measurable “absolute time slice” (this relates to the concept of timers and metronomes). In fact, the time slice changes with the beat of the timer, and the same NICE value may differ when it finally maps to the processor time.

  4. In order to wake up the process faster, new processes need to be prioritized, but this can break the “fairness”.

To address these issues, CFS radically redesigned the way time slices were allocated, abandoning time slices and replacing them with processor usage gravity.

Fair Scheduling (CFS)

Starting point: The effect of process scheduling should be the same as if the system had an ideal multitask processor — we can schedule any process for infinitesimally small cycles, so we can give n processes more time to run in any measurable range.

An example to distinguish Unix scheduling from CFS is this: You have two processes running with the same priority. In Unix, it might be 5ms each, completely occupying the processor for the duration of execution, but in an “ideal” world, it would be possible to run two processes simultaneously in 10ms, each occupying half of the processor’s power.

CFS does this by calculating how long a process should run from the total number of runnable processes. Nice is no longer used as a standard for time slice allocation, but is used to deal with the processor usage weight gained by the calculation.

Next we consider scheduling cycles. In theory, the smaller the scheduling cycle, the closer to “perfect scheduling” it is, but in practice this inevitably leads to significant context-switching costs. In CFS, an approximate goal is set for the minimum scheduling period that can be achieved, called the “target delay,” while a baseline is set for the time slice size that can be obtained per process — the minimum granularity (usually 1ms) — in order to avoid unacceptable consumption of context switching.

CFS is fair in cases where the average running time of each process is greater than the minimum granularity. Nice is used to calculate the percentage of processor time that a process should get in the current minimum scheduling cycle, so that even if the NICE value is different, as long as the difference is the same, it will always get the same time slice. We assume that a minimum scheduling period is 20ms and the difference between the nice values of the two processes is 5:

  • The nice values of the two processes are 0 and 5, respectively. The latter gets 1/3 of the time slice of the former, so it finally gets 15ms and 5ms, respectively
  • The nice values of the two processes are 10 and 15, respectively. The latter gets 1/3 of the time slice of the former, and the final results are also 15ms and 5ms

This corollary may be a little hard to understand, so let’s dig a little deeper and see how the nice difference affects the processing area ratio at the bottom.

First, at the bottom, the kernel converts nice to a weight value before actually calculating the processor ratio of a process, as follows:

Weight = 1024 / (1.25 ^ nice)Copy the code

For example, the process with the default nice value gets a weight of 1024/(1.25^0) = 1024/1 = 1024.

This conversion formula ensures that we get non-negative weights and that nice’s effect on weights is exponential.

Ok, now suppose we have n processes in our runnable process queue, and their weights are the same as w(1)+w(2)+… +w(n) is called W (queue), so any process I will end up with w(I) over W (queue).

Then, it is not difficult to deduce that the proportion of processors allocated to any two processes I and J should be W (I)/ W (j), and the final result can be obtained through simple mathematical deduction: 1.25^(nice(I)-nice(j)), which means that as long as the difference between the two nice values is the same, the two processes will always get the same percentage of processors, thus solving the third problem above.

The conversion formula of reference from: https://oakbytes.wordpress.com/2012/06/06/linux-scheduler-cfs-and-nice

In summary, the amount of processor time any process gets is determined by the relative difference between its nice value and that of all other runnable processes, so we can say that CFS at least guarantees a fair share of processor usage for each process, which is a near-perfect way to multitask scheduling.

Implementation of Linux scheduling

Let’s take a look at how CFS is implemented, generally divided into four main sections.

The time of account

All schedulers must account for the running time of the process, in other words, how many slices of time are left for the process in the current scheduling period (this will be an important criterion for preemption).

1. Scheduler entity structure

The data structure used in CFS to record the running time of a process is the “scheduling entity”, which is defined in < Linux /sched.h> :

Struct sched_entity {/* Load_weight load; struct sched_entity {/* load_weight load; Unsigned long runnable_weight; // Unsigned long runnable_weight; Struct rb_node run_node; Struct list_head group_node; // Unsigned int on_rq; U64 exec_start; U64 sum_exec_runtime; U64 vRuntime; U64 prev_sum_EXEC_runtime; U64 nr_migrations; u64 nr_migrations; struct sched_statistics statistics; // Some of the variables that are enabled only under certain macro conditions are omitted below}Copy the code

Note: All the Linux source code used in this article is from the official Linux Git repository on Github (2018.01).

2. Virtual Real-time (Vruntime)

Now let’s talk about what the VRuntime variables in the above structure mean. We call this the “virtual elapsed time,” which is calculated by standardizing (or simply weighting) the total number of runnable processes. It is in ns units and is no longer related to timer beats.

You can think of this as a new clock that CFS has to virtualize in order to achieve ideal multitasking. Specifically, the VRuntime of a process increases as the running time increases, but the rate of increase is determined by its weight load.

The result is that the higher the weight, the slower the growth: the smaller the resulting schedule time, which CFS uses to keep track of how long a program has been running and how long it should be running.

Let’s look at the implementation of the accounting function of the source code (kernel/sched/fair.c)

/*
 * Update the current task's runtime statistics. */ static void update_curr(struct cfs_rq *cfs_rq) { struct sched_entity *curr = cfs_rq->curr; u64 now = rq_clock_task(rq_of(cfs_rq)); u64 delta_exec; if (unlikely(! curr)) return; Delta_exec = now - curr->exec_start; delta_exec = now ->exec_start; if (unlikely((s64)delta_exec <= 0)) return; Curr ->exec_start = now; schedstat_set(curr->statistics.exec_max, max(delta_exec, curr->statistics.exec_max)); curr->sum_exec_runtime += delta_exec; schedstat_add(cfs_rq->exec_clock, delta_exec); Curr ->vruntime = calc_delta_fair(delta_exec, curr); update_min_vruntime(cfs_rq); if (entity_is_task(curr)) { struct task_struct *curtask = task_of(curr); trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime); cgroup_account_cputime(curtask, delta_exec); account_group_exec_runtime(curtask, delta_exec); } account_cfs_rq_runtime(cfs_rq, delta_exec); }Copy the code

This function calculates the execution time of the current process, stores it in the delta_exec variable, and then uses the clac_delta_fair function to calculate the corresponding virtual runtime and update the VRuntime value.

This function is called periodically by the system timer (regardless of the state of the process), so vRuntime can accurately measure the running time of a given process and infer from that what process will be running next.

Process selection

This is the core of scheduling, a one-sentence summary of the core of the CFS algorithm is to select the process with the minimum VRuntime as the next process to be scheduled.

To do this, of course, maintain a runnable process queue (the textbook version of the ready queue), and CFS uses a red-black tree to organize the queue.

Red-black trees are a well-known data structure, but instead of discussing its implementation and many features (which are too complex), remember: Red-black tree is a self-balanced binary tree. More simply, it is a structure that stores data in the form of tree nodes. Each node corresponds to a key value, which can be used to quickly index the data in the tree. So that the speed of retrieving the corresponding node by key value has an exponential ratio to the size of the whole tree node.

1. Find the next task node

Assume that a red-black tree stores all runnable processes in the system, and the keys of the nodes are their VRuntime. The CFS now wants to find the next process to be scheduled, so it needs to find the node with the lowest key in the red-black tree: the left-most leaf node.

Kernel /sched/fair.c

static struct sched_entity *
pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
	struct sched_entity *left = __pick_first_entity(cfs_rq);
	struct sched_entity *se;

	/*
	 * If curr is set we have to see if its left of the leftmost entity
	 * still in the tree, provided there was anything in the tree at all.
	 */
	if(! left || (curr && entity_before(curr, left))) left = curr; se = left; 【 * ideally we run the leftmost entity */ /* * the following process is mainly for some special cases, and we shall not discuss it here */if (cfs_rq->skip == se) {
		struct sched_entity *second;

		if (se == curr) {
			second = __pick_first_entity(cfs_rq);
		} else {
			second = __pick_next_entity(se);
			if(! second || (curr && entity_before(curr, second))) second = curr; }if (second && wakeup_preempt_entity(second, left) < 1)
			se = second;
	}

	if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
		se = cfs_rq->last;

	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
		se = cfs_rq->next;

	clear_buddies(cfs_rq, se);

	return se;
}
Copy the code

2. Add a new process to the queue

Inserting a new node into the runnable queue means that a new process state transitions to runnable, either when the process is awakened from a blocked state or when a new process is forked.

The process of enqueuing it is essentially the process of inserting a new node into a red-black tree:

static void enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) { bool renorm = ! (flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATED); bool curr = cfs_rq->curr == se; /* * Re-normalize vRuntime * and update the runtime statistics for the current task */ if the process to be added is the one currently runningif (renorm && curr)
		se->vruntime += cfs_rq->min_vruntime;

	update_curr(cfs_rq);

	/*
	 * Otherwise, renormalise after, such that we're placed at the current * moment in time, instead of some random moment in the past. Being * placed in the past could significantly boost this task to the * fairness detriment of existing tasks. */ if (renorm && ! curr) se->vruntime += cfs_rq->min_vruntime; / * * * update records corresponding to the scheduler entity value/update_load_avg (cfs_rq, se, UPDATE_TG | DO_ATTACH); update_cfs_group(se); enqueue_runnable_load_avg(cfs_rq, se); account_entity_enqueue(cfs_rq, se); if (flags & ENQUEUE_WAKEUP) place_entity(cfs_rq, se, 0); check_schedstat_required(); update_stats_enqueue(cfs_rq, se, flags); check_spread(cfs_rq, se); if (! curr) __enqueue_entity(cfs_rq, se); Se ->on_rq = 1; if (cfs_rq->nr_running == 1) { list_add_leaf_cfs_rq(cfs_rq); check_enqueue_throttle(cfs_rq); }}Copy the code

The above function mainly updates the running time and various statistics, and then calls __enqueue_entity() to actually insert the data into the red-black tree:

static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
	struct rb_node **link = &cfs_rq->tasks_timeline.rb_root.rb_node;
	struct rb_node *parent = NULL;
	struct sched_entity *entry;
	bool leftmost = true; /* * Search for the appropriate location in the red-black tree */while(*link) { parent = *link; entry = rb_entry(parent, struct sched_entity, run_node); /* * Nodes with the same key value are grouped together */if (entity_before(se, entry)) {
			link = &parent->rb_left;
		} else {
			link = &parent->rb_right;
			leftmost = false;
		}
	}

	rb_link_node(&se->run_node, parent, link);
	rb_insert_color_cached(&se->run_node,
			       &cfs_rq->tasks_timeline, leftmost);
}
Copy the code

The while() loop is the process of traversing the tree looking for matching key values, that is, searching for a balanced tree. We insert the node by executing rb_link_node() on the parent node of the insertion position, and then update the self-balancing properties of the red-black tree.

3. Remove the process from the queue

A node can be removed from the queue either if the process exits or if it is blocked.

static void dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, Int flags) {/* * Update_curr (cfs_rq); /* * When dequeuing a sched_entity, we must: * - Update loads to have both entity and cfs_rq synced with now. * - Substract its load from the cfs_rq->runnable_avg. *  - Substract its previous weight from cfs_rq->load.weight. * - For group entity, update its weight to reflect the new share * of its group cfs_rq. */ update_load_avg(cfs_rq, se, UPDATE_TG); dequeue_runnable_load_avg(cfs_rq, se); update_stats_dequeue(cfs_rq, se, flags); clear_buddies(cfs_rq, se);if(se ! = cfs_rq->curr) __dequeue_entity(cfs_rq, se); se->on_rq = 0; account_entity_dequeue(cfs_rq, se); /* * Re-normalize vRuntime */if(! (flags & DEQUEUE_SLEEP)) se->vruntime -= cfs_rq->min_vruntime; / *return excess runtime on last dequeue */
	return_cfs_rq_runtime(cfs_rq);

	update_cfs_group(se);

	/*
	 * Now advance min_vruntime if@se was the entity holding it back, * except when: DEQUEUE_SAVE && ! DEQUEUE_MOVE,in this case we'll be * put back on, and if we advance min_vruntime, we'll be placed back
	 * further than we started -- ie. we'll be penalized. */ if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) == DEQUEUE_SAVE) update_min_vruntime(cfs_rq); }Copy the code

As with inserts, the actual work of manipulating tree nodes is done by __dequeue_entity() :

static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
	rb_erase_cached(&se->run_node, &cfs_rq->tasks_timeline);
}

Copy the code

You can see that deleting a node is much easier than inserting it, thanks to the rb_erase() function implemented by the red-black tree itself.

Scheduler entry

As mentioned above, whenever scheduling of a process occurs, there is a unified portal from which to select the actual scheduling class that needs to be invoked.

This entry point is a function in the kernel called Schedule (), which finds the highest-priority scheduling class with its own runnable queue and asks it who the next process to run is.

The only important thing in this function is that the pick_next_task() function (defined in Kenerl /sched/core.c) is executed, which checks each scheduling class in turn in order of priority and selects the highest-priority process from the highest-priority scheduling class.

static inline struct task_struct * pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) { const struct sched_class *class; struct task_struct *p; /* * optimization: If all the current processes to be scheduled are normal processes, then use the normal process scheduling class (CFS) */if (likely((prev->sched_class == &idle_sched_class ||
		    prev->sched_class == &fair_sched_class) &&
		   rq->nr_running == rq->cfs.h_nr_running)) {

		p = fair_sched_class.pick_next_task(rq, prev, rf);
		if (unlikely(p == RETRY_TASK))
			goto again;

		/* Assumes fair_sched_class->next == idle_sched_class */
		if(unlikely(! p)) p = idle_sched_class.pick_next_task(rq, prev, rf);returnp; } // for_each_class(class) {p = class->pick_next_task(rq, prev, rf);if (p) {
			if (unlikely(p == RETRY_TASK))
				goto again;
			return p;
		}
	}

	/* The idle class should always have a runnable task: */
	BUG();
}
Copy the code

Each scheduling class implements the pick_next_task() method, which returns a pointer to the next runnable process, or NULL if it does not. The scheduler entry selects the next runnable process from the first class that returns a non-NULL.

Sleep and wake up

The sleep and wake flow in Linux looks like this:

  • Sleep: The process marks itself to sleep, then removes itself from the executable red-black tree, puts it on a wait queue, and then callsschedule()Select and execute an additional process.
  • Wake up: The process is set to the executable state and then moved from the wait queue to the executable red-black tree.

There are two states of hibernation in Linux, one that ignores a signal, and one that wakes up and responds when it receives a signal. However, processes in both states are on the same wait queue.

1. Wait in a queue

Unlike the complex structure of runnable queues, the implementation of a wait queue in Linux is simply a linked list. All data structures related to wait queues are defined in include/ Linux /wait.h, and the implementation code is defined in kernel/sched/wait.c.

The kernel uses the wait_queue_head_t structure to represent a wait queue, which is essentially the head node of a linked list, but adds a spin lock to maintain consistency (the wait queue can be modified at any time when it is interrupted).

struct wait_queue_head {
	spinlock_t		lock;
	struct list_head	head;
};
typedef struct wait_queue_head wait_queue_head_t;
Copy the code

Hibernation requires the process to put itself in a wait queue, which can be implemented using the kernel’s recommended functions.

One possible flow is as follows:

  1. Call macrosDEFINE_WAIT()Create a waiting queue item (linked list node)
  2. calladd_wait_queue()Add yourself to the queue. This queue will wake up the process when the condition it is waiting for is met, but the specific operation must be defined by the process itself (you can think of it as a callback).
  3. callprepare_to_wait()Method changes its state to one of the two hibernation states described above.

Here is the source code for the method mentioned above:

void add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
{
	unsigned long flags;

	wq_entry->flags &= ~WQ_FLAG_EXCLUSIVE;
	spin_lock_irqsave(&wq_head->lock, flags);
	__add_wait_queue(wq_head, wq_entry);
	spin_unlock_irqrestore(&wq_head->lock, flags);
}

static inline void __add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
{
	list_add(&wq_entry->entry, &wq_head->head);
}
Copy the code
void
prepare_to_wait(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry, int state)
{
	unsigned long flags;

	wq_entry->flags &= ~WQ_FLAG_EXCLUSIVE;
	spin_lock_irqsave(&wq_head->lock, flags);
	if(list_empty(&wq_entry->entry)) __add_wait_queue(wq_head, wq_entry); // Mark your own process state set_current_state(state); spin_unlock_irqrestore(&wq_head->lock, flags); }Copy the code

2. Wake up

The wakeup operation is implemented primarily through wake_up(), which wakes up all processes on the specified wait queue. Internally, the corresponding process is marked as TASK_RUNNING by the try_to_wake_up() function, followed by a call to enqueue_task() to add the process to the red-black tree.

The wake_up() function is defined by the macro, and is generally implemented internally by the following function:

/*
 * The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just
 * wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve * number) then we wake all the non-exclusive tasks and one exclusive task. * * There are circumstances in which we can try to wake a task which has already * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns * zero in this (rare) case, and we handle it by continuing to scan the queue. */ static int __wake_up_common(struct wait_queue_head *wq_head, unsigned int mode, int nr_exclusive, int wake_flags, void *key, wait_queue_entry_t *bookmark) { wait_queue_entry_t *curr, *next; int cnt = 0; if (bookmark && (bookmark->flags & WQ_FLAG_BOOKMARK)) { curr = list_next_entry(bookmark, entry); list_del(&bookmark->entry); bookmark->flags = 0; } else curr = list_first_entry(&wq_head->head, wait_queue_entry_t, entry); if (&curr->entry == &wq_head->head) return nr_exclusive; list_for_each_entry_safe_from(curr, next, &wq_head->head, entry) { unsigned flags = curr->flags; int ret; if (flags & WQ_FLAG_BOOKMARK) continue; ret = curr->func(curr, mode, wake_flags, key); if (ret < 0) break; if (ret && (flags & WQ_FLAG_EXCLUSIVE) && ! --nr_exclusive) break; if (bookmark && (++cnt > WAITQUEUE_WALK_BREAK_CNT) && (&next->entry ! = &wq_head->head)) { bookmark->flags = WQ_FLAG_BOOKMARK; list_add_tail(&bookmark->entry, &next->entry); break; } } return nr_exclusive; }Copy the code

Preemption and context switching

Context switch

Context switching refers to switching from one executable process to another. Kernel /sched/core.c context_switch();

static __always_inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,
	       struct task_struct *next, struct rq_flags *rf)
{
	struct mm_struct *mm, *oldmm;

	prepare_task_switch(rq, prev, next);

	mm = next->mm;
	oldmm = prev->active_mm;
	/*
	 * For paravirt, this is coupled with an exit inswitch_to to * combine the page table reload and the switch backend into * one hypercall. */ arch_start_context_switch(prev); // Switch virtual memory from the previous memory map to the new processif(! mm) { next->active_mm = oldmm; mmgrab(oldmm); enter_lazy_tlb(oldmm, next); }else
		switch_mm_irqs_off(oldmm, mm, next);

	if(! prev->mm) { prev->active_mm = NULL; rq->prev_mm = oldmm; } rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP); /* * Since the runqueue lock will be released by the next * task (which is an invalid locking op but in the case
	 * of the scheduler it's an obvious special-case), so we * do an early lockdep release here: */ rq_unpin_lock(rq, rf); spin_release(&rq->lock.dep_map, 1, _THIS_IP_); /* Here we just switch the register state and the stack. This includes saving and restoring information about registers and stacks switch_to(prev, next, prev); barrier(); return finish_task_switch(prev); }Copy the code

Context switches are invoked by the schedule() function when switching processes. But the kernel must know when schedule() is called, and if called explicitly by user code alone, the code may execute forever.

To do this, the kernel sets a need_resCHED flag for each process to indicate whether a rescheduling is needed. Scheduler_tick () sets this flag when a process should be preempted, and when a higher-priority process is in an executable state, Try_to_wake_up () also sets this flag bit, and the kernel checks this flag bit and calls schedule() to reschedule.

Users to grab

User preemption occurs when the kernel is about to return to user space and if the need_reshced flag bit is set, schedule() will be called. If need_resCHED is set, the kernel will choose a more appropriate process to run. If need_resCHED is set, the kernel will choose a more appropriate process to run.

To put it simply, there are two situations in which user preemption occurs:

  • Returns user space from the system call
  • Return user space from interrupt handler

Kernel preemption

Linux differs from most other Unix variants in that it supports full kernel preemption.

Systems that do not support kernel preemption mean that kernel code can be executed until it is finished, that kernel-level tasks cannot be rescheduled when executed, that tasks work cooperatively and that preemption is not possible.

In Linux, the kernel can preempt the task in progress at any time as long as rescheduling is safe, and this security means that preemption can take place as long as no lock is held.

To support kernel preemption, Linux has made the following changes:

  • For each processthread_infoThe introduction of thepreempt_countThe counter, which counts the number of locks held, is zero, indicating that the process is preemptable.
  • Is checked when returning from an interrupt to kernel spaceneed_reschedandpreempt_countThe value of, ifneed_reschedBe marked, andpreempt_countIf the value is 0, it means that a more scheduled process needs to be scheduled, and the current situation is safe for preemption, then the scheduler will be invoked.

In addition to responding to an interrupt and returning, there is another case where kernel preemption occurs when a process in the kernel explicitly calls schedule() for reasons such as blocking: Of course, the process explicitly calls the scheduler, which means it knows it can be preempted safely, so we don’t need any extra logic to check for security issues.

The following is a list of possible kernel preemption scenarios:

  • Interrupt processing is executing and before returning to kernel space
  • Once again, kernel code is preemptible
  • Tasks in the kernel are explicitly calledschedule()
  • Tasks in the kernel are blocked