Almost all operating systems run many more processes than there are cpus, requiring time-sharing of cpus. Ideally this sharing should be transparent to user processes. One common approach is to allocate processes to hardware cpus by multiplexing, with each process having its own virtual CPU.
multiplexing
XV6 switches CPU processes to reuse in two cases: one is the SLEEP and wakeup mechanism of XV6, which switches when a process is waiting for device or pipe I/O to complete, waiting for a child process to finish, or waiting in a sleep system call; Another is that XV6 periodically forces a switch on a process that has been computing for a long time without sleeping. This strategy gives each process its own CPU, just as XV6 page tables give each process its own memory.
There are many challenges to implementing multiplexing. How do I switch from one process to another in the first place? While the idea of context switching is simple, its implementation is one of the most opaque code in XV6. Second, how do you force a switch to make it transparent to the user process? XV6 uses the standard technique of timer driven context switching. Third, many cpus may be switching processes in parallel, so locks are needed to avoid contention. Fourth, a process’s memory and other resources must be freed at the end of the process, but it cannot do all the work independently, such as it cannot free its own kernel stack while it is still in use. Fifth, each core of a multicore machine must remember which process it is executing so that system calls can properly affect the process’s kernel state. Finally, the sleep and wakeup running processes abandon the CPU and sleep for an event, allowing other processes to wake the process up. Care needs to be taken to avoid contention, which can result in the loss of wake notifications. XV6 tries to deal with these issues as simply as possible, but the resulting code is still tricky.
Code: context switch
In XV6, there are the following steps: the user mode transitions to the kernel thread of the old process in the kernel state (system call or interrupt), the context switches to the scheduler thread of the current CPU, the context switches to the kernel thread of the new process, and the user process is returned. The XV6 scheduler has a dedicated thread (saved registers and stacks) on each CPU because it may be unsafe to execute the scheduler on the kernel stack of an older process: other cores may wake up the process and run it, and using the same stack on two different cores is a recipe for disaster. This section describes the switching mechanism between the kernel thread and the scheduler thread.
The switch between threads involves saving the registers of the old thread and restoring the registers saved before the new thread. Sp and PC registers are saved and restored meaning that the CPU switches stacks and executes code.
The SWTCH function saves and restores registers for kernel process switching. SWTCH doesn’t know about threads directly; It simply saves and restores the set of registers, the context. When a process abandons the CPU, the process’s kernel thread calls SWTCH to save its context and return it to the scheduler’s context. Each context is stored in a Strcut Context structure, which itself is stored in the struct proc or struct CPU. SWTCH takes two arguments, strcut Context *old and strcut Context *new. It saves the current register in old, restores the register from new, and returns.
Let’s trace a process through SWTCH into the scheduler. An interrupt ends up calling yield in the usertrap. Yield then calls Sched, which calls SWTCH to save the current context P ->context and switch to the context previously saved by the scheduler in CPU ->scheduler.
SWTCH will only save the called save register, which will be called by the C function to save on the stack if needed. SWTCH knows the offset of each register field in the struct Context. Instead of saving the program counter, it saves the RA register, which holds the return address for calling the SWTCH function. SWTCH now restores the context saved by the previous SWTCH from the new context. When SWTCH returns, it returns to the instruction pointed to by the restored RA register, which is the instruction that SWTCH was last called by the new thread. At the same time, it returns to the stack of the new thread.
In this example, Sched calls SWTCH to switch to the scheduler context CPU ->scheduler for each CPU. This context is saved by the SWTCH called by the scheduler. When the SWTCH we are tracing returns, it does not return to the Sched function but to the scheduler function, and the stack pointer is pointed to the current CPU’s scheduler stack.
Code: Scheduling
The previous section covered the low-level details of SWTCH, but now let’s use SWTCH as an example to examine the process of switching from one process’s kernel thread to another through the scheduler. The scheduler takes the form of a dedicated thread per CPU, each of which runs scheduler functions. This function is responsible for selecting the next process to run. If a process wants to abandon the CPU, it must release the process’s own lock P ->lock, release all held locks, update its state P ->state, and call sched. The yield function follows this convention, as do sleep and exit. Sched rechecks these conditions, which mean that when the lock is held, the interrupt must be turned off. Finally sched calls SWTCH to save the current context in the P ->context and switch to the scheduler context. The scheduler continues with the for loop, finds a process to run, switches to that process, and repeats the loop.
XV6 holds p->lock for the duration of the SWTCH call: the caller to SWTCH must hold the lock, after which control of the lock is passed to the code being switched to. This locking convention is uncommon, and usually the thread that acquires the lock is also responsible for releasing it, making it easier to ensure code correctness. For context switches, this convention must be broken because P -> Lock protects the immutability of the process state and context fields, which is not correct when performing SWTCH. For example, if p->lock was not held at SWTCH time, another CPU might decide to run the process and set the state to RUNNABLE, but the previous SWTCH would cause it to stop using its own kernel stack, which would result in two cpus running on the same stack, which might not be correct.
A kernel thread always abandons the CPU in Sched, always switches to the same spot in the scheduler, and almost always switches to some kernel thread that called Sched earlier. Therefore, if you print out the line numbers of XV6 switching threads, you will see these simple patterns: kernel/proc.c:475, kernel/proc.c:509, kernel/proc.c:475, kernel/proc.c:509, etc. This process of switching styles between two threads is sometimes called a coroutine. In this example, Sched and Scheduler are coroutines to each other.
There is a case where a scheduler call to SWTCH does not end up in Sched. When a new process is first scheduled, it starts with Forkret. Forkret needs to release P -> Lock, or a new process may start with Usertrapret.
The scheduler runs a simple loop: find a process to run, run the process until it yields, then repeat the process. The scheduler traverses the process table to find a RUNNABLE process, p->state == RUNNABLE. When it finds a process, it sets the CPU’s current process variable C ->proc, marks the process as RUNNING, and calls SWTCH to run it.
Scheduling code can be structured as if it guarantees a series of invariants for each process, holding P ->lock when any invariants are incorrect. One invariant is that when a process is in the RUNNING state, the yield of timer interrupts must be safely switched out of the process. This means that the CPU register must hold the register value of the process (such as SWTCH not moving it to context), and c->proc must point to that process. Another invariant is that when a process is RUNNABLE, the idle CPU scheduler must be able to run it safely; This means that the P ->context must hold the process’s registers, no CPU executes on the current process’s kernel stack, and no CPU’s C ->proc points to the process. These attributes are usually incorrect when holding a lock.
Maintaining the above invariants is why XV6 acquires locks in one thread and releases locks in another, such as yield and scheduler. When yield begins to change the state of a running process to RUNNABLE, the lock must be held until immutable is restored: the earliest correct release point is when the scheduler clears c->proc. Similarly, when scheduler starts converting a RUNNABLE process to RUNNING, the lock cannot be released until the kernel thread is fully RUNNING (after SWTCH, as in YIELD).
P -> Lock also protects other things: the interaction between exit and wait, mechanisms to avoid wake loss, and risks when a process exits and other processes read or write its state (such as the exit system call looking at p-> PID and setting P -> KILLED).
Code:mycpu
andmyproc
XV6 typically requires a pointer to the proc structure of the current process. Global variables can be used to execute the current process on a single-core processor. But this approach is not possible on multi-core machines, where each core is executing a different process. The solution to this problem is that each core has its own set of registers, one of which we can use to help find core information.
XV6 maintains a struct CPU for each CPU, keeping track of the currently running process, the context of the scheduler thread, and the number of nested layers of spinlock. Mycpu returns a pointer to the structure of the current CPU. RISC-V numbers each CPU and gives each CPU a hartid. XV6 ensures that each CPU’s hartid is stored in its TP register while in the kernel. This allows myCPU to use TP to look up the current CPU from an array of CPU constructs.
Ensuring that THE CPU’s TP always holds the CPU’s hartid is a bit more complicated. Mstart in the CPU startup process sets up the TP register while still in machine mode. Usertrapret saves tp registers in trampoline because the user process may modify TP. Finally, uservec restores the saved TP registers when it enters kernel mode from user mode. The compiler guarantees that tp registers will not be used. It would be much easier if RISC-V allowed hartid to be read directly, but this is only allowed in machine mode, not supervisor mode.
The return values of CPUID and myCPU are fragile: if one timer breaks, causing the thread to yield and be moved to a different CPU, the previous return value is no longer correct. To avoid this problem, XV6 requires the caller to turn off interrupts only when the struct CPU is used up.
The myProc function returns a pointer to the struct Proc of the current CPU running process. Myproc closes the interrupt, calls myCPU, gets the current process pointer from the struct CPU, and then initiates the interrupt. The return value of myProc is safe even if interrupts are allowed: if a timer interrupt moves the process to another CPU, the Proc pointer is still the current process.
Sleep and wake up
Scheduling and locking help processes hide each other’s presence, but we have no concept of how to help processes interact with each other. Many mechanisms are used to solve this problem. XV6 uses a mechanism called sleep and wake, which allows one process to sleep and wait for an event, and another process to wake it up when the event occurs. Sleep and arousal are often referred to as sequence coordination or conditional synchronization mechanisms.
To illustrate this point, let’s consider a synchronization mechanism called a semaphore that coordinates producers and consumers. Semaphores maintain a counter and provide two operations. V operation (corresponding to the producer) increments the counter. The P operation (corresponding to the consumer) waits until the counter is non-zero and reduces the counter and finally returns. If there is only one producer and one consumer thread running on different cpus, and the compiler is not optimized too aggressively, then the following implementation is possible:
struct semaphore
{
struct spinlock lock;
int count;
};
void V(struct semaphore *s)
{
acquire(&s->lock);
s->count += 1;
release(&s->lock);
}
void P(struct semaphore *s)
{
while (s->count == 0); acquire(&s->lock); s->count -=1;
release(&s->lock);
}
Copy the code
The cost of this implementation is high. If a producer rarely performs operations, the consumer spends a lot of time at spin waiting for the counter to become non-zero. The consumer’S CPU can find more efficient work to perform than busy waiting. Avoiding busy waiting requires the consumer to surrender the CPU and restore it only when V increases the counter.
This is a step in that direction, although we shall see that it is not enough. Let’s think of a pair of calls to sleep and wakeup. Sleep (chan) Sleeps at any value of chan. Sleep causes the calling process to sleep, freeing the CPU for other processes to work on. Wakeup (chan) Wakes up all processes sleeping on chan, causing their sleep calls to return. If no process is sleeping on chan, wakeup does nothing. We can use sleep and wakeup to implement semaphores:
void V(struct semaphore *s)
{
acquire(&s->lock);
s->count += 1;
wakeup(s); // change this
release(&s->lock);
}
void P(struct semaphore *s)
{
while (s->count == 0)
sleep(s); // change this
acquire(&s->lock);
s->count -= 1;
release(&s->lock);
}
Copy the code
Now P gives up CPU instead of spin. However, it turns out that using this interface to design sleep and wake up without sleep loss is not an easy task. Assuming P finds s->count==0, V on another CPU returns before calling sleep: it modifies s->count to a non-zero value and calls wakeup, finding that no process is sleeping and therefore doing nothing. Now P continues, calling sleep and going to sleep. This leads to a problem: P is sleeping waiting for V calls that have already occurred. Unless we are lucky enough to get another call to V, the consumer will stay asleep even if the counter is non-zero.
The root of this problem is that the sleep invariant is corrupted by the running V only when s->count==0. An incorrect way is to maintain invariants by moving the lock acquisition in P so that it checks the counter and calls sleep are atomic:
void V(struct semaphore *s)
{
acquire(&s->lock);
s->count += 1;
wakeup(s);
release(&s->lock);
}
void P(struct semaphore *s)
{
acquire(&s->lock); // change this
while (s->count == 0)
sleep(s);
s->count -= 1;
release(&s->lock);
}
Copy the code
This version of P avoids wake loss because the lock prevents V from being executed. But if you do this, you will cause a deadlock: P holds the lock and sleeps, so V will be blocked waiting for the lock.
We fix this by modifying the sleep interface: the caller must pass a conditional lock to the sleep so that it can release the lock while the sleep calling process waits on the sleep channel. The lock forces parallel V to wait until P puts it to sleep, so wakeUp finds a sleeping consumer and wakes it up. Once the consumer is awakened, sleep needs to acquire the lock again before returning. Our new correct sleep/wake up regimen is as follows:
void V(struct semaphore *s)
{
acquire(&s->lock);
s->count += 1;
wakeup(s);
release(&s->lock);
}
void P(struct semaphore *s)
{
acquire(&s->lock);
while (s->count == 0)
sleep(s, &s->lock); // change this
s->count -= 1;
release(&s->lock);
}
Copy the code
P holds s->lock to prevent V from attempting to wake up between P checking s->count and calling sleep. Note that we need sleep to release s->lock atomically and put the consumer to sleep.
Code:sleep
andwakeup
Let’s look at the implementation of sleep and wakeup. The basic idea is that sleep flags the current process as SLEEPING and calls Sched to release the CPU; Wakeup finds a process on the provided sleep channel and marks it as RUNNABLE. The callers of sleep and wakeup can use any mutually convenient follow as a channel. XV6 typically uses the addresses of kernel data structures that involve waits.
Sleep gets p->lock. Hold both P -> LOCK and LK while process sleep. Holding LK is necessary in the caller: this ensures that no other process can start calling wakeup. When sleep holds p->lock, it is safe to release LK: other processes may try to call wakeup, but wakeup will wait for P ->lock, so it will wait for sleep to put the process to sleep, avoiding wake loss.
Here’s a small catch: If LK and P ->lock are the same, sleep will deadlock if it tries to get p->lock. However, if the process is already holding p->lock when it calls sleep, nothing more needs to be done to avoid the loss of parallel wakeup. This happens when wait calls sleep using p->lock.
Sleep now holds only P ->lock, which can put the process to sleep and log the sleep channel, change the process state to SLEEPING, and call Sched. We’ll talk later why p->lock must be released by the scheduler when the process is marked as asleep.
At some point, a process acquires a conditional lock, sets the condition for which the sleeper is waiting, and calls wakeup. The conditional lock must be held when wakeUp is called. Wakeup iterates through the process table and gets p->lock for each process, both because it manipulates the process state and because sleep and wakeup are not lost. When wakeup finds a SLEEPING process and chan corresponds to it, it changes the process status to RUNNABLE. The next time the scheduler runs, it will find that the process is ready to run.
Why do sleep and wakeup lock rules ensure that the sleep process does not lose its wakeup? The SLEEPING process holds either a conditional lock, its own P ->lock, or both from before it checks the condition to after it marks SLEEPING. The process that calls wakeUp holds these locks in its loop. So the wakeer either sets the condition to true before the consumer checks the condition, or the inspector’s wakeup checks the SLEEPING thread after the process is marked as SLEEPING, and wakeup sees the SLEEPING thread and wakes it up.
Sometimes multiple processes may sleep on the same channel; For example, more than one process reads the pipe. A single wakeup call will wake all. One of them will run first, acquire the lock, and read any waiting data from the pipe. Other processes find that there is no data to read despite being woken up, and that the awakening is “fake” from their point of view, so they need to sleep again. This is why sleep is called in a loop that checks the condition.
If two sleepers/awakeners accidentally choose the same channel, there is no problem: they will see false awakenings, which the cycle tolerates. The beauty of sleep/wake is that it is both lightweight (no special data structures need to be created to act as sleep channels) and provides a layer of indirection (callers do not need to know the specific process they are interacting with).
Code: Pipe
A more complex example of producer-consumer synchronization using sleep and Wakeup is the PIPELINE implementation of XV. Bytes written to the end of a pipe are copied to the kernel buffer and can be read from the other end of the pipe. The next chapter covers support for pipes with file descriptors, but for now let’s look at the implementation of Pipewrite and PiperEAD.
Each pipe represents a struct pipe that contains a lock and a data buffer. The nread and nwrite fields count the number of bytes read and written to buffers. The buffer is a ring: the next byte of buf[pipesize-1] is buf[0]. But the counter does not reset. This convention makes it possible to distinguish between full buffers (nwrite == nread + PIPESIZE) and empty buffers (nwrite == nread), But this also means that the subscript of the buffer must be buf[nread % PIPESIZE] instead of buf[nread] (nwrite is similar).
Assume that calls to PiperEAD and Pipewrite occur simultaneously on two different cpus. Pipewrite starts getting the lock on the pipe that protects counters, data, and their invariants. Piperead then tried to acquire the lock, but could not. It starts spinning in Acquire waiting for the lock. While piperEAD waits, PipeWrite begins a loop through the bytes to be written, adding bytes to the pipe each time. The buffer may be full during the loop, in which case Pipewrite calls wakeUp to alert any process that is asleep waiting to read, and then on &pi->nwrite sleeps waiting for the reader to read characters from the buffer. Sleep releases PI ->lock.
Now that the PI ->lock is available, Piperead takes the lock and enters the critical section: it finds that nread! = nwrite, so it enters the for loop, copies data out of the buffer, and increments the nread counter. There are now some bytes that can be written, so wakeup is called to wakeup any sleeping writers before the piperead returns. Wakeup finds the process sleeping on &pi->nwrite that is running pipewrite but stopped because the buffer is full. When found, mark the process as RUNNABLE.
The pipeline code uses different sleep channels for readers and writers, which makes the system more efficient when multiple readers are waiting on the pipeline at the same time. The pipe’s sleep is within a loop that checks the sleep condition. If there are multiple reader writers, all but the first running process wake up but find the condition is still false, so go back to sleep.
Code:wait
exit
andkill
Sleep and wakeup can be used for many kinds of waiting. An interesting example is the interaction between the child process’s exit and the parent process wait. When the child dies, the parent may already be sleeping in wait or doing something else; In the latter case, the subsequent wait calls must be able to observe the death of the child, possibly long after the child exit. XV6 records the death of a child process by setting the caller status of exit to ZOMBIE. The process remains in this state until the parent process waits to observe, then changes the state to UNUSED, copies the exit state of the child process, and returns the pid of the child process to the parent process. If the parent exits before the child, the child’s parent calls the init process, and the init process keeps calling wait. So each child process has a parent process to clean it up. The biggest implementation challenge is contention and deadlock between wait and exit of parent-child processes, as well as between exit and exit.
Wait uses the p->lock of the calling process as a conditional lock to avoid wake loss and obtains this lock at the start. And then iterate through the process table. If a child is found to be in a ZOMBIE state, it releases the child’s resources and proc structure, copies the child’s exit status to the wait argument (if not 0), and returns the pid of the child. If wait finds a child but does not exit, sleep is called to wait for the child to exit and then scan again. The conditional lock released in a sleep is the p->lock for the waiting process, which is the special case mentioned earlier. Note that wait usually holds two locks; So it acquires the current process lock before it acquires the child process lock; That is, all XV6 must follow the same lock order (parent process, then child process) to avoid deadlocks.
Wait looks at np->parent for each process to find child processes. No lock is held when using NP ->parent, which violates the common rule that shared variables must be protected by locks. Because NP could be the ancestor of the current process, in which case obtaining NP ->lock would violate the above rules and could result in a deadlock. Non-locking judgment NP ->parent is safe in this case: a process’s parent field can only be modified by its parent, so if NP ->parent==p, the value will not be changed until the current process changes it.
Exit logs the exit status, releases some resources, passes all child processes to init, wakes up the parent in wait, marks the caller as ZOMBIE, and permanently cedes the CPU. The final process is a little more complicated. You must hold the parent’s lock when you exit the process with its ZOMBIE setting and wake up the parent, because the parent’s lock is used to avoid wake loss in the wait. The child process must also hold its own P ->lock, because the parent process may see its state ZOMBIE and release the process while it is still running. Lock acquisition is important to avoid deadlocks: since WAIT first applies the lock to the parent process, all exits must also be in the same order.
Exit calls the special wakeup function wakeup1, which wakes only the parent process sleeping in wait. It may seem wrong for the child to wake up the parent before setting itself as ZOMBIE, but it’s safe: Wakeup1 may cause the parent process to run, but a loop in WAIT cannot judge the child process until its lock is released by scheduler, so wait cannot view the exiting process until exit sets its status to ZOMBIE.
Exit runs a process to terminate itself, while kill allows a process to terminate other processes. Killing directly terminates a process can be complicated because the process may be executing on another CPU or updating a kernel data structure. So kill does very little work: it simply sets the process to P -> KILLED and wakes it up if it is sleeping. Eventually the sacrifice process will enter or leave the kernel, and usertrap will call exit if P -> KILLED is set. If the sacrificial process is running in kernel mode, it will later enter the kernel through a system call or timer (or other device) interrupt.
If the sacrificial process is sleeping, kill calls wakeup to return the process from sleep. And that can be dangerous because the waiting conditions may not be true. However, an XV6 sleep is always included in the while loop. Some calls to sleep also test p-> KILLED in the loop and discard the current activity if set. Do this only when it is right to do so. For example, the pipe read/write code will return when KILLED is set, and the code will eventually return to trap, which will judge the flag again and exit.
Some XV6 sleep loops do not check for P -> KILLED because the code is in a multi-step system call, which should be atomic. The Virtio driver is an example: it does not check for P -> KILLED because the disk operation may be one of a series of writes, and all of these operations need to be done sequentially to ensure that the file system is correct. A process waiting for disk IO is not killed until it completes the entire system call, after which UserTrap checks for flags.
Real operating system
The XV6 scheduler runs each process in a loop using a very simple scheduling strategy. This strategy is called round robin. Real operating systems implement more complex policies, such as allowing processes to have priority. A process with a higher priority is scheduled before a process with a lower priority. These strategies can quickly become complicated because there are often competing goals: for example, the operating system may want to achieve both fairness and high throughput. In addition, complex policies can lead to unexpected interactions, such as priority reversals and escorts. Priority inversion occurs when a lower-priority process shares a lock with a higher-priority process, and the lower-priority process acquires the lock, preventing the higher-priority process from running. When many high-priority processes are waiting for low-priority processes to acquire shared locks, a long queue of waiting processes can form. Once a queue is formed, it can last a long time. To avoid such problems, other mechanisms are needed in complex schedulers.
Sleep and wakeup are simple and efficient ways to synchronize, but there are others. The first challenge with all of these approaches is to avoid wake loss. The sleep of the original UNIX kernel simply turns off the interrupt, which is sufficient because UNIX runs on a single-CPU system. Because XV6 runs on multiple processors, it adds an explicit lock to sleep. FreeBSD’s MSleep uses a similar approach. Plan 9’s sleep is called when it is about to go to sleep using the callback function. The Linux kernel’s sleep replaces the wait channel with an explicit process queue called a wait queue; Queues have their own internal locks.
It is inefficient to scan the entire process list in wakeup for matching Chans. A better approach would be to replace the Chan in Sleep and WakeUp with a data structure that keeps a list of sleep processes, like a Linux wait queue. Many thread libraries use the same data structures as conditional variables; In this case, sleep and wakeup are called wait and signal. All of these mechanisms share the same mechanism: protecting sleep conditions by automatically breaking some kind of lock during sleep.
The implementation of wakeUp wakes up all waiting processes on the channel, and there may be many processes waiting on the channel, and the operating system schedules all of these processes and they compete to check for sleep conditions. This process is sometimes called thundering herd and is best avoided. Most condition variables have two types of wakeup: signal wakes up one process, and broadcast wakes up all waiting processes.
Semaphores are usually used for synchronization. The count usually corresponds to something like the number of bytes available in the pipeline buffer, the number of zombie processes in the process. Use explicit counts as part of the abstraction to avoid the wakeup loss problem: there is an explicit wakeup count that occurred. Counting also avoids false wake-up and thunderbolt problems.
Termination and cleanup processes in XV6 introduce a lot of complexity. On most operating systems this is more complicated because, for example, a sacrificial process may lie dormant deep in the kernel, and unrolling its stack requires careful programming. Many operating systems use an explicit exception handling mechanism to expand the stack, such as LongjMP. In addition, other events can also awaken the sleep process, although the waiting event has not yet occurred. For example, when a UNIX process is sleeping, other processes may send a signal. In this case, the process returns -1 from the interrupted system call and sets the error code to EINTR. Applications can examine error codes and decide what to do about them. XV6 does not support signals so this complexity does not occur.
XV6 support for kill is not entirely satisfactory: sleep cycles may need to be checked for P -> KILLED. A related issue is that even for the sleep cycle examined for P -> KILLED, there is competition between sleep and kill; It may then set p-> KILLED and try to wake up the sacrificed process just before the sacrificed process loop detects P -> KILLED but before sleep is called. If this problem occurs, the sacrifice process will not notice p-> KILLED until the wait condition occurs. And that may be a little bit late or not happen.
A real operating system would look for idle Proc constructs in an explicit free linked list at constant time complexity rather than linear time complexity, and XV6 uses linear scanning for simplicity.