Process is one of the great inventions of the operating system. It shields the CPU scheduling, memory management and other hardware details of the application program, but abstracts the concept of a process, so that the application program can concentrate on realizing its own business logic, and can “simultaneously” many tasks on the limited CPU. But while it brings convenience to users, it also introduces some extra overhead. In the figure below, the CPU is busy in the middle of a process, but does not complete any user work, which is the additional overhead of the process mechanism.

Before switching from process A to process B, save the context of process A so that when process A resumes, it can know what process A’s next command is. The context of the B process to be run is then restored to the register. This process is called context switching. Context switching overhead is not a problem in scenarios with few processes and infrequent switching. But now the Linux operating system is used for highly concurrent network application back-end servers. When a single machine supports tens of thousands of user requests, this overhead comes into play. This is because when the user process requests Redis, Mysql data and other network IO blocks, or when the process time slice expires, context switch will be triggered.

A simple process context switch overhead test experiment

Without further ado, let’s test the CPU time required for a context switch with an experiment. The experiment is to create two processes and pass a token between them. One of these processes blocks when reading the token. Another process is also blocked while sending a token and waiting for it to return. Do this a certain number of times, and then count their average switching time overhead.

# gcc main.c -o main
# ./main./main
Before Context Switch Time1565352257 s, 774767 us
After Context SWitch Time1565352257 s, 842852 us
Copy the code

The time of each execution varies, with each context switch taking an average of 3.5us after multiple runs. Of course, this number varies from machine to machine, and it is recommended to test it on a real machine.

When we tested the system call earlier, the lowest value was 200ns. As you can see, context switch overhead is higher than system call overhead. The system call simply switches from user to kernel and back within the process, whereas the context switch switches directly from process A to process B. Obviously, this context switch requires more work.

What are the process context switching costs

So what exactly is the CPU overhead for context switching? Expenditure is divided into two types, one is direct expenditure and the other is indirect expenditure.

Direct overhead is what the CPU must do when switching, including:

  • 1. Switch to the page table global directory
  • 2. Switch kernel stack
  • 3. Switch hardware context (data that must be loaded into registers before process recovery is collectively referred to as hardware context)
    • Instruction pointer (IP) : indicates the next instruction to be executed
    • Bp (Base Pointer): Bottom address of a stack frame corresponding to a function being executed
    • Sp (Stack poinger): used to store the top address of the stack frame corresponding to the executing function
    • Cr3: Page directory base register, which holds the physical address of the page directory table
    • .
  • Refresh the TLB
  • 5. System scheduler code execution

The overhead mainly refers to the fact that although switching to a new process, the speed will run slower because the various caches are not hot. If the process is always scheduled on the same CPU, it is better. If it is across cpus, the previously hot TLB, L1, L2, and L3 are not used because the running process has changed, so the code and data stored in the local principle are not used, causing the new process to penetrate more IO into memory. In fact, our experiment above did not measure this well, so the actual context switch overhead may be larger than 3.5us.

For more details, see chapters 3 and 9 in Understanding the Linux Kernel.

A more professional testing tool – LMbench

Lmbench is a multi-platform open source benchmark used to evaluate the comprehensive performance of systems. It can test the performance including document reading and writing, memory operation, process creation and destruction overhead, network and so on. The method of use is simple, but it is a little slow, interested students can try it. The advantage of this tool is that there are multiple groups of experiments with 2 processes, 8, and 16 experiments each. The size of data used by each process also varies, fully mimicking the effect of a cache miss. I tested it with the following results:

------------------------------------------------------------------------- Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw --------- ------------- ------ ------ ------ ------ ------ ------- ------- bjzw_46_7 Linux 2.6.32-2.7800 2.7800 2.7000 4.3800 4.0400 4.75000 5.48000Copy the code

Lmbench displays process context switching times ranging from 2.7US to 5.48.

Thread context switch time

After testing the overhead of process context switching, let’s continue testing threads on Linux. Let’s see if it’s faster than the process, and if so, how much faster.

In Fact, there is no thread in Linux, just to cater to the taste of developers, made a lightweight process out called threads. Lightweight processes, like processes, have their own task_struct process descriptor, and each has its own independent PID. From an operating system perspective, scheduling is no different from a process, just selecting a task_struct from a two-way list of waiting queues and cutting it into a running state. The difference between lightweight and normal processes is that they can share the same memory address space, code snippet, global variables, and open file collection.

All getpid() threads in the same process see the same PID, but task_struct also has a tgid field. For multithreaded programs, the getPID () system call actually gets this TGID, so multiple threads belonging to the same process appear to have the same PID.

Let’s test this with an experiment that works much like a process test, creating 20 threads and piping signals between them. Wake up after receiving the signal, and then pass the signal to the next thread to sleep. The extra cost of transmitting signals to pipes is considered separately in this experiment and is calculated in the first step.

# gcc -lpthread main.c -o main
0.508250  
4.363495  
Copy the code

There are some differences in the results of each experiment, the above results are taken several times and then averaged, about 3.8us per thread switch cost. Linux threads (lightweight processes) are not that different from processes in terms of context-switching time.

Linux commands

Now that we know how much CPU time context switches consume, what tools can we use to see how much switching is actually happening in Linux? If context switching is affecting overall system performance, is there a way to root out problematic processes and optimize them out?

# vmstat 1 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 2 0 0 595504 5724 190884 0 0 295 297 0 0 14 6 75 0 4 5 0 0 593016 5732 193288 0 0 0 92 19889 29104 20 6 67 0 73 0 591292 5732 195476 00 0 20151 28487 20 6 66 0 84 00 589296 5732 196800 00 116 384 19326 27693 20 7 67 0 740 586956 5740 199496 0 0 216 24 18321 24018 22 8 62 0 8Copy the code

Or is it

# sar -w 1 proc/s Total number of tasks created per second. cswch/s Total number of context switches per second. 11:19:20 AM Proc /s CSWCH /s 11:19:21 AM 110.28 23468.22 11:19:22 AM 128.85 33910.58 11:19:23 AM 47.52 40733.66 11:19:24 AM 35.85 30972.64 11:19:25 AM 47.62 24951.43 11:19:26 AM 47.52 42950.50......Copy the code

The environment in the figure above is a production machine configured with an 8-core, 8-GB KVM virtual machine. The environment is in NGINx + FPM. The number of FPMS is 1000, and the average user interface requests processed per second are about 100. The CS column represents the number of context switches in the system within 1s, which is about 4W times. As a rough estimate, each core needs to switch about 5K times per second, so it takes approximately 20ms to switch context within 1s. Remember that this is a virtual machine, which has some overhead in virtualization itself, and also really consumes CPU in user interface logic processing, system call kernel logic processing, as well as network connection processing and soft interrupts, so the overhead of 20ms is actually not low.

Further, what processes are responsible for the frequent context switches?

# pidstat -w 1 11:07:56 AM PID CSWCH /s NVCSWCH /s Command 11:07:56 AM 32316 4.00 0.00 PHP-fpm 11:07:56 AM 32508 160.00 34.00 php-fpm 11:07:56 AM 32726 131.00 8.00 php-fpm......Copy the code

Since FPM is synchronous blocking mode, every request for Redis, Memcache, Mysql will block and cause CSWCH/S voluntary context switch, while NVCSWCH/S involuntary switch will be triggered only after the time slice expires. As you can see, most of the handoffs in the FPM process are voluntary and few are involuntary.

If you want to see the total context switching for a particular process, you can look directly under the /proc interface, but this is the total.

grep ctxt /proc/32583/status  
voluntary_ctxt_switches:        573066  
nonvoluntary_ctxt_switches:     89260  
Copy the code

This section is the conclusion

There is no need to remember what context switching does, just remember one conclusion, it is estimated that the overhead of the following context switching on the author’s developer is about 2.7-5.48us, your own machine can use the code or tools PROVIDED by me to test. Lmbench is more accurate because it takes into account the extra overhead caused by Cache misses after switching.

Extension: we usually know the concept of CPU time slice in the operating system theory learning, time slice will be removed from the CPU, another process. But in fact, in the network IO intensive applications of the Internet, there are very few involuntary switching due to the time slice, and most of them are voluntary switching due to waiting for network IO. As you can see from the example above, one of my FPM processes had 57W active switches and less than 9W passive switches. Therefore, network IO is responsible for frequent context switches in synchronous blocking development mode



Development of internal skills of the CPU

  • 1. Do you think all your multicore cpus are eucore? Multicore “illusion”
  • 2. I heard you only know memory, not cache? CPU says very sad!
  • 3. What is TLB cache? How to check TLB miss?
  • 4. What is the overhead of process/thread switching?
  • 5. What makes coroutines better than threads?
  • 6. How much CPU do soft interrupts eat you?
  • 7. How much does a system call cost?
  • 8. What is the overhead of a simple PHP request redis?
  • 9. Do too many function calls cause performance problems?

My public account is “developing internal Skills and Practicing”. Here I am not simply introducing technical theories, nor only introducing practical experience. But to combine theory with practice, with practice to deepen the understanding of theory, with theory to improve your technical practice ability. Welcome you to follow my public number, also please share with your friends ~~~