Content basic reference geek time two columns, incidentally to take notes for themselves just ni pengfei “Linux performance optimization actual combat” 02,03 section tao hui “system performance tuning must know will” 04 section
This is one of the most frequent interview questions, probably as an extension of Kafka, has not been summarized in detail. I’ve been asked this before by many readers, by colleagues, and I didn’t answer it well in a vague way at the time. Now these days I may brush the content of other authors more, after seeing part of the relevant content, now go back to mention for a while, should be a better understanding of this.
How would you implement file transfer?
The server provides the file transfer function. You need to read files from disks and send them to clients over network protocols. If you had to code the file transfer function yourself, how would you do it? Usually, you’ll take the most direct approach: After finding the path of the file on disk from the network request, if the file is large, say 320MB, you can allocate a 32KB buffer in memory, and then divide the file into 10,000 pieces, each of which is only 32KB, so that 32KB is read into the buffer from the start of the file. The 32KB is then sent to the client through the network API. Repeat 10,000 times until you have sent the entire file. As shown below:
However, this solution does not perform well for two main reasons.
First of all, it’s been through at least 40,000 times
User-mode and kernel-mode context switching. because
Each 32KB message is processed with a read call and a write callEvery time,
The system calls
Both have to switch from user mode to kernel mode first, and then switch back to user mode after the kernel completes its task. As you can see, there are four context switches for every 32KB processed, and 40,000 context switches after 10,000 repetitions.
this
The system callsWe can combine the following three additions.
Context switch“To understand,
Context switchThey are process, thread and interrupt.
Added: process context switch
Linux divides the running space of processes into kernel space and user space according to privilege levels, which correspond to Ring 0 and Ring 3 of CPU privilege levels in the figure below. Kernel space (Ring 0) has the highest privileges and can directly access all resources, while user space (Ring 3) can only access restricted resources and cannot directly access hardware devices such as memory. You must be trapped into the kernel through system calls to access these privileged resources.
Put another way, that is, processes can run in both user space and kernel space.
When a process is running in user space, it is called the user state of the process, and when it is trapped in kernel space, it is called the kernel state of the process. To switch from user mode to kernel mode, pass the
The system callsTo complete. For example, when we look at the contents of a file, it takes multiple system calls: first open() to open the file, then read() to read the file, then write() to write the content to standard output, and finally close() to close the file.
So how does the system call process switch CPU context? Two more concepts:
-
A CPU register is a small but fast memory built into the CPU.
-
A program counter is used to store the location of an instruction being executed by the CPU, or the location of the next instruction to be executed. These are the environments that the CPU must depend on before running any task, and are therefore also called CPU contexts.
Now that you know what CPU context is, YOU can easily understand CPU context switching. CPU context switch is to first save the CPU context of the previous task (i.e., CPU registers and program counters), then load the context of the new task into these registers and program counters, and finally jump to the new position indicated by the program counter to run the new task. The saved context is stored in the system kernel and reloaded when the task is rescheduled. This ensures that the original state of the task remains intact and the task appears to be running continuously. Going back to the system call problem, the CPU registers where the instructions were in user mode need to be saved first. Then, in order to execute the kernel code, the CPU register needs to be updated to the new location of the kernel instruction. The last step is to jump to kernel mode and run the kernel task. After the system call, the CPU register needs to restore the original saved user state, and then switch to user space, continue to run the process. So, in the process of one system call, there are actually two CPU context switches. However, it should be noted that the system call does not involve user resources such as virtual memory, nor does it switch processes. This is different from what we would normally call a process context switch:
-
A process context switch is a process that is switched from one process to another.
-
The same process is always running during a system call.
So what’s the difference between a process context switch and a system call? First, you need to know that processes are managed and scheduled by the kernel, and process switching can only happen in kernel mode. Therefore, the process context includes not only user space resources such as virtual memory, stack, global variables, but also the state of kernel space such as stack and register. Therefore, the process context switch is one more step than the system call: before saving the current process’s kernel state and CPU registers, the virtual memory, stack, and so on need to be saved. After loading the kernel state of the next process, the virtual memory and user stack of the process need to be refreshed. The process of saving and restoring context is not “free” and requires the kernel to run on the CPU.
Each context switch requires tens of nanoseconds to microseconds of CPU time. This time is still considerable, especially in the case of a lot of process context switching, it is easy to cause the CPU to spend a lot of time on registers, kernel stack, virtual memory and other resources to save and restore, and then greatly shorten the actual running process time. Linux uses TLB (Translation Lookaside Buffer) to manage the mapping between virtual memory and physical memory. When the virtual memory is updated, the TLB also needs to be refreshed, slowing down memory access. Especially on multiprocessor systems, where the cache is shared by multiple processors, flushing the cache will affect not only the processes of the current processor, but also those of other processors that share the cache. TLB, the material of this thing is difficult to understand, I searched roughly, there are a lot of professional terms, it is not too late to expand, when we really need to use, and then to understand the general content, I think if you want to expand, it is enough to expand my following part. TLB is a cache that memory management hardware uses to improve the speed of converting virtual addresses to physical addresses. All current personal desktops, laptops, and server processors use TLB to map virtual addresses to physical addresses. The TLB kernel can quickly find the virtual address pointing to the physical address, without requiring RAM memory to obtain the mapping relationship between the virtual address and the physical address. Virtual address versus physical address, roughly. Each process has its own 4 gb memory space, and each process has a similar structure. A new process is set up, will build up their own memory space, the process of data, such as code from a disk copy to their own process space, where, what data are controlled by the process the task_struct records in the table, it will have a list, records of the distribution of memory space, which address data, countless according to which address, What can be read and what can be written can be recorded through this linked list. The amount of memory that each process has allocated is mapped to the corresponding disk space, but the computer doesn’t have that much memory (n x 4G for n processes). There is also the establishment of a process, it is necessary to copy the program files on disk to the process of the corresponding memory, for a program corresponding to multiple processes this situation is not necessary to operate in this way. Therefore, each process’s 4G memory space is only virtual memory space, and every time an address of memory space is accessed, the address needs to be translated into physical memory address. All processes share the same physical memory, and each process maps and stores only the virtual memory space it currently needs in physical memory. To know which memory addresses are in physical memory, which are not, and where they are, a page table is required. Each entry in the page table has two parts. The first part records whether the page is in physical memory, and the second part records the address of the physical memory page (if so). When a process accesses a virtual address and looks at the page table, if the corresponding data does not exist in the physical memory, a page missing exception occurs. The processing process of the page missing exception is to copy the data required by the process from the disk to the physical memory. Now that we know the potential performance problems of process context switching, let’s look at exactly when process context switching happens. Obviously, context switching is only necessary when the process is scheduled. Linux maintains a ready queue for each CPU, sorting the active processes (that is, the processes that are running and waiting for the CPU) by priority and the amount of time they have waited for the CPU, and then selecting the processes that need the most CPU, that is, the processes that have the highest priority and wait for the CPU the most. When will a process be scheduled to run on the CPU? One of the easiest times to think about is when a process terminates, its CPU is freed, and a new process is taken from the ready queue to run. In fact, there are many other scenarios that trigger process scheduling, and I’ll sort them out for you here. First, to ensure that all processes are scheduled fairly, CPU time is divided into time slices, which are then allocated to each process in turn. In this way, when a process runs out of time, it is suspended by the system and switched to another process that is waiting for the CPU. Second, when the system resources are insufficient (such as insufficient memory), the process can not run until the resources are sufficient. At this time, the process will also be suspended and the system will schedule other processes to run. Third, when a process suspends itself via a method like Sleep, it reschedule. Fourth, when a process with a higher priority runs, the current process will be suspended and run by the process with a higher priority to ensure the running of the process with a higher priority. Finally, when a hardware interrupt occurs, processes on the CPU are suspended by the interrupt and run interrupt service routines in the kernel.
Thread context switch
The main difference between threads and processes is that threads are the basic unit of scheduling, while processes are the basic unit of resource ownership. To put it bluntly, the so-called task scheduling in the kernel is actually scheduled by threads; The process only provides virtual memory, global variables and other resources to the thread. So, threads and processes can be understood as follows:
- When a process has only one thread, it can be considered equal to a thread.
- When a process has multiple threads, these threads share resources such as virtual memory and global variables. These resources do not need to be modified during context switching.
- In addition, threads also have their own private data, such as stacks and registers, which need to be saved during context switches.
In this way, thread context switching can be divided into two cases: first, the two threads belong to different processes. At this point, since resources are not shared, the switching process is the same as a process context switch. In the second case, both threads belong to the same process. At this point, because the virtual memory is shared, so during the switch, the virtual memory resources remain unchanged, only need to switch the thread private data, registers and other non-shared data. As you can see, switching between threads in the same process consumes less resources than switching between multiple processes, which is an advantage of multi-threading instead of multi-process.
Interrupt context switch
One scenario that also switches CPU context is interrupts. In order to quickly respond to hardware events, the interrupt interrupts the normal scheduling and execution of the interrupt process, and instead invokes the interrupt handler to respond to device events. When you interrupt another process, you need to save the current state of the process so that after the interruption, the process can resume running from its original state. Unlike process context, interrupt context switching does not involve the user state of the process. Therefore, even if the interrupt interrupts a user-mode process, there is no need to save and restore user-mode resources such as virtual memory and global variables for the process. Interrupt context, in fact, only includes the state necessary for the execution of kernel-mode interrupt service program, including CPU registers, kernel stack, hardware interrupt parameters, etc. Interrupt processing has higher priority than process for the same CPU, so interrupt context switching does not occur at the same time as process context switching. Similarly, since interrupts interrupt the scheduling and execution of normal processes, most interrupt handlers are short and concise in order to finish execution as quickly as possible. In addition, like process context switching, interrupt context switching also consumes CPU, too many switching times will consume a large amount of CPU, seriously reduce the overall performance of the system. To sum up, CPU context switch is one of the core functions to ensure the normal operation of Linux system, and generally does not require our special attention. However, too much context switching will consume CPU time in registers, kernel stack and virtual memory and other data storage and recovery, thus shortening the process of real running time, resulting in a significant decline in the overall system performance.
Back to the zero-copy thing
In our scenario, there were four context switches for every 32KB processed, and 40,000 after 10,000 repetitions. Context switching is not trivial, and while a switch consumes only tens of nanoseconds to microseconds, high concurrency services magnify this time consumption. Second, the scheme made 40,000 memory copies and quadrupled the number of bytes copied to a 320MB file to 1280MB. Obviously, excessive memory copying unnecessarily consumes CPU resources and reduces the concurrent processing capacity of the system. Therefore, in order to improve the performance of transferred files, we need to reduce the frequency of context switching and memory copy.
How does zero-copy improve file transfer performance?
By the way, why do I have to do a context switch when I read a disk file? This is because reading disks or operating network cards is done by the operating system kernel. The kernel is responsible for managing all processes on the system, it has the highest authority, and the working environment is completely different from user processes. Whenever our code makes a system call like read or write, there must be two context switches: first from user to kernel, and then back to user for the process code to execute when the kernel is done. Therefore, if you want to reduce the number of context switches, you must reduce the number of system calls. The solution is to merge the read and write system calls into one, and complete the data exchange between disk and network card in the kernel. Second, we should consider how to reduce the number of memory copies. Four memory copies are made in each period. Two memory copies related to physical devices are essential, including copying disk content to the memory and memory to the NIC. But the other two copy actions associated with the user buffer are not necessary because there is no reason for the user buffer to exist in the case of sending disk files to the network. If the kernel copies the contents of PageCache directly to the Socket buffer after the file is read, and notifies the process after the nic is sent, there will be only 2 context switches and 3 memory copies.
If The NETWORK adapter supports The Scatter-Gather Direct Memory Access (SG-DMA) technology, you can remove The copy of The Socket buffer, so that there are only two Memory copies in total. The source physical address and destination physical address must be contiguous during DMA data transfer. However, contiguous memory addresses are not necessarily physically contiguous, so DMA transfers are divided into several. If an interrupt is caused after the transmission of a piece of physically continuous data, then the host carries out the next piece of physically continuous data transmission. Scatter-gather DMA takes a different approach, using a linked list to describe physically discontinuous storage Spaces and then telling the DMA master the address at the head of the list. Instead of initiating an interrupt after transferring a piece of physically continuous data, the DMA master transfers the next piece of physically continuous data according to the linked list until the transfer is complete and then initiates another interrupt.
In effect, this is zero-copy technology. It is a new function provided by the operating system that accepts both a file descriptor and a TCP socket as input parameters, such as this
During execution, memory copy can be completed completely in kernel mode, which not only reduces The Times of memory copy, but also reduces The Times of context switch. Furthermore, by eliminating the user buffer, zero copy not only reduces the user’s memory consumption, but also passes
Maximizing memory in the socket buffer indirectly reduces the number of system calls againThe opportunity to drastically reduce the number of context switches
You may recall that in order to transfer the 320MB file without zero copy, 32KB was allocated in the user buffer and the file was divided into 10,000 copies. However, where did 32KB come from? Why not 32MB or 32 bytes? This is because, in the absence of zero copies, we want the highest memory utilization.
If the user buffer is too large, it cannot copy the message all at onceThe socket buffer(Here is the socket size limit)
; If the user buffer is too small, it results in too many Read /write system calls.
Why isn’t the user buffer the same size as the socket buffer? And that’s because,
The available space of the socket buffer changes dynamically, which is used for both TCP sliding Windows and application buffers, and is affected by overall system memory. Especially in the fattening network, its range of variation is particularly large.
Zero copy makes it unnecessary to care about the socket buffer size. For example, when calling the zero-copy send method,
You can set the number of bytes sent to all the unsent bytes of the fileIf the socket buffer size is 1.4MB, then 1.4MB will be sent to the client at one time instead of 32KB. This means that for a zero copy of 1.4MB, only 2 context switches were brought in, compared to 176 (4 * 1.4MB/32KB) context switches when no zero copy was used and the user buffer was 32KB.
In summary, for the 320MB file transfer mentioned at the beginning of the article, when the socket buffer is about 1.4MB, only 400 times of context switch, and 400 times of memory copy, the data copy is only 640MB, so that not only the request delay will be reduced, Processing each request also consumes less CPU resources to support more concurrent requests.
In addition, zero copy uses PageCache technology.
PageCache, disk cache
If you recall from above, you will notice that when reading a file, you copy the disk file to PageCache first and then to the process. Why do you do that? There are two reasons for this. Since the disk is much slower than the memory, we should try to replace the read and write disk with the read and write memory. For example, we can copy the data from the disk to the memory and replace the read disk with the read memory. However, memory is much smaller than disk space, and memory is destined to replicate only a fraction of the data on disk. In general, newly accessed data has a high probability of being accessed again within a short period of time. PageCache caches the most recently accessed data, and when space runs out the oldest cache that has not been accessed (that is, the LRU algorithm). When reading a disk, the PageCache page is first searched. If the data exists, the PageCache page is directly returned, which greatly improves the disk read performance. In addition, to read data from a disk, you need to find the location of the data first. In the case of a mechanical disk, this means that the magnetic head is rotated to the sector where the data is located and then the data is read sequentially. To reduce the impact of rotating the head, which takes a long time, PageCache uses prefetch. That is, although the read method reads only 0-32KB bytes, the kernel reads the next 32-64KB into PageCache, which is cheap to read. If the process reads PageCache before 32-64KB is weeded out, the benefit is very large. That’s what’s going to happen in the file transfer scenario in this lecture. In summary, PageCache improves disk performance in more than 90% of scenarios. However, in some cases, PageCache does not work or even reduces disk performance due to an extra memory copy. In these scenarios, a zero copy of PageCache would also lose performance. Specifically, when transferring large files. For example, if you have a lot of gigabyte files that need to be transferred, each time the user accesses these large files, the kernel loads them into PageCache. These large files quickly fill up the limited PageCache. However, due to the size of the file, the probability that a part of the file will be accessed again is actually very low. This causes two problems: First, because PageCache is occupied by large files for a long time, hot small files don’t make full use of PageCache, and they read slowly; Second, large files in PageCache do not enjoy the benefits of caching, but cost the CPU more than one copy to PageCache. Therefore, in high concurrency scenarios, in order to prevent PageCache from being occupied by large files and no longer having effect on small files, large files should not use PageCache, and therefore, zero-copy technology should not be used for processing. Take watching a movie for example, I just want to watch the first 10 minutes, so I have to watch the whole movie, which is obviously a loss. Asynchronous IO and direct IO should be used instead of zero-copy techniques when processing large files in high-concurrency scenarios.
Asynchronous I/O + Direct I/O
Going back to the example at the beginning, when the read method is called to read a file, it actually blocks the wait while the disk is addressing, preventing the process from doing other tasks concurrently, as shown in the following figure: pulling data blocks the entire process.
Asynchronous IO (which can handle both network IO and disk IO, but we’ll focus on disk IO here) can solve the blocking problem. It divides the read operation into two parts, the first part
Making a read request to the kernel and returning immediately without waiting for the data to be in place allows the process to process other tasks concurrently. when
After the kernel copies the data from disk to the process buffer, the process receives notification from the kernel and processes the dataThis is the second half of asynchronous IO. As shown below:
As you can see from the figure, asynchronous IO is not copied to PageCache, which is actually a defect in the implementation of asynchronous IO.
The IO that goes through the PageCache is called cache IOIt is so tightly coupled to the virtual memory system that asynchronous IO has not supported cached IO since its birth. IO that bypasses PageCache is a new species, and we call it direct IO. For disks, asynchronous I/OS support only direct I/OS.
There are only a few application scenarios for direct IO. There are two main scenarios: First, the application already implements disk file caching and does not need PageCache to cache again, causing additional performance consumption. Databases such as MySQL use direct IO; As mentioned above, large files are difficult to hit the PageCache cache, which brings extra memory copy and occupies the memory required by PageCache for small files. Therefore, direct IO should be used at this time.
The disadvantage of direct IO is that it does not enjoy the benefits of PageCache, that is, the kernel (IO scheduling algorithm) will try to cache as many consecutive IO in PageCache, and finally merge into a larger IO to send to disk, which can reduce the disk addressing operation; In addition, the kernel also prereads subsequent IO and places it in PageCache to reduce disk operations. It can’t do any of that
summary
Excessive memory copying and context switching can degrade performance when transferring files based on user buffers. Zero copy technology completes the memory copy in the kernel, which naturally reduces the memory copy times. It combines disk read and network send operations in a single system call, reducing the number of context switches. In particular, because copying is done in the kernel, it maximizes the available space to use socket buffers, thereby increasing the amount of data processed in a system call and further reducing the number of context switches. Zero copy technology is based on PageCache, and PageCache caches recently accessed data, improving the performance of accessing cached data. At the same time, in order to solve the problem of slow addressing of mechanical disks, it also helps IO scheduling algorithm to achieve I/O merge and prefetch (this is also the reason why sequential read is better than random read performance). This further improves zero-copy performance. Almost all operating systems support zero copy, which is a good idea if the scenario is to send files to the network.
In fact, if you use solid-state drives like SSDS (without rotating the head), PageCache will not have much impact. For details, see my previous article on SSDS
However, zero copy has the disadvantage that it does not allow the process to do anything to the file contents before sending them, such as compress the data before sending them. In addition, when PageCache causes negative effects, also cannot use zero copy, at this time can use asynchronous IO+ direct IO replacement. We typically set a file size threshold, using asynchronous AND direct IO for large files and zero copy for small files.
Recently, I have been operating my own knowledge planet, which aims to help friends who want to know about big data to get into the content of big data, and have been engaged in big data developers to advance together. It is free but does not mean that you will have no harvest. If you are interested, you can add it