Performance optimization is a system engineering, always related to the whole, it involves from the program design, programming language, to the system, storage, network and other aspects of the underlying infrastructure. Each component can fail, and most likely multiple components can fail simultaneously.
Basic metrics for Linux performance, tools, and corresponding observation, analysis, and tuning methods. This includes CPU performance, disk I/O performance, memory performance, and network performance. When it comes to system performance, you only need to understand a few basic principles of the application and the system, and then do a lot of practical practice to build up a picture of the overall performance.
Performance indicators: “High concurrency” and “fast response” are used to evaluate performance from the perspective of application load, which directly affect the user experience of the product terminal. It also corresponds to the two core indexes of sex energy optimization — “throughput” and “delay”.
As the application load increases, the usage of system resources increases, even to the limit. The nature of the performance problem is that system resources have reached a bottleneck, but requests are not being processed fast enough to support more requests. Performance analysis is about identifying application or system bottlenecks and finding ways to avoid or mitigate them so that system resources can be used more efficiently to handle more requests.
Performance optimization methodology
- How do you evaluate the effectiveness of performance optimization?
Determine the quantitative indicators of performance, performance indicators before and after test optimization. Selection of quantitative indicators. Choose at least two different metrics from the dimensions of application and system resources:
1) Application dimensions, we can use throughput and request latency to evaluate application performance. 2) The dimension of system resources, we can use CPU usage to evaluate the CPU usage of the system. Points to note in performance test:
1) Avoid performance testing tools interfering with application performance; 2) Avoid changes in the external environment affecting the evaluation of performance indicators.
- Multiple performance problems exist at the same time. Which should be optimized?
The 80/20 rule: Not all performance problems are worth tuning. Find the most important ones that will give you the most performance improvement. Optimize system resource usage and performance indicators with the greatest variation.
- Multiple optimization methods, which one to choose?
Choose the method that maximizes performance, but performance optimization often increases complexity, reduces maintainability, and may cause exceptions to other metrics.
Here are our starting points for system optimization.
This article starts with the CPU and introduces the direction of CPU-related optimization.
CPU optimization
Performance statistics
- Average load ratio
Load average refers to the average number of processes in the runnable and uninterruptible states per unit of time, which is also the average number of active processes. It is not directly related to CPU usage.
- (1) Operational status
The process is in the Running or Runable (R) state through ps-AUX
- ② Uninterruptible state
A process that is part of a kernel critical process that is Uninterruptible. The most common is waiting for an I/O response from a hardware device. This process is known as an Uninterruptible Sleep (Disk Sleep) in the PS command
- View the average load rate
The uptime command is used to check the average system load
[root@iZbp1agiqhn4s2q8a0980xZ ~]# uptime
15: 00:00 up 535 days, 37 min, 1 user, load average: 0.04. 0.05. 0.05
Copy the code
1.15: 00:25 // The current time
2.up 535day // System uptime
3.user // Number of Logged In Users
4.load average: 0.00. 0.02. 0.05 // The past, in turn 1 Minutes, 5 Minutes, 15 Average load in minutes
Copy the code
For example, what does it mean when the average load is 2?
On a system with only 2 cpus, this means that all cpus are just about fully occupied. On a 4-CPU system, that means the CPU is 50% idle. On a system with only one CPU, this means that half of the processes are not competing for CPU and there is a direct command to display the load average of the system
Cat /proc/loadavg 0.10 0.06 0.01 1/72 29632Copy the code
In addition to the first 3 numbers represent the average number of processes, the following 1 score, the denominator represents the total number of system processes, the numerator represents the number of running processes; The last number represents the ID of the most recently run process.
- How reasonable is the average load rate setting
Command to see cpus grep ‘model name/proc/cpuinfo | wc -l when the average load is higher than 70% CPU number, can analyze the reasons of high load
- Relationship between average load and CPU
Load average refers to the number of active processes. Back to the meaning of load average, load average refers to the number of processes in the runnable and uninterruptible states per unit of time. So it includes not only running CPU processes, but also processes waiting for the CPU and waiting for THE IO. CPU usage, defined as the CPU busy statistics per unit of time, corresponds to load average:
- Cpu-intensive processes. If a large number of cpus are used, the load average is too high.
- IO intensive processes, waiting for I/OS will also increase the average load, but not necessarily the CPU
- A large number of processes waiting for the CPU can also lead to a higher load average and higher CPU usage
Common Analysis Commands
- mpstat
1. Common multi-core CPU performance analysis tool; 2. View performance indicators of each CPU and average performance indicators of all cpus in real time
Run mpstat to check the CPU usage
# -p ALL indicates that ALL cpus are monitored. The number 5 indicates that a set of data is output after an interval of 5 seconds
[root@centos7-2 ~]# mpstat -P ALL 5
Copy the code
You can use pidstat to find out which process is responsible for the high CPU usage
# pidstat[options][< interval >][< times >]
# -u: displays CPU usage statistics for each process by default
# -r: Displays memory usage statistics for each process
# -d: Displays the IO usage of each process
# -p: Specifies the process number
# -w: Displays the context switch for each process
# -t: Displays additional information in addition to the statistics of the thread that selected the task
[root@centos7-2 ~]# pidstat -u 5 1
Copy the code
Live text switch
- Conceptual understanding
CPU context switch, it is put before a task CPU context (CPU registers and the program counter) saved, then load the context of a new task to the register and the program counter, and then jump to the program counter refers to a new location, run the new task, and preserved context, will be stored in the system kernel, And load it again when the task is rescheduled for execution.
Each context switch requires tens of nanoseconds to microseconds of CPU time. Therefore, if the process context switch times are too many, the CPU will spend a lot of time on saving and recovering resources such as registers, kernel stack, and virtual memory, which greatly reduces the actual running time of the process. The actual effective CPU running time is significantly reduced (you can argue that context switching is a waste of time for the user).
-
Context switch timing
-
The CPU time is allocated to a time slice based on the scheduling policy. When the time slice runs out, context switch is needed;
-
A process that has insufficient system resources will hang until it has obtained sufficient resources
-
The process suspends itself via the sleep function
-
When a process with a higher priority runs, the current process is suspended to ensure that the process with a higher priority runs. In other words, the process is preempted
-
When a hard interrupt occurs, processes on the CPU are suspended by the interrupt and run interrupt service routines in the kernel
-
Context-cut classification
-
Process context switch
-
Thread context switch
-
Interrupt context switch (hardware triggers signals that cause interrupt handlers to be called, which is also a common task)
-
View the live text switch
- vmstat
Is a common system performance analysis tool, mainly used to analyze the memory usage of the system, also used to analyze the number of CPU context switches and interrupts
To highlight the four columns that need special attention:
-
Context switch (CS) is the number of context switches per second.
-
In (interrupt) is the number of interrupts per second.
-
Running or Runnable (R) is the length of the ready queue, that is, the number of processes that are Running and waiting for the CPU.
-
B (Blocked) is the number of processes that are in an uninterruptible sleep state.
Vmstat only gives the overall context switch of the system, and to see the details of each process, you need to use pidstat, which we mentioned earlier. Add the -w option to see how each process is context-switched.
-
A high value of US indicates that the user process is consuming a lot of CPU time, but if the use exceeds 50% for a long period of time, then it is time to consider tuning the application algorithm or other measures
-
Sy System process CPU consumption If the sys value is too high, the system kernel consumes too many CPU resources. This is not a benign symptom. You need to check the cause. The reference value of US + SY is 80%. If us+ SY is greater than 80%, there may be insufficient CPU
-
Id Idle time (including I/O wait time) Generally speaking, us+sy+ ID =100
-
Wa WAIT TIME for I/O If the WA is too high, the I/O wait is serious, which may be caused by random disk access or disk bandwidth bottleneck.
- pidstat
[root@centos7-2 ~]# pidstat -w -u 3
Copy the code
Conclusion: There are two columns in the results that we focus on. One is CSWCH, which stands for the number of voluntary Context switches per second, and the other is NVCSWCH, which stands for the number of non-voluntary context switches per second.
- Voluntary context switch refers to the context switch caused by the process’s failure to obtain required resources. For example, voluntary context switching occurs when system resources such as I/O and memory are insufficient.
- Involuntary context switch refers to the context switch that occurs when the process is forcibly scheduled by the system due to the time slice. For example, when a large number of processes are competing for CPU, involuntary context switching can occur.
CPU utilization
- concept
What metrics are used to describe the CPU performance of the system? Not load averaging, not CPU context switching, but another, more intuitive metric, CPU usage, which is a statistic of CPU usage per unit of time, expressed as a percentage
- The beat rate
To maintain CPU time, Linux triggers time interrupts with a predefined beat rate (expressed as HZ in the kernel) and uses the global variable Jiffies to record the number of beats since startup. Each time a time interrupt occurs, the value of Jiffies increases by 1
- Check the beat rate
[root@iZbp1agiqhn4s2q8a0980xZ ~]# grep 'CONFIG_HZ=' /boot/config-$(uname -r)
CONFIG_HZ=1000
Copy the code
- User beat rate
Because metronomic HZ is a kernel option, user space programs are not directly accessible. For the convenience of user-space programs, the kernel also provides a user-space beat rate USER_HZ, which is always fixed at 100, or 1/100 of a second. This way, the user-space program doesn’t need to care how much HZ is set in the kernel, because it always sees the fixed value USER_HZ
USER_HZ=100
Copy the code
For the convenience of the user control program, the kernel also provides a user control beat rate, which is always fixed at 100, that is 1/100 of a second, so that the user control program does not need to depend on the kernel HZ set
CPU usage formula
CPU usage, which is the percentage of total CPU time other than idle time, is expressed by the formulaUsing this formula, we can easily calculate CPU usage from the data in /proc/stat. Of course, you can calculate the CPU usage for each scenario by dividing the CPU time for each scenario by the total CPU time.
- Performance tool calculation method
In fact, for the purpose of computer CPU usage, performance tools usually calculate the difference between two values over a period of time (say 3 seconds)
Average CPU usage over time
Command to check the CPU usage
Viewing CPU usage top, PS, and pidstat are the most commonly used performance analysis tools:
-
Top shows the overall CPU and memory usage of the system, as well as the resource usage of each process.
-
Ps only shows the resource usage for each process.
-
Pidstat analyzes CPU usage for each process
1. The top illustration
2. Pidstat analyzes the CPU usage of each process
Top does not break down the user-mode and kernel-mode cpus for each process, so how do you view the details for each process?
Example: Pidstat 1 5 play 5 times every second
3. The illustration pidstat
The CPU usage is too high
1. Query the processes that are heavily used
Run the top, ps, and pidstas commands
2. Determine which function in your code is occupying the most CPU
GDB and perf
The GNU Project Debugger (GDB) is a powerful program debugging tool. The process of GDB debugging will interrupt The program running, which is often not allowed in The online environment.
Perf is a built-in performance analysis tool for Linux 2.6.31. Based on performance event sampling, it can not only analyze various events and kernel performance of the system, but also analyze performance problems of specified applications, using PERF to analyze CPU performance problems
- case
Here are two usages
The first common usage is perf top, which is similar to top. It can display the functions or instructions that occupy the most CPU clock in real time. Therefore, it can be used to find hot functions, as shown in the following interface:
[root@centos7-2 ~]# perf top
Samples: 724 of event 'cpu-clock'. Event count (approx.): 125711088
Overhead Shared Object Symbol
45.11% [kernel] [k] generic_exec_single
.
Copy the code
In the output result, the first line contains three data, which are Samples, event type, and Event count. In this example, perF collects 1000 CPU clock events for a total of 271937500. The number of samples needs special attention. If the number of samples is too small (say, only a dozen or so), then the following order and percentage have no practical reference value
The first Overhead is the percentage of the symbol’s performance events in all samples, expressed as a percentage.
The second column, Shared, is the Dynamic Shared Object where the function or instruction resides, such as the kernel, process name, and dynamically linked library
Name, kernel module name, etc.
The third column, Object, is the type of the dynamically shared Object. For example, [.] represents user-space executables, or dynamically linked libraries, while [k] represents kernel space.
The last column of Symbol is the Symbol name, which is the function name. When the function name is unknown, it is represented by a hexadecimal address.
- Perf command description
- The second common usage is that
perf record
和perf report
.perf top
Although the performance information of the system is displayed in real time, its disadvantage is that it does not
Data is saved and cannot be used offline or for subsequent analysis. And perF record provides the function of saving data, saved data, you need to use perF report parsing display
perf record Press Ctrl+C to terminate sampling
[root@centos7-2 ~]# perf report
Samples: 5K of event 'cpu-clock'. Event count (approx.): 1332500000
Overhead Command Shared Object Symbol
97.15% swapper [kernel.kallsyms] [k] native_safe_halt
0.49% swapper [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
.
Copy the code
CPU performance tuning policies
CPU optimization Let’s take a look at how we can reduce CPU utilization and improve CPU parallel processing power from the application and system perspectives respectively
- Application optimization 1. First of all, from an application perspective, the best way to reduce CPU utilization is, of course, to eliminate all unnecessary work and keep only the core logic. For example, reduce the level of loops, reduce recursion, reduce dynamic memory allocation, and so on. Here are some of the most common methods to remember and use. 1) Compiler optimization: many compilers provide optimization options. Turn them on properly. 2) You can get compiler help during compilation to improve performance. 3) Algorithm optimization: Using a less complex algorithm can significantly speed up processing. This prevents the application from blocking while waiting for a resource, thus increasing the concurrent processing capability of the application. For example, by replacing polling with event notification, you can avoid the CPU-consuming problem of polling. 5) Multi-threading instead of multi-process: as mentioned above, context switching of threads does not switch process address space compared to process context switching, so it can reduce the cost of context switching. 6) Make good use of cache: frequently accessed data or steps in the calculation process can be put into memory cache, so that the next use can be directly obtained from memory, accelerate the processing speed of the program
- System optimization from the perspective of the system, to optimize the OPERATION of CPU, on the one hand, to make full use of the local CPU cache, speed up cache access; On the other hand, you want to control the CPU usage of processes and reduce the interaction between processes. Specifically, there are many ways to optimize CPU at the system level. Here are some of the most common ways to memorize and use them.
1.CPU binding: The process can be improved by binding to one or more cpus
2.CPU cache hit ratio, reducing context switching problems caused by cross-CPU scheduling
3.CPU exclusivity: Similar to CPU binding, cpus are grouped into different groups and processes are assigned to them using the CPU affinity mechanism. In this way, the CPU is exclusively used by the specified process, in other words, no other process is allowed to use the CPU
- Priority adjustment: Use NICE to adjust the priority of the process. Positive values decrease the priority and negative values increase the priority.
5. Set resource limits for processes: Use Linux Cgroups to set the maximum CPU usage for processes. This prevents system resources from being exhausted due to an application’s own problems.
Interrupt load balancing: Interrupt handlers for both soft and hard interrupts can be cpu-intensive. Enable the IRqBalance service or configure SMP_affinity to automatically load interrupt processing across multiple cpus.