About the author
Yang Yong, Wu Yihao, Linux system engineer, from Ali Cloud system group.
If there are any omissions, suggestions or comments in this article, please reply to our wechat account, or via oliver. Yang at linux.alibaba.com.
Ali Cloud system team was expanded from the former Taobao core group. In 2013, Taobao core group responded to the call of Alibaba Group and integrated into Ali Cloud, and began to build perfect system support for the underlying system of cloud computing. Ali Cloud system team is composed of a group of kernel developers with a high sense of mission and self-pursuit. Most of the team members are active community kernel developers. My current work focuses on (but is not limited to) Linux kernel memory management, file system, network and kernel maintenance and construction, as well as user-mode libraries and tools associated with the kernel. If you are interested in our work, please join us and send your resume to Toa. ma at linux.alibaba.com or boyu.mt at alibaba-inc.com.
1. What is CPI?
This section explains why it makes sense to use CPI to analyze application performance. If you already know what THE CPI means for analyzing application performance, skip this section.
1.1 How can the program run fast?
To understand what CPI is, let’s start by thinking about how to make a program run faster on a given processor.
Assuming that the standard for a program to run fast is the execution time of the program, the execution speed of the program can be expressed by the following formula:
Therefore, in order for the program to run faster, that is, to reduce the execution time of the program, we need to work on the following three areas:
-
Reduce the total number of program instructions
To reduce the total number of instructions executed by the program, you may:
-
Algorithm optimization; Better algorithm design may result in fewer instructions to execute.
-
More efficient compilers or interpreters; A new compiler or interpreter may generate less machine code for the same source code.
-
Optimizations in lower-level languages; This is why the Linux kernel code uses C and prefers inline assembly.
-
Updated processor instructions; The most important job of the new compiler is to use the latest efficient instructions on the new processor. For example, x86 SSE, AVX instructions.
Reduce the clock cycle time per CPU
This point is easy to understand. Shortening the CPU clock cycle is actually increasing the CPU frequency. This is one of Intel’s past successes. Today, it is difficult to reduce the time of the CPU clock cycle further, as the main frequency has been improved to the limit of the manufacturing process.
Reduce the average number of clock cycles per instruction
How do you reduce the average number of CPU clock cycles per instruction? Let’s start with a CPU design perspective:
-
Scalar Processor; Only one instruction can be executed per CPU clock cycle;
-
Superscalar Processor; A CPU clock cycle can execute multiple instructions; This is usually done by implementing multistage pipelines in the processor.
Therefore, it is not hard to see that if we use a CPU that supports superscalar processors and use CPU pipelining to improve instruction parallelism, then we can achieve our goal. The higher the parallelism of the pipeline, the higher the execution efficiency, the lower the average number of clock cycles per instruction.
Of course, the parallelism and efficiency Of pipeline depend on many factors, such as command fetch speed, access speed, out-of-order Execution, Branch Prediction Execution. The ability to Speculative Execution. Once the pipeline’s ability to execute in parallel is reduced, the performance of the program suffers. For details on superscalar processors, pipelining, out-of-order execution, and speculative execution, please refer to the relevant resources.
In addition, in SMP, or multi-core processor systems, programs can be programmed in parallel to increase the parallelism of instructions, so this is why today’s CPU architectures are moving to multi-core and many-core, when the CPU frequency can’t be increased anymore.
Processor manufacturers need to constantly improve their manufacturing processes to reduce the chip area and power consumption of cpus because they need to ensure that more instructions can be executed in a CPU clock cycle while increasing the CPU dominant frequency.
1.2 the CPI and IPC
In the world of computer architecture, you often see the use of CPI. CPI is short for Cycle Per Instruction, which means the number of cycles Per Instruction. In addition, in some situations, we often see IPC as Instruction Per Cycle.
Therefore, it is not difficult to conclude that the relationship between CPI and IPC is,
Using the definition of CPI, the formula used at the beginning of this article to measure program execution performance can actually be expressed as:
Due to the limitation of silicon material and manufacturing process, the improvement of processor main frequency has been faced with a bottleneck. Therefore, the improvement of program performance is mainly based on Instruction Count and CPI.
On Linux, it is easy to measure the IPC of a program through the PERf tool, through registers (PMU) provided by the Intel processor. For example, the following example can give the IPC of a Java program, and the IPC of the Java program is 0.54 over 8 seconds:
So, through IPC, we can also convert CPI to 1/0.54, which is about 1.85.
In general, CPI values can be used to determine whether a computationally intensive task is CPU-intensive or memory-intensive:
-
With a CPI of less than 1, programs are usually CPU-intensive;
-
With a CPI greater than 1, programs are typically memory-intensive;
1.3 Relearning the CPU Usage
For programmers, the key to determining the efficiency of a computation-intensive task is the CPU utilization at the time the program is running. Many people think that high CPU utilization means that your program’s code is running like crazy. In fact, CPU utilization is high, or the CPU is busy and some resources, such as memory access, are bottleneck.
Some computationally intensive tasks, under normal conditions, have low CPI and otherwise good performance. CPU utilization is high. However, as the system load increases, other tasks compete for system resources. As a result, the CPI of these computing tasks increases significantly and the performance deteriorates. At this point, CPU utilization is probably still very high, but this high CPU utilization actually reflects the effect of busy CPU, etc., and pipeline pauses.
Brendan Gregg pointed out in his CPU Utilization is Wrong blog that CPU Utilization metrics need to be analyzed in conjunction with CPI/IPC metrics. And detailed introduction of the cause and effect. Interested readers can read the original text by themselves, or subscribe to the kernel monthly talk official account to read our official account very reliable translation.
At this point, it is clear to the reader that it is important to know the performance of an application through the CPI without modifying the binary. For computationally intensive applications, traditional metrics such as CPU utilization alone will not help you determine the efficiency of your application. CPU utilization must be combined with CPI/IPC to determine the efficiency of your application.
1.4 How to Analyze CPI/IPC Indicators?
Although it is easy to obtain CPI/IPC indicators using PERF, to analyze and optimize the problem of high CPI of the application, you need some tools and analysis methods to find the cause of high CPI and the call stack of the associated software, so as to determine the direction of optimization.
The cause analysis of the high CPI is described in the Intel 64 and IA-32 Architectures Optimization Reference Manual, Appendix B. The main idea is to investigate the four main causes of high CPI from the top down method. Since this article mainly introduces the CPI flame chart,
The top-down analysis method in this section will not be expanded in detail due to space limitations, but will be covered in a special article later.
2. CPI flame diagram
Brendan Gregg, in CPI Flame Graphs: Catching Your CPUs Napping, describes the use of THE CPI Flame graph to correlate the CPI with the software call stack.
We already know that CPU utilization alone does not tell us what the CPU is doing. Because the CPU may execute an instruction and then stop, waiting for resources. This wait is transparent to the software, so from the user’s point of view, the CPU is still in use, but in reality, the instructions are not effectively executed, the CPU is busy, etc., this CPU utilization is not effective utilization.
The easiest way to find out what the CPU is actually doing when it’s busy is to measure the average CPI. A higher CPI means more cycles per instruction are being run. These additional Cycles were often caused by pipelineproprietary Cycles, such as waiting for memory to be read and written.
The CPI flame chart, based on the CPU flame chart, provides a visual comprehensive analysis of program CPU execution efficiency based on CPU utilization and CPI indicators.
The CPI flame chart below is taken from Brendan Gregg’s blog post.
As you can see, the CPI flame chart is based on the CPU flame chart, with each bar colored according to the size of the CPI. Red represents instructions and blue represents pipeline stops: The width of each function frame in the fire chart shows the number of times a function or its children are on the CPU, just like in a normal CPU fire chart. The colour indicated whether the function was running on the CPU (red) or stalled (blue).
In the flame diagram, the color range is from highest CPI blue (slowest instruction) to lowest CPI red (fastest instruction). The flame map is in SVG format, a vector map, and therefore supports mouse click zooming.
However, this post from Brendan Gregg’s blog, the CPI flame chart is generated based on commands specific to the FreeBSD operating system. On Linux, what do you do?
3. A small program
Let’s write a little artificial program to show the use of CPI flame chart under Linux.
This is the simplest small program, which contains the following two functions:
-
The cpu_bound function body is a loop of NOP instructions; Since the NOP instruction is one of the simplest instructions that does not access memory, the function CPI must be less than 1, which is typical CPU-intensive code.
-
The memory_bound function evicts the Cache using _mm_clflush, artificially triggering L1 D-cache Load Miss for the program. Therefore, the function CPI must be greater than 1, which is typical of memory-intensive code.
Here is the source code of the program:
While the above applet runs, we generate the CPI flame map using the following command,
The resulting flame diagram is shown below,
As you can see, the CPI flame chart shows the results that are in line with our expectations:
-
All of the program’s CPU time is split between cpu_bound and memory_bound functions
-
Cpu_bound is red, indicating that the instructions for this function are running continuously on the CPU
-
Memory_bound is blue, indicating that this function is experiencing a serious delay in accessing memory, causing pipeline pauses, being busy, etc
4. 一个benchmark
Now, we can use the CPI flame chart to analyze a slightly more realistic test scenario. The CPI flame chart below is from the FIO test scenario.
This fiO does multiple synchronous Direct IO sequential writes to SATA disks, as seen:
-
The functions labeled CPU Bound are in red. The darkest of these is _raw_spin_lock, which is caused by the spin lock wait cycle.
-
The blue color is the function labeled Memory Bound. The darkest of these is the fiO test program’s get_io_u function, which, if further analyzed using perf, has a serious LLC Cache Miss.
Since the CPI flame chart is vector and supports scaling, this can be further confirmed by zooming in on the get_io_u call stack,
At this point, the reader will find that using the CPI flame chart, it is very easy to do CPU utilization analysis and find and locate the functions causing CPU pauses. Once the relevant function is found, the instruction that caused the pause can be further confirmed by the perf annotate command. Furthermore, we can use the top-down analysis in Section 1.4 to determine where the CPU bottleneck is occurring. Finally, the optimization direction is determined based on this information.
5. Summary
This paper introduces the method of using CPI flame graph to analyze program performance. The CPI flame chart not only shows the correlation between an application’s Call Stack and CPU usage, but also reveals which parts of these CPU usage are actually running time effectively and which are actually busy due to pauses.
System administrators can use this tool to discover resource bottlenecks in the system and use some system management commands to alleviate resource bottlenecks. For example, Cache bumping interference between applications can be resolved by tying applications to different cpus.
And application developers can optimize related functions to improve the performance of the program. For example, processor performance issues due to fetch pauses can be reduced by optimizing code to reduce Cache misses, thereby lowering application CPI.