Perf: performance analysis tool in Linux

Introduction to the

Perf is a system Performance analysis tool built after Linux 2.6.31. Applications can take advantage of PMU (Performance Monitoring Unit), TracePoint, and special counters inside the core for statistics, as well as analyzing the running core code for a more complete understanding of Performance bottlenecks in the application.

The basic principle of PERF is to sample the target and record whether and how many events are detected under certain conditions. For example, sampling is performed according to the TICK interrupt, that is, the sampling point is triggered within the TICK interrupt, and the context at the time of the process is determined in the sampling point. If 90% of a trip is spent on function foo(), then 90% of the sample points should be in the context of function foo().

Perf can sample a large number of events that can be analyzed:

Hardware events, such as CPU-Cycles, instructions, cache-misses, and branch-misses
Software events, such as Page-faults and context-switches
Tracepoint event

Knowing cpu-cycles and instructions, we can know how much Instruction per cycle is, and then judge whether the code makes good use of CPU. How many cache-misses do you have to take good care of the Locality of reference, and does the number of branch misses lead to a serious pipeline hazard? In addition, Perf can also sample functions to know which side of the performance card.

Background knowledge

Hardware Cache

Memory reads and writes are fast, but not as fast as the processor can execute instructions. In order to read instructions and data from memory, the processor has to wait, which is very long in terms of processor time. A Cache is a type of SRAM that has a very fast read/write rate that matches the processing speed of the processor. Frequently used data is stored in the cache so that the processor does not have to wait, thus improving performance. The size of the Cache is generally very small, and making full use of the Cache is an important part of software tuning.

CPU Cache

CPU cache knowledge relevant to the programmer
Basic principles of Cache

Pipeline of hardware features, superscalar architecture, out-of-order execution

One of the most effective ways to improve performance is parallelism. Processors are also designed as parallel as possible in hardware, such as pipelining, superscalar architecture, and out-of-order execution.

A processor processes an instruction in several steps, such as fetching the instruction, performing the operation, and finally printing the result to the bus. Inside the processor, this can be viewed as a three-level pipeline.

Instructions enter the processor from the left. The pipeline in the image above has three levels, and three instructions can be processed simultaneously in a clock cycle by different parts of the pipeline.

Superscalar refers to a pipeline machine architecture with multiple instructions per clock cycle, such as Intel’s Pentium processor, which has two internal execution units, allowing two instructions to be executed in a single clock cycle.

In addition, inside the processor, different instructions require different processing steps and clock cycles. If you strictly follow the order of execution of the program, you will not be able to take full advantage of the pipeline of the processor. So instructions can be executed out of order.

The above three parallel technologies have a basic requirement for the instructions to be executed, that is, the adjacent instructions do not depend on each other. If an instruction needs to rely on the execution result data of a previous instruction, the pipeline fails because the second instruction must wait for the first instruction to complete. So good software must avoid this kind of code generation as much as possible.

Branch instruction has a great influence on software performance. Especially when the processor adopts pipeline design, assuming pipeline has three levels, the current first instruction into the pipeline is a branch instruction. If the result of the branch is a jump to another instruction, the next two instructions prefetched by the processor pipeline will be discarded, affecting performance. To this end, many processors provide branch prediction, which reads the most likely next instruction based on the historical execution record of the same instruction, rather than sequential reading instructions.

Branch prediction has some requirements on software structure. For repetitive branch instruction sequence, branch prediction hardware can get good prediction results, but for program structure like Switch Case, it often can’t get ideal prediction results.

Several of the processor features described above have a significant impact on software performance, but profiler patterns that rely on periodic sampling by the clock do not reveal the program’s use of these processor hardware features. To solve this problem, processor vendors add PMU units, also known as Performance Monitor Units, to their hardware.

PMU allows the software to set counter for a hardware event, after which the processor starts counting the number of occurrences of that event. When the number of occurrences exceeds the value set in counter, an interrupt occurs. For example, when cache Miss reaches a certain value, PMU can generate the corresponding interrupt.

Capturing these interrupts allows you to see how efficiently your program can take advantage of these hardware features.

Tracepoints

Tracepoint is a set of hooks scattered throughout the kernel source code that, once enabled, can be triggered when specific code is run, and can be used by various trace/debug tools. Perf is one of the users of this feature.

If you want to know how the kernel memory management module behaves during application runtime, you can take advantage of tracepoint lurking in the slab allocator. When the kernel runs to these Tracepoints, perf is notified.

Perf logs events generated by TracePoint and generates reports that can be analyzed so that the tuner can learn the details of the kernel at runtime and make a more accurate diagnosis of performance symptoms.

Use Perf

As mentioned earlier, The Times that Perf can trigger can be broadly divided into three categories, consisting of more than 20 sub-toolsets, with primary data retrieved through help.

perf help top/mem/bench..
Copy the code

This paper summarizes the use of PERF in combination with the program. The first program code perf_test.c is as follows:

int add(int i) {
    i++;
    return i;
}
int div(int i) {
    i--;
    return i;
}
int main(a) {
    long int i = 0;
    while(1) {
        i++;
        add(i);
        div(i);
    }
    return 0;
}
Copy the code

The content of the above procedure is easily found to be an infinite loop, such code running, will be very consumption of system performance, the following procedure according to the system summary Perf daily use:

perf list

The perf list should be the first command for most people after installing perf. It can list the events that perF can trigger. According to the first paragraph, there are mainly three kinds of events:

alignment-faults [Software event] bpf-output [Software event] context-switches OR cs [Software event] cpu-clock [Software event] cpu-migrations OR migrations [Software event] ... power/energy-ram/ [Kernel PMU event] rNNN [Raw hardware event descriptor] cpu/t1=v1[,t2=v2,t3 ...] /modifier [Raw hardware event descriptor] (see 'man perf-list' on how to encode it) mem:<addr>[/len][:access] [Hardware breakpoint] ... dma_fence:dma_fence_enable_signal [Tracepoint event] dma_fence:dma_fence_init [Tracepoint event] ...Copy the code

perf top

Perf top is similar to the built-in top command in Linux. It can analyze the hot spots of each function on an event in real time and find out the murderer of slowing down the system. Even if there is no specific program to observe, you can also directly issue the PERF top command to observe what program eats up the system efficiency and causes the system to slow down abnormally.

You can see the red hot spots appear. The first column on the right is the symbol of each function, and the first row on the left is the proportion of events triggered by this symbol in the whole “monitoring domain”, which is called the heat of this symbol. The monitoring domain refers to all symbols monitored by PERF, and the default value includes functions of all programs, cores and core modules of the system. The second line on the left shows the Shared Object where the symbol is located. If [.] is displayed next to the symbol, it indicates that it is in User mode, and [k] indicates kernel mode.

Pressing H calls help, which lists all the functions of perf Top and the corresponding keys. Let’s try Annotate, a feature that allows further analysis of a symbol. Usually the cursor is on the line and the event returns twice is also “Annotate.”

This is the Annotate of the div function in the program above:

If you want to watch other events (default cycles) and specify sampling frequency (default 4000 cycles per second) :

perf top -e cache-misses -c 4000
Copy the code

perf stat

This command is used to perform time sampling analysis on known programs that need to be optimized. For example, for stat analysis of perf_tes.c:

[root@k8s-master ~]# perf stat ./perf_test ^C./perf_test: Interrupt Performance counter stats for './perf_test': 34,794.52 msec task-clock # 0.998 CPUs to448K/SEC 506 CPU-migrations # 0.015k/SEC 113 Page -faults # 0.003 K/ SEC <not supported> cycles <not supported> instructions <not supported> branches <not supported> Branch - Misses 34.859414299 seconds time Elapsed 34.793114000 seconds user 0.001998000 seconds sysCopy the code

Without the -e option to specify the particular event you are interested in, perf stat will collect the following events:

Task-clock-msecs: CPU utilization. A high value indicates that most of the program’s time is spent on CPU calculations rather than IO.
Context-switches: Indicates the number of process switches that occur during the program running. Frequent process switches should be avoided.
Cache-misses: indicates the total Cache utilization during the running of the program. If this value is high, it indicates that the Cache utilization of the program is not good
Cpu-migrations: indicates the number of CPU migrations during the running of process T1. That is, the scheduler moves cpus from one CPU to another.
Cycles: processor clock, a machine instruction may require multiple Cycles,
Instructions: Number of machine Instructions.
IPC: is the ratio of Instructions/Cycles. The larger the value, the better, indicating that the program makes full use of the characteristics of the processor.
Cache-references: Cache hit times
Cache-misses: indicates the number of times that the Cache fails.

perf record / perf report

After using top and stat, you probably have a rough idea. Further analysis requires more granular information. For example, you have concluded that the target program is computationally heavy, perhaps because some of the code is not written cleanly enough. So what lines of code do you need to change further in a long code file? This requires a Perf Record to record statistics at the individual function level and a Perf Report to display the results.

Tuning should focus on the high percentage of hot code snippets. If a piece of code takes up only 0.1% of the total application running time, even if you optimize it to a single machine instruction, you may only improve the overall application performance by 0.1%. As the saying goes, good steel makes good use of the blade.

The problem may occur when using record is that the sampling frequency is too low, and some function information cannot be displayed because it is not sampled. You can adjust the sampling frequency by using the -f option, and the maximum frequency is cat /proc/sys/kernel/perf_event_max_sample_rate.

Perf tuning examples

Through the code of the infinite loop above, the common perf command is introduced to find the hot events of the online server. Here’s another example of using PERf to find problems with your program and optimize them.

Program 2, cpu_cache_miss.c:

static char array[10000] [10000];
int main (void){
  int i, j;
  for (i = 0; i < 10000; i++)
    for (j = 0; j < 10000; j++)
       array[j][i]++;
  return 0;
}
Copy the code

# perf stat --repeat 5 -e cache-misses,cache-references,instructions,cycles ./cpu_cache_missPerformance counter stats for './cpu_cache_miss' (5 runs): Cache - References (+ -0.09%) 2,010,379,094 Instructions # elapsed (+-) cycles (+ -0.03%) 2,418,641,076 cycles (+ -0.30%) 1.180222417 seconds time elapsed (+- 1.53%)Copy the code

Array [I][j]++ array[I][j]++ array[I][j]++ array[I]

# perf stat --repeat 5 -e cache-misses,cache-references,instructions,cycles ./cpu_cache_miss_newPerformance counter stats for './cpu_cache_miss' (5 runs): = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = Instructions # 1.36 per cycle (+ -0.03%) cycles (+ -1.23% 3.21%)Copy the code

Visible IPC rose from 0.83 to 1.36.

Flame figure

The steps for generating a flame chart can be found in the previous summary article: CPU flame chart

Refer to articles and extended reading

PERF tutorial: Finding execution hot spots
CPU cache knowledge relevant to the programmer
Basic principles of Cache
Linux performance analysis tool: Perf
Perf — System performance tuning tool for Linux, Part 1
CPU flame figure