In the development process, we will pay more and more attention to the performance of the service. However, performance optimization is relatively difficult, and often requires multiple rounds of optimization and testing, which is time-consuming and laborious, and sometimes may not have good results. However, if we have good performance optimization method guidance and tool-assisted analysis, we can quickly find the performance bottleneck and conduct targeted optimization, which can get twice the result with half the effort.

The challenge of performance optimization is to identify key performance bottlenecks, which are difficult to locate without the help of tools such as perf/BCC in c++ programs. There are many reasons for performance bottlenecks such as CPU, memory, disk, architecture, etc. This article is focused solely on CPU tuning, i.e. how to squeeze performance out of a CPU to maximize CPU throughput. (In fact, CPU performance is determined by the time it comes out of the factory, all we need to do is make the CPU as useful as possible), so optimizing for CPU utilization is really about finding out that we didn’t write good enough code.

A, sample

First code:

#include <stdlib.h>
 
 #define CACHE_LINE __attribute__((aligned(64)))
 
 struct S1
 {
   int r1;
   int r2;
   int r3;
   S1() :r1 (1), r2 (2), r3 (3){}
 } CACHE_LINE;
 void add(const S1 smember[],int members,long &total) {
     int idx = members;
     do {
        total += smember[idx].r1;
        total += smember[idx].r2;
        total += smember[idx].r3;
     }while(--idx);
 }
 int main (int argc, char *argv[]) {
   const int SIZE = 204800;
   S1 *smember = (S1 *) malloc (sizeof (S1) * SIZE);
   long total = 0L;
   int loop = 10000;
   while (--loop) {  // Make it easy to compare tests
       add(smember,SIZE,total);
   }
   return 0;
 }

Copy the code

Note: The code logic is relatively simple to do an accumulation operation, just for demonstration.

Compile + run:

g++ cache_line.cpp -o cache_line ; task_set -c 1 ./cache_line

Copy the code

Here’s an example: Cache_line is running on a CPU 1 core with 99.7% CPU utilization. At this point, the CPU is almost fully loaded. How do we know if the CPU is doing all the work it needs to do to run cache_Line?

Some students may say, you can use perF analysis to find the hot spot function. It is possible to use PERf, but perF can only know that a function is hot (or that some assembly instruction is hot), but it cannot determine which operations in the CPU are causing hot bottlenecks, such as fetch, decode,…..

If you are still at a loss to determine which CPU operations are causing service performance bottlenecks, this article will give you some clarity. This paper mainly introduces the top-down analysis method (TMAM) methodology to quickly and accurately locate CPU performance bottlenecks and relevant optimization suggestions to help improve service performance. In order to better understand the method introduced in this article, you need to prepare some knowledge.

Second, CPU pipeline introduction

(Image credit: Intel official document)

Modern computers are generally von Neumann computer models with five core components: operation, storage, control, input and output. The method introduced in this paper is related to CPU. CPU execution involves fetching instructions, decoding, execution and writing back these most basic stages. In the earliest CPU execution process, one instruction was executed in sequence according to the above steps, and then the second instruction was executed in sequence. Obviously, this method has a very low utilization rate of each CPU hardware unit. In order to improve CPU performance, Intel introduced technologies such as multi-level flow and out-of-order execution to improve performance. In general, Intel CPU has a 5-level pipeline, that is, the same cycle can process 5 different operations. Some new-type CPUS have as many as 15 levels of pipeline. The following figure shows the state of a 5-level pipeline. This is why the CPU decodes instructions: By breaking up the operations of instructions on different resources into microinstructions (UOps), such as ADD eax,[mem1] can be decoded into two microinstructions, one for loading data from memory to a temporary register, and the other for performing an operation. This allows the operation unit to perform another instruction’s operation UOPS while loading data, and multiple different resource units can work in parallel.

(Image credit: Intel official document)

There are also many kinds of resources inside the CPU, such as TLB, ALU, L1Cache, Register, port, BTB, and so on, and the execution speed of each resource is different. Some speed is fast, some speed is slow, and there are dependencies between each other, so in the process of running the program, different RESOURCES of the CPU will appear a variety of constraints. In this paper, TMAM is used to analyze more objectively which internal CPU resources are bottlenecked during application execution.

Top down Analysis (TMAM)

TMAM stands for top-down micro-architecture Analysis Methodology. This is the methodology that Intel CPU engineers use to optimize CPU performance. The theoretical basis of TMAM is to classify all kinds of CPU and microinstructions and identify possible bottlenecks from a large perspective, and then drill down to find bottleneck points. This method is also consistent with our human thinking. From macro to details, it takes more time to pay attention to details too early. The advantages of this methodology are:

  1. It is possible to optimize programs based on CPU characteristics even without hardware knowledge
  2. Systematically eliminate our guess about program performance bottlenecks: Low branch prediction success rate? Low CPU cache hit ratio? Memory bottlenecks?
  3. Fast identification of bottlenecks in multi-core out-of-order cpus

In the process of TMAM evaluation of various indicators, two measurement methods are adopted: one is CPU clock cycle (cycle[6]) and the other is CPU Pipeline slot[4]. In this method, it is assumed that each CPU core pipeline has 4 slots per cycle, that is, CPU pipeline width is 4. The following figure shows the different states of the four slots in each clock cycle. Note that only Clockticks 4 has 100% cycle utilization, and all the other slots are cycle stalls.

(Image credit: Intel official document)

3.1 Basic Classification

(Image from Intel Documentation)

TMAM categorizes various CPU resources and identifies bottlenecks in the process of using these resources through different categories. It first identifies the general bottlenecks from the general direction, and then conducts in-depth analysis to find the corresponding bottlenecks and break them one by one. In TMAM, CPU resource operations are divided into four categories at the top level. The meanings of these categories will be introduced next.

3.1.1 Retiring

Bridge, which refers to pipeline slots that run valid uOps, uOps[3] that eventually exit (note that the result of a microorder is either thrown away or written back to register), can be used to evaluate a program’s relatively realistic EFFICIENCY with the CPU. Ideally, all pipelined slots should be “bridge”. 100% bridge maximizes UOPs-heavy traffic per cycle, and extreme bridge traffic increases the number of orders per cycle (IPC). It’s important to note that the high percentage of bridge categories doesn’t mean there isn’t room for improvement. For example, the Microcode Assists category in Bridge is actually performance-destructive and needs to be avoided.

3.1.2 Bad Speculation

Bad Speculation refers to a misprediction that leads to wasted pipeline resources, including uOps that will not eventually be retired due to commits and parts of slots that are blocked due to recovering from a previous misprediction. Work wasted by predicting the wrong branch is classified as the “wrong prediction” category. “Bad speculation”, “if”, “switch”, “while”, “for” etc.

3.1.3 Front – End – Boun

Front – End duties:

  1. Instruction fetch

  2. Decode instructions into microinstructions

  3. The instructions are distributed to the back-end, with a maximum of four microinstructions per cycle

Front-end Bound indicates that the portion of slots that handles its front-end does not deliver enough instructions to back-end. Front-end is the first part of the processor and its core responsibility is to get the instructions required by back-end. In front-end, the predictor predicts the next address to be obtained, and then obtains the corresponding cache line from the memory subsystem, converts it into the corresponding instruction, and finally decodes it into uOps (microinstruction). Front-end Bound means that some slots will be left idle even if the back-end is not blocked. For example, a block that misses because the instruction cache misses would be classified as front-end Bound. Memory ordering

3.1.4 Back – End – Bound

Responsibilities of back-end:

  1. Receives the microinstruction submitted by front-end

  2. Rearrange the front-end submitted microinstructions if necessary

  3. Retrieves the corresponding instruction operand from memory

  4. Execute microinstructions and commit results to memory

Back-end Bound represents part of pipeline slots. Because back-end lacks some necessary resources, no uOps is delivered to back-end.

The core of the back-end processor is the out-of-order distribution of the prepared uOps by the scheduler to the corresponding execution unit. Once the execution is complete, the uOps will return the corresponding results according to the sequence of the program. For example, a blockage like when the cache-misses or a pause when the division operator is overloaded can fall into this category. This category can be broken down into two categories: memory-bound and Core Bound.

To sum up, it is:

Front End Bound = Bound in Instruction Fetch -> Decode (Instruction Cache, ITLB)

Back End Bound = Bound in Execute -> Commit (Example = Execute, load latency)

Bad Speculation = When pipeline incorrectly predicts execution (Example branch mispredict memory ordering nuke)

Retiring = Pipeline is retiring uops

A microinstruction state can be categorized according to the following decision tree:

(Image credit: Intel official document)

As for the leaf nodes in the figure above, there is a ratio of pipeline slots from each bridge category after running for a certain period of time. What is the ratio of each bridge category reasonable or the performance is relatively good and there is no need to further optimize? Intel provides a reference standard for different application types in the lab:

(Image: Intel User Manual)

Abovecodes are high, and abovecodes are low. If one category is prominent, then that’s what we focus on when we optimize.

There are two major performance analysis tools based on this methodology: Intel VTune (which is expensive and expensive) and pM-Tools, an open-source community.

With that in mind, let’s take a look at the categories of the initial example:

Although everything is on the list above, as long as bridge is not 100%, there is still room for improvement. The bottleneck in the figure above is clearly at back-end.

3.3 How to optimize for different categories?

When using Vtune or PM-Tools, focus on the high percentage of the other three categories aside from Retired, and optimize for those that stand out. In addition, the MUX Reliability (multivariate analysis Reliability) index should be paid attention to when using tools to analyze projects. The closer it is to 1, the higher the Reliability of the current results. If it is lower than 0.7, the less the current analysis results are not reliable. Let’s analyze and optimize the three categories.

3.3.1 Front – End Bound

(Image credit: Intel official document)

In the figure above, front-end is responsible for fetch (possibly fetch ahead of time according to the prediction), decode, and distribute to back-end pipeline. Its performance is limited by two aspects: latency and bandwidth. Latency is usually caused by instructions (L1 ICache, iTLB missed, or interpreted programming language Python \ Java, etc.) and decoding (some special instructions or queuing problems). When front-end is limited, pipeline utilization decreases. The non-green parts in the figure below indicate that slot utilization is not used, and slot utilization in ClockTicks 1 is only 50%. For BandWidth, it is divided into three subcategories: MITE, DSB and LSD. Interested students can learn about these three subcategories through other ways.

(Image credit: Intel official document)

3.3.1.1 Optimization Suggestions for front-end:

  • Code to minimize code footprint7:

C/C++ can use compiler optimization options to help optimize footprint, such as gcc-o * to optimize footprint or by specifying -fomit-frame-pointer.

  • Take full advantage of CPU hardware features: Macro-fusion

The macro fusion feature can combine two instructions into a single microinstruction, which improves front-end throughput. Example: like the loop we usually use:

Therefore, it is recommended to use unsigned data types in loop conditions to improve front-end throughput by using macro fusion features.

  • Locating the code layout:

① Make full use of the compiler’s PGO feature: -fprofile-generate-fprofile-use

② You can adjust the layout of code in memory by __attribute__ ((hot)) __attribute__ ((code))

In the decoding phase, it is beneficial for the CPU to prefetch.

For other optimization options, see GCC optimization options GCC general attribute options

  • Branch prediction optimization

Eliminating branching reduces the likelihood of predicting performance: for example, small loops can be expanded with loops less than 64 times (use GCC for loops — funroll-loops).

② Try to use if instead 😕 , do not use a=b>0? X :y because you can’t do a branching prediction

(3) combination of minimizing conditions, using a single condition such as: the if (a | | b) {} else {} this code, the CPU can’t branch prediction

④ For a switch with multiple cases, put the most likely case first

⑤ We can adjust the code layout according to its static prediction algorithm to meet the following conditions:

A precondition that makes the first block of code after the condition branch most likely to be executed

bool  is_expect = true;
 if(is_expect) {
    // Code with high probability of execution is placed here as much as possible
 } else {
    // Code with low probability of execution is placed here as much as possible} post conditions that make the conditional branch's branch with a backward target unlikely targetdo {
    // The code here runs as little as possible
 } while(conditions);

Copy the code

3.3.2 rainfall distribution on 10-12 Back – End Bound

This type of optimization involves CPU Cache usage optimization. CPU Cache [14] exists to bridge the speed gap between ultra-fast CPU and DRAM. The CPU has multi-level cache (register\L1\L2\L3), and TLB is introduced to speed up the conversion between virtual memory address and physical address.

This delay would be unacceptable if instructions were loaded into DRAM every time without a cache.

(Image credit: Intel official document)

3.3.2.1 Optimization Suggestions:

  • Adjust the algorithm to reduce data storage, reduce the dependence of instruction data before and after and improve the concurrency of instruction operation
  • Resize the data structure based on the cache line
  • Avoid false sharing in L2 and L3 cache

(1) Proper use of cache line alignment

The CPU cache is very precious, so you should try to improve its usage. Some errors may cause low CPU cache usage. Let’s look at an example where cache line alignment is not appropriate:

#include <stdlib.h>
 #define CACHE_LINE
 
 struct S1
 {
   int r1;
   int r2;
   int r3;
   S1() :r1 (1), r2 (2), r3 (3){}
 } CACHE_LINE;
 
 int main (int argc, char *argv[])
{
  // Same as before
 }

Copy the code

Here is the test effect:

Do cache line alignment:

#include <string.h>
  #include <stdio.h>
 
  #define CACHE_LINE __attribute__((aligned(64)))
 
  struct S1 {
    int r1;
    int r2;
    int r3;
    S1() :r1(1),r2(2),r3(3){}
  } CACHE_LINE;
 
  int main(int argc,char* argv[]) {
    // Same as before
  }


Copy the code

Test results:

Bridge from zero to zero, high cache utilization is high because of heavy lines, which in this example is 3*4/64 = 18%. Cache line alignment principles:

  • Scenario where multiple threads write an object or structure at the same time (i.e. there is a pseudo-sharing scenario)
  • Object or structure is too large
  • Place frequently accessed object attributes at the head of the object or structure as much as possible

(2) Pseudo sharing

This section describes how to use cache rows to solve false shared problems in SMP. Multiple cpus modify data in the same cache row at the same time, resulting in inconsistent data in the CPU cache, which is a cache invalidity problem. Why does pseudo-sharing only occur in multi-threaded scenarios, but not in multi-process scenarios? This is because the characteristics of the Linux virtual memory, the virtual address space of each process are isolated, that is to say, in the data without the cache line, when the CPU execution process 1 load data of a cache line, can only belong to process 1, and no 1, the other part is part process 2.

(In the figure above, different types of L2 cache may be organized differently. Some may be exclusive to each core, such as Skylake.)

Pseudo-sharing has a significant impact on performance because it causes operations that could otherwise be executed in parallel to be executed concurrently. This is unacceptable for high performance services, so we need to optimize alignment, which is CPU cache line alignment to solve pseudo-sharing, which is essentially a space for time solution. Such as the code snippet above:


#define CACHE_LINE __attribute__((aligned(64)))
 
  struct S1 {
    int r1;
    int r2;
    int r3;
    S1() :r1(1),r2(2),r3(3){}
  } CACHE_LINE;


Copy the code

Therefore, the use of cached lines needs to be treated differently based on your actual code, not just by others.

3.3.3 Bad Speculation branch forecast

(Image credit: Intel official document)

“Bad Speculation” is created when back-end deletes microinstructions, which means front-end will fetch and decode the instructions in a pointless way. Therefore, avoid branching or improve branch prediction accuracy to improve service performance. Although the CPU has a BTB record of historical predictions, this part of the cache is very scarce and it can cache very little data.

Branch prediction is used in FONT-end to speed up the process by which the CPU gets a specified instruction, rather than waiting until it needs to read an instruction from main memory. Front-end can use branch prediction to load predictive instructions to L2 Cache in advance, which greatly reduce CPU fetch latency. Therefore, we should avoid this situation.

  • Use GCC’s built-in branch prediction feature whenever possible where if is used (see the front-end section for other cases)
#definelikely(x) __builtin_expect(!! (x), 1)// GCC built-in functions to help compiler branch optimization
 #defineunlikely(x) __builtin_expect(!! (x), 0)
 
 if(likely(condition)) {
   // This code has a high probability of execution
 }
 if(unlikely(condition)) {
  // This code has a high probability of execution
 }
 
 // Try to avoid remote calls

Copy the code
  • Avoid indirect jumps or calls

In c++, for example, switch, function pointer or virtual function may have multiple jump targets when generating assembly language, which will also affect the result of branch prediction. Although BTB can improve these, after all, BTB resources are very limited. (Intel P3 BTB 512 entry, some newer CPUS can not find the relevant data)

Write at the end

Here we take a look at the initial example. The evaluation results after the optimization method mentioned above are as follows:

g++ cache_line.cpp -o cache_line -fomit-frame-pointer; task_set -c 1 ./cache_line

From 15s to 9.8s, improve performance by 34% from 66.9% to 78.2%. Backend bound reduced from 31.4% to 21.1%

Five, CPU knowledge charging station

[1] Cycle per Instruction (CPI) Average number of clock cycles per instruction

[2] IPC (Instruction per Cycle) Number of instructions per CPU cycle

[3] Modern uOps processors can decode at least four instructions per clock cycle. The decoding process produces many small pieces of operations called micro-ops (uOps)

[4] Pipeline Slots Represent the hardware resources needed to process uOps, and TMAM assumes that each CPU core has multiple pipeline slots available in each clock cycle. The number of lines is called line width.

[5] MillionInstructions Per Second (MIPS) MIPS= 1/ (CPI x clock cycle) = main frequency /CPI

[6]cycle Clock cycle: Cycle =1/ main frequency

[7] Memory footprint Footprint Specifies the memory footprint required by the program to run. This includes code segments, data segments, heaps, call stacks, and hidden data stores such as symbol tables, debugged data structures, open files, shared libraries mapped to process space, and so on.

[8] MITE Micro-instruction Translation Engine

[9]DSB Decode Stream Buffer is decoded UOP cache

[10]LSD Loop Stream Detector

[11] Analysis of various CPU dimensions

[12] An introduction to TMAM Theory

[13] CPU Cache

[14] micro architecture

Author: Vivo – Li Qingxing