When we talk about performance optimization, most of you are probably thinking about performance optimization at the system level. For example, in a Web service application, using Redis or other caches to speed up Web site access. On the one hand, the compiler does a lot of optimization work for us, on the other hand, we feel that the optimization effect at the system level is more obvious and higher. In fact, in addition to the performance optimization at the system level, the performance optimization at the program code level is also very good. Without further ado, let’s stick to the facts. If you look at the following two programs, they do exactly the same thing, increment each element in a two-dimensional array by one. Take a look, do you think there is a performance difference between these two programs? The actual test results showed a nearly four-fold performance difference.

// This is the first program
 for(i = 0; i < 1024; i ++) 
      for(j = 0; j < 1024; j ++) 
             array[j][i] ++; 
// The program difference is here ^ ^
// This is the second part of the program
 for(i = 0; i < 1024; i ++) 
      for(j = 0; j < 1024; j ++) 
             array[i][j] ++; 
Copy the code

Cause analysis of performance differences

Think about it, why is there such a big difference in performance?Together, we see that the difference between the two pieces of code is the order in which the array elements are accessed, column by column, and row by row. Figure 1 May make this a little clearer. Then, in combination with the layout rules of 2d data in MEMORY in C language (which can be verified by printing addresses in the above code), we can know that the former accesses the continuous address space, while the latter accesses the jumping address space.


Take an integer array, that is, the former accesses addresses X, X+4, X+8, and so on. The addresses accessed by the latter are X, X+4096, and X+8192, respectively. The latter jumps 4KB of address space at a time.With that in mind, is there a reason for the performance difference?As we know, in order to improve the performance of accessing memory, THE CPU adds caches between it and memory. Modern CPU caches usually have three levels of caches, namely L1, L2 and L3, among which L1 and L2 are unique to the CPU core, while L3 is shared by multiple cores of the same CPU. The basic architecture is shown in Figure 2.

Due to the cachedistributedTo ensure consistency between multiple cpus. Anyway, the cache needs to be cut down to a smaller granularity of management called a snap-inCache line

Spatial locality: For data that has just been accessed, its adjacent data has a high probability of being accessed in the future. Time locality: Data that has just been accessed has a high probability of being accessed in the future.

Understanding the above principle, we know that for the above program code, because the second program in turn jumps too far, that is, does not meet the spatial locality, resulting in a cache hit failure. This means that the second program does not actually access the data in the cache, but accesses the memory directly. Memory access performance is much lower than cache access performance, resulting in a nearly 4-fold performance difference from the start of the article.

Additional considerations regarding program performance

A very small change to our program can have a very big impact on performance. Therefore, in our daily development, we should always be on the lookout for improper code in the code that causes performance problems. The following is an example of a performance-related program for reference in future development.

Program structure

The impact of improper application structure on performance can sometimes be disastrous. The performance difference between the following two functions can be huge in the case of long strings. The lower1 function evaluates the string length in each loop, but this calculation is not necessary. The lower2 function evaluates the length of the string before the loop starts, and then evaluates the condition using a constant variable. The root of the problem is the strlen function, which loops through the length of the string, which can be time-consuming if the string is long.

void lower1(char *s)
{
    int i;
    
    for (i = 0; i < strlen(s); i++)
        if (s[i] >='A' && s[i] <= 'Z')
            s[i] -= ('A' -'a');
}

// The following implementation performs better.
void lower2(char *s)
{
    int i;
    int len = strlen(s);
    
    for (i = 0; i < len; i++)
        if (s[i] >='A' && s[i] <= 'Z')
            s[i] -= ('A' -'a');
}
Copy the code

Procedure (function) calls

As we know, there will be operations such as pushing and removing the stack during procedure call. These operations are usually memory operations, and the process is relatively complicated. In other words, the call process of a function is a time-consuming operation, and the function call should be minimized. The good news is that modern compilers can do a lot of tuning for function calls, and simple function calls can often be tuned by the compiler. The so-called optimized call is only at the machine language level (assembly language) there are no high-level language function calls. Let’s take a look at a concrete example of implementing a simple function call in C, where fun_1 calls fun_2, which in turn calls printf. Here fun_2 doesn’t do much work, just add the two arguments and pass them to printf.

The assembler code for fun_1 shows that it does not call fun_2 at all, but calls printf directly

Operator difference

For example, multiplication takes two or three times as long as addition, while division takes more than ten times as long as addition. Therefore, reducing the use of division in frequently accessed logic will significantly improve. In Java’s implementation of HashMap, hash keys are computed by bitwise operations, not by modulo operations. Because modular operations are themselves division operations, performance is more than ten times worse than bitwise operations.

static final int hash(Object key) {
    int h;
    return (key == null)?0 : (h = key.hashCode()) ^ (h >>> 16);
}
Copy the code

Refer to the JDK source code for more detailed processing logic, and this article is just a brick toss.

Reference and copy

High-level languages that support classes involve copying when passing object parameters, and copying objects is also a performance-intensive operation. Of course, high-level languages implement the passing of object addresses through a mechanism called reference, thus avoiding the copying process (which is the difference between passing values and addressing). There are too many performance issues in program development to list in this article. However, the key problem is to master the underlying implementation principle of technology, any other high-level content can be explained by the underlying principle, is the so-called ten thousand changes do not leave its root.