Writing in the front

As a front-end developer, I’ve always had a hard time understanding the importance of learning the basics of computing because I can’t imagine how that knowledge would apply to front-end development.

However, in the section of CSApp optimizing program performance, I completely changed my mind. The author describes how to optimize the performance of a correct and well-written C language program.

For a piece of code with the same function, after optimization and before optimization produced a huge difference, the speed has been improved by dozens of times. However, WHAT puzzles me is that js and C, which are so different at the language level, will this optimization have the same effect? In practice, the answer is yes.

So computer basics are not castles in the air, they actually help you write better programs. Why do dachang like to study the basis, because these foundations do reflect the level of a person’s programming level in some aspects, for computer majors, this is precisely the place to open a gap with others.

The body of the

primers

Take c as an example. The following code converts uppercase letters in a string to lowercase letters

void lower(char *s) {
    size_t i;
    for (i = 0; i < strlen(s); i++) {
        if (s[i] >= 'A' && s[i] <= 'Z')
            s[i] -= ('A' - 'a'); }}Copy the code

Can you see what’s wrong with this code?

If you’re sensitive enough, you’ll notice that the strlen() function in the for loop is a problem, as the subtitle implies. First of all, this code is functionally correct, but performance is questionable.

For those unfamiliar with strlen functions, here’s a hint: The strlen function iterates through the string to get its length, so every time the for loop, it iterates through the string s once, so this code is o(n^2) in time, but really we’re just changing every value of S, but we’re not removing or increasing anything, so we don’t have to evaluate s every time. You might wonder if the compiler can’t recognize this pattern and optimize for it? The answer is no, simply because the compiler is unable to determine whether there are side effects that cause the judgment conditions to change. There are also reasons for memory alias usage, which is not explained here, but can be read for interested parties.

Of course, optimization for this code is simple, with a temporary variable that evaluates strlen once before the loop, and then uses that variable as a judgment. It’s worth noting that with this simple optimization, the time complexity goes from O (n^2) to O (n), which is a huge improvement.

Experienced programmers might think that the problem with this code is unique to C, because most high-level languages maintain the length of a string as a constant by the language itself, without needing to iterate. In fact, this example is just a reminder that small changes can lead to huge performance degradations or improvements


Next, we use the Combine function as an example of summing an array to see the performance difference between a piece of code that functions the same and is optimized.

The combine1 function calls getLengthFromArray in the for loop to obtain the array length, which is o(1) longer than strlen. There are several considerations for why you should fart with your pants down and why you should not simply use the.length attribute:

  • Simulate behavior in other languages
  • You can dynamically calculate the length of an array
  • Provide error checking
  • Even o(1) calls can have a performance impact

As for why we use getFromArray to get array elements, because js arrays have no out-of-bounds check (out-of-bounds access does not report errors), it can happen that the program has been running for a long time before the error shows up. So the purpose of getFromArray is to provide out-of-bounds checking and throw an error when out of bounds.

Version 1:

// This code can be run directly

function getFromArray(arr, index) { // Provides array getters for out-of-bounds checks
    if (index < 0 || index >= arr.length) throw new Error('OUT OF INDEX' + index);
    return arr[index]
}

function getLengthFromArray(arr) { // Get the length of the array
    if (arr == null) throw new Error(arr + 'is not an Array.')
    return arr.length; // Even constant function calls still incur overhead
}

void function() {
    console.log('start');

    const numberOfElement = 9999999;
    const arr = Array(numberOfElement).fill(2);

    (function combine1() {
        console.time('combine1');

        let sum = 0;
        for (let i = 0; i < getLengthFromArray(arr); i++) { // The length is recalculated each time
            sum += getFromArray(arr, i); // Get array elements using getters that provide out-of-bounds checks
        }
        console.log('sum', sum);

        console.timeEnd('combine1'); } ());console.log('end'); } ();Copy the code

Eliminate circular inefficiencies

In each for loop, combine1 calls getLengthFromArray as the test condition of the loop. According to the function of our function, we only access each element of the array, and do not modify the length of the array, so there are a lot of invalid calculations (calls), so we need to optimize.

Version 2

(function combine2() {
// Can be executed directly
    console.time('combine2');
    let sum = 0;
    const len = getLengthFromArray(arr); // Eliminate the array length calculation for each loop
    for (let i = 0; i < len; i++) {
        sum += getFromArray(arr, i);
    }
    console.log('sum', sum);
    console.timeEnd('combine2'); } ());Copy the code

In my TESTS on my PC, Combine2 was indeed about 10ms faster than Combine2 after eliminating the useless calculations in the for loop judgment condition, which you might think is nothing. But: no matter how large or not this optimization is, it is a source of inefficiency that needs to be eliminated or it can become a bottleneck for further optimization.

Reduce redundant function calls

By analyzing the combine2 function, we can see that the out-of-bounds check provided by the getFromArray function seems unnecessary if we set the termination condition of the loop correctly. For each access, getFromArray makes a judgment, which is unnecessary, and each function call is an overhead.

(function combine3() {
    /** * now eliminates the calculation of the array length for each loop * and determines that the access is not out of bounds. No out-of-bounds checking is required to access the array */
    console.time('combine3');
    let sum = 0;
    const len = getLengthFromArray(arr);
    for (let i = 0; i < len; i++) {
        sum += arr[i]; // Direct access
    }
    console.log('sum', sum);
    console.timeEnd('combine3'); } ());Copy the code

After practice, the performance of Combine3 is almost twice as high as that of Combine1, so this optimization is obviously needed in terms of performance.

But the argument is that if you think of arR as a data structure provided by someone else, and getFromArray is the getter for that data structure, we, as users, shouldn’t make any assumptions about the underlying implementation of ARR, we don’t know whether it’s an array or a linked list, so it violates the black box principle, Improved program coupling.

So when optimizing your program, you may need to balance high performance with high coupling or low performance with low cohesion

Eliminate unnecessary memory access

To understand this optimization point, you first need to understand:

  • 1) For a function, its internal variables are typically stored in a register, which you can think of as a cache between CPU and memory
  • 2) Programs can access registers much faster than memory
  • 3) For a variable whose value is an array, its value in the register is the address of the array in memory, so changing or accessing its data requires access to memory

With these points in mind, let’s look at a piece of negatively optimized code:

(function combine4() {
    let sum = [0]; // Negative optimization here, sum is now a pointer
    console.time('combine4');
    const len = getLengthFromArray(arr);
    for (let i = 0; i < len; i++) {
        sum[0] += arr[i]; // Three memory accesses
    }
    console.log('sum', sum[0]);
    console.timeEnd('combine4'); } ());Copy the code

Sum [0] += arr[I] = sum[0] += arr[I] = sum[0] += arr[I] = sum[0] += arr[I] = sum[0] = arr[I]

  • Read the sum [0]
  • Read the arr [I]
  • Write back sum[0] twice read and once write, three times in total

In order to draw a more convincing conclusion, here is another positive optimized version of Combine4 for comparison.

(function combine4_anothor() {
    let sum = [0];
    let tmp = 0; // Reduce memory access times by designing temporary variables
    console.time('combine4_anothor');
    const len = getLengthFromArray(arr);
    for (let i = 0; i < len; i++) {
        tmp += arr[i]; // A memory access
    }
    sum[0] = tmp;
    console.log('sum', sum[0]);
    console.timeEnd('combine4_anothor'); } ());Copy the code

Combine4_anothor performs only one arr[I] read operation in each for loop, reducing the number of memory accesses as compared with combine4. Combine4 Anothor chooses to write to memory last. In fact, the improvement in operation performance is also huge. Here I give the results of my own computer:

  • Combine3:15.298 ms
  • Combine4:28.701 ms
  • Combine4_anothor: 17.618 ms

The core of this section is: by setting temporary variables, reuse variables, reduce unnecessary memory access

In this video, you may be wondering: Why is memory slow? What is the relationship between registers and memory? Why are local variables stored in registers? How did you do that? Again, these questions are answered in CSAPP.

Using loop unwrapping

(function combine5() {
    /** * use loop expansion */
    let sum = 0;        
    console.time('combine5');
    const len = getLengthFromArray(arr);
    const limit = len - 1; // Prevent trespassing
    let i;
    for (i = 0; i < limit; i += 2) {  // Expansion of step 2
        sum += arr[i]; // Direct access
        sum += arr[i + 1];
    }
    for (; i < len; i++) { // Process the remaining unaccessed elements
        sum += arr[i];
    }
    console.log('sum', sum);
    console.timeEnd('combine5'); } ());Copy the code

How does loop unwinding improve the performance of this code? The reason for this is that it eliminates some of the overhead of calling the for loop. In this particular example, it reduces the number of for loops by about half, thus reducing the number of judgments by half, so the performance is improved.

But here’s the surprise: However, when I use the step size of 2 in the machine practice, Combine5 does not run faster than Combine4 Anothor, on the contrary, it is a little slower. However, when I gradually increase the number of steps, the time gradually approaches combine4 Anothor, and finally it is about the same. Why is this?

First mentioned just now, the reason of cycle performance improvement is to reduce the number of judgment in the for loop, when we use the cycle of step 1, modern compiler can identify the common cycle mode, thereby directly for similar loop unrolling optimization, so when we improve the loop unrolling step length, is actually in the manual for this optimization. However, it is worth noting that for this simple code, the compiler can recognize, but more complex, not necessarily, so it is necessary to master the technology of this expansion, and the source of the performance bottleneck optimized by cyclic expansion is worth thinking about.

Analyze the performance bottleneck again

Now let’s analyze the performance limitations of Combine5, or the running time of this function is mainly dependent on which factor?

We can reduce the number of cycles by loop expansion, so as to save the cost of comparison after each cycle, but the cost of sum += arr[I] can not be saved, no matter how optimization we need to access arR array length at least times, and the computer for sum += arr[I] execution must be sequential, Loop expansion or not, because the sum calculated each time depends on the sum calculated last time.

Sum += arr[I] = sum += arr[I] = sum += arr[I] = sum += arr[I] = sum += arr[I] = sum += arr[I] = sum += arr[I] = sum += arr[I] = sum += arr[I] = sum += arr[I] = sum += arr[I] = sum += arr[I] = sum += arr[I]

Here, we’re going to introduce a little bit of modern CPU knowledge. People think of code (instructions) as being executed in order, and that’s actually what it looks like, but underneath, the CPU is actually executing code out of order. By analyzing and reordering instructions, the CPU can execute code concurrently, even in parallel (by utilizing multiple cores). Why is this correct?

Consider the following code

let a = 1 + 1;
let b = 2 + 2;
let c = a + b;
let d = c + c;
Copy the code

A and B can be computed in parallel because they do not depend on each other, while C must wait until a and B are finished and D must wait until C is finished. Therefore, C and D cannot be executed in parallel and can only be executed sequentially.

Now let’s review the reasons that restrict the performance of Combine5:

  • 1) We all need to access arR array length at least a few times
  • 2) The computer must execute sum += arr[I] in order, regardless of whether loop expansion is used, because each calculation of sum depends on the calculation of the previous sum

1) is unavoidable, but is there an opportunity for us to make improvements to 2) consider the following code

Improve code parallelism

(function combine5_another() {
    /** * improves parallelism */
    console.time('combine5_another');
    let sum = 0;        
    let tmp1 = 0;
    const len = getLengthFromArray(arr);
    const limit = len - 1; // Prevent loop expansion from overstepping bounds
    let i;
    for (i = 0; i < limit; i += 2) {
        sum += arr[i]; // Direct access
        tmp1 += arr[i + 1];
    }
    for (; i < len; i++) { // Process the remaining unaccessed elements
        sum += arr[i];
    }
    sum += tmp2;
    console.log('sum', sum);
    console.timeEnd('combine5_another'); } ());Copy the code

Combine5_another computs the sum of arr[2n-1] by setting a temporary variable tmp1. Sum computs the sum of arr[2n-2]. This allows the CPU to calculate tMP1 and sum in parallel, since tMP1 does not depend on sum and vice versa. Finally, after the loop ends and then merges the results, the theoretical execution time of 100s is reduced to 50s.

However, in my practice, parallel optimization did not improve or even decrease performance for two possible reasons, one is interference and the other is very important. I will talk about interference first because the compiler cannot recognize the loop pattern and optimize it because parallel computing uses loop unwrapping.

Another important reason is loop unrolling and improve the parallelism of two optimization is dependent on the underlying hardware, as for the latter, the CPU can perform parallel computing, because the underlying set up redundant computing function unit, and the number of unit limits at the same time the number of parallel computing * *, for the same optimized code, run on different cpus, The performance improvement is not guaranteed, and may even decrease **. So these two optimizations need to be specific to the machine, specific analysis, but in principle are general.

Similarly, CSAPP explains the principles and reasons behind these two optimizations in detail, and you can read it if you are interested.

Finally, the comparison between unoptimized and optimized versions is provided:

  • Before optimization: 47.267ms
  • Optimized: 12.731ms

Approximately 70% performance improvement

Take advantage of program locality

The main point of this section is that you need to write programs with good locality

Let’s get one thing straight: Operating system caches often use data, through the data stored in memory and CPU cache, namely when a program requests a data, the operating system will look to the cache first, then go to memory to find (although memory transmission speed is faster than a hard disk, but compared to CPU processing speed, or slowly, So a layer of caching is added between the CPU and the CPU), and then the hard disk, and so on, down. If you find it, you go straight back, and it gets slower and slower at each level.

Second, in layman’s terms, the locality of a program is that the data just accessed or the code executed is likely to be accessed (executed) either by itself or by adjacent data (code).

The locality is divided into:

  • Time locality (data (code) that has been accessed is likely to be accessed again)
  • Spatial locality (for data (code) that has already been accessed, its adjacent data (code) is likely to be accessed)

For example: the following code sums a 10 by 10 matrix.

const matrix = getMatrix(10.10);
let sum = 0;
for (let row = 0; row < matrix.length; row++) {
    for (let col = 0; col < matrix[0].length; col++) { sum += matrix[row][col]; }}Copy the code

For variable sum, the good embodies the local time, on the first visit, after each cycle will visit it again, so the operating system after the first visit to add it to the cache, and each time the sum of the visit will go directly to the cache to get, rather than the memory, thus reduced the time consumption of every visit to the memory.

For martix[row][col], it is spatial locality. To understand this, you need to understand how two-dimensional arrays are stored in memory. Two-dimensional arrays are stored in memory as rows first. Then the address bits corresponding to each element:

  • Arr [0][0] -> Memory address 0
  • Arr [0][1] -> memory address 1
  • .
  • Arr [0][9] -> Memory address 9
  • Arr [1][0] -> Memory address 10
  • Arr [1][1] -> Memory address 11
  • .
  • Arr [9][0] -> Memory address 90
  • Arr [9][1] -> memory address 91
  • .
  • Arr [9][9] -> Memory address 99

That is, if we iterate first (as in the above code), then we iterate through the memory address order in a linear increasing order: memory addresses 0, 1, 2… , 50, 51, 52… , 90, 91, 92… , 99,

When we access ARr [0][0](memory address 0), the operating system based on the principle of space locality assumes that we are likely to access ARr [0][1], so it is loaded into the cache. When we do access arR, we do not have to wait for memory. I took it directly from the cache.

By way of comparison, let’s introduce a bad example

let sum = 0;
for (let col = 0; col < matrix[0].length; col++) {
    for (let row = 0; row < matrix.length; row++) { sum += matrix[row][col]; }}Copy the code

The function of this code is exactly the same as that of the above code, the only difference is that the way of traversal has become the first column and then the line, in order to give you an intuitive understanding, at this time traversal memory address order is: 0, 10, 20… , 80, 90, 1, 11, 21… , 81, 91, 2, 12, 22… . This traversal is a little bit more jumpy than the continuous increment traversal. When you access ARR [0][0], the system adds ARR [0][1] to the cache, and then you jump to ARR [1][0]. If the cache misses, you have to wait for memory retrieval. When this operation accumulates, This can result in significant, low program performance.

Here gives you a loose, but intuitive matrix summation example code result:

Start sum 99980001 Good locality: 159.879ms Sum 99980001 bad locality: 3420.815ms endCopy the code
// This code can be copied directly to the browser to run
void function() {
    console.log('start');
    const size = 9999;
    const matrix = Array(size);
    for (let i = 0; i < matrix.length; i++) {
        matrix[i] = Array(size).fill(1);
    }

    (function goodVersion() {
        console.time('Good locality');
        let sum = 0;
        for (let row = 0; row < matrix.length; row++) {
            for (let col = 0; col < matrix[0].length; col++) { sum += matrix[row][col]; }}console.log('sum', sum);
        console.timeEnd('Good locality'); }) (); (function worseVersion() {
        console.time('Bad locality');
        let sum = 0;
        for (let col = 0; col < matrix[0].length; col++) {
            for (let row = 0; row < matrix.length; row++) { sum += matrix[row][col]; }}console.log('sum', sum);
        console.timeEnd('Bad locality'); }) ();console.log('end'); } ();Copy the code

Again, the point is: your program needs to be local enough to be correct

summary

Any articles length limit, in this paper, the principles of high performance code behind the explanation for the reader is only a tasted, there are a lot of things didn’t talk about, some optimization involving all levels of computer knowledge, such as the program locality principle stays at a macro level, the use of only qualitative analysis, but for local principle, It is essentially based on a solid set of theories, relies on a good division of work at all levels of the computer, and this optimization can be quantitatively analyzed. If you expect to know more, more detailed, more system point, suggest you go to see the CSAPP this book and the public class of form a complete set, see: www.yuque.com/u184091/ohc…