This article is participating in the Java Theme Month – Java Debug Notes Event, see the event link for details
Question: Why is it faster to process sorted arrays than unsorted arrays?
Here is a C++ code that shows some very special behavior. For some strange reason, miraculously sorting the data makes the code nearly six times faster:
#include <algorithm> #include <ctime> #include <iostream> int main() { // Generate data const unsigned arraySize = 32768; int data[arraySize]; for (unsigned c = 0; c < arraySize; ++c) data[c] = std::rand() % 256; / /!!!!!! With this, the next loop runs faster. std::sort(data, data + arraySize); // Test clock_t start = clock(); long long sum = 0; for (unsigned i = 0; i < 100000; ++i) { for (unsigned c = 0; c < arraySize; ++c) { // Primary loop if (data[c] >= 128) sum += data[c]; } } double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC; std::cout << elapsedTime << std::endl; std::cout << "sum = " << sum << std::endl; }Copy the code
No, the code runs in 11.54 seconds. std::sort(data, data + arraySize); Using sorting data, the code runs in 1.93 seconds. Initially, I thought it might just be a language or compiler exception, so I tried Java:
import java.util.Arrays; import java.util.Random; public class Main { public static void main(String[] args) { // Generate data int arraySize = 32768; int data[] = new int[arraySize]; Random rnd = new Random(0); for (int c = 0; c < arraySize; ++c) data[c] = rnd.nextInt() % 256; / /!!!!!! With this, the next loop runs faster Arrays.sort(data); // Test long start = System.nanoTime(); long sum = 0; for (int i = 0; i < 100000; ++i) { for (int c = 0; c < arraySize; ++c) { // Primary loop if (data[c] >= 128) sum += data[c]; Println ((system.nanotime () -start) / 1000000000.0); System.out.println("sum = " + sum); }}Copy the code
With similar but less extreme results.
My first thought was that sorting would bring the data into the cache, but then I thought how stupid that was, since the array had just been generated.
1. What’s going on?
2. Why is it faster to process sorted arrays than unsorted arrays?
Answer:
Consider railway interchanges:
Now for the sake of argument, suppose this was back in the 19th century – before long-distance or radio communication.
You are a junction operator and you hear a train coming. You don’t know which way to go. You stop and ask the driver what direction they want. Then set the switch appropriately.
Trains are heavy and have a lot of inertia. Therefore, they need to start and slow down forever.
Is there a better way? Guess which way the train is going!
If you guess right, it will continue. If you’re wrong, the captain will stop, stand up, and yell at you to flip the switch. It can then restart to another path. If you get it right every time, the train will never stop. If you guess wrong too much, the train will take a lot of time to stop, back up and restart.
Consider the if statement: at the processor level, it is a branch instruction:
You are a processor, and you see a branch. You don’t know which way it’s going to go. What’s your job? You stop execution and wait for the previous instructions to complete. Then, you continue on the right path.
Modern processors are complex and have long pipes. Therefore, they need to “warm up” and “slow down” forever.
Is there a better way? You guess which way the branch will go!
If you get it right, you go ahead. If you are wrong, flush the pipe and roll back to the branch. You can then restart along other paths. If you get it right every time, the execution never stops. If you guess wrong too much, you spend a lot of time procrastinating, rolling back, and restarting.
This is the branching prediction. Not the best analogy, I admit, because trains can signal direction with flags. But in computers, the processor doesn’t know which direction the branch will go until the last minute.
So how do you guess strategically to minimize the number of times the train has to back up and travel down another road? Look at the past history! If the train goes left 99 percent of the time, you guess it’s a left turn. If it alternates, then you alternate your guess. If it goes one way every three times, you guessed the same…
In other words, you try to recognize and follow the pattern. This is more or less how branch predictors work.
Most applications have branches that perform well. As a result, modern branch predictors typically achieve a hit rate of >90%. However, branch predictors are almost useless when faced with unrecognizable branches that cannot be recognized.
Further reading: Wikipedia article on “branch predictors”.
As mentioned above, the culprit is this if statement:
if (data[c] >= 128)
sum += data[c];
Copy the code
Note that the data is evenly distributed between 0 and 255. When sorting data, approximately the first half of the iteration is not entered with an if statement. After that, they will all enter if the statement.
This is very friendly to branch predictors, because branches go in the same direction many times in a row. Even a simple saturation counter will correctly predict branching unless it switches direction with a few iterations.
Quick visualization:
T = branch taken N = branch not taken data[] = 0, 1, 2, 3, 4, ... 126, 127, 128, 129, 130... 250, 251, 252,... branch = N N N N N ... N N T T T ... T T T ... = NNNNNNNNNNNN ... NNNNNNNTTTTTTTTT ... TTTTTTTTTT (easy to predict)Copy the code
However, when the data is completely random, the branch predictor becomes useless because it cannot predict random data. As a result, there could be about a 50% chance of miscalculation (nothing better than a random guess).
data[] = 226, 185, 125, 158, 198, 144, 217, 79, 202, 118, 14, 150, 177, 182, ...
branch = T, T, N, T, T, T, T, N, T, N, N, T, T, T ...
= TTNTTTTNTNNTTT ... (completely random - impossible to predict)
Copy the code
So what can be done?
If the compiler can’t optimize the branch for conditional movement, you can try some hacks if you’re willing to sacrifice readability for better performance.
Replace:
if (data[c] >= 128)
sum += data[c];
Copy the code
With:
int t = (data[c] - 128) >> 31;
sum += ~t & data[c];
Copy the code
This eliminates branching and replaces it with some bit operations.
(Note that this hack is not strictly equivalent to the original if statement. But in this case, it applies to all input values of the data [].
Baseline: Core I7 920 = 3.5ghz
C++ – visual studio 2010 – x64 release
Scenario Time (s) Branch - Random Data 11.777 Branch - Sorted Data 2.352 No Branch - Random Data 2.564 No Branch - Sorted Data 2.587 Java - Netbean 7.1.1 JDK 7 - X64 Scenario Time (s) Branch - Random data 10.93293813 Branches - Sorted Data 5.643797077 No Branches - Random Data 3.113581453 No Branches - Sorted Data 3.186068823Copy the code
Observation:
And branching: There is a huge difference between sorted data and unsorted data. With hackers: There is no difference between sorted data and unsorted data. In the C++ case, when sorting data, the hacker is actually a bit slower than the support. A general rule of thumb is to avoid data-dependent branching in critical loops, such as in this example.
Update:
GCC 4.6.1 is able to generate conditional movement with or on X64. Therefore, there is no difference between sorted data and unsorted data – both are fast. -O3-ftree-vectorize
(Or a little faster: For cases that are already sorted, it can be slower, especially if the GCC puts it on the critical path, not just, especially before Intel Broadwell has 2 cycle delays: GCC optimization flag -O3 makes the code slower than -o2CMovAddCmov)
VC++ 2010 cannot generate conditional movement for this branch even under. /Ox
Intel C++ compiler (ICC) 11 does a miracle. It swaps the two rings, thus dangling the unpredictable branch to the outer ring. As a result, it is not only impervious to wrong predictions, but also twice as fast as any VC++ and GCC can produce! In other words, the ICC uses the testing cycle to beat the benchmark…
If you give the Intel compiler no branching code, it just quantifies it right out… As fast as branching (and loop swapping).
This shows that even mature modern compilers can vary widely in their ability to optimize code…