In every age, there is no ill-treatment of those who can learn.
Hello, I’m Yes.
I was supposed to write a RocketMQ article this time, but I got cut in line, which I didn’t expect.
Let’s take a look at this code. It belongs to ChannelEventRunnable. This runnable is created by the Dubbo IO thread. Throw this task into the business thread pool for processing.
State == ChannelState.RECEIVED == ChannelState.RECEIVED == ChannelState.
I was back and forth brain scan, think about this in the end what flower head, how shallow knowledge a face meng force.
So began a wave of adventure!
It turned out to be CPU branch prediction
Encounter a problem is to ask a search engine of course, generally speaking, I will search each big engine at the same time, za this also do not say who is better than who, anyway, sometimes Baidu is still good, such as this search baidu to the results are more in front, Google is more after.
Generally I like to search things on the official website first, can not find it and then open search, so first so search site:xxx.com key.
There you go. It’s perfect!
Let’s take a look at this blog post on the official website and then analyze it in more detail.
Dubbo’s blog
Modern cpus all support branch prediction and Instruction pipeline, which combine to greatly improve CPU efficiency. For things like simple if jumps, the CPU does a good job of branch prediction. But the CPU doesn’t have much to do with switch jumps. A switch essentially jumps from an address array to an address array based on an index.
In other words, if is a jump instruction, if it is a simple jump instruction, CPU can use branch prediction to pre-execute the instruction, while switch needs to find the corresponding address according to the value of an array structure, and then jump, so CPU prediction is not helpful.
Since the state of a channel is channelstate.received more than 99.9% of the time after it is established, this state is singled out so that the CPU branch prediction mechanism can be used to improve code execution efficiency.
BenchSwitch generates 100W random states, 99.99% channelstate. RECEIVED, and compares them in the following two ways: I didn’t notice it until I proofread the article).
Although the blog also gives its comparison results, I will run it locally to see what the results are. Actually, JMH does not recommend running in IDE, but I am lazy and run directly in IDEA.
It turns out that standalone if code executes more efficiently (note that we’re measuring throughput here), and the blog also suggests that this technique can be used where performance is critical, meaning it’s not necessary in general.
At this point we know that this is true, but we need to take a closer look at the difference between if and switch execution, and then look at what branch prediction and instruction pipelining actually do.
if vs switch
Let’s start with a quick demo to see if and switch execution efficiency, in fact, add a whole if else control code, switch and if + switch is not moving, See how efficient they are compared to each other (at this point, we still RECEIVED more than 99.9%).
Take a look at the results of the implementation:
If + switch = if else = if else = if else = if else = if else = if else = if else
I change the value generated by state to random and run it again to see what happens:
I ran several times or if throughput is the highest, how to complete the whole if is the best drop.
Decompile if and switch
It is my impression that this switch should be superior to if, regardless of CPU branch prediction, when it comes to bytecodes, let’s look at the bytecodes generated by each.
Take a look at the decompilation of the Switch, which intercepts the key parts.
In other words, the switch generates a tableswitch. After getStatic gets the value of the table, it can directly query the table according to the index and jump to the corresponding row. That is, the time complexity is O(1).
For example, if the value is 1, skip to line 64, or if it is 4, skip to line 100.
There are a few minor details about the switch. When the values in swtich are not consecutive and vary widely, lookupswitch is generated, which is a dichotomous query with O(logn) time complexity and is not directly found by index. The generated lookup should look binary because it is sorted by value size.
If the values in the switch are not consecutive but the difference is small, tableswtich is generated but some values are filled. For example, if the values in the switch are 1, 3, 5, 7, and 9, tableswtich automatically fills the row that 2, 4, 6, and 8 all jump to default.
Let’s look at the decompilation of if again:
It can be seen that if takes out variables and conditions for comparison every time, while switch takes out variables once and then looks up the table and directly jumps to the correct row. From this aspect, the efficiency of switch should be better than that of IF. Of course, if it passes the first judgment, it will directly goto, and will not perform any of the following judgments.
Therefore, from the perspective of generated bytecode, the efficiency of switch should be higher than that of IF, but from the perspective of test results, the efficiency of IF is higher than that of Switch, no matter in the case of random generated state or 99.99% of the same state.
First of all, the optimization of CPU branch prediction is certain, so I am not sure why if is still better than switch in the random case. Maybe it is the JIT that does some optimization, or the benefit of successful branch prediction is greater than that of failed branch prediction in the random case.
Do I have too few enumerations to make switch work? In random cases, switch should not be weaker than if. I added 7 enumerated values to the test, and the result is as follows:
As if the distance was close, I saw the play, so I recite wave 26 letters, or to tell the truth, singing letters.
After expanding the number of branches, we ran another wave of tests, and this time SWtiCH came into its own, finally beating if.
An aside:
JMH did not write correctly, define a constant to test if and switch, and the result of the test method is no consumption. I don’t know how this code will be optimized by the JIT. If I write dozens of lines, IT will be optimized to return some value.
Summarize the test results
So let’s summarize this.
First of all, the hot branch is extracted from the switch and judged independently by if. The convenience brought by fully utilizing THE CPU branch prediction is indeed superior to pure SWTICH. According to the test results of our code, the roughly throughput is twice higher.
However, in the case of hot branching, the throughput is improved more when it is changed to pure IF judgment instead of if + SWtich. 3.3 times that of pure switch and 1.6 times that of if + Switch.
In the random branching case, the three are not very different, but the pure if case still works best.
From a bytecode point of view, the switch mechanism should be more efficient, whether it is O(1) or O(logn), but not from a test result point of view.
If is better than switch in the case of fewer selection conditions. I’m not sure why. Maybe the consumption of table lookup in the case of fewer values is greater than the benefit it brings? Those who know can leave a message at the end of the article.
Switch is better than if when there are a lot of options. If there are more options, I will not test them. If you are interested, you can test them yourself, but this is the trend.
CPU branch prediction
We’ll take a look at how it works and why it’s called “branch prediction”, but first we need to talk about Instruction Pipelining, which is the pipeline of modern microprocessors.
CPU is all about fetch and execute, and fetch and execute. Let’s look at the five steps: fetch (IF), decode (ID), execute (EX), access to memory (MEM), write back (WB), and take a look at a picture from Wikipedia.
Of course, there may be more steps, but that’s exactly what it means to go through so many steps, so that one execution can be divided into many steps, so that these steps can be parallel, to improve the efficiency of the processing.
So an instruction pipeline is an attempt to keep each part of the processor busy with instructions by dividing the incoming instructions into a series of sequential steps that are executed by different processor units, with different parts of the instructions processed in parallel.
Like the assembly line in our factory, my Ultraman’s feet are on the next Ultraman’s feet are on the next Ultraman’s feet, and I’m not waiting for the next Ultraman’s feet to be on the next Ultraman’s feet.
Of course, it is not so rigid, it is not necessarily sequential execution, some instructions waiting for the following instructions do not depend on the previous results, so they can be executed in advance, this is called out-of-order execution.
Let’s go back to our branch predictions.
Just like our life, the code always faces a choice, and only after making a choice can we know what to do next. But in fact, we found that the code often takes the same choice, so we came up with a branch predictor, which can predict the trend and execute all the instructions in advance.
What if it’s wrong? This is not like our life, it can throw away all the results of the previous execution and start again, but there are also effects, namely the deeper the pipeline, the more errors there are and the more wasted, the wrong prediction delay is between 10 and 20 clock cycles, so there are side effects.
To put it simply, the branch predictor is used to predict the instructions to be executed in the future, and then it is pre-executed. In this way, the result can be directly obtained when it is really needed, which improves efficiency.
There are many kinds of prediction methods, such as static prediction, dynamic prediction, random prediction, etc. According to Wikipedia, there are 16 kinds.
Let me briefly say the three kinds I mentioned, static prediction is stupid, just like Mongolian English multiple choice questions, I care what you choose A, that is to say, it will predict A trend, inching forward, simple and rough.
Dynamic prediction is going to be based on the history of the prediction, so for example, if I’m true for the first few times, I’m going to be true for the instructions that I’m going to execute, and if I’m false for the last few times, I’m going to be false for the instructions that I’m going to execute, again, using the principle of locality.
Random prediction can be seen from the name, this is another way of Mongolian English multiple choice questions, a shot in the dark, randomly choose a direction directly execute.
The list goes on for those of you who are interested in exploring it on your own. By the way, in 2018, Google’s Project Zero and other researchers published a catastrophic security vulnerability called Spectre, which exploits the branch predictive execution of the CPU to leak sensitive information. I won’t go into it here, but I’ll link to it at the end.
Then there was another attack called BranchScope, which also used predictive execution, so every time a new gadget came out, there were pros and cons.
Now that we know what pipelines and branch predictions are, and understand why Dubbo is optimised this way, I want to mention this famous stackOverflow problem, just look at the number.
Why is it faster to process ordered arrays than non-ordered arrays?
This question was raised at the beginning of that blog post, and it is clearly related to branch prediction. Since we have seen it and analyzed it again, you can answer this question in your mind, because we already know the answer, so let’s see if it’s clear.
So here’s the code, sorting the array makes the loop faster.
And then the gods came out, and let’s see what the great man had to say.
When you say it, you hit the nail on the head.
You are a victim of branch prediction fail.
Then the picture above, a look is the old driver.
He said, let’s go back to the 19th century, when long-distance communication was impossible and radio was not common. If you were a switchman at a railroad crossing, how would you know which side to turn when the train was coming?
The consumption that the train stops to restart again is very big, arrive bifurcation mouth every time to stop, then you ask him, the elder brothers go to where, pull next channel, restart again very time-consuming, how to do? Guess!
If you’re right, the train keeps going. If you’re wrong, stop and back up and change lanes.
So it’s up to you to guess! Try to turn a bicycle into a motorcycle.
Then the big guy pointed out the key code corresponding to the assembly code, that is, the jump instruction, which corresponds to the fork in the train, it is time to choose a road.
I will not analyze the following, everyone should know that the sorted array execution value greater than 128 must be greater than 128, so every time the branch prediction is correct! So the execution is very efficient.
And an unsorted array is out of order, so a lot of times you’re going to get a prediction error, and then you’re going to have to empty the pipeline, and then you’re going to have to do it again, which of course is slow.
So the big guy says you are the victim of branch prediction error.
Finally, the big boss gave us a revised plan that we do not need if, can not provoke us to hide? Direct use of bit operations to achieve this function, I will not analyze the specific, to give you a look at the big guy’s proposal to modify the program.
The last
This article is nearly enough, today is from a piece of Dubbo code to start the exploration journey, analysis of wave IF and switch, from the test results show that the optimization of Dubbo is not thorough enough, should be changed to if else structure.
Swtich is superior to if in terms of bytecode, but it shows an advantage in the case of many branches according to the test results, and generally fails to beat if.
Then we also know what is called the instruction pipeline, which is actually combined with the reality, the pipeline is fast enough, and branch prediction is also a method to improve efficiency, of course, you have to guess right, otherwise the side effects of branch prediction error can not be ignored, so the requirements for branch predictor is also very high.
JMH test code I also hit a package, want to run the classmate background input “branch prediction” can be obtained, if you think this article is a good point at yo, if there is a mistake, please contact me quickly whip.
Shoulders of giants
Spectre :www.freebuf.com/vuls/160161…
Dubbo blog:Dubbo.apache.org/zh-cn/blog/…
Stackoverflow.com/questions/1…
En.wikipedia.org/wiki/Instru…
En.wikipedia.org/wiki/Branch…
I’m yes, from a little bit to a billion bits. See you next time.