directory
-
Basic knowledge of
-
The assembler syntax
-
Demo
-
The basic program
-
debug
Speaking of assembly optimization, we have to say a famous word of Gartner – premature optimization is the root of all evil. If you’re not cornered or trying to drain your CPU, don’t try the following lecture.
I submitted these projects for Go 1.11, and the first one is Hashmap optimization, which is the most time-consuming hash calculation optimization in your common map operations. VDSO, virtual dynamic docking syscall, is mainly to optimize the system time call. Md5 and Chacha20 are not mentioned. There is also a Duffcopy, which is for compiler unwrapping optimization, which is not optimized very well on arm64, so I optimized it as well. All of them are available on Go Master except Chacha20, which is not yet complete. Some people may wonder why all the arm64 platform optimizations? In fact, the Go official team maintained x86-64, which was already well optimized, so I will not get involved and chose a relatively new platform, ARM64.
Xiao Wei, a leader of Domestic ARM company, also led his team to make go-related optimization, such as SHA256, which improved efficiency by 16 times. Cloudflare, a CDN company, has made some optimization, which was also incorporated in Go 1.11. What is the efficiency of optimization?
This was a retweet from their CTO, and the CTO asked him what had he optimized last week? He said he optimized some of Go’s libraries for 20 times RSA performance, 15 times AES-GCM performance and 18 times P256 performance. After seeing these danniu optimization has such a good performance improvement, is not very enchanted ah? This speech is to teach you the introduction of assembly optimization, how to do more than ten times the acceleration.
1. Basic knowledge
So how did you run so fast? You have to know what to do. Three things to sum up: read and write reduction, parallel operation, and hardware acceleration.
1.1 Reducing Read and Write
Above, Google’s Jeff Dean shared “Delays Programmers Should Know about.” What is this delay? For example, the speed of data extraction from CPU L1 was 0.5ns in 2012, while that from CPU L2 was 7ns. Storage, or what we call memory, is 100ns pulled out of it. Have you noticed that each additional layer is 10 times less performance, so you need to use less memory and more registers. Also, there is a trick when the CPU accesses memory. If you line this up, the CPU will execute faster. These are basic knowledge, we can baidu, Google, do not expand.
1.2 Parallel Operations
In what’s known in the industry as parallel operation SIMD, multiple data operations are performed on a single instruction. For example, normal addition operations, you can only add one number at a time, but if you use some vector instruction set, you can do 8, 16, 32 at a time, which means you can do more data in the same amount of time, which is faster, which is a natural thing.
1.3 Hardware Acceleration
No matter how good the algorithm is, it’s at most 10 times, but the hardware instructions are 16 times up, and xiao Wei and Vlad, for example, are basically using hardware instructions to optimize, very simple and crude. As Ma Yun said, no matter how high the martial arts, also afraid of kitchen knives.
1.4 Program memory distribution
-
Construction is consistent with other programs
-
TEXT= executable code
-
DATA= heap + global variable
-
Frame = function argument + temporary data
-
Stack =Go scheduler/signal processing
The memory distribution of Go needs to be generally understood, because assembly is a direct operation on memory, so you need to know where memory is and what is stored in where. In fact, Go uses memory pretty much the same way other programs do. At the bottom, TEXT is where the knowable code is stored, and DATA is the heap and global variables. The only difference is that Go does not use the system stack entirely, but breaks it down into frame stack frames, which hold the program’s parameters and temporary data. So what happened to the stack of the original system? Go’s scheduler and signal processing are on the system stack, not on the stack frame.
2. Assembly syntax
Features of assembly syntax
-
Quasi-abstract assembly language
-
AT&T Style (left to right)
-
Instruction parameter ×N target (N=0… 3)
Although assembly syntax looks complicated, it is actually very simple and crude, without C++, Java and a bunch of terminology. Operating directly on memory, it’s that simple. In fact, Go’s assembly syntax has a lot of roots in the operating system Plan9, which you may not have heard of, but was built by the same people as Go.
Go assembly syntax is actually very simple, it is quasi-abstract assembly language, why called quasi-abstract? Go’s original assembly language is the hope of a unified, x86-64, Arm64, we just write an assembly language can. After the realization of the majority of the discovery can not do, finally can only retain differences, unified style, and then output machine code, so called abstract assembly language. Then there’s the AT&T style, left to right, which is the instruction level on the far left, with a few parameters in the middle, and then the target register or target, and everything else, it’s completely different from platform to platform.
2.1 Assembly syntax example
It’s a very complicated function, c=a+b, and then returns. So what’s the first step? SB is telling the assembler that this thing is static base, based on static address.So the TEXT area, this is telling the assembler to say you start looking here, don’t look anywhere else, and the assembler says, okay, I’ll just code the address in, and that’s it.Remember in the example we saw three arguments, ABC, all int64, how many bytes is an int64? 8 bytes, so this stack frame is 24 bytes long. Notice there’s an alignment problem here, other platforms don’t necessarily have 24, but I’ll put 24 here for simplicity.
2.2 Example code explanation
The first step is the move instruction
To move data from one place to another, simply put ab and AB into registers R1 and R2. There is an extra object called FP in the Frame Pointer.This is the start of FP addressing, FP refers to the lowest stack frame. You take a out, you start at zero, you move it into R1, you take B out, is it eight bytes, and you store it in R2.
Step three, R3 is equal to a plus b.
Finally, put the data in R3 back into the parameter of C, Return.
You’ve learned assembly language by now.
Very simple, but you don’t want to write that. Why? I’ll write it in Go, just one line. If you use a sink to write a complex language, it’s very difficult to use a sink to write a complex language. Remember, we talked about the three assembly optimization goals, read and write reduction, parallel operation, hardware acceleration.
2.3 Reduce Reading and Writing
For example, memmove, Go built-in function copy, very simple, to move a piece of data from the original address to the target address, the simplest way is to move one by one, from the original address to move 8 bytes, move 8 bytes, save in, move 8 bytes out, and so on. What could possibly go wrong here?
Fill register
The CPU is not happy when it has to go through the same register every time it moves eight bytes — its performance deteriorates.
Fill line
This phenomenon is called PIPELining of the CPU. If you move one CPU to another, it will cause a blockage. How to solve this problem? Move as much data as you can from the source address, moving all the data to different CPU registers at once, and then writing it to the target address at once. By doing so, we can avoid the blocking problem of the pipeline.
Processing block data
Processing block data is very easy for the CPU to do, it can be previously operated data into L1, L2, so the addressing speed is much faster than pulling out of main memory.
This is a bit complicated to write, but at its core, the ARM64 platform has a lot of registers, 32, minus the 4 that Go uses for internal purposes, leaving you with 28. So you can move 28 times 8 bytes all at once. The code should look like this. And you’ll notice, why is the performance down here? This is related to the CPU, each company to implement the CPU is not the same, some companies cut corners, unfortunately I met, so this problem. According to ARM’s specs, there is no performance penalty for accessing unaligned addresses. How do you measure it? Look for a BETTER CPU to test, and finally qualcomm a buddy gave me the data, said that the optimization effect is very good, are to improve. It’s a pity that this patch has not been officially accepted. Why? Because of the open source agreement, because Go uses BSD, I refer to the code glib C, after all, this algorithm is not what I thought, the world’s code is copied randomly, the code is moved from other places, gliBC uses the protocol is GPL, Go’s core development said, this GPL, no. I argued that Glibc is also open source, so why not? The official reply is that BSD and GPL of our company cannot be interused, so this patch is not included in the master of Go. It’s a pity.
2.4 Parallel Operations
Let me give you an example. Very simple, a uint8 slice, you add it up, put it in the DST, add up all the data in that slice. I’m going to do a little bit more complicated data here, so I’m going to give you a Demo.
So there are three functions in there, so let’s look at the first one, which is exactly the same function as before, straight from the bottom, empty, which tells the assembler that we’re going to start here. I’m going to test everything, and I’m going to lay out 64 of them, and I’m going to stuff them all in, DST, and I’m going to subtract 64 from the original I, and I’m going to end up with each 64. This code is empty. There’s nothing there. This is more or less the same. Does anyone know the data structure of Slice? It’s only three numbers. Do you remember that function? Two slices, read from the first slice from bit 0, I read from the second slice from bit 2, if you don’t understand. You’re temporarily interpreting it as plugging two Pointers into R1 and R3. Then I load the data from R1 and R3 into 4 vector registers, that is, 8 in total. After loading the data, I finally do a vector addition operation, and then stuff the data back to R1. Finally, the return operation. To develop master code for Go, you may need some of the latest Go compilers. The results of the test were clean.
So if you look at the effect, the top function, the version that I just did with Go, the bottom one is added as a vector, there’s only one function name difference between these two functions. This throughput was originally realized with Go, originally 285MB/S, with vector, 3GB/ S, the effect is improved by 10 times. In fact, the true optimization is not that high, this is only if your algorithm and data structure implementation is good, about this performance improvement.
Maybe you have some impression, some file names are strange, why add arm64 at the end? This tells the assembler to compile only in ARM64 and leave it alone on other platforms.
Benchmark is important. You think your code/data structure is good, but you don’t measure it. Why? It takes tests and benchmarks to figure that out.
2.5 the GDB Debug
For code written in Go, such as binary programs, how does GDB run the program? The run.
If you want to break a point on a certain line, in a certain place, which is important for assembler programs, use break.
To see what happens next, Go does a great job of even assembly line by line, using ext.
Sometimes registers are used in Go optimizations, so the check register is info Register.
Sometimes you want global variables how do you look at those things? EXamine the addresses or registers of global variables with eXamine, and do this primarily with registers.
And then finally hardware acceleration, which I’m not going to do in time, hardware acceleration is a very difficult thing, you have to be very knowledgeable about the instruction set of a particular CPU.
That’s all, thank you!
[Q&A]
Question: compilation can not understand, many do not match.
Munjo: The figure on the right is an example. Specifically, the implementation of stack frames is different for each platform. In my reference materials, Xargin Cao, a developer of Didi in China, also studied it and found that the implementation of stack frames is different for X86 and ARM64.
Question: Look at the compiler code?
Mundro: No, look at the runtime code. It has some documentation on it, but it’s not complete. If you’re really in doubt, run along with GDB once to find out.
Q: many parameters are loaded at once, reducing the read. Is it space for time? Will it take up a lot of space?
Mundro: Yeah, register space for execution time. Registers are for stuff. Isn’t it nice that I’m using them so well?
Question: assembly this should how to use let us learn? Because I have also read an official source code, if I take the source code to debug, ok, how to do?
Munjo: Yes, use GDB, log can not see, but look at the function behavior, break points are possible.
Question: this I look for information can see, now get the Go language source code, how to run? And then I Go in the Go language, for example, for example, now the Go language syntax might want to see how the syntax works, or even how it compiles, how do I debug this to see?
Munjo: Go itself is just a compiler, and everything compiled is computer executable, which involves three parts, a compiler, a connector, and finally the executable file. You have to look at three parts. It means that you know the Go source code is not a problem, but the Go itself is just compiled, just Go language, statements into binary files, that’s all. What did you just say about the compilation process?
Question: how change.
Munjo: Look at the compiler code below Go.
Question: I want to see how it works, and see the execution steps through the IDE.
Munjo: IDE way, that is like Go other programs, like Go Run and so on.
Question: I have never run successfully. I just want to analyze its language magic and see how it compiles. I know the grammar is officially available, but I want to change its grammar by myself, re-implement a set, and increase the grammar function of my own internal test.
Other audience: Go has a special AST package.
Monjo: You need to study the part, from the grammar analysis to the compilation link, Go official documents in their own, the source code is in golang.org/pkg/go.