Go ARM64 Map optimization tips

background

Go has the map type built in, and the important hash algorithm is a variant of CityHash.

In order to avoid Collision attacks and speed up hashing, Keith Randall added the hardware-accelerated AESHASH algorithm supported by x86_64 in Go1.0. I searched the Internet and was surprised to find that this algorithm is only implemented in Go, which is a brilliant idea.

I was looking around for the ARM64 Go Runtime to optimize: ARM64 also supports AESHASH hardware acceleration instructions, but Go is not used.

I smiled at the corners of my mouth, ready to add code, full of joy. But I do not know, this seemingly calm sea does not know what hidden monster…

starts

Snake dozen seven inches, see the code to achieve the natural first look. The initialization code is at Runtime /alg.go

if (GOARCH == "386" || GOARCH == "amd64") && GOOS ! = "nacl" && support_aes && // AESENC support_ssse3 && // PSHUFB support_sse41 { // PINSR{D,Q} useAeshash = true algarray[alg_MEM32].hash = aeshash32 algarray[alg_MEM64].hash = aeshash64 algarray[alg_STRING].hash = aeshashstr // Initialize with random data so hash collisions will be hard to engineer. getRandomData(aeskeysched[:]) return }Copy the code

As you can see, by replacing the Hash function in Algarrary with the aeshash function, we have achieved this accelerated substitution, fully considering future follow-through on other platforms, and feel that this is a kind invitation for me to finish the next work.

Let’s take a look at the simplest aesHash64 implementation

//func aeshash64(p unsafe.Pointer, h uintptr) uintptr TEXT Runtime ·aeshash64(SB),NOSPLIT,$0-24 MOVQ p+0(FP), AX // PTR to data MOVQ h+8(FP), X0 // seed PINSRQ $1, (AX), X0 // data AESENC runtime·aeskeysched+0(SB), X0 AESENC Runtime · AESKeySCHed +16(SB), X0 AESENC Runtime · AESKeysched +32(SB), X0 MOVQ X0, RET +16(FP) RETCopy the code

AX loads a pointer to the data, and then loads the map seed and data into X0 for calculation. It should be noted that each Hashmap is initialized with a random seed to prevent hackers from finding the seed of the system and launching a hash collision attack. In the next few steps, the data is encrypted using the random seed runtime· aesKeysched generated during program initialization, and the results are returned.

More complex aeshashes simply load various lengths for calculation.

At this point, I could only sigh that it was so easy, and it took me two weekends to write the general code, with the following problems:

  1. Platform differences
  2. Smhasher
  3. (Collision)
  4. Go compiler bug

Platform differences

The first problem discovered was that ARM64 did not have the AESENC of X86, but was split into two instructions, AESE and AESMC.

Standard AES encryption is divided into 4 steps:

  1. AddRoundKey
  2. SubBytes
  3. ShiftRows
  4. MixColumns

The X86 AESENC instruction is equivalent to:

  1. SubBytes
  2. ShiftRows
  3. MixColumns
  4. data XOR Key

But… The AESE instruction for ARM64 is equivalent to:

  1. AddRoundKey
  2. SubBytes
  3. ShiftRows

So, if you just copy X86 AESENC X0,X0 is written… The data is wiped clean… (it broke. Helpless, had to find another way to use the system random seed encryption data, code ideas are as follows:

// Load system seeds into V1 // load seeds and data into V2 AESE v1.b16, v2.b16Copy the code

SMhasher&Collision

Having solved the last problem, we started testing. Go uses Smhasher, a Hash function that passes the following tests:

  1. Sanity has to deal with the whole key
  2. AppendedZeros, zero, different length
  3. SmallKeys, all combinations of SmallKeys (< 3 bytes)
  4. Cyclic, for example: 11211->11121
  5. Sparse, e.g. 0B00001 and 0B00100
  6. Each block is Permutation in a different order
  7. Avalanche, flip every bit
  8. Windowed, for example, if the hash value is 32 bits, then the results need to be different if the 20 bits are the same
  9. The change in the Seed will affect the result

Every time Smhasher reported an error, I started to look at what was wrong with the code. It was usually a simple error like misplacing registers… Go also tests bucket allocations in a map, and can also fail if a bucket has too much data. This part, however, went smoothly.

Go compiler bug

With this done, I found another hole. To reduce the number of instructions, I want to load the registers directly into the ARM64 Vector Lane

But Go’s compiler cannot compile the different Lane indexes correctly. For example, the following two instructions produce the same instruction code……

VMOV R1, V2.D[0]
VMOV R1, V2.D[1]
Copy the code

I have to report a bug first.

cmd/asm: wrong implement vmov/vld on arm64

Then use native bytecode first, want to know how to pull Tips directly

The devil

Finally, all the Runtime Smhasher and Hash tests passed, and I started trying to build Go by running SRC /all.bash.

Then I reached the monster at the bottom of the sea…

The build log

$./all.bash Building Go CMD /dist using /usr/lib/go-1.6. Building Go toolchain1 using /usr/lib/go-1.6. Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1. Building Go toolchain2 using go_bootstrap and Go toolchain1. Building Go toolchain3 using go_bootstrap and Go toolchain2. # runtime duplicate type.. hashfunc.struct { "".full "".lfstack; "".empty "".lfstack; "".pad0 [64]uint8; "".wbufSpans struct { "".lock "".mutex; "".free "".mSpanList; "".busy "".mSpanList }; _ uint32; "".bytesMarked uint64; "".markrootNext uint32; "".markrootJobs uint32; "".nproc uint32; "".tstart int64; "".nwait uint32; "".ndone uint32; "".alldone "".note; "".helperDrainBlock bool; "".nFlushCacheRoots int; "".nDataRoots int; "".nBSSRoots int; "".nSpanRoots int; "".nStackRoots int; "".markrootDone bool; "".startSema uint32; "".markDoneSema uint32; "".bgMarkReady "".note; "".bgMarkDone uint32; "".mode "".gcMode; "".userForced bool; "".totaltime int64; "".initialHeapLive uint64; "".assistQueue struct { "".lock "".mutex; "".head "".guintptr; "".tail "".guintptr }; "".sweepWaiters struct { "".lock "".mutex; "".head "".guintptr }; "".cycles uint32; "".stwprocs int32; "".maxprocs int32; "".tSweepTerm int64; "".tMark int64; "".tMarkTerm int64; "".tEnd int64; "".pauseNS int64; "".pauseStart int64; "".heap0 uint64; "".heap1 uint64; "".heap2 uint64; "".heapGoal uint64 } build failedCopy the code

Yi… Compiler build error?? I passed all the tests! ?

Run all.bash again and find the error is not the same.

Use GDB breakpoints on asM_arm64.s :aeshash I wrote to track the execution flow and hash on long strings (>129) without problem. Is it a compiler bug?

So I started tracking the compiler’s actions and found that only the symbol tables used the map functionality. After catching up on the fundamentals of the compiler and the Go implementation, I realized that this symbol table also only uses aeshashstr (hash characters) functionality. I converted all the aeshashes in Smhasher to aeshashstr, and still miraculously passed the test! Aeshash32 and aeshash64 are identical to x86 implementations, including the result.

So I emailed, I posted, I asked all these weird questions in groups. There’s still no solution.

In my spare time of one month, I basically read the relevant codes of the compiler and found that obviously two different symbols are still considered to be the same. Finally, I desperately wanted money to see if anyone would debug it for me. Attitude rubs some people the wrong way. I guess this bug is messing with my head…

Out of the pit

Until recently, it occurred to me that maybe the Smhasher test didn’t cover everything? Sure enough, the aeshash129plus section shows up on closer inspection

SUBS	$1, R1, R1
BHS	aesloop
Copy the code

SUBS is SUB, and at R1=1, it quits. But Smhasher does a full test on only 128 bytes, so the test passes, but it doesn’t compile. The bug only fires when 256 bytes are larger.

Is improved

SUB	$1, R1, R1
CBNZ	R1, aesloop
Copy the code

Finally, you can commit CL

runtime: implement aeshash for arm64 platform

Note that the speed is about 30-40MB faster if the PRFM instruction is used (Hash1024). May be the focus of the next optimization (alignment and Cache)

Name old speed New Speed delta Hash5 97.0MB/s ± 0% 97.0MB/s ± 0% -0.03% (p=0.008 n=5+5) Hash16 329MB/s ± 0% 329MB/s ± 0% ~ (p= 0.302N =4+5) Hash64 858MB/s ±20% 890MB/s ±11% ~ (p= 0.841n =5+5) Hash1024 3.50GB/s ±16% 3.57GB/s ± 7% ~ (p=0.690 N =5+5) Hash65536 4.54GB/s ± 1% 4.57GB/s ± 0% ~ (p=0.310 n=5+5)Copy the code

Tips

How to generate native ARM64 instruction bytecode in GNU Assembly Language?

$cat vld.s
ld1 {v2.b}[14], [x0]   

$as -o vld.o -al vld.lis vld.s
AARCH64 GAS  vld.s                      page 1    

   1 0000 0218404D      ld1 {v2.b}[14], [x0] 
Copy the code

The third column is the generated bytecode, which is OK to copy to GO

WORD $0x4D401802
Copy the code

There is a tool called ASM2Plan9s, but currently it does not compile ARM64

Thank you

And finally thank you very much

  • Wei Xiao
  • Fangming
  • Keith Randall

For my meticulous help