Antecedents feed

You can see me first before sharing on JTalk practice: www.bilibili.com/video/BV1UZ…

This article is a detailed explanation of my final section on “tuning with SIMD.”

TL; DR

Increased by 6 times in List and 12 times in List.

background

With the FastRead/Write interface, since we already have all the memory, we can try SIMD for further optimization.

Train of thought

The most obvious optimization point is also the most common usage in the company list<i64/i32>, which is easy to think of using SIMD for optimization.

In thrift binary, the int type is converted to the big end (binar.bigendia.putint) before being copied to buffer. But under AMD64 there is a BSWAP directive that can be done directly. This optimization is already done by the Go compiler, so the pseudocode now looks like this:

var src, dst
for i := 0; i < len; i++ {
    dst[i] = bswap(src[i])
}
Copy the code

As you can see, this operation is actually quite regular and all adjacent operations, following the pattern of SIMD instructions.

POC

Using c + + first made a POC (see only posted key code, complete code gist.github.com/PureWhiteWu…

const long long int MASK = 0x0001020304050607; const __mmask16 bit16mask[17] = {0x0000, 0x0001, 0x0003, 0x0007, 0x000f, 0x001f, 0x003f, 0x007f, 0x00ff, 0x01ff, 0x03ff, 0x07ff, 0x0fff, 0x1fff, 0x3fff, 0x7fff, 0xffff}; void avx512_little_2_big(const long long int *src, long long int *dst, int n) { int loop_count = n / 8; int remainder = n % 8; __m512i mask = _mm512_set1_epi64(MASK); for (int i = 0; i < loop_count; i++) { int index = i * 8; __m512i input_data = _mm512_loadu_si512(&src[index]); __m512i output_data = _mm512_shuffle_epi8(input_data, mask); _mm512_storeu_si512(&avx512_data[index], output_data); } if (remainder ! = 0) { int index = loop_count * 8; __m512i padding = _mm512_set1_epi64(0); __m512i input_data = _mm512_mask_loadu_epi64(padding, bit16mask[remainder], &src[index]); __m512i output_data = _mm512_shuffle_epi8(input_data, mask); _mm512_mask_storeu_epi64(&avx512_data[index], bit16mask[remainder], output_data); } return; } void avx2_little_2_big(const long long int *src, long long int *dst, int n) { int loop_count = n / 4; int remainder = n % 4; __m256i mask = _mm256_set1_epi64x(MASK); for (int i = 0; i < loop_count; i++) { int index = i * 4; __m256i input_data = _mm256_loadu_si256((__m256i *)&src[index]); __m256i output_data = _mm256_shuffle_epi8(input_data, mask); _mm256_storeu_si256((__m256i *)&avx2_data[index], output_data); } if (remainder ! = 0) { int index = loop_count * 4; for (int i = index; i < index + remainder; i++) { avx2_data[i] = bswap_64(src[i]); } } return; }Copy the code

The test results

The compile command is as follows:

$ g++ little_2_big_gcc.cpp -o ll2 -mavx512f -mavx512bw -mavx2 -mavx -O3
Copy the code

The test results on a Linux physical machine are as follows:

avx512 time: 27009 us
avx2 time: 21920 us
bswap time: 49967 us
Copy the code

It can be concluded that:

  1. Avx512 performance is very inconsistent and in some cases worse than AVX2;
  2. Compared with BSWAP scheme, AVX2 can basically improve the performance of more than one time;
  3. Linus is honest with me.

Explain in detail

What bswap does is reverse the entire byte order. Int32, for example, contains 4 bytes, assuming the original data is as follows:

00000000 00000001 00000010 00000011

So after bswap, the data is:

00000011 00000010 00000001 00000000

There is also an instruction in avx2 vpshufb can achieve a similar effect, but not pure bswap, see: software.intel.com/content/www…

Shuffle, meaning “shuffle”, is used to rearrange byte positions based on incoming masks. So the key here is the top line in the code example:

const long long int MASK = 0x0001020304050607;
Copy the code

Why is this mask ok? We have to go over the big and the small.

Int32 = int32; int32 = int32; int32 = int32

00000011 (high level here) 00000010 00000001 00000000

So on our computers, the little endian order is stored like this:

Memory address 0 1 2 3 (high level here)
value 00000000 00000001 00000010 00000011

At this time, the corresponding MASK is 0x00010203, which is represented in the small endian order in memory:

Memory address 0 1 2 3 (high level here)
value 3 2 1 0

In shuffle mode, memory address 0 corresponds to memory address 3, memory address 1 corresponds to memory address 2, and so on.

After shuffle calculates, the memory value becomes:

Memory address 0 1 2 3
value 00000011 00000010 00000001 00000000

shuffle

At this point, a bswap operation has been successfully completed.

Int64 has 8 bits, so MASK 0x00 01 02 03 04 05 06 07 can perform a bswap for INT64.

(Note: No 0 key was abused when this section was written)

Go test results

We have tested a benchmark with 12,345 elements in the List:

BenchmarkWriteListI64
BenchmarkWriteListI64-16         703928         1753 ns/op
BenchmarkWriteI64
BenchmarkWriteI64-16              98204        11875 ns/op
BenchmarkWriteListI32
BenchmarkWriteListI32-16        1300507          907 ns/op
BenchmarkWriteI32
BenchmarkWriteI32-16              98522        12580 ns/op
Copy the code

It can be seen that the performance of Go is greatly improved. In the List scenario, the performance is improved by six times and the List is improved by more than ten times.

The reason for this is that Go’s optimizations are so poor that they are nowhere near as good as GCC.

Related articles

  • Golang escape analysis
  • A preliminary study on Golang generics
  • Why does Golang assignment cause memory allocation?
  • Source code analysis golang sync.mutex
  • Verify that the unsafe package in Golang is unsafe