In a previous article [compiler optimization, what the compiler actually does], we mentioned a formula process:

Code -> Lexical parsing -> Semantic parsing -> Intermediate code Generation -> Object code generationCopy the code

In order for the generated object code to be executed by the CPU, it needs to use an API provided by the CPU, which is the CPU instruction set.

Common classification

  • Complex Instruction Set (CISC)

The hardware implements complex instructions. Processors belonging to the complex instruction set include x86 and X86_64

  • Reduced Instruction Set (RISC)

Instructions are few and simple, with instructions of equal length. Compilers combine instructions to complete complex operations. RISC processors include ARM, etc

The two designs have different starting points: (1) RISC uses more simple and commonly used instructions, which occupy less space. Low complexity instructions make CPU circuit simpler and lower power consumption and heat dissipation, but the same function needs more instructions to complete; (2) CISC provides complex instructions to optimize the number of instructions and realize functions with fewer instructions. At the same time, its complexity also brings problems such as power consumption and compiler optimization.

Dazzling instruction sets

With respect to instruction sets, you’ll often see fireworks.

In /proc/cpuinfo flags, CPU supported instruction sets are also recorded

You might see these keywords:

x86, x86_64, sse, sse2, sse3, sse4, avx, avx-512 ...
Copy the code

In a nutshell, there are two types: basic operation instruction set and extended instruction set

  • Basic instruction set (x86, X86_64)

Meet cpu-based logic control, data calculation, such as ADD, DIV, OR, JMP, etc

  • Extended instruction set (AVX, VT-X, AMD-V, SSE)

An extended instruction set is provided to meet specific requirements for scenarios where basic instructions cannot be met or are inefficient

X86, x86_64, AMD64?

X86 is a 32-bit instruction set architecture designed and launched by Intel. Basic early cpus are used and supported. X86_64 is a 64-bit extension of the x86 instruction set architecture. AMD first introduced the 64-bit instruction set to extend x86, called AMD64.

Sse, avx? SIMD!

We often see the configuration of SSE and AVX instruction sets in compilation or Linux kernel source code. What’s special about them? Both are actually implementations of the SIMD idea.

SIMD (Single Instruction Multiple Data), which uses one command to operate Multiple Data and realize the parallel operation of small pieces of Data, belongs to an expansion mode of CPU basic Instruction set.

For example, take the ADD command. The normal process involves finger fetching, decoding and execution. First, the addition instruction is obtained for decoding. First, the memory is accessed to obtain the first operand. Access memory again to get the second operand before summing.

If simD-supported instruction sets are used to perform operations, the CPU accesses the memory to obtain multiple operands after basic finger fetching and decoding, and computes multiple operands through one CPU instruction.

The logical contrast between the two can be understood by the following diagram:

Why can this process be achieved? There are several questions involved:

32-bit, 64-bit

When we focus on the operating system, can always see that in the early years system is 32 bit or 64 bit, for 64 – bit, you will first think of address space is bigger, can use more than 4 gb of memory, and it also has another layer of meaning, namely a CPU processing word length of data from 32 bits to 64 bits, a single processing word length doubled, The corresponding universal register is doubled in size.

64-bit registers perform 32-bit operations

Today’s 64-bit cpus use 64-bit universal registers, so if a normal 4-byte integer is performed, only the lower 32 bits of the register are used, not the higher 32 bits.

Are unused high 32 bits so wasted

Logically, you can take advantage of that. For example, if the instruction is addition, then the lower 32 bits and the higher 32 bits can logically be computed separately and simultaneously, which is how SIMD computes.

The register is only 64 bits

To implement parallel operation, there is a problem, a single operand can be up to 64bit, but the register is only 64bit, how to solve the problem? The answer is simple: bigger, wider registers.

How do I tell the CPU to parallelize

The basic instruction set does not support such parallel operations, so a new extended instruction set is needed to support such operations.

In fact, through the above questions, we can basically understand the idea of SIMD technology, which is to use a larger register and divide it into multiple sections, and realize parallel operation by expanding the instruction set at the same time. SSE and AVX are the concrete implementation of SIMD technology.

SIMD technology is created by Intel, in 1996, the introduction of MMX extended instruction set, with 64-bit vector processing ability; SSE1, SSE2, SSE3 and SSE4 were further introduced, and the vector processing capability was expanded from 64 bits to 128 bits. In 2007, AMD preempt Intel with SSE5, and Intel launched AVX the following year.

Application of SIMD technology

Taking AVX as an example, let’s briefly understand the application of SIMD technology.

To use it, first take a look at the description of the AVX instruction set in the official documentation to see what it provides. (Intel ® Advanced Vector Extension Instruction Set introduction)

The Intel ® Advanced Vector Extension Instruction Set (Intel ® AVX) is an instruction set that performs single instruction multiple data (SIMD) operations on Intel ® architecture cpus. These instructions add the following features that extend previous SIMD products -- MMX™ instructions and the Intel ® Data Stream Single Instruction Multiple Data Extended Instruction set (Intel ® SSE) : (1) Extend the 128-bit SIMD register to 256-bit. Intel ® AVX aims to support 512 or 1024 bits in the future. (2) Add 3 non-destructive operands. The 2 operand instruction that was previously executed in the A = A + B operation overrides the source operand, while the new operand can perform the A = B + C operation, leaving the original source operand unchanged. (3) A few instructions use 4 register operands, by removing unnecessary instructions, support smaller, faster code. (4) Memory alignment requirements for operands are relaxed. (5) The new extended coding scheme (VEX) is designed to make it easier to add later and to perform smaller and faster instruction coding.Copy the code

In addition to the relaxation of alignment restrictions and the new extension encoding scheme, these two extensions are of interest:

  1. Larger SIMD registers. The 256-bit register means it can support four 64-bit operands in parallel, doubling the SSE size of 128-bit.

  2. A single instruction supports 3 and 4 operands and is non-destructive. The official example is a good example. The original sum is added using eAX, and the result is placed in EAX. This process only supports two operands, and the original operands placed in EAX are overwritten. 3 The non-destructive operation instruction of operands will make the operation instruction smaller and faster in some scenarios.

AVX

The definition of this extension can also be observed with the AVX directive:

AVX register, YMM0 ~ YMM15

__m256, 256 bit single precision vector __m256i, 256 bit integer vector __m256D, 256 bit double precision vector

AVX instructions

Intel® Intrinsics Guide (Intrinsics Guide) The Intel® Intrinsics Guide provides all of its supported API commands.

Double precision vector addition, __m256d _mm256_add_pd (__m256d a, __m256d b)

Operation
FOR j := 0 to 3
	i := j*64
	dst[i+63:i] := a[i+63:i] + b[i+63:i]
ENDFOR
dst[MAX:256] := 0
Copy the code

High-low assignment, _mm256_storeu2_m128i(__m128i* hiaddr, __m128i* loaddr, __m256i A)

Operation
MEM[loaddr+127:loaddr] := a[127:0]
MEM[hiaddr+127:hiaddr] := a[255:128]
Copy the code

While the instructions alone are boring, let’s look at an example of the improvements AVX has made.

void main() { unsigned long a[ARRAY_SIZE] = {0}; unsigned long index; unsigned long sum = 0; for (index = 0; index < ARRAY_SIZE; index++) { a[index] = index % 2; } long start = clock(); for (index = 0; index < ARRAY_SIZE; index++) { sum += a[index]; } long end = clock(); Printf ("Sum: %ld, Time cost: %lf \n", Sum, (end-start) * 1.0 / CLOCKS_PER_SEC); }Copy the code

Initialize array A and sum array A as follows:

If optimized using AVX, the code would be rewritten to read:

#include <immintrin.h> void main() { double a[ARRAY_SIZE] = {0}; unsigned long index; __m256d sum = _mm256_setzero_pd(); double ret[4] = {0}; for (index = 0; index < ARRAY_SIZE; index++) { a[index] = index % 2; } long start = clock(); for (index = 0; index < ARRAY_SIZE; index+=4) { __m256d ax = _mm256_load_pd(a + index); sum = _mm256_add_pd(sum, ax); } _mm256_store_pd(ret, sum); long end = clock(); double result = ret[0] + ret[1] + ret[2] + ret[3]; Printf ("Sum: %lf, Time cost: %lf \n", result, (end-start) * 1.0 / CLOCKS_PER_SEC); printf("Sum: %lf, Time cost: %lf \n", result, (end-start) * 1.0 / CLOCKS_PER_SEC); }Copy the code

The modified code has several changes: (1) floating point calculation instead of integer calculation (2) sum type change, use 256bit double precision type (3) for loop calculation using AVX command, first through _mm256_load_pd 4 64bit floating point number, (4) Read four bits of 256bit double precision from _mm256_store_pd by _mm256_add_pd and add them together to get the final result.

Since this code uses AVX instructions, compiling with GCC requires the compilation parameter -mavx to support AVX during compilation. After compiling, the execution result is as follows:

As can be seen from the results, the time consumption after optimization using AVX instruction is about 60% of that before optimization.

JVM + SIMD

In JAVA, The JVM actually turns SSE and AVX on by default, as described in The official documentation (The JAVA Command).

-XX:UseSSE=version
Enables the use of SSE instruction set of a specified version. Is set by default to the highest supported version available (x86 only).

-XX:UseAVX=version
Enables the use of AVX instruction set of a specified version. Is set by default to the highest supported version available (x86 only).
Copy the code