What is the SVE

The NEON instruction set is the standard implementation of the ARM64 architecture’s single instruction multiple Data Stream (SIMD). Scalable Vector Extension (SVE) is a new Vector instruction set developed for areas such as high Performance computing (HPC) and machine learning. It is a next-generation implementation of the SIMD instruction set rather than a simple Extension of the NEON instruction set. The SVE instruction set has many concepts similar to the NEON instruction set, such as vectors, channels, data elements, and so on. The SVE instruction set also introduces a new concept: the VectorLength Agnostic (VLA).

The traditional SIMD instruction set uses fixed size vector registers, such as the NEON instruction set with a fixed 128-bit length vector register. The SVE instruction set, which supports the VLA programming model, supports variable length vector registers. This allows the chip designer to choose an appropriate vector length based on load and cost. The length of the vector register of the SVE instruction set supports a minimum of 128 bits and a maximum of 2048 bits in increments of 128 bits. The SVE design ensures that the same application can run on a machine that supports SVE instructions of different vector lengths without the need to recompile the code, which is the essence of the VLA programming model.

The SVE instruction set is a new instruction set based on the A64 instruction set, and SVE2 is a superset and extension of the SVE instruction set released on the ARMv9 architecture.

The SVE instruction set contains hundreds of instructions, which can be grouped into the following categories.

ø Load storage instructions and prefetch instructions

ø Vector movement instruction

ø Integer operation instruction

ø Bit operation instruction

ø Floating point operation instruction

ø Predictive operation instruction

ø Data element operation instructions

SVE2 instruction set is further expanded and improved on the basis of SVE instruction set, adding some new instructions and extensions. This section does not cover each Instruction in detail, interested readers can read the ARMv9 Instruction set documentation: Arm A64 Instruction SetArchitecture, ARMv9, for ARMv9 – an ArchitectureProfile.

What is vector computation?

SIMD, full name Single Instruction Multiple Data, a Single Instruction to operate Multiple Data, providing small Data parallel processing capabilities. ARM added the NEON instruction set extension from the ARMv7 architecture to vectoscope parallel computing for image processing, audio and video processing, video codec, and other scenarios.

SISD (Single Instruction Single Data) refers to Single Instruction Single Data. Most ARM64 instructions are single instruction single data (SISD). Each instruction performs its specified operation on a single data source, so multiple instructions are required to process multiple data items. For example, to perform four addition operations, it takes four instructions to add from four pairs of registers.

ADD w0, w0, w5

ADD w1, w1, w6

ADD w2, w2, w7

ADD w3, w3, w8

 

If the data elements are small, such as when adding 8-bit values, each 8-bit value needs to be loaded into a separate 64-bit register. Because processors, registers, and data paths are designed for 64-bit computing, performing a large number of individual operations on a small data size is not an efficient use of machine resources.

SIMD refers to single instruction multiple data stream, which performs the same operation on multiple data elements simultaneously. These data elements are packaged into separate Lanes in a larger register. For example, the ADD directive adds together 32-bit data elements. These values are packaged into separate channels in two pairs of 128-bit registers, V8 and V9 respectively. Each channel in the first source register is then added to the corresponding channel in the second source register and then stored in the same channel in the target register (V10).

The ADD V0.4 S, V1.4 S, V2.4 SCopy the code

As shown in Figure 1, the ADD instruction performs four addition operations in parallel, each of which is located on the four computing channels inside the processor and is independent of each other. Overflow or carry in any channel does not affect the other channels.

V0.4 S [0[] = V1.4 S0[] + V2.4 S0[] V0.4 S1[] = V1.4 S1[] + V2.4 S1[] V0.4 S2[] = V1.4 S2[] + V2.4 S2[] V0.4 S3[] = V1.4 S3[] + V2.4 S3]
Copy the code

In Figure 1, a 128-bit vector register, Vn, can store four 32-bit data Sn simultaneously. In addition, it can store two 64-bit data Dn, eight 16-bit data Hn, or 16 8-bit data Bn.

SIMD is well suited for image processing scenarios. The commonly used data types of image data are RGB565, RGBA8888,YUV422 and other formats. The data characteristics of these formats are that one component of A pixel (A, R, G and B component) is represented by 8-bit data. If you use a traditional processor to do the calculation, even though the processor’s registers are 32-bit or 64-bit, you can only use the lower 8 bits of the registers to process the data, which is a bit wasteful. If the 64-bit register is split into eight 8-bit data channels, eight operations can be performed simultaneously, and the computational efficiency is improved eightfold.

In summary, the difference between SISD and SIMD is shown in Figure 2.

Vectors and channels

Vectorformats are often used in SIMD instructions. Vectors are divided into lanes, each of which contains a vector element. As shown in Figure 3, a Vn vector register can be divided into 8 16-bit data, such as channel 0, channel 1, etc.

A channel can be composed of a variety of different data types, such as the 128-bit data type is represented by Vn, the 64-bit data type is represented by Dn, the 32-bit data type is represented by Sn, the 16-bit data type is represented by Hn, and the 8-bit data type is represented by Bn, as shown in Figure 4.

In the vector instruction set (NEON/SVE), instructions can usually be divided into two categories, one is vector operation instructions, the other is scalar operation instructions. Vector operations are performed on all channels in the vector register at the same time, while scalar operations are performed on only one channel in the vector register.

SVE register group

The SVE instruction set provides a whole new set of registers.

ø 32 new variable length vector registers Z0~ Z31.

ø 16 predicateregister P0 ~ P15.

ø First Fault predicateRegister (FFR)

ø SVE control register ZCR_Elx

(1) Variable length vector register Z

The Z register is a data register with variable length. Its length is a multiple of 128, up to 2048 bits. The data in the Z register can be stored as 8-bit, 16-bit, 32-bit, 64-bit, or 128-bit data elements, as shown in Figure 5. The lower 128 bits of each Z register are multiplexed with the corresponding NEON register.

(2) Forecast register P

The P register retains one bit for each byte in the Z register, that is, the P register is always 1/8 the size of the Z register. Predicatedinstruction uses the P register to determine which vector elements (channel data) to process. Each bit in the P register specifies whether the corresponding byte in the Z register is active or inactive.

When the data element is 8 bits wide (Bn), the P register can use 1 bit to indicate its active state. This bit is 1 for active and 0 for inactive. By analogy, when the data element is 128 bits wide (Vn), 8 bits are reserved in register P to represent the active state of the corresponding data element in register Z, but only the lowest bit is used to represent the active state, and other bits are reserved.

Assuming a vector register length of 256 bits, the vector register is divided into 8 channels, each of which stores data 32 bits wide. If you want to manipulate the data of these 8 channels simultaneously in the SVE instruction, you need to use a P register to represent the status of these 8 data channels. As shown in Figure 6, the Pn register is also divided into 8 groups, each of which consists of 4 bits, and each group uses only the lowest bit to represent the active state of the corresponding data channel (32 bits wide) in the Zn register. For example, Bit[3:0] of the Pn register represents channel 0 of the Zn register, Bit[7:4] of the Pn register represents channel 1 of the Zn register, and so on.

(3) FFR register

The FFR register is the same size as the format and prediction register P. The FFR register is used for the First Fault PredicteLoad Instruction, such as the LDRFF Instruction. When the vector elements are loaded using the first exception prediction load pointer, the FFR register updates the loading status of each data element, whether it succeeded or failed

(4) ECR_ELx register

The system software can set the length of the vector register through the LEN field in the ECR_ELx register. However, the length of the setting cannot exceed the length of the hardware implementation.

SVE instruction syntax

The syntax of the SVE directive is quite different from that of the NEON directive. The SVE instruction format consists of operation code, target register, P register and input operator. Here are some examples.

[example 1]

The following is the format of a LD1D instruction.

LD1D { <Zt>.D },<Pg>/Z, [<Xn|SP>, <Xm>, LSL #3]
Copy the code

Among them:

ø Zt indicates the vector register. Z0~Z31 can be used.

ø D represents the data type of the channel in the vector register.

ø Pg indicates predicateOperand. P0 to P15 can be used.

The Z in ø /Z stands for zero predication.

ø Xn/SP: indicates the base address of the source operand. Xn is a general purpose register and SP is a stack pointer register.

ø Xm: represents the second source operand register.

 

[example 2]

Here is the format of an ADD directive.

ADD <Zdn>.<T>,<Pg>/M, <Zdn>.<T>, <Zm>.<T>
Copy the code

Among them:

ø Zdn indicates the first source vector register or target vector register.

ø Pg indicates the prediction register. P0~P15 can be used.

The M in ø /M stands for mergingpredication.

Zm indicates the second source vector register.

ø T represents the data type of the channel in the vector register.

Check out the Running Linux community for more articles!