Huawei | Rust SIMD calculation speed instruction in language use

Author: Li Yuan


Introduction to SIMD

SIMD stands for Single Instruction Multiple Data, which means Single Instruction Multiple Data stream. It is a computing performance optimization technology based on specific CPU Instruction set. As the name implies, it refers to the execution of a CPU instruction, can be more than one data calculation simultaneously. The performance can be improved several times or even tens of times in scientific computing, multimedia application and other data-intensive computing scenarios.

Introduction to the official Rust SIMD accelerator library

Rust is a programming language that allows you to choose from a variety of compile backends. Of course, most Rust projects in the industry today use LLVM, the compiler’s default compiler choice, as the compile back end. It is worth noting that LLVM itself already integrates most of the major CPU architectures with various instruction sets including SIMD. This makes it natural for Rust to use SIMD because Rust can use the SIMD function interface provided by LLVM directly in the compiler and even in user code as a static link, rather than having to write the assembly code itself as in Go, C, etc.

Rust provides two SIMD accelerators in the official Github project: STDARch and STDSIMd. Their Github repository addresses are given here. Stdarch provides dedicated SIMD acceleration instruction sets for each CPU architecture in a modular manner, such as AVX, AVX512, SSE, SSE2 instruction sets for x86 architecture. ARM/Aarch64 platform NEON, SVE instruction set; As well as SIMD instruction set of RISCV, WASM and other architectures, users must understand the CPU architecture they use and the SIMD instruction set function provided by the architecture. Stdsimd provides a large number of abstract SIMD function interfaces common to all platforms, such as vector addition, subtraction, multiplication and division, displacement, type conversion, etc. Readers do not have to understand the hardware architecture and instruction set they use, so it is relatively convenient to use, but there will be some limitations in the use of functions. The two projects have different starting points for functional design, and their maintainers are different. Their design and use will be described below.

The use of stDSIMD

Stdsimd provides a common simD acceleration interface for all platforms, and its implementation relies on the set of platform-intrinsic interfaces provided by the Rust compiler. This set of interfaces in turn encapsulates the platform instruction sets provided by LLVM, so the relationship between them should be:

Stdsimd – Wrapper – Rust compiler – wrapper – LLVM

The STDSIMD project has not been integrated into the Rust standard library due to its incomplete functions. Readers can use the source code by cloning it into their own projects. Or use the community version of STDSIMd, packed_simd (add Packed_simd = {version = “0.3.4”, package = “packed_simd_2”} in the Cargo. Toml file). The following uses are also described based on the community version.

The Packed_simd project provides a set of vector data types, Simd<[T; N]>, that is, vectors composed of N T elements, and provides them with easy-to-understand type aliases, such as f32x4, which stands for Simd<[F32; 4]>. The SIMD acceleration function provided by packed_SIMd is also based on this vector data type.

Packed_simd provides the following SIMD data types (element_width represents the size and amount of data, such as 32×4 and 64×8) :

  • i{element_width}: Indicates a signed integer
  • u{element_width}: Is an unsigned integer
  • f{element_width}: Floating point type
  • m{element_width}: bool type
  • *{const,mut} T: variable or immutable SIMD type pointer

By default, operations on vector structures are “vertical”, that is, they are applied to each vector channel independently of the other vectors, as in this example:

let a = i32x4::new(1.2.3.4);
let b = i32x4::new(5.6.7.8);
assert_eq!(a + b, i32x4::new(6.8.10.12));
Copy the code

This example declares two i32x4 vectors and computes their sum using the addition operator overload. On the other hand, “horizontal” operations are certainly provided, as in the following example:

assert_eq!(a.wrapping_sum(), 10);
Copy the code

In general, “vertical” operations are always the fastest, while “horizontal” operations are relatively slow. That is, the fastest way to calculate the sum of an array is to use multiple “vertical” operations plus one “horizontal” operation, as shown below:

fn reduce(x: &[i32]) -> i32 {
    assert!(x.len() % 4= =0);
    let mut sum = i32x4::splat(0); // [0, 0, 0, 0]
    for i in (0..x.len()).step_by(4) { sum += i32x4::from_slice_unaligned(&x[i..] ); } sum.wrapping_sum() }let x = [0.1.2.3.4.5.6.7];
assert_eq!(reduce(&x), 28);
Copy the code

Here are some more common use cases:

// generate i32x4 with all 0 elements:
let a = i32x4::splat(0);

// The i32x4 vector is generated from the first four elements of the array:
let mut arr = [0.0.0.1.2.3.4.5];
let b = i32x4::from_slice_unaligned(&arr);

// Read the elements in the vector:
assert_eq!(b.extract(3), 1);

// Replace the element at the corresponding position in the vector:
let a = a.replace(3.1);
assert_eq!(a, b);

// Write a vector to an array:
let a = a.replace(2.1);
a.write_to_slice_unaligned(&mut arr[4. ] );assert_eq!(arr, [0.0.0.1.0.0.1.1]);
Copy the code

In addition, packed_simd also provides conditional operations on vectors, such as the following code to perform the +1 operation on the corresponding element in a vector based on whether the element in m is true:

let a = i32x4::new(1.1.2.2);

// Perform the +1 operation on the first two elements of a.
let m = m16x4::new(true.true.false.false);
let a = m.select(a + 1, a);
assert_eq!(a, i32x4::splat(2));
Copy the code

This leads to more flexible uses, such as new vectors composed of larger values at each location of the two vectors

let a = i32x4::new(1.1.3.3);
let b = i32x4::new(2.2.0.0);

// ge: greater than or equal to calculate, generate vector of type bool element
let m = a.ge(i32x4::splat(2));

if m.any() {
    // Select the elements in a or b based on the result in m
    let d = m.select(a, b);
    assert_eq!(d, i32x4::new(2.2.3.3));
}
Copy the code

This is the basic usage of STDSIMD (PackeD_SIMD), a project that allows developers to easily enjoy the effects of SIMD acceleration through SIMD-type data structures. However, the project has some drawbacks, such as the user having to manually select the length of the vector. Since most CPU architectures provide at least a 128-bit SIMD instruction set, it is always reasonable to choose a 128-bit vector length. However, when the CPU provides a more advanced SIMD instruction set (such as AVX512), it is better to choose a longer instruction set. Therefore, when developers have a certain amount of KNOWLEDGE about CPU architecture and SIMD, they can get twice the result with half the effort.

Use of special instruction acceleration library STDARch

Stdarch is integrated into the standard library of the Rust language and can be used in code through the USE STD :: ARCH statement. It is important to note that only two architectures, x86_64 and x86, have stable versions released so far, Therefore, other structures, such as ARM and AARCH64, must switch the Rust compiler to the nightly version (enter rustup default nightly on the command line) to compile and use the Rust compiler. Therefore, the following uses x86_64(x86), which is available in stable, as an example.

Stdarch encapsulates many SIMD instruction sets provided by LLVM in a statically linked way, and provides SIMD instruction sets under various mainstream architectures in a modular way, as shown below. The SIMD function interfaces available under each architecture can be viewed by clicking on the corresponding links.

  • x86
  • x86_64
  • arm
  • aarch64
  • mips
  • mips64
  • powerpc
  • powerpc64
  • nvptx
  • wasm32

Compared to STDSIMD, STDARCH requires a higher level of CPU architecture knowledge for developers. Because STDARch provides thousands of SIMD instructions with different functions for every major CPU architecture, developers need to manually identify which instructions they need the most.

Take this example:

#[cfg(
    all(
        any(target_arch = "x86", target_arch = "x86_64"),
        target_feature = "avx2")))
fn foo() {
    #[cfg(target_arch = "x86")]
    use std::arch::x86::_mm256_add_epi64;
    #[cfg(target_arch = "x86_64")]
    use std::arch::x86_64::_mm256_add_epi64;

    unsafe {
        _mm256_add_epi64(...);
    }
}
Copy the code

This code first uses Rust’s native CPU feature detection, the target_arch property macro, to detect whether the development environment is X86_64 or x86, and then uses the target_feature property macro to detect whether the AVX2 instruction set is available. The following foo function is compiled when all of the above conditions are met. Inside foo, the corresponding SIMD instruction is selected depending on whether the CPU is x86_64 or x86.

Alternatively, developers can use the dynamic feature detection statement is_x86_Feature_detected! , as follows:

fn foo() {
    #[cfg(any(target_arch = "x86", target_arch = "x86_64")))
    {
        ifis_x86_feature_detected! ("avx2") {
            return unsafe{ foo_avx2() }; }}// return without using AVX2
}

#[cfg(any(target_arch = "x86", target_arch = "x86_64")))
#[target_feature(enable = "avx2")]
unsafe fn foo_avx2() {
    #[cfg(target_arch = "x86")]
    use std::arch::x86::_mm256_add_epi64;
    #[cfg(target_arch = "x86_64")]
    usestd::arch::x86_64::_mm256_add_epi64; _mm256_add_epi64(...) ; }Copy the code

Stdarch itself has plenty of similar conditionally compiled code. Therefore, the corresponding instruction set module is available only when it meets the requirements of the environment. For example, the USE STD :: ARCH ::x86_64 statement can be used in the X86_64 architecture, but the use STD :: ARCH ::x86_64 or USE STD :: ARCH :: ARM statement cannot be used.

The following illustrates the use of STDARch through a concrete example, the SIMD implementation of a hexadecimal encoding function. This example mainly uses the SSE4.1 instruction set for x86 and X86_64.

The specific code is as follows, and the various SIMD instructions and their uses can be found in the comments or linked documents of the corresponding modules (x86 or X86_64) above.

fn main() {
    let mut dst = [0; 32];
    hex_encode(b"\x01\x02\x03", &mut dst);
    assert_eq!(&dst[..6].b"010203");

    let mut src = [0; 16];
    for i in 0.16 {
        src[i] = (i + 1) as u8;
    }
    hex_encode(&src, &mut dst);
    assert_eq!(&dst, b"0102030405060708090a0b0c0d0e0f10");
}

pub fn hex_encode(src: &[u8], dst: &mut [u8]) {
    let len = src.len().checked_mul(2).unwrap();
    assert!(dst.len() >= len);

    #[cfg(any(target_arch = "x86", target_arch = "x86_64")))
    {
        ifis_x86_feature_detected! ("sse4.1") {
            return unsafe { hex_encode_sse41(src, dst) };
        }
    }

    hex_encode_fallback(src, dst)
}

#[target_feature(enable = "sse4.1")]
#[cfg(any(target_arch = "x86", target_arch = "x86_64")))
unsafe fn hex_encode_sse41(mut src: &[u8], dst: &mut [u8]) {
    #[cfg(target_arch = "x86")]
    use std::arch::x86::*;
    #[cfg(target_arch = "x86_64")]
    use std::arch::x86_64::*;

    // Generate 16 vectors of type INT8 with all values set to ASCII numbers of character '0'
    let ascii_zero = _mm_set1_epi8(b'0' as i8);
    // Generate 16 vectors of type INT8 and set all values to the integer 9
    let nines = _mm_set1_epi8(9);
    // Generate 16 vectors of type INT8, all set to the ASCII number of character 'a' minus 10
    let ascii_a = _mm_set1_epi8((b'a' - 9 - 1) as i8);
    // Generate 16 vectors of type INT8 with all values set to binary 00001111
    let and4bits = _mm_set1_epi8(0xf);

    let mut i = 0_isize;
    while src.len() >= 16 {
        // Read a 128-bit integer from a pointer to form a 128-bit vector (can be converted to int8x16, int32x4, etc.)
        let invec = _mm_loadu_si128(src.as_ptr() as *const _);
		
        // Convert the 128-bit vector type to a vector of type INT8x16, and manipulate each element with the binary number 00001111
        let masked1 = _mm_and_si128(invec, and4bits);
        // Convert the 128-bit vector type to int8x16, move each element logically 4 bits to the right, then add each element to the binary number 00001111
        let masked2 = _mm_and_si128(_mm_srli_epi64(invec, 4), and4bits);

        // Get the position of all elements greater than 9 in the vector
        let cmpmask1 = _mm_cmpgt_epi8(masked1, nines);
        let cmpmask2 = _mm_cmpgt_epi8(masked2, nines);
		
        // _mm_blendv_EPI8 generates a new vector whose elements are either ascii_zero or ascii_A, depending on whether the corresponding position in CMPMASk1 is true
        // _mm_ADD_EPI8 represents the sum of the elements in the corresponding position of the vector, and the result represents the ASCII hexadecimal number generated
        let masked1 = _mm_add_epi8(
            masked1,
            _mm_blendv_epi8(ascii_zero, ascii_a, cmpmask1),
        );
        let masked2 = _mm_add_epi8(
            masked2,
            _mm_blendv_epi8(ascii_zero, ascii_a, cmpmask2),
        );

        // Generate a new vector with even-numbered elements (starting from 0) from masked2 and odd-numbered elements from masked1
        // This vector has 256 bits, so put the first 128 bits into res1 and the last 128 bits into res2
        let res1 = _mm_unpacklo_epi8(masked2, masked1);
        let res2 = _mm_unpackhi_epi8(masked2, masked1);

        // Write the result vector to the target pointer
        _mm_storeu_si128(dst.as_mut_ptr().offset(i * 2) as *mut _, res1);
        _mm_storeu_si128(
            dst.as_mut_ptr().offset(i * 2 + 16) as *mut _,
            res2,
        );
        src = &src[16. ] ; i +=16;
    }

    let i = i as usize;
    hex_encode_fallback(src, &mut dst[i * 2. ] ); }fn hex_encode_fallback(src: &[u8], dst: &mut [u8]) {
    fn hex(byte: u8) - >u8 {
        static TABLE: &[u8] = b"0123456789abcdef";
        TABLE[byte as usize]}for (byte, slots) in src.iter().zip(dst.chunks_mut(2)) {
        slots[0] = hex((*byte >> 4) & 0xf);
        slots[1] = hex(*byte & 0xf); }}Copy the code

The usage of SIMD acceleration instruction in STDARch is briefly presented here through this specific example. It can be seen that, compared with STDSIMD, the use of special instructions requires much higher SIMD experience of developers, but it will provide more complete functions and applicable scenarios.

The above is a simple introduction to the use of the official SIMD accelerator library in Rust, hoping to inspire and help readers to learn and develop.