Ant Group | Rust data memory layout - Moment For Technology

Author: worcsrcsgg

background

Jiacai Liu, a classmate in our team, mentioned in the previous article that the pointer to trait OBj is a fat pointer:

Rust uses fat Pointers (two Pointers) to represent references to trait objects, one to data and one to vtable.

In addition, the team uses some C libraries lib, such as rust-Rocksdb, in which the data structure that encapsulates C frequently appears #[repr(C)].

This paper is expanded under the extension of the above two problems, learning to explore the memory layout of Rust data types.

It is mainly divided into two parts: one is the basic memory layout of data types in Rust; the other is the representation of memory layout.

Commonly used type

The layout of a type is its size, alignment, and the relative offsets of its fields. For enumerations, how the discriminants are laid out and interpreted is also part of the type layout. For a Sized data type, the memory layout is known at compile time and the size and align can be obtained from size_of and ALIGN_of.

The layout of a type is its size, alignment, and the relative offsets of its fields. 
For enums, how the discriminant is laid out and interpreted is also part of type layout.
Type layout can be changed with each compilation.
Copy the code

Numeric types

Integer types

Type	Maximum	size(bytes)	align(bytes)
`u8`	2⁸– 1	1	1
`u16`	2¹⁶– 1	2	2
`u32`	2³²– 1	4	4
`u64`	2⁶⁴– 1	8	8
`u128`	2¹²⁸– 1	16	16

Type	Minimum	Maximum	size(bytes)	align(bytes)
`i8`	– (2⁷)	2⁷– 1	1	1
`i16`	– (2¹⁵)	2¹⁵– 1	2	2
`i32`	– (2³¹)	2³¹– 1	4	4
`i64`	– (2⁶³)	2⁶³– 1	8	8
`i128`	– (2¹²⁷)	2¹²⁷– 1	16	16

Floating point Numbers

The IEEE 754-2008 “binary32” and “binary64” floating-point types are f32 and f64, respectively.

Type	size(bytes)	align(bytes)
f32	4	4
f64	8	8

F64 is aligned to 4 bytes on x86 systems.

usized & isized

Usize unsigned integer, isize signed integer. The value is 8 bytes on a 64-bit system and 4 bytes on a 32-bit system.

bool

The value can be true or false. The length and alignment length is 1 byte.

array

let array: [i32; 3] = [1.2.3];
Copy the code

The memory layout of arrays is an ordered combination of tuples of system types.

Size n*size_of::<T>() align is align_of::<T>()Copy the code

str

The type char

Char: a 32-bit character, a Unicode Scalar Value. Unicode Scalar Value is in the 0x0000-0xD7FF or 0xe000-0x10FFFF.

STR type

STR represents a U8 slice as [u8] does. The standard library in Rust has an assumption about STR: that STR is utF-8. The memory layout is the same as [U8].

slice

Slice is of type DST and is a view of a sequence of type T. Slice must be used through Pointers. &[T] is a fat pointer that holds the address and number of elements to the data. The memory layout of a slice is the same as the array part it points to.

&The difference between STR and String

The following is the memory structure comparison for &str String:

let mut my_name = "Pascal".to_string();
my_name.push_str( " Precht");

let last_name = &my_name[7. ] ;Copy the code

String

Buffer/capacity / / length / / / + - + - + - + stack frame, │ │ │ │ 6 8, < - my_name: String + - + - + - + │ │ - [- │ -- -- -- -- capacity -- -- -- -- -- -] │ + - V - + - + - + - + - + - + - + - + heap P │ │ │ │ a s c │ │ a L │ │ │ + - + - + - + - + - + - + - + - + [-- -- -- - length -- -- -- --]Copy the code

String vs &str

my_name: String last_name: & STR [-- -- -- -- -- --] [-- -- -- -] + - + - + - + - + - + - + stack frame, │ │ │ 16, 13 │ │ │ │ 6 + - │ - + - + - + - + - + + - │ - │ │ │ + - - - - - - + │ │ │ │ │ │ [- - - - - - - the STR -- -- -- -- -] + - V - + - + - + - + - + - + - + - V - + - + - + - + - + - + - + - + - + heap P │ │ │ │ │ s c a │ │ a l P │ │ │ │ │ │ c h e t r │ │ │ │ + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - +Copy the code

struct

Structs are named compound types. There are several types of structs: StructExprStruct

struct A {
    a: u8,}Copy the code

StructExprTuple

struct Position(i32.i32.i32);
Copy the code

StructExprUnit

struct Gamma;
Copy the code

See Section 2, Data Layout – Data Alignment, for the detailed memory layout.

tuple

A tuple is an anonymous compound type. There are several types of tuples:

() (unit)
(f64, f64)
(String, i32)
(i32, String) (different type from the previous example)
(i32, f64, Vec<String>, Option<bool>)
Copy the code

The structure of a tuple is the same as that of a Struct, except that the elements are accessed through an index.

closure

A closure is equivalent to a structure that captures a variable, implementing either FnOnce or FnMut or Fn.

fn f<F : FnOnce() - >String> (g: F) {
    println!("{}", g());
}

let mut s = String::from("foo");
let t = String::from("bar");

f(|| {
    s += &t;
    s
});
// Prints "foobar".
Copy the code

Generate a closure type:

struct Closure<'a> {
    s : String,
    t : &'a String,}impl<'a> FnOnce< > ()for Closure<'a> {
    type Output = String;
    fn call_once(self) - >String {
        self.s += &*self.t;
        self.s
    }
}
f(Closure{s: s, t: &t});
Copy the code

union

The key feature of a union is that all fields of the union share common storage. Thus, writes to one field of the union can override its other fields, and the size of the union is determined by the size of its largest field.

#[repr(C)]
union MyUnion {
    f1: u32,
    f2: f32,}Copy the code

Each union access interprets the storage only on the type of field used for access. Read union field Reads the union bit at the field type. Fields may have non-zero offsets (unless C notation is used); In this case, bits starting at the field offset are read. It is the programmer’s responsibility to ensure that the data is valid on the type of the field. Failing to do so results in undefined behavior. For example, if you read the integer 3, but you want to convert it to bool, you get an error.

enum

enum Animal {
    Dog(String.f64),
    Cat { name: String, weight: f64}},let mut a: Animal = Animal::Dog("Cocoa".to_string(), 37.2);
a = Animal::Cat { name: "Spotty".to_string(), weight: 2.7 };
Copy the code

Enumeration item declaration types and many variants, each of which is named independently and has the syntax of a struct, tuple struct, or unit-like struct. An enum is a union of named labels, so the memory consumed by its value is the memory of the largest variable of the corresponding enumeration type and the size required to store the discriminant.

use std::mem;

enum Foo { A(&'static str), B(i32), C(i32)}assert_eq!(mem::discriminant(&Foo::A("bar")), mem::discriminant(&Foo::A("baz")));
assert_eq!(mem::discriminant(&Foo::B(1)), mem::discriminant(&Foo::B(2)));
assert_ne!(mem::discriminant(&Foo::B(3)), mem::discriminant(&Foo::C(3)));
Copy the code

enum Foo {
    A(u32),
    B(u64),
    C(u8),}struct FooRepr {
    data: u64.U64, U32, or U8, depending on the tag
    tag: u8.// 0 = A, 1 = B, 2 = C
}
Copy the code

trait obj

Official definition:

A trait object is an opaque value of another type that implements a set of traits. 
The set of traits is made up of an object safe base trait plus any number of auto traits.  
Copy the code

Trait OBj is of type DST. The pointer to trait OBj is also a needle, pointing to data and vtable, respectively. A more detailed description is available

Dynamically Sized Types (DST)

In general, for most types, size and alignment properties can be determined at compile time, and the Sized trait ensures this. The size (? Sized) and DST. DST types include Slice and trait OBj. The DST type must be used through Pointers. Note:

DST can be used as a generic parameter, but note that the generic parameter defaults to Sized. If the type is DST, you need to specify? Sized.

struct S {
    s: i32
}

impl S {
    fn new(i: i32) -> S {
        S{s:i}
    }
}

trait T {
    fn get(&self) - >i32;
}

impl T for S {
    fn get(&self) - >i32 { 
        self.s
    }
}

fn test<R: T>(t: Box<R>) -> i32 {
    t.get()
}


fn main() {
    let t: Box<T> = Box::new(S::new(1));
    let _ = test(t);
}
Copy the code

A compiler error

error[E0277]: the size for values of type `dyn T` cannot be known at compilation time | 21 | fn test<R: T>(t: Box<R>) -> i32 { | - required by this bound in `test` ... 28 | let _ = test(t); | ^ doesn't have a size known at compile-time | = help: the trait `Sized` is not implemented for `dyn T` help: consider relaxing the implicit `Sized` restriction | 21 | fn test<R: T + ? Sized>(t: Box<R>) -> i32 { | ^^^^^^^^Copy the code

fix it

fn test<R: T + ?Sized>(t: Box<R>) -> i32 {
    t.get()
}
Copy the code

Are traits implemented by default? Sized.
Structures can actually store a DST directly as their last member field, but this also makes the structure DST. You can refer to DST to learn more about user-defined DST.

ZST, Zero Sized Type

struct Nothing; // No fields = no size

// All fields have no size = no size
struct LotsOfNothing {
    foo: Nothing,
    qux: (),      // empty tuple has no size
    baz: [u8; 0].// empty array has no size
}
Copy the code

One of the most extreme examples of ZST is Set and Map. We already have the type Map

, so the common way to implement Set

is to simply encapsulate a Map

. Many languages have to allocate space for UselessJunk, store it, load it, and then simply discard it without doing anything. It is difficult for the compiler to determine that these actions are actually unnecessary. But in Rust, we can just say Set

= Map

. Rust statically knows that all load and store operations are useless and does not actually allocate space. As a result, this generic code is simply an implementation of a HashSet, with no extra processing of values from a HashMap.
,>
,>
,>
,>

Empty Types

enum Void {} // No variants = EMPTY
Copy the code

A major application scenario for empty types is to declare unreachable at the type level. For example, an API usually needs to return a Result, but in the special case it will never fail. In this case, by setting the return value to Result<T, Void>, the API caller can confidently use unwrap because it is impossible to produce a Void value, so the return value cannot be an Err.

Data layout

The data aligned

Data alignment has significant benefits for BOTH CPU operations and caching. The alignment property of a structure in Rust is equal to the largest alignment property of all its members. Rust fills in blank data where necessary to ensure that each member is properly aligned and that the size of the entire type is an integer multiple of the aligned property. Such as:

struct A {
    a: u8,
    b: u32,
    c: u16,}Copy the code

Print the address of the variable, and you can see that the alignment attribute is 4.

fn main() {
    let a = A {
        a: 1,
        b: 2,
        c: 3};println!("0x{:X} 0x{:X} 0x{:X}", &a.a as *const u8 as usize, &a.b as *const u32 as usize , &a.c as *const u16 as usize)}0x7FFEE6769276 
0x7FFEE6769270 
0x7FFEE6769274
Copy the code

Data alignment in Rust

struct A {
    b: u32,
    c: u16,
    _pad1: [u8; 2], 
    a: u8,
    _pad2: [u8; 3],}Copy the code

Compiler optimization

Let’s look at this structure

struct Foo<T, U> {
    count: u16,
    data1: T,
    data2: U,
}
Copy the code

fn main() {
    let foo1 = Foo::<u16.u32> {
        count: 1,
        data1: 2,
        data2: 3};let foo2 = Foo::<u32.u16> {
        count: 1,
        data1: 2,
        data2: 3};println!("0x{:X} 0x{:X} 0x{:X}", &foo1.count as *const u16 as usize, &foo1.data1 as *const u16 as usize, &foo1.data2 as *const u32 as usize);
    println!("0x{:X} 0x{:X} 0x{:X}", &foo2.count as *const u16 as usize, &foo2.data1 as *const u32 as usize, &foo2.data2 as *const u16 as usize);
}
0x7FFEDFDD61C4 0x7FFEDFDD61C6 0x7FFEDFDD61C0
0x7FFEDFDD61CC 0x7FFEDFDD61C8 0x7FFEDFDD61CE
Copy the code

Foo1: data1(8), count(c), data2(e) The principle of memory optimization requires that different paradigms can have different order of members. If not optimized, the following may occur, resulting in a large memory overhead:

struct Foo<u16.u32> {
    count: u16,
    data1: u16,
    data2: u32,}struct Foo<u32.u16> {
    count: u16,
    _pad1: u16,
    data1: u32,
    data2: u16,
    _pad2: u16,}Copy the code

repr(C)

The purpose of repr(C) is simply to keep the memory layout consistent with C. All types that need to interact through FFI should have repr(C). Repr (C) is also necessary if we are going to play with data layout, such as reparsing data into another type. For more information, see Repr (C).

repr(u) repr(i)

These two can specify the size of a no-member enumeration. The value can be U8, U16, U32, U64, U128, USize, i8, I16, I32, I64, I128, and ISize.

enum Enum {
    Variant0(u8),
    Variant1,
}

#[repr(C)]
enum EnumC {
    Variant0(u8),
    Variant1,
}

#[repr(u8)]
enum Enum8 {
    Variant0(u8),
    Variant1,
}

#[repr(u16)]
enum Enum16 {
    Variant0(u8),
    Variant1,
}

fn main() {
    assert_eq!(std::mem::size_of::<Enum>(), 2);
    // The size of the C representation is platform dependant
    assert_eq!(std::mem::size_of::<EnumC>(), 8);
    // One byte for the discriminant and one byte for the value in Enum8::Variant0
    assert_eq!(std::mem::size_of::<Enum8>(), 2);
    // Two bytes for the discriminant and one byte for the value in Enum16::Variant0
    // plus one byte of padding.
    assert_eq!(std::mem::size_of::<Enum16>(), 4);
}
Copy the code

repr(align(x)) repr(pack(x))

The align and Packed modifiers can be used to raise or lower the alignment of structures and unions, respectively. Packed can also change the padding between fields. Align enables tricks such as ensuring that adjacent elements of an array never share the same cache line (which can speed up some types of concurrent code). Pack is not easy to use. It should not be used unless extremely requested.

#[repr(C)]
struct A {
    a: u8,
    b: u32,
    c: u16,}#[repr(C, align(8))]
struct A8 {
    a: u8,
    b: u32,
    c: u16,}fn main() {
    let a = A {
        a: 1,
        b: 2,
        c: 3};println!("{}", std::mem::align_of::<A>());
    println!("{}", std::mem::size_of::<A>());
    println!("0x{:X} 0x{:X} 0x{:X}", &a.a as *const u8 as usize, &a.b as *const u32 as usize, &a.c as *const u16 as usize);


    let a = A8 {
        a: 1,
        b: 2,
        c: 3};println!("{}", std::mem::align_of::<A8>());
    println!("{}", std::mem::size_of::<A8>());
    println!("0x{:X} 0x{:X} 0x{:X}", &a.a as *const u8 as usize, &a.b as *const u32 as usize, &a.c as *const u16 as usize); } the result:4
12
0x7FFEE7F0B070 0x7FFEE7F0B074 0x7FFEE7F0B078
8
16
0x7FFEE7F0B1A0 0x7FFEE7F0B1A4 0x7FFEE7F0B1A8
  
Copy the code

#[repr(C)]
struct A {
    a: u8,
    b: u32,
    c: u16,}#[repr(C, packed(1))]
struct A8 {
    a: u8,
    b: u32,
    c: u16,}fn main() {
    let a = A {
        a: 1,
        b: 2,
        c: 3};println!("{}", std::mem::align_of::<A>());
    println!("{}", std::mem::size_of::<A>());
    println!("0x{:X} 0x{:X} 0x{:X}", &a.a as *const u8 as usize, &a.b as *const u32 as usize, &a.c as *const u16 as usize);


    let a = A8 {
        a: 1,
        b: 2,
        c: 3};println!("{}", std::mem::align_of::<A8>());
    println!("{}", std::mem::size_of::<A8>());
    println!("0x{:X} 0x{:X} 0x{:X}", &a.a as *const u8 as usize, &a.b as *const u32 as usize, &a.c as *const u16 as usize); } the result:4
12
0x7FFEED627078 0x7FFEED62707C 0x7FFEED627080
1
7
0x7FFEED6271A8 0x7FFEED6271A9 0x7FFEED6271AD
Copy the code

repr(transparent)

Repr (transparent) is used on structs or enums that have only a single field and is intended to tell the Rust compiler that new types are only used in Rust and that new types (struc or enum) need to be ignored by the ABI. The new type of memory layout should be treated as a single field.

The attribute can be applied to a newtype-like structs that contains a single field. It indicates that the newtype should be represented exactly like that field's type, i.e., the newtype should be ignored for ABI purpopses: not only is it laid out the same in memory, it is also passed identically in function calls. Structs and enums with this representation have the same layout and ABI  as the single non-zero sized field.Copy the code

conclusion

The preceding information describes the memory layout of common data types in Rust

reference

Type system
Data Layout
Item
String vs &str in Rust
Data layout
enter-reprtransparent

About us

We are the time series storage team at The Ant Intelligence Monitoring Technology Center. We are using Rust to build a new generation of time series databases with high performance, low cost and real-time analysis capabilities. Please contact: jiachun.fjc@antgroup.com

Ant Group | Rust data memory layout