preface

SDS (simple dynamic string) is the most basic data type of Redis, so what is SDS string? Why is Redis not directly represented by C strings? Why is SDS more efficient than C strings

Let’s get straight to those questions

What is SDS

SDS (Simple Dynamic String) stands for simple dynamic String and is the default string representation of Redis

C strings use an array of characters of length N+1 to represent strings of length N, and the last element is always a null character ‘\0’ (SDS follows the convention that C strings end with a null character, that is, SDS last byte holds the null character ‘\0’). However, the strings used by C language cannot meet the requirements of Redis on the security, efficiency and function of strings, so WE build SDS by ourselves

There are two versions of SDS, this is the previous version of SDS:

typedef char *sds;

struct sdshdr {
    // Records the number of bytes used in the BUF array, i.e. the length of the string held by SDS
    unsigned int len;
    // Records the number of unused bytes in the BUF array
    unsigned int free;
    // An array of bytes to hold strings
    char buf[];
};
Copy the code

Of course, the len attribute here does not contain a null character at the end of an SDS (such as an SDS with unused space) :

Later versions of SDS define different structures for different length ranges, but generally the structures are the same: len for the used length, alloc for the allocated length of the BUF array, Flag for the SDS type, and buF for the byte array

SDS and C strings

1. Obtain the time complexity of the string length

C string does not record its length, so in order to get the length of a C string, you have to traverse the entire string, so the time is O(n).

For SDS, the len attribute is simply accessed, so Redis reduced the complexity of obtaining string length from O(n) to O(1)

2. Prevent buffer overflow

A C string with no record of its length can easily cause a buffer overflow. For example, a C string SRC is concatenated directly to the end of another string, dest. If dest does not preallocate enough memory, this will cause a buffer overflow

As for SDS, since it has the attribute of free, it should check whether the length is enough before performing the splicing operation. If it is possible, it should directly splice; otherwise, it should expand the space of SDS first and then splice

3. Reduce the number of memory reallocation times caused by string modification

With C strings, every time you grow or shorten it, you need to reallocate the array of C strings, such as concatenation of C strings, you need to expand the size of the underlying array through memory redistribution to avoid buffer overflow, or truncation of C strings, Memory reallocation is needed to free up unused parts of the string to avoid memory leaks

SDS can realize two optimization strategies of space pre-allocation and lazy space release through len and free attributes

  • Space preallocation

Space preallocation is used to optimize string growth operations for SDS: When space expansion is required for SDS, it not only allocates the necessary space for modification, but also allocates additional unused space for SDS

If the modified SDS is less than 1MB, the value of the free attribute is the same as the value of the len attribute

If the modified SDS is greater than or equal to 1MB, 1MB of unused space is allocated

You can see that with this preallocation strategy, SDS reduces the number of memory reallocation required to grow strings N times in a row from a certain N to a maximum of N

  • Inert space release

Lazy space free is used to optimize string shortening operations on SDS: Shortening operations on SDS are recorded with the free property instead of using memory reallocation to reclaim the extra bytes

Of course, there are apis to actually free up SDS unused space if needed

4. Binary security

Characters in C must conform to some encoding (such as ASCII), and strings cannot contain null characters, making C strings only hold text data

The API of SDS is binary safe, and will process the data of BUF array in SDS in the way of binary processing. Therefore, SDS can not only store text data, but also store binary data of any format

The encoding method of SDS

The encoding of a string object can be int, RAW, or embstr

If a string object holds an integer value that can be represented as long, the string object stores the integer value in the PTR property of the string object structure (converting void* to long) :

redis> SET number 666
OK
redis> OBJECT ENCODING number
"int"
Copy the code

Here’s the redisObject structure

All Redis objects have a Redis object header structure

Different objects have different types of type, and types of the same type have different encoding

Each object has a reference count refCount, and when the reference count reaches zero, the object is destroyed and memory is reclaimed. The PTR pointer points to the exact location where the object content (body) is stored

struct RedisObject { // It takes 16 bytes
    int4 type; / / 4 bits type
    int4 encoding; / / 4 bits coding
    int24 lru; // 24bits Records LRU information
    int32 refcount; // 4bytes reference count
    void *ptr; // 8bytes, 64-bit system
} robj;
Copy the code

If the string object holds a string value and the string value is longer than 39 bytes, the string object holds the string value using SDS and sets the encoding of the object to RAW:

redis> SET sentence "I was forced to 996 and I just felt tired after work"
OK
redis> STRLEN sentence
(integer) 52
redis> OBJECT ENCODING sentence
"raw"
Copy the code

If the string object holds a string value with a length of 39 bytes or less, the string object will use embSTR encoding to hold the string value

Embstr is an optimized encoding for short strings. Instead of raw encoding, which calls memory allocation functions twice to create redisObject and SDSHDR structures, Embstr encodings allocate a contiguity of space by calling the memory allocation function once and include redisObject and SDSHDR structures

In addition, all the data of embSTR-encoded string objects is stored in a contiguous memory, so it is better to take advantage of the advantages of caching

redis> SET msg "hello"
OK
redis> OBJECT ENCODING msg
"embstr"
Copy the code

If you want to save a floating-point number into a string object, you will first converts the floating-point number to a string value, and then save the transformation of the proceeds of the string value, if the need of the floating-point calculations, will be stored in a string object in the string value into a floating point value, perform operations to convert the floating-point back again after a string value and save the inside

In addition, embstr-encoded string objects are actually read-only (because no modification program is written for embstr-encoded string objects). If any modification commands need to be executed on an EMbstr-encoded string object, the program converts the encoding from EMbstr to RAW, and then executes the modification commands