Instead of using C’s traditional string representation directly: an array of characters ending in null characters (i.e., “\0”), Redis built an abstract type called Simple Dynamic String (SDS).

1. What is an SDS string

Each SDS. h/ SDSHDR structure represents an SDS value:


struct sdshdr {
    // The length of the string stored by the SDS
    int len;
    // Record the number of unused bytes in the buf array
    int free;
    // An array of bytes to hold strings
    char buf[];
};
Copy the code

Figure 1-1 shows an example of an SDS:

· The value of the free attribute is 0, indicating that this SDS has not allocated any unused space.

· The len attribute has a value of 5, indicating that the SDS holds a five-byte string.

· The buf attribute is an array of type char. The first five bytes of the array hold the characters ‘R’, ‘e’, ‘d’, ‘I’, ‘s’, and the last byte holds the null character ‘\0’.

SDS follow the conventions of C string null terminated save null character 1 byte space does not calculate in SDS len attribute, and allocate an additional 1 byte to null character space, and operations such as add null character to the end of the string is a function of SDS automatically, so the null character is completely transparent for users of SDS. The advantage of following the null-terminated convention is that SDS can directly reuse some of the functions in the C string function library.

C uses a character array of length N+1 to represent a string of length N, and the last element of the character array is always the null character ‘\0’.

For example, Figure 1-2 shows a C string with the value “Redis”.

This simple string representation method used by C language cannot meet the requirements of Redis for string security, efficiency and function. Let’s go over the difference between C strings and SDS.

2. Why Redis chooses to use the SDS string

2.1 The length of the string can be obtained in the constant complexity by SDS

Because C strings do not record their own length information, to get the length of a C string, the program must walk through the entire string, counting every character it encounters until it encounters a null character representing the end of the string, an operation of O (N) complexity.

Unlike C strings, because SDS records the length of the SDS itself in the len attribute, the complexity of getting an SDS length is only O (1).

2.2 Preventing buffer overflows

Since C strings do not record their own length, strcat assumes that the user has allocated enough memory for dest to hold all the contents of the SRC string while executing this function, and if this assumption fails, a buffer overflow occurs.

For example, suppose you have two C strings s1 and S2 next to each other in memory, where S1 holds the string “Redis” and S2 holds the string “MongoDB”, as shown in Figure 2-2.

If a programmer decides to implement:


strcat(s1, " Cluster");
Copy the code

The content of S1 is changed to “Redis Cluster”, but the careless one forgot to allocate enough space for S1 before strcat is executed. After strcat is executed, the data of S1 will overflow into the space of S2, causing the content of S2 to be accidentally modified, as shown in Figure 2-3.

Unlike C strings, the SDS space allocation strategy completely eliminates the possibility of buffer overflows: When the SDS API needs to change the SIZE of the SDS, the API first checks whether the size of the SDS meets the modification requirements. If not, the API automatically expands the size of the SDS to the required size before performing the actual modification. Therefore, you do not need to manually change the size of the SDS. There will be no buffer overflow problems described earlier.

For example, the SDS API also has a function called SDSCat that performs concatenation operations. It concatenates a C string to the end of the string in a given SDS. But before concatenating, sdSCat checks to see if there is enough space for the given SDS. Sdscat expands the space of the SDS before performing the splicing operation.

For example, if we implement:


sdscat(s, " Cluster");
Copy the code

Sds-s is shown in Figure 2-4. Then SDSCat will check whether the length of S is sufficient before splicing. After finding that the current space of S is insufficient for splicing “Cluster”, SDSCat will first expand the space of S before splicing “Cluster”. Figure 2-5 shows the SDS after the splicing.

Figure 2-4 SDS before sdSCAT is executed

Figure 2-5 SDS after SDSCAT is executed

Note the SDS shown in Figure 2-5. Not only did SDSCAT concatenate the SDS, it also allocated 13 bytes of unused space to the SDS, and the concatenated string is exactly 13 bytes long. This phenomenon is neither a bug nor a coincidence, it is related to the SDS space allocation strategy

2.3 Reduce the number of memory reassignments caused by string modification

Because C strings do not record their own length, for a C string containing N characters, the underlying implementation of the C string is always an array of N+1 characters (with an extra character space for empty characters). Because the length of the C string and the bottom there is a correlation between the length of the array, or shorten a C string, so every time growth program is always want to save the C string array redistribution, a memory because memory redistribution involving complex algorithm, and may need to perform a system call, so it is usually a more time-consuming operation.

To avoid this drawback of C strings, SDS disassociate the string length from the underlying array length by using unused space: In SDS, the length of a BUF array is not necessarily the number of characters plus one; the array can contain unused bytes, and the number of bytes is recorded by the SDS free attribute.

SDS implements two optimization strategies, space pre-allocation and lazy space release, through unused space.

2.3.1 Space preallocated

Space preallocation to optimize SDS string growth operations: When the SDS API makes modifications to an SDS that require space expansion, the program not only allocates the space necessary for the modifications, but also allocates additional unused space to the SDS.

Where, the amount of extra allocated unused space is determined by the following formula:

  • If the SDS is modified so that the length of the SDS (the value of the len attribute) is less than 1MB, then the program allocates the same amount of unused space as the len attribute, and then the value of the SDS len attribute will be the same as the value of the free attribute. For example, if the len of SDS is changed to 13 bytes, the program will also allocate 13 bytes of unused space, and the actual length of the BUF array of SDS will be 13+13+1=27 bytes (one extra byte for null characters).

  • If you modify the SDS so that the LENGTH of the SDS is greater than or equal to 1MB, the program allocates 1MB of unused space. For example, if the len of the SDS is changed to 30MB, then the program allocates 1MB of unused space, and the actual length of the BUF array of the SDS is 30MB+1MB+1byte.

With a pre-allocation strategy, Redis can reduce the number of memory reallocations required to perform string growth operations continuously.

2.3.2 Lazy space is released

Lazy space release is used to optimize the string shortening operations of SDS: When the SDS API needs to shorten a string saved by SDS, the program does not immediately use memory reallocation to reclaim the shortened extra bytes. Instead, the program uses the free attribute to record the number of these bytes and wait for future use.

2.4 Binary Security

The characters in a C string must conform to some encoding (such as ASCII) and must not contain null characters except at the end of the string, otherwise the first null character read by the program will be mistaken for the end of the string. These limitations make C strings hold only text data. It cannot save binary data such as pictures, audio, video, and compressed files.

To ensure that Redis can be used ina variety of scenarios, the SDS API is binary-safe. All SDS apis will handle the data stored in the BUF array as binary, without any restrictions, filters, or assumptions. What data is when it is written is what it is when it is read.

This is why the BUF property of an SDS is called a byte array — Redis uses this array not to hold characters, but to hold a list of binary data.

2.5 Compatible with some C string functions

While the SDS apis are binary-safe, they also follow the null-terminated convention of C strings: These apis always set the end of the data stored by the SDS to a null character, and always allocate an extra byte to accommodate this null character when allocating space in the BUF array, so that SDS that hold text data can reuse some of the functions defined by the <string.h> library.

3 summary

Table 3-1 summarizes the differences between C strings and SDS.

Table 3-1c Differences between character strings and SDS

Reference: Redis Design and Implementation, Huang Jianhong Machine Press