This is my first article on getting started.

A simple dynamic string

1. Introduction 2. The definition of SDS 3. The difference between SDS and C string 4

1.1 introduction

Instead of directly using C’s traditional string representation (null-terminated character arrays), Redis implements an abstract type called Simple Dynamic String (SDS) and uses SDS as the default string representation for Redis.

1.2 Definition of SDS

Each SDS. h/ SDSHDR structure represents an SDS value:

struct sdshdr {
    // Record the number of bytes used in the buf array
    // is equal to the length of the string held by SDS
    int len;
    // Record the number of unused bytes in the buf array
    int free;
    // An array of bytes to hold strings
    char buf[];
};
Copy the code

Examples of SDS data structures

SDS follow C string to an empty string at the end of the practice, save the empty string 1 byte space does not calculate in SDS len attribute, and allocate an additional 1 byte to null character space, as well as add operation, such as at the end of an empty string is done automatically by SDS function, namely the empty string is completely transparent for users. The advantage of following an empty string end is that SDS can directly reuse functions from the C string function library.

1.3 SDS and C Strings

The figure shows a C string with the value “Redis”.

C uses a character array of length N+1 to represent a string of length N, and the last element of the character array is always the null character ‘\0’. This simple string representation method used by C language cannot meet the requirements of Redis for string security, efficiency and function.

1.4 SDS features

1.4.1 Constant Complexity Obtain the length of the string

Since C strings do not record their own length information, to get the length of a C string, the program must walk through the entire string, counting every character it encounters until it encounters an empty string representing the end of the string, which is O(N). Unlike C strings, because SDS records the length of the SDS itself in the len attribute, the complexity of getting an SDS length is only O(1). The work of setting and updating the LENGTH of the SDS is done automatically by the API of the SDS at execution time. Using SDS does not require any manual modification of the length. That is, Redis’ work to get the length of a string is not a performance bottleneck for Redis.

                An 11-byte SDS

1.4.2 Preventing buffer overflows

Another problem with C strings that do not record their own length is that they are prone to buffer overflow. For example, the

/strcat function concatenates the contents of the SRC string to the end of the dest string: char *strcat(char *dest, const char * SRC); Once no Dest allocates enough memory, a buffer overflow occurs.

Unlike C strings, the SDS space allocation strategy completely eliminates the possibility of buffer overflows.

For example, the SDS API has a function called SDSCat that performs concatenation. It concatenates a C string to the end of the string held by a given SDS. But before concatenating, sdSCat checks to see if there is enough space for the given SDS. Sdscat will first expand the space of the SDS and then perform the splicing operation. For example, if we execute sdscat(s, “Cluster”);

The execution process of sdSCat function is shown in the figure above: The figure shows the SDS before sdSCAT was executed

Perform the operation of splicing “Cluster”, and the SDS after the splicing is completed is as follows:



Note that the SDS above, SDSCAT, not only concatenated the SDS, but also allocated 13 bytes of unused space to the SDS, and the concatenated string is exactly 13 bytes long because it is related to the SDS space allocation strategy. This strategy is explained in the next section.

1.4.3 Reduce the number of memory reallocations caused by string modification

Because there is a correlation between the length of a C string and the length of the underlying array, each time a C string is grown or shortened, the program reallocates memory to the array that holds the C string. For example, the concatenation operation (Append). Before the operation can be performed, the program must manually expand the size of the underlying array by reallocating memory. If a truncation operation (trim) is performed, the part of space that is no longer used must be freed manually after the operation is performed, otherwise a memory leak will occur.

Because of the complexity of the algorithms involved and the system calls that might be required, reallocating memory is a time-consuming operation, but not for a speed-critical database like Redis. To avoid this drawback of C strings, SDS disassociates the string length from the underlying array by using unused space. In SDS, the length of the BUF array is not necessarily the number of characters plus one. The array can contain unused bytes, and the number of bytes is recorded by the SDS free attribute.

SDS implements two optimization strategies, space pre-allocation and lazy space release, through unused space.

1. Pre-allocate space

Space preallocation is used to optimize the string growth operation of SDS. When a modification is made to an SDS that requires space expansion, the program not only allocates all the necessary space for the modification to the SDS, but also allocates additional unused space for the SDS.

Where, the amount of extra allocated unused space is as follows:



2, inert space release

Lazy space frees string shortening operations used to optimize SDS: When the SDS API needs to shorten a string saved by SDS, the program does not immediately use memory reallocation to reclaim the shortened extra bytes. Instead, the program uses the free attribute to record the number of these bytes and wait for future use.

For example, the sdstrim function takes an SDS and a C string and removes all characters that occur in the C string from the left and right sides of the SDS.

For example, do this for the figure abovesdstrim(s, "XY"); // Remove all X's and Y's of the SDS stringAfter the function, the SDS will be modified as shown in the figure below:



Note that the SDS after the sdSTRIm is executed does not release the extra 8 bytes. Instead, the 8 bytes are kept in the SDS as unused space, which can be used if the SDS needs to be grown later. For example, if you want to perform sdSCat (s, “Redis”) operation on S, you do not need to perform memory reallocation to complete the operation: the 8 bytes of space reserved in SDS is enough to splint 6 bytes of” Redis”.

With the lazy space release strategy, SDS avoids the memory reallocation operations required when shortening strings and provides optimization for possible future growth operations. At the same time, THE SDS provides apis that allow us to actually free the SDS ‘unused space when needed, so we don’t have to worry about wasting memory with this strategy.

1.4.4 Binary security

The characters in a C string must conform to some encoding (such as ASCII) and must not contain empty characters except at the end of the empty string, otherwise the first empty string read by the program will be mistaken for the end of the string. However, these limitations make C strings hold only text data. Cannot save binary data such as pictures, audio, video, compressed files.

To ensure that it is applicable to different scenarios, the SDS API is binary-safe. All THE SDS API will process the data stored in the BUF array as binary, without any restrictions or filters. What data is when it is written is what it is when it is read. This is why the BUF attribute of an SDS is called a byte array ————Redis uses this array not to hold characters, but to hold a series of binary data.

1.4.5 Compatibility with some C string functions

While the SDS apis are binary-safe, they also follow the null-ending convention of C strings: these apis always set the end of SDS data to a null character, and always allocate an extra byte to accommodate this null character when allocating space in the BUF array. This is all to avoid unnecessary code duplication by allowing SDS that hold text data to reuse some of the functions defined by the <string.h> library.

1.5 summarize

C string SDS
Get the complexity O(N) of the length of the string Get the complexity of string length O(1)
The API is not secure and can cause buffer overflows The API is secure and does not cause buffer overflows
If the string length is changed N times, the memory must be reallocated N times You need to perform a maximum of N memory reassignments to change the string length for N times
Only text data can be saved You can save text data or binary data
You can use functions from the <string.h> library You can use some of the functions from the <string.h> library