Cabbage Java self study room covers core knowledge

Dynamic string

Redis has five basic data structures: string, list, hash, set, and zset.

The string in Redis is not a traditional C language string (namely character array, hereinafter referred to as C string), but a simple dynamic string (SDS) is constructed by itself, and SDS is represented as the default string of Redis.

For a primer on Redis, read the author’s article on the Path to Becoming a Java Engineer

In Redis, C strings are typically used only where string values do not need to be modified, such as Redis startup logs. Redis requires a string of modifiable character length, so SDS is used to represent a string. Take this example:

127.0.0.1:6379> set msg "hello world"
OK
Copy the code

This is a very simple command that maps the string “hello world” to the MSG key. And “Hello world”, in Redis, is an SDS. After all that SDS, what does this SDS look like? Let’s take a look:

SDS. H:

Struct SDSHDR {// Record the number of bytes used in the buF array // is equal to the length of the string held by the SDS int len; // Record the number of unused bytes in the buF array int free; // A byte array to hold the string char buf[]; };Copy the code

An example of an SDS is shown below:

  1. The value of the free attribute is 0, indicating that the SDS does not have any bytes left to use;
  2. Len is 5, indicating that the SDS stores a string of length 5;
  3. The buf property is an array of type char. The first five bytes of the array hold ‘R’, ‘e’, ‘d’, ‘I’, and ‘s’ respectively. The last byte holds the null character ‘\0’, indicating the end of the string.

2. Advantages of SDS

Reading this, there may still be people who do not understand the benefits of using SDS. Never mind, let’s look at another example. If we look at the following figure, this SDS is different from the SDS shown above, although both save the string “Redis”. However, the length of buF character array and the value saved by free of SDS in the figure below are different from that of SDS in the figure above:

As we all know, Redis is a non-relational in-memory database, its value is very easy to change. We also know that in C, the length of the character array cannot be changed. If the string is used in the Redis C string, rather than the SDS, when we change a key of the string, if the new string length is less than or equal to the length of the string, the original so we just replace the content on the character array, and then advance the representative at the end of the string (if the old and the new string length is equal, The empty string remains where it was. But if the length of the new string is larger than the length of the old string, then unfortunately we have to apply a new array that can hold the new string, which is not good for Redis.

When Redis creates an SDS object for a string, it usually applies for an array of bytes longer than the string (BUF). Redis stores the string in the array, along with len, the length of the string, and free, the number of bytes not yet used by BUF.

Struct SDSHDR {// Record the number of bytes used in the buF array // is equal to the length of the string held by the SDS int len; // Record the number of unused bytes in the buF array int free; // A byte array to hold the string char buf[]; };Copy the code

When a client asks to change the string corresponding to a key, if the buF length is greater than the length of the new string, there is no need to declare a new array to hold the new string.

Let’s look at the structure SDSHDR, which contains the free, Len, and BUF fields. So is this len a little bit redundant? Since free already records the number of unused bytes, and len we can calculate string length by strlen(buf)strlen(buf)strlen(buf) is len really redundant? No, if you’ve ever used Java or Python or some other high-level language, it’s pretty easy to call a function to get the length of a string, and the internal string implementations of these high-level languages also record the length of a string.

If we want to calculate the length of a string, we call strlen(buf)strlen(buf)strlen(buf) strlen(buf) every time. This operation is O(N)O(N)O(N) O(N).

As you can see from the figure above, when you compute the length of a C string by strlen(buf)strlen(buf)strlen(buf), the cursor will traverse to the null character before it stops, and in programming, you will use a certain string length repeatedly in some business scenarios. So, in SDS, that’s what len does, we only have to calculate the length of the string once, and when we need it, we just take it from Len, and the time complexity is O(1)O(1)O(1).

In addition to the time complexity of retrieving the length of a string, another problem with C strings not recording their length is the vulnerability to buffer overflows. The STRcat function in C can concatenate two strings as follows:

#include <string.h>
char *strcat(char *dest, const char *src);
Copy the code

The strcat function can concatenate the contents of the SRC string to the end of the dest string, but since C itself does not record string lengths, it is assumed that dest has allocated enough memory. For example, suppose you have two strings S1 and S2 next to each other, where S1 holds the string “Redis” and S2 holds the string “MongoDB”, as shown:

Strcat (S1,”Cluster”)strcat(S1, “Cluster”)strcat(S1,”Cluster”) strcat(S1,”Cluster”) In other words, the contents of the character array corresponding to S2 will be modified, as shown in the figure:

Unlike C strings, the SDS space allocation strategy completely eliminates the possibility of buffer overflows. When the SDS API needs to modify the SDS, the API checks whether the SDS space meets the modification requirements. If not, The API automatically expands the SDS space to the size required for execution before performing modification operations.

The SDS API has a sdSCat function that performs the concatenation operation. It can concatenate a C string to the end of the string saved by a given SDS, but before concatenating, SDSCat checks whether the given SDS has enough space. If not, Sdscat expands the SDS space before performing the splice operation. For example, we execute sdscat(S,”Cluster”) sdSCat (S,”Cluster”) sdSCat (s,”Cluster”), where the SDS value s is shown in the figure below:

After checking, SDSCat finds that the current space of S is not enough to splice “Cluster”. Then, SDSCAT will expand s space and perform splice operation of” Cluster”. SDS after splice is shown in the figure:

Redis as in-memory database, demanding is often used to speed, data are frequently modified occasions, if every time to modify the length of the string need to perform a memory redistribution, the execution time for the redistribution of memory alone would take up the greater part of the modified string used in time, if this change occurred frequently, There may also be an impact on performance.

3. Spatial optimization strategy

To avoid the pitfalls of C strings, SDS disassociates the string length from the underlying array length by using no space. In SDS, the length of the BUF array is not necessarily the number of characters plus one. The array can contain unused bytes, and the number of bytes is recorded by the FREE property of SDS. SDS realizes two optimization strategies of space pre-allocation and inert space release through unused space.

3.1. Space pre-allocation

Space preallocation is used to optimize string growth operations in SDS, and we all know that when the SDS API makes changes to an SDS, it allocates some spare space in addition to the required byte space. So, how big is this spare space? The spare space is determined by the following formula:

  • If the length of SDS (the value of len attribute) is less than 1MB after modification, then the program allocates the same amount of unused space as len attribute, then the value of free attribute of SDS will be the same as len attribute. For example, if SDS len is changed to 13 bytes, the program will allocate 13 bytes of spare space, plus one byte for storing empty strings to indicate the end of strings. So the actual length of buF array of SDS is 13B+13B+1B=27B13B+13B+1B=27B13B+13B+1B=27B bytes;
  • If the SDS is greater than or equal to 1MB after modification, the program allocates 1MB more unused time. For example, after modification, len of SDS is 30MB, then the program will allocate 1MB more unused space, and the actual length of BUF array of SDS is 30MB+1MB+1B30MB+1MB+1B30MB+1MB+ 1MB+1B bytes;

3.2. Lazy space release

Lazy space free is used to optimize the string shortening operation of SDS. When the SDS API needs to shorten the string saved by SDS, the program does not immediately use memory reallocation to reclaim the shortened extra bytes, but stores the modified number of unused bytes in FREE for future use.

4. Binary security

C strings must conform to a certain encoding (such as ASCII), and must contain no null characters except the end of the string. Otherwise, the first empty string read by the program will be mistaken for the end of the string. This limits C strings to only hold text data. It cannot save binary data such as pictures, audio, video, and compressed files.

Although databases are generally used to store text data, it is not uncommon to use a database to store binary data. Therefore, to ensure that Redis can be used in a variety of scenarios, SDS apis are binary safe. All SDS apis treat data stored in BUF arrays as binary, and the program does not restrict, filter, or make assumptions about the data. Data is read as it is written.

This is why we call the BUF property of SDS a byte array, because Redis does not use this array to hold characters, but to hold a series of binary data. And in SDS, the end of the character array is not judged by a null character, but by the value of len attribute, as shown in the figure:

Compatible with some C string functions

Although SDS apis are binary safe, they all follow the null-terminating convention of C strings: These apis always set the end of the data stored by SDS to a null character, and always allocate an extra byte of space to hold the null character when assigning space to the BUF array, so that SDS holding text data can reuse some of the functions defined by the <string.h> library. For example, strcasecmp is a function that compares two strings regardless of case, using it to compare a string saved by SDS with another C string.

strcasecmp(sds->buf, "hello world");
Copy the code

Alternatively, we can append what buF holds in SDS to another C string. In this way, there is no need to write another set of functions similar to those in <string.h>.

strcat(c_string, sds->buf);
Copy the code

5. Difference between C string and SDS

C string SDS
The time to get the string length is O(N) The time to get the string length is O(1)
The API is not secure and may cause buffer overflows The API is secure and does not cause buffer overflows
Changing the string length N times requires N memory reallocations Changing the string length N times requires a maximum of N memory reallocations
Only text data can be saved You can save text or binary data
You can use all the functions in the <string.h> library You can use some of the functions in the <string.h> library