This article, the second in the Redis Internal Data Structure Detail series, looks at one of the most used basic data structures in Redis: SDS.

Strings are almost the most used data structure in any programming language. SDS is the String structure that is widely used in Redis. Its full name is Simple Dynamic String. Compared to strings found in other locales, it has the following notable characteristics:

  • Dynamically expandable memory. SDS represents a string whose contents can be modified or appended. In many languages, strings are both mutable and immutable, and SDS is clearly a mutable type.
  • Binary Safe. SDS can store arbitrary binary data, not just printable characters.
  • Compatible with traditional C string types. The implications of this will be discussed in a moment.

Seeing this, many students who are familiar with Redis may have had a question: Redis has exposed a string structure, called string, then what is the relationship between SDS and string? One might guess that string is implemented based on SDS. This conjecture is very close to the truth, but the description is not very accurate. A more detailed analysis of the relationship between String and SDS will be covered later. For the sake of discussion, let’s briefly consider that the underlying implementation of String is SDS.

Before we get into the details of SDS implementation, let’s take a look at some of the major operations supported by String from the Redis consumer’s perspective. Here is an example operation:

All of the above operations are relatively simple, let’s briefly explain:

  • The initial string value is set to “tielei”.
  • Step 3 appends the string to “Tielei Zhang” with the append command.
  • Then set the 53rd bit to 1 using the setbit command. The offset of bit is counted from the left, starting at 0. The 48-55th bit is the middle space character, and its ASCII code is 0x20. After setting the 53rd bit to 1, its ASCII code becomes 0x24, which prints out as’ $’. So now the value of the string is “tielei$zhang”.
  • Finally, getrange is used to fetch the contents from the fifth byte to the first byte to get “Zhang”.

The realization of these commands is partly related to the realization of SDS. Let’s begin our discussion in detail.

SDS data structure definition

As we know, in C, strings are stored as arrays of ‘\0’ characters terminated by the NULL terminator, usually expressed as character Pointers (char *). It does not allow byte 0 to appear in the middle of a string, so it cannot be used to store arbitrary binary data.

We can find the type definition of SDS in SDS.h:

typedef char *sds;Copy the code

Someone must be confused, SDS is the same as char *? As we mentioned earlier, SDS and traditional C strings remain type-compatible, so their type definition is the same, char *. In some cases, where you need to pass in a C string, you can indeed pass in an SDS. However, SDS and CHAR * are not the same. SDS is Binary Safe. It can store arbitrary Binary data. It cannot end a string with the character '\0' as C strings do, so it must have a length field. But where is this length field? In fact, SDS also contains a header structure:

struct __attribute__ ((__packed__)) sdshdr5 {
    unsigned char flags; /* 3 lsb of type, and 5 msb of string length */
    char buf[];
};
struct __attribute__ ((__packed__)) sdshdr8 {
    uint8_t len; /* used */
    uint8_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};
struct __attribute__ ((__packed__)) sdshdr16 {
    uint16_t len; /* used */
    uint16_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};
struct __attribute__ ((__packed__)) sdshdr32 {
    uint32_t len; /* used */
    uint32_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};
struct __attribute__ ((__packed__)) sdshdr64 {
    uint64_t len; /* used */
    uint64_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};Copy the code

There are five types of headers in SDS. The reason there are five is so that strings of different lengths can use headers of different sizes. In this way, shorter strings can use smaller headers, saving memory.

The complete structure of an SDS string, consisting of two parts adjacent to each other in memory addresses:

  • A header. Usually contains the length of the string (len), the maximum size (alloc), and flags. Sdshdr5 is different.
  • A character array. The length of this character array is equal to the maximum capacity +1. The length of truly valid string data is usually less than the maximum capacity. After the actual string data, there are unused bytes (typically filled with byte zeros), allowing limited backward expansion of the string data without reallocating memory. After the actual string data, there is a NULL terminator, which is the ASCII 0 '\0' character. This is for compatibility with traditional C strings. The reason the character array is 1 byte longer than the maximum size is so that there is still 1 byte to hold the NULL terminator when the string reaches the maximum size.

Except for sdSHdr5, the structure of the other four headers contains three fields:

  • Len: Represents the true length of the string (excluding the NULL terminator).
  • Alloc: Indicates the maximum size of a string (excluding the last extra byte).
  • Flags: Always occupy one byte. The lowest three bits represent the type of the header. There are five types of headers, defined as constants in SDS.h.
#define SDS_TYPE_5  0
#define SDS_TYPE_8  1
#define SDS_TYPE_16 2
#define SDS_TYPE_32 3
#define SDS_TYPE_64 4Copy the code

SDS data structure, we need to parse it very carefully.

An example of the internal structure of AN SDS is shown above. The figure shows the memory structure of two SDS strings s1 and S2, one using header of type SDSHDR8 and the other using header of type SDSHdr16. But they all represent the same value of a string of length 6: "tielei". Let's look at the code to explain what each part is made of.

The character Pointers to SDS (S1 and S2) point to the beginning of the real data (the character array), and the header is in the lower direction of the memory address. In SDS.h there are some macro definitions related to parsing headers:

#define SDS_TYPE_MASK 7
#define SDS_TYPE_BITS 3
#define SDS_HDR_VAR(T,s) struct sdshdr##T *sh = (void*)((s)-(sizeof(struct sdshdr##T)));
#define SDS_HDR(T,s) ((struct sdshdr##T *)((s)-(sizeof(struct sdshdr##T))))
#define SDS_TYPE_5_LEN(f) ((f)>>SDS_TYPE_BITS)Copy the code

SDS_HDR is used to obtain a pointer to the start position of the header from the SDS string, for example, SDS_HDR(8, S1) represents the header pointer of S1, and SDS_HDR(16, s2) represents the header pointer of S2.

Of course, before using SDS_HDR, we need to know what kind of header it is, so that we know what the first parameter of SDS_HDR should be. The header type is obtained from the SDS character pointer by offsetting the lower address by 1 byte to get the FLAGS field. For example, s1[-1] and S2 [-1] get flags for S1 and S2, respectively. Then take the lowest 3 bits of flags to get the header type.

  • Since S1 [-1] == 0x01 == SDS_TYPE_8, the header type for S1 is SDSHdr8.
  • Since S2 [-1] == 0x02 == SDS_TYPE_16, s2's header type is SDSHdr16.

With the header pointer, you can quickly locate its len and alloc fields:

  • In the header of S1, len is 0x06, indicating that the string data length is 6. The value of alloc is 0x80, indicating that the maximum size of the character array is 128.
  • In the header of S2, len is 0x0006, indicating that the string data length is 6. The value of alloc is 0x03E8, indicating that the maximum size of the character array is 1000. (Note: the figure is constituted by the little end address)

There are a few more things to note in the header type definitions:

  • __attribute__ ((Packed)) is used in the definition of each header to allow the compiler to allocate memory in compact mode. Without this property, the compiler might optimize the alignment of the struct's fields, filling them with empty bytes. In that case, there is no guarantee that the header and SDS data sections are close to each other, nor that the flags field can be retrieved by a fixed offset of 1 byte in the direction of the lower address.
  • There is a char buf[] at the end of each header definition. We notice that this is an array of characters of no specified length, and this is a special way of saying that in C we're going to define an array of characters, and it's just going to act as a flag to indicate that there's an array of characters after the flags field. And when the program allocates memory for the header, it doesn't take up memory. If the sizeof(struct sdshdr16) is evaluated, the result is 5 bytes with no BUf field.
  • Sdshdr5 differs from the other header structures in that it does not contain the alloc field and the length is stored using the higher 5 bits of flags. Therefore, it cannot allocate free space for strings. If the string needs to grow dynamically, it must reallocate memory. Therefore, this type of SDS string is more suitable for storing statically short strings (less than 32 in length).

At this point, we can see very clearly that the header of the SDS string is actually hidden in front of the actual string data (low address direction). Such a definition has the following advantages:

  • Headers and data are adjacent rather than being split into two separate memory Spaces. This helps reduce memory fragmentation and improve memory efficiency.
  • Although headers have multiple types, SDS can be expressed in a uniform char *. And it is type-preserving compatible with traditional C language strings. If an SDS stores printable strings, we can pass it directly to C functions, such as STRCMP for comparing string sizes, or printf for printing.

After making clear the data structure of SDS, its specific operation function is easier to understand.

Some basic functions of SDS

  • Sdslen (const SDS S): Gets the length of the SDS string.
  • Sdssetlen (SDS S, size_t newlen): Sets the length of the SDS string.
  • Sdsinclen (SDS S, size_t inc): Increases the length of the SDS string.
  • Sdsalloc (const SDS S): Gets the SDS string capacity.
  • Sdssetalloc (SDS S, size_t newlen): Sets the size of the SDS string.
  • Sdsavail (const SDS S): Get the free space of the SDS string (that is, alloc - len).
  • SdsHdrSize (char type): Specifies the header size based on the header type.
  • SdsReqType (size_t string_size): The required header type is calculated based on the length of the string data.

Here we pick the code for sdSLen and sdsReqType and take a look.

static inline size_t sdslen(const sds s) { unsigned char flags = s[-1]; switch(flags&SDS_TYPE_MASK) { case SDS_TYPE_5: return SDS_TYPE_5_LEN(flags); case SDS_TYPE_8: return SDS_HDR(8,s)->len; case SDS_TYPE_16: return SDS_HDR(16,s)->len; case SDS_TYPE_32: return SDS_HDR(32,s)->len; case SDS_TYPE_64: return SDS_HDR(64,s)->len; } return 0; } static inline char sdsReqType(size_t string_size) { if (string_size < 1<<5) return="" sds_type_5; ="" if="" (string_size="" <="" 1<<8)="" sds_type_8; ="" 1<<16)="" sds_type_16; ="" 1ll<<32)="" sds_type_32; ="" sds_type_64; ="" }<="" pre="">

Similar to the previous analysis, sdSLen first offsets the lower address by 1 byte with S [-1] to get flags. Then, bitwise and SDS_TYPE_MASK are used to obtain the header type. Then, depending on the different header types, SDS_HDR is called to get the header starting pointer, and hence the len field.

From the sdsReqType code, it is easy to see:

  • The length is between 0 and 2^5-1. Use the SDS_TYPE_5 header.
  • The length is between 2^5 and 2^8-1. Use the SDS_TYPE_8 header.
  • The length is between 2^8 and 2^16-1. Use the SDS_TYPE_16 header.
  • The length is between 2^16 and 2^32-1. Use the SDS_TYPE_32 header.
  • If the length is greater than 2^32, select the SDS_TYPE_64 header. The maximum length that can be represented is 2^64-1.

Note: The implementation code for sdsReqType, up until 3.2.0, had a problem with length boundary values until recently fixed by commit 6032340 on 3.2 Branch.

Creation and destruction of SDS

sds sdsnewlen(const void *init, size_t initlen) { void *sh; sds s; char type = sdsReqType(initlen); /* Empty strings are usually created in order to append. Use type 8 * since type 5 is not good at this. */ if (type == SDS_TYPE_5 && initlen == 0) type = SDS_TYPE_8; int hdrlen = sdsHdrSize(type); unsigned char *fp; /* flags pointer. */ sh = s_malloc(hdrlen+initlen+1); if (! init) memset(sh, 0, hdrlen+initlen+1); if (sh == NULL) return NULL; s = (char*)sh+hdrlen; fp = ((unsigned char*)s)-1; switch(type) { case SDS_TYPE_5: { *fp = type | (initlen << SDS_TYPE_BITS); break; } case SDS_TYPE_8: { SDS_HDR_VAR(8,s); sh->len = initlen; sh->alloc = initlen; *fp = type; break; } case SDS_TYPE_16: { SDS_HDR_VAR(16,s); sh->len = initlen; sh->alloc = initlen; *fp = type; break; } case SDS_TYPE_32: { SDS_HDR_VAR(32,s); sh->len = initlen; sh->alloc = initlen; *fp = type; break; } case SDS_TYPE_64: { SDS_HDR_VAR(64,s); sh->len = initlen; sh->alloc = initlen; *fp = type; break; } } if (initlen && init) memcpy(s, init, initlen); s[initlen] = '\0'; return s; } sds sdsempty(void) { return sdsnewlen("",0); } sds sdsnew(const char *init) { size_t initlen = (init == NULL) ? 0 : strlen(init); return sdsnewlen(init, initlen); } void sdsfree(sds s) { if (s == NULL) return; s_free((char*)s-sdsHdrSize(s[-1])); }Copy the code

Sdsnewlen creates an SDS string of length initlen and initializes the data with the array of characters pointed to by init (arbitrary binary data). If init is NULL, all zeros are used to initialize the data. In its implementation, we need to pay attention to:

  • If you want to create an empty string of length 0, use the SDS_TYPE_8 header instead of the SDS_TYPE_5 header. This is because creating an empty string is likely to be followed by appending data, but SDS strings of type SDS_TYPE_5 are not suitable for appending data (which causes memory reallocation).
  • The required memory space is allocated once, which consists of three parts: header, data, and the last extra byte (hdrlen+initlen+1).
  • The initialized SDS string data is appended with a NULL terminator (s[initlen] = '\0').

One thing to note about sDSfree is that memory needs to be freed as a whole, so you compute the header starting pointer and pass it to the s_free function. This pointer is also the address returned by the call s_malloc in sdsnewlen.

The join (append) operation of SDS

sds sdscatlen(sds s, const void *t, size_t len) {
    size_t curlen = sdslen(s);

    s = sdsMakeRoomFor(s,len);
    if (s == NULL) return NULL;
    memcpy(s+curlen, t, len);
    sdssetlen(s, curlen+len);
    s[curlen+len] = '\0';
    return s;
}

sds sdscat(sds s, const char *t) {
    return sdscatlen(s, t, strlen(t));
}

sds sdscatsds(sds s, const sds t) {
    return sdscatlen(s, t, sdslen(t));
}

sds sdsMakeRoomFor(sds s, size_t addlen) {
    void *sh, *newsh;
    size_t avail = sdsavail(s);
    size_t len, newlen;
    char type, oldtype = s[-1] & SDS_TYPE_MASK;
    int hdrlen;

    /* Return ASAP if there is enough space left. */
    if (avail >= addlen) return s;

    len = sdslen(s);
    sh = (char*)s-sdsHdrSize(oldtype);
    newlen = (len+addlen);
    if (newlen < SDS_MAX_PREALLOC)
        newlen *= 2;
    else
        newlen += SDS_MAX_PREALLOC;

    type = sdsReqType(newlen);

    /* Don't use type 5: the user is appending to the string and type 5 is
     * not able to remember empty space, so sdsMakeRoomFor() must be called
     * at every appending operation. */
    if (type == SDS_TYPE_5) type = SDS_TYPE_8;

    hdrlen = sdsHdrSize(type);
    if (oldtype==type) {
        newsh = s_realloc(sh, hdrlen+newlen+1);
        if (newsh == NULL) return NULL;
        s = (char*)newsh+hdrlen;
    } else {
        /* Since the header size changes, need to move the string forward,
         * and can't use realloc */
        newsh = s_malloc(hdrlen+newlen+1);
        if (newsh == NULL) return NULL;
        memcpy((char*)newsh+hdrlen, s, len+1);
        s_free(sh);
        s = (char*)newsh+hdrlen;
        s[-1] = type;
        sdssetlen(s, len);
    }
    sdssetalloc(s, newlen);
    return s;
}Copy the code

Sdscatlen appends any binary data of length len to t to the end of the SDS string S. The append command for String, as demonstrated at the beginning of this article, is implemented internally by calling sdSCATlen.

In the sdSCATlen implementation, sdsMakeRoomFor is called first to ensure that the string S has enough space to append data of len length. SdsMakeRoomFor may or may not allocate new memory.

SdsMakeRoomFor is a very important function in SDS. Here are some things to note about its implementation code:

  • If the free space in the original string is sufficient (Avail >= addlen), it does nothing and returns directly.
  • If space needs to be allocated, it will be allocated a little more than is actually requested, in case of further appending. It allocates at least one more SDS_MAX_PREALLOC byte if the string is already long. This constant is defined in SDS.h as (1024*1024)=1MB.
  • Depending on the size of the allocated space, you may need to change the header type (the original header alloc field is too short to express the increased capacity).
  • If the header needs to be replaced, then the entire string space, including the header, needs to be reallocated (s_malloc) and the original data copied to the new location.
  • If you don't need to replace the header (the old header will suffice), a special s_realloc is called to try to reallocate space at the old address. The implementation of s_realloc depends on which Allocator Redis compiled with (jemalloc is the default on Linux). But no matter which realloc implementation it is, it basically says the same thing: it tries to reassign at the original location, and if the original location has enough free space to complete the reallocation, it returns the same new address as the old one. Otherwise, it allocates a new address block and moves the data. Cx/realloc. See man.

From the function interface of sdSCATlen, we can see a usage pattern: when it is called, an old SDS variable is passed in, and then it returns a new SDS variable. Because its internal implementation may cause address changes, the old variables become invalid after the caller is done, and should be replaced with the newly returned variables. Not only the SDSCATlen function, other functions in SDS (such as SDSCpy, SDSTRIm, sDSJoin, etc.), as well as other data structures in Redis that can automatically expand memory (such as Ziplist), are also used in the same way.

On the relationship between SDS and String

Now let's go back to the string operation example given at the beginning of this article.

  • The append operation is implemented using SDSCATlen of SDS. It was mentioned earlier.
  • Both setbit and Getrange fetch the entire SDS string by key, and then select or modify the specified portion of the string. Since AN SDS is just an array of characters, it seems easy to manipulate any part of it.

However, in addition to these operations, string also supports incr, DECr, and so on when the value it stores is a number. So, when a string stores numeric values, is its internal storage still SDS? Actually, no. Also, setbit and Getrange implementations will be different in this case. These details will be discussed systematically in the next article on Robj.



Original article, reproduced please indicate the source!



Links to this article:Zhangtielei.com/posts/blog-...


Copy the code