background

Squirrel is a caching system built by Meituan’s technology team based on Redis Cluster. After continuous iterative research and development, a complete set of automatic operation and maintenance system has been formed, including one-click operation and maintenance cluster, fine-grained monitoring, automatic capacity expansion and hot Key monitoring and other complete solutions. At the same time, the server is deployed through Docker to maximize the flexibility of operation and maintenance. Distributed cache Squirrel product has been widely used in Meituan since its launch in 2015, with storage capacity exceeding 60T and daily usage exceeding trillion times, gradually becoming one of the most important cache systems in Meituan.

As the usage volume and scenarios grew, the Squirrel team found several “holes” and deficiencies in Redis, and continued to improve Redis to support the rapidly growing business needs within Meituan. This article tries to share some pits of Redis Rehash mechanism in the process of operation and maintenance as well as our solutions, in which the phenomenon of physical machine packet loss under high load and solutions have been written in the blog. For those interested, please refer to interrupt optimization under high load of Redis.

case

Large number of Key evictions due to Rehash in Redis full capacity state

Let’s first look at a monitoring chart (above, our online real case). When Redis is full of expulsion strategies, Master/Slave has a large number of Key expulsion and elimination, resulting in Master/Slave inconsistency.

Root Cause positioning

The Slave memory has one less Repl – Backlog Buffer (128 MB online) than the Master memory. In normal cases, when the Master memory reaches its full capacity, the Key is eliminated according to the expulsion policy and synchronized to the Slave memory. So the Slave case does not trigger expulsion due to full capacity.

According to previous experience, the troubleshooting thought mainly focuses on the problems causing the sudden increase of Slave memory, including all external factors leading to the sudden increase of Redis memory, such as client connection, input/output buffer, service data access and network jitter. The Redis monitoring and service link monitoring failed to locate the problem.

Therefore, by combing through the Redis source code, we tried to look at an important mechanism that Redis can consume memory overhead – Redis Rehash.

Redis Rehash internal implementation

In Redis, key-value pairs are stored by Dict, and the underlying dictionary is implemented through hash tables. The key-value pairs in the dictionary are saved by the nodes in the hash table. Similar to the HashMap in Java, the Key is mapped to the hash table node position through a hash function.

Let’s take a step-by-step look at the mechanics and processes of Redis Dict Reash.

(1) Redis hash table structure:

/* Hash table structure definition */
typedef struct dictht { 
    dictEntry **table;   // Hash table array
    unsigned long size;  // Hash table size
    unsigned long sizemask; // Hash table size mask
    unsigned long used;  // Hash represents the number of nodes
} dictht; 
Copy the code

Instantiate an empty hash table of size 4 (Redis defaults to 4) :

The table array of Redis hash table stores the hash bucket structure (dictEntry), which is Redis key value pair; Similar to the Java implementation of HashMap, Redis’s dictEntry also resolves hash conflicts through linked lists (next Pointers) :

/* Hash bucket */
typedef struct dictEntry { 
    void *key;     / / key definitions
    / / value definition
    union { 
        void *val;    // Custom type
        uint64_t u64; // Unsigned integer
        int64_t s64;  // Signed integer
        double d;     / / floating point
    } v;     
    struct dictEntry *next;  // points to the next hash table node
} dictEntry;
Copy the code

(3) Two hash tables are defined in Redis Dict to be used as rehashes for subsequent dictionary extensions:

/* Dictionary structure definition */
typedef struct dict { 
    dictType *type;  // Dictionary type
    void *privdata;  // Private data
    dictht ht[2];    // hash table [two]
    long rehashidx;   // Records the rehash progress. A value of -1 indicates that the rehash has not been performed
    int iterators;   // The number of iterators currently being iterated
} dict;
Copy the code

To sum up:

  • In Cluster mode, a Redis instance corresponds to a RedisDB(DB0).
  • A RedisDB corresponds to a Dict;
  • A Dict corresponds to two dicthts, which are normally used only in HT [0]; Ht [1] is used for Rehash.

Above, we reviewed the implementation of Redis KV storage. (There are other structures inside Redis, which are not related to Rehash.)

We know that when Hash collisions (load factors) in a HashMap exceed a certain threshold, Resize is performed for performance reasons. It is the same in Redis (). Let’s look at the implementation in Redis:

/* Extend the dictionary according to relevant trigger conditions */
static int _dictExpandIfNeeded(dict *d) 
{ 
    if (dictIsRehashing(d)) return DICT_OK;  // If Rehash is being performed, return directly
    if (d->ht[0].size == 0) return dictExpand(d, DICT_HT_INITIAL_SIZE);  // If ht[0] dictionary is empty, create and initialize HT [0]
    /* (ht[0].used/ht[0].size)>=1; /* (ht[0].used/ t[0].size)>=1; /* (ht[0].used/ t[0].size)>=1
    if (d->ht[0].used >= d->ht[0].size && 
        (dict_can_resize || 
         d->ht[0].used/d->ht[0].size > dict_force_resize_ratio)) 
    { 
        return dictExpand(d, d->ht[0].used*2);   // Expand the dictionary to double the original size
    } 
    returnDICT_OK; }.../* Calculate the location of the bucket that stores the Key */
static int _dictKeyIndex(dict *d, const void *key) 
{ 
    unsigned int h, idx, table; 
    dictEntry *he; 
 
    /* Check to see if the hash table needs to be extended, or */ if not 
    if (_dictExpandIfNeeded(d) == DICT_ERR)  
        return - 1; 
    /* Compute the hash value of the Key */ 
    h = dictHashKey(d, key); 
    for (table = 0; table <= 1; table++) { 
        idx = h & d->ht[table].sizemask;  // Calculate the bucket position of the Key
        /* Check whether the new Key */ exists on the node 
        he = d->ht[table].table[idx]; 
        /* Check in the node list */ 
        while(he) { 
            if (key==he->key || dictCompareKeys(d, key, he->key)) 
                return - 1; 
            he = he->next;
        } 
        if(! dictIsRehashing(d))break;  // If hashing is not rehashing after sweeping HT [0], do not need to sweep HT [1]
    } 
    returnidx; }.../* Insert Key into hash table */
dictEntry *dictAddRaw(dict *d, void *key) 
{ 
    int index; 
    dictEntry *entry; 
    dictht *ht; 
 
    if (dictIsRehashing(d)) _dictRehashStep(d);  // If the hash table is rehashing, perform one-step rehash
 
    /* Call the dictkeyIndex () to check whether the key exists and return NULL */ if it does 
    if ((index = _dictKeyIndex(d, key)) == - 1) 
        return NULL; 
 

    ht = dictIsRehashing(d) ? &d->ht[1] : &d->ht[0]; 
    entry = zmalloc(sizeof(*entry));   // Allocate memory for the new node
    entry->next = ht->table[index];  // Insert the node into the linked list header
    ht->table[index] = entry;   // Updates node and bucket information
    ht->used++;    / / update the ht
 
    /* Sets the key */ for the new node 
    dictSetKey(d, entry, key); 
    returnentry; }.../* Add a new key-value pair */
int dictAdd(dict *d, void *key, void *val) 
{ 
    dictEntry *entry = dictAddRaw(d,key);  // Add a new key
 
    if(! entry)return DICT_ERR;  // If the key exists, failure is returned
    dictSetVal(d, entry, val);   If the key does not exist, set the node value
    return DICT_OK; 
}
Copy the code

dictExpand

int dictExpand(dict *d, unsigned long size) 
{ 
    dictht n; // New hash table
    unsigned long realsize = _dictNextPower(size);  // Calculate to expand or scale the size of the new hash table (call the following function dictNextpower ()))
 
    /* Error */ is returned if the rehash or new hash table is smaller than it is already in use 
    if (dictIsRehashing(d) || d->ht[0].used > size) 
        return DICT_ERR; 
 
    /* If the hash table size is calculated to be the same as the existing hash table size, error */ is also returned 
    if (realsize == d->ht[0].size) return DICT_ERR; 
 
    /* Initializes the new hash table */ 
    n.size = realsize; 
    n.sizemask = realsize- 1; 
    n.table = zcalloc(realsize*sizeof(dictEntry*));  // Allocate memory for table pointing to dictEntry
    n.used = 0; 
 
    /* If HT [0] is null, initialize ht[0] as the hash table of the current key-value pair */ 
    if (d->ht[0].table == NULL) { 
        d->ht[0] = n; 
        return DICT_OK; 
    } 
 
    /* If HT [0] is not null, initialize HT [1] as the hash table of the current key-value pair and enable progressive rehash mode */ 
    d->ht[1] = n; 
    d->rehashidx = 0; 
    returnDICT_OK; }...static unsigned long _dictNextPower(unsigned long size) { 
    unsigned long i = DICT_HT_INITIAL_SIZE;  // Hash table initial value: 4
 

    if (size >= LONG_MAX) return LONG_MAX; 
    /* Calculate the size of the new hash table: the first value greater than or equal to size 2 to the N */
    while(1) { 
        if (i >= size) 
            return i; 
        i *= 2; }}Copy the code

To summarize the specific logical implementation:

It can be confirmed that when Redis Hash conflict reaches a certain condition, the dictExpand() function will be triggered to expand HashTable.

DICT_HT_INITIAL_SIZE initializes to 4, using the above expression to take the value of 4*2^n >= HT [0]. Used *2 as the size of the dictionary extension. Ht [1]. Size is equal to the first 2^n value greater than or equal to ht[0]. Used *2.

Redis creates the dictionary through dictCreate(). In the initialization, the table pointer is Null, so the two hash tables HT [0]. Table and HT [1]. Table is allocated memory pointing to dictEntry only when the dictExpand() dictionary expands.

From the above, when Redis triggers Resize, a block of memory will be dynamically allocated, which is eventually pointed by HT [1]. Table, and the dynamically allocated memory size is: Realsize *sizeof(dictEntry*), table points to a pointer to dictEntry*, size is 8bytes (64-bit OS), that is, HT [1]. Table allocates 8*2*2^n (n is greater than or equal to 2).

Hash table size and memory request size

ht[0].size Memory that ht[1] needs to allocate when Resize is triggered
4 64bytes
8 128bytes
16 256bytes
. .
65536 1024K
. .
8388608 128M
16777216 256M
33554432 512M
67108864 1024M
. .

Repetition validation

Let’s test the environment data to verify the actual memory usage during the Redis Rehash process.

In the above two figures, the number of Redis keys exceeds the critical point of Redis Resize. When the total number of keys is stable and Rehash is completed, the Redis memory (Slave) decreases from 3586M to 3522M: 3586-3522=64M. That is, verify that the above Redis is in the intermediate state from Resize to completion, and it will maintain the memory consumption for a period of time, and the occupied memory value is the corresponding memory space listed above.

Take a closer look at Redis internal statistics:

/* Dict status information when the Redis node is about 8 million keys: only ht[0] information is available. */ "[Dictionary HT] Hash table 0 stats (main hash table): table size: 8388608 number of elements: 8003582 Different Slots: 5156314 Max Chain Length: 9 AVG chain Length (counted): 1.55 AVG chain Length (computed): counted Chain length Distribution: 0: 3232294 (38.53%) 1: 3080243 (36.72%) 2: 1471920 (17.55%) 3: 466676 (5.56%) 4: 112320 (1.34%), 21301 (0.25%), 6:3361 (0.04%), 7:427 (0.01%), 8:63 (0.00%), 9: 3 (0.00%) "/* The Dict state information when the Redis node is about 8.4 million Key is in Rehasing, including HT [0] and HT [1] information. */ "[Dictionary HT] [Dictionary HT] Hash table 0 stats (main hash table): table size: 8388608 number of elements: 8019739 Different slots: 5067892 Max Chain Length: 9 AVG chain Length (counted): 1.58 AVG chain Length (computed): counted 1.58 Chain Length Distribution: 0:3320716 (39.59%) 1:2948053 (35.14%) 2:1475756 (17.59%) 3:491069 (5.85%) 4: 123594 (1.47%), 24650 (0.29%), 6:4135 (0.05%), 7:553 (0.01%), 8:78 (0.00%), 9: 4 (0.00%) Hash Table 1 STATS (rehashing target): Table size: 16777216 Number of elements: 384321 Different slots: 305472 Max Chain Length: 6 COUNTED AVg Chain Length (counted): 1.26 COUNTED AVg Chain Length (computed): Chain length Distribution: 0: 16471744 (98.18%) 1: 238752 (1.42%) 2: 56041 (0.33%) 3: 9378 (0.06%) 4: 1167 (0.01%) 5:119 (0.00%) 6:15 (0.00%) "/* Dict status information when the Redis node is about 8.4 million keys (after Rehash is completed); Ht [0]. Size expanded from 8388608 to 16777216. */ "[Dictionary HT] Hash table 0 stats (main hash table): table size: 16777216 number of elements: 8404060 Different slots: 6609691 Max Chain Length: 7 AVG Chain Length (counted): 1.27 AVG chain Length (computed): counted Chain Length Distribution: 0: 10167525 (60.60%) 1: 5091002 (30.34%) 2: 1275938 (7.61%) 3: 213024 (1.27%) 4: 26812 (0.16%), 2653 (0.02%), 6:237 (0.00%), 7:25 (0.00%)"Copy the code

Through in-depth Redis Rehash internal mechanism, Redis state monitoring and Redis internal statistics, we can come to the conclusion that when the total number of keys in Redis nodes reaches a critical point, Redis will trigger Dict expansion and Rehash. Apply for the corresponding memory space size after expansion.

As mentioned above, Redis Rehash is the root cause of the massive expulsion triggered by Redis Master and Slave in the state of full capacity expulsion.

In addition to leading to full capacity eviction, Redis Rehash can cause a number of other problems:

  • If the tablesize level is not in the same range as the number of existing Keys, after the primary/secondary switchover, the secondary database tablesize is changed to a value matching the existing Keys due to full Redis synchronization, resulting in memory skew.
  • A fragment in a Redis Cluster resizes in advance due to a large number of keys, causing uneven memory in the Cluster fragments. And so on…

Redis Rehash mechanism optimization

Then how to avoid Redis jitter caused by Rehash in Redis full capacity expulsion state?

  • We add a judgment condition to the logic of the source code implementation of Redis Rehash. If the existing remaining memory is not enough to trigger the Rehash operation, the Resize operation will not be carried out.
  • This can be avoided by operating in advance, for example, taking the memory occupied by Rehash into account during capacity estimation, or by monitoring periodic expansion.

In addition to the above mentioned memory management and usage, the Redis Rehash mechanism also affects other Redis internal function modules associated with it. Let’s share a second pitfall due to the Rehash mechanism.

Redis uses the Scan Key to clear data incompletely due to Rehash

The Squirrel platform provides API background logic for business cleaning keys, which is implemented through Scan. The actual online performance is not always completely cleaned up. That is, matching keys are scanned and cleared by Scan. In a low frequency, keys are missing or not all keys are cleared. With our previous experience, we went straight to the basics.

Scan the principle

To efficiently match all the keys in the database that match a given pattern, Redis provides the Scan command. This command returns a partial Key and a Cursor value at each call (the initial value is 0). Each return Cursor is iterated until the Cursor’s return value is 0, indicating the end of the walk.

Redis officially defines Scan features as follows:

  1. Throughout the traversal, from the beginning to the end, all keys that have been in the Redis dataset and match the matching pattern are returned;
  2. If rehash occurs, the same element may be returned multiple times, and keys added or removed during traversal may or may not be returned.

The specific implementation

The aforementioned Redis Keys are stored in Dict mode. Normally, all Keys can be scanned completely by going through all Hash buckets in the Dict once. But in practice, Redis Dict is stateful, changing as keys are added and deleted.

Next, the different implementations of Scan are analyzed according to the four Dict states.

Dict has four state scenarios:

  1. The dictionary tablesize remains unchanged.
  2. Dict Resize, Dict expands (complete state);
  3. Dict Resize, Dict shrink (complete state);
  4. Dictionaries are Rehashing.

(1) The dictionary tablesize remains unchanged and can be traversed directly in sequence in the stable state of Redis Dict; (2) Dictionary Resize, Dict is enlarged, and if it is traversed sequentially, a large number of repeated keys will be scanned. For example, if the dictionary tablesize changes from 8 to 16, buckets 4 to 15 will be accessed if bucket 3 is used. However, the data in buckets 0 to 3 is migrated to buckets 8 to 11 after the Dict length becomes larger. Therefore, a large number of duplicate keys are returned when buckets 8 to 11 are traversed. (3) Dictionary Resize, Dict is reduced. If the dictionary is traversed in order, a large number of keys will be omitted. For example, if the dictionary tablesize changes from 8 to 4, if bucket 3 is accessed, the next time the traversal is complete. However, data in buckets 4 to 7 can be migrated to buckets 0 to 3 after capacity reduction. Therefore, this part of Key cannot be scanned. (4) Dictionaries are Rehashing, as in (2) and (3); either rescanning a lot, or missing a lot of keys.

So in the case of Dict non-stable state, that is, Rehash, how can Scan ensure that all the original keys can be traversed with as little repetition as possible? Redis Scan resolves this problem by using Hash bucket mask high-order access.

The high-order access is based on Dict Sizemask (mask), and an enumeration is added from the high-order on the valid bits (Dict Sizemask is 3 in the figure above). The low value is incremented by the low value of the significant bit. Low order: 0→1→2→3→4→5→6→7 High order: 0→4→2→6→1→5→3→7

The reason why Scan uses high-order access is to ensure that Redis Dict returns keys as few times as possible during Rehash.

For example, if Dict tablesize expanded from 8 to 16, comb through the Scan mode:

  1. Dict(8) scans from Cursor 0;
  2. Resize occurs when Cursor 6 is ready to scan, expand to twice the previous size, and complete Rehash;
  3. The client then starts iterating over Dict(16) Cursor 6;
  4. In this case, the Scan is completed as follows: 6→14→1→9→5→13→3→11→7→15.

It can be seen that high-order Scan can avoid repeated traversal in Dict Rehash and completely return all the original keys. The same is true for dictionary capacity reduction, which can be seen as a reverse expansion.

The above is the theoretical basis of Scan, we look at the Redis source code how to achieve.

(1) Implementation in non-rehashing state:

 if(! dictIsRehashing(d)) {// Check whether rehashing is occurring, if not only HT [0]
        t0 = &(d->ht[0]);  // ht[0]
        m0 = t0->sizemask;  / / mask

        /* Emit entries at cursor */
        de = t0->table[v & m0];  / / target
        while (de) {           
            fn(privdata, de);
            de = de->next;       // Iterate over all nodes in the bucket and return via the callback function fn()}.../* Reverse binary iteration algorithm implementation logic -- the essence of cursor implementation */
     /* Set unmasked bits so incrementing the reversed cursor * operates on the masked bits of the smaller table */
    v |= ~m0;

    /* Increment the reverse cursor */
    v = rev(v);
    v++;
    v = rev(v);

    return v;
}

Copy the code

In the source code, Redis calculates Cursor through Reverse Binary Iteration (Reverse Binary Iteration algorithm) to achieve the above high-order scanning mode.

(2) Realization under Rehashing:

.else {    // Rehashing (hashing [0], hashing [1])
        t0 = &d->ht[0];
        t1 = &d->ht[1];  // Point to two hash tables

        /* Make sure t0 is the smaller and t1 is the bigger table */
        if(t0->size > t1->size) {ensure that t0 is less than t1 t0 = &d->ht[1];
            t1 = &d->ht[0];  
        }

        m0 = t0->sizemask;
        m1 = t1->sizemask;  // The corresponding mask

        /* Emit entries at cursor */
        /* Iterate over all nodes in bucket t0 */
        de = t0->table[v & m0];
        while (de) {   
            fn(privdata, de);
            de = de->next;
        }

        /* Iterate over indices in larger table that are the expansion * of the index pointed to by the cursor in the smaller table */
        / * * /
       
        do {
            /* Emit entries at cursor */
            /* Scan all slots in t1 that are not covered by the small table */ 
            de = t1->table[v & m1];
            while (de) {
                fn(privdata, de);
                de = de->next;
            }

            /* Increment bits not covered by the smaller mask */
            v = (((v | m0) + 1) & ~m0) | (v & m0);

            /* Continue while bits covered by mask difference is non-zero */
        } while (v & (m0 ^ m1));
    }

    /* Set unmasked bits so incrementing the reversed cursor * operates on the masked bits of the smaller table */
    v |= ~m0;

    /* Increment the reverse cursor */
    v = rev(v);
    v++;
    v = rev(v);

    return v;
Copy the code

As with Rehashing, Redis uses the else branch to scan two Hash tables.

Sort out the logical flow:

When Redis deals with dictScan(), the implementation of the four scenarios subdivided above is divided into two logics:

  1. This is not the state of Rehashing: this state, the Dict, is static. For the above three scenarios in this state, Redis uses the aforementioned Reverse Binary Iteration algorithm: ⅰ. First, flip the Cursor bits; ⅱ. Add 1 to the flipped value; ⅲ. Finally, the results of ⅱ are reversed again.

Ensure that all elements are traversed by enumerating the highest and then advancing to the lowest (i.e., the implementation of high-order access).

This algorithm has reduced the return of repeated elements as much as possible, but there may still be repeated returns in the practical implementation and logic. For example, when Dict shrinks, the high order is merged into the low bucket, and the elements in the low bucket are taken out repeatedly.

  1. In Rehashing state: While Redis is in Rehashing state, dictScan() implements a one-time scan of both existing dictionary tables to avoid unmaintenable intermediate states. The concrete implementation is that after traversing the small table Cursor position, all the positions that the small table Cursor position may Rehash to the large table are traversed, and then the traversal element and the next small table traversal position are returned.

Root Cause positioning

Rehashing state, cursor iteration main logic code implementation:

             /* Increment bits not covered by the smaller mask */
            v = (((v | m0) + 1) & ~m0) | (v & m0);   //BUG
Copy the code

ⅰ. V low + 1 to carry high; ⅱ. Remove the most front and last part of V, and only retain the high part of V compared with M0; ⅲ. Keep the low value of V and increase the high value by 1. That is, the low level remains the same and the high level increases by 1, realizing the association between the small table and the large table bucket.

For example, if tablesize of Dict expanded from 8 to 32, comb through the Scan mode:

Dict(8) scans from Cursor 0; Resize occurs when preparing to scan Cursor 4, expand to 4 times as before, Rehashing; The client accesses bucket 4 in Dict(8) first. Then go to Dict(32) :4→12→20→28.

Here you can see that the bucket order of the large table is not the binary high order as described earlier, but actually traverses the significant bits of the large table that are higher than the small table in the low order.

In large table T1, the high order is calculated by adding 1 to the low order, but the scan order is to add 1 from the low order and carry to the high order. Redis for Rehashing this logic implementation can work well in scaling up, but in scaling down the high-order and low-order traversal on the size table can be problematic under certain conditions.

Again, Dict tablesize shrinks from 32 to 8:

  1. Dict(32) scans from Cursor 0;
  2. Resize occurs when Cursor 20 is ready to scan, tablesize is 8, Rehashing;
  3. The client initiates Cursor 20 and accesses bucket 4 in Dict(8).
  4. Dict(32) :20→28;
  5. Finally return Cursor = 2.

It can be seen that bucket no. 12 in the large table is not accessed, that is, when the large table is traversed, the low order access will miss some buckets.

Certain conditions must be met for this to happen:

  1. Scan when Dict Rehash is shrunk;
  2. The Dict is shrunk to at least one fourth of the original Dict tablesize. Only in this case, the significant bits of a large table are more than two digits higher than those of a small table, triggering a bucket skip.
  3. If the Cursor returned before the Rehash starts is within the range that can be represented by the small table (i.e., no more than 7), then the increment operation of the high significant bit must start from 0 and all the related buckets can be accessed by each increment. If the cursor returned before the Rehash starts is not in a range that a small table can represent (such as 20), then it is possible to skip or repeat buckets when performing a high-value increment.

It can be seen that some keys are missing during Scan traversal only when the preceding three conditions are met. If a large number of keys are cleared and the Hash table inside Redis shrinks by at least one fourth of the original Dict tablesize during Key clearing, some keys may be missed.

Scan source code optimization

Repair logic is all from the high start to increase traversal, that is, size tables are using high order access, repair source code is as follows:

unsigned long dictScan(dict *d,
                       unsigned long v,
                       dictScanFunction *fn,
                       dictScanBucketFunction* bucketfn,
                       void *privdata)
{
    dictht *t0, *t1;
    const dictEntry *de, *next;
    unsigned long m0, m1;

    if (dictSize(d) == 0) return 0;

    if(! dictIsRehashing(d)) { t0 = &(d->ht[0]);
        m0 = t0->sizemask;

        /* Emit entries at cursor */
        if (bucketfn) bucketfn(privdata, &t0->table[v & m0]);
        de = t0->table[v & m0];
        while (de) {
            next = de->next;
            fn(privdata, de);
            de = next;
        }

        /* Set unmasked bits so incrementing the reversed cursor * operates on the masked bits */
        v |= ~m0;

        /* Increment the reverse cursor */
        v = rev(v);
        v++;
        v = rev(v);

    } else {
        t0 = &d->ht[0];
        t1 = &d->ht[1];

        /* Make sure t0 is the smaller and t1 is the bigger table */
        if (t0->size > t1->size) {
            t0 = &d->ht[1];
            t1 = &d->ht[0];
        }

        m0 = t0->sizemask;
        m1 = t1->sizemask;

        /* Emit entries at cursor */
        if (bucketfn) bucketfn(privdata, &t0->table[v & m0]);
        de = t0->table[v & m0];
        while (de) {
            next = de->next;
            fn(privdata, de);
            de = next;
        }

        /* Iterate over indices in larger table that are the expansion * of the index pointed to by the cursor in the smaller table */
        do {
            /* Emit entries at cursor */
            if (bucketfn) bucketfn(privdata, &t1->table[v & m1]);
            de = t1->table[v & m1];
            while (de) {
                next = de->next;
                fn(privdata, de);
                de = next;
            }

            /* Increment the reverse cursor not covered by the smaller mask.*/
            v |= ~m1;
            v = rev(v);
            v++;
            v = rev(v);

            /* Continue while bits covered by mask difference is non-zero */
        } while (v & (m0 ^ m1));
    }

    return v;
}

Copy the code

Our team has pushed this PR to Redis official: Fix dictScan(): It can’t scan all buckets when dict is decreasing, and It has been merged by Redis official.

So far, the two mechanisms involving Rehash based on Redis Rehash and Scan implementation have been basically understood and optimized.

conclusion

This paper mainly expounds the two pits stepped on by Redis Rehash mechanism, and gives a detailed introduction from phenomenon to principle. To sum up, in the first case, a large number of online clusters will be eliminated, and there will be inconsistency between master and slave, and a large number of timeouts will occur at the business level, affecting the service availability, which is a serious problem worthy of our attention. As a result of the second case meeting, the data cannot be completely cleaned up, but it can be cleaned up after one Scan. Note: Source code in this article is based on Redis 3.2.8.

Author’s brief introduction

Chun Lin joined Meituan in 2017. After graduation, he has been deeply engaged in operation and maintenance. He has changed his positions from network engineer to Oracle DBA and then to MySQL DBA.

Zhao Lei joined Meituan in 2017 and has been engaged in the research and improvement of Redis kernel since graduation. He has submitted several optimizations to the community and been adopted by the community.

The technical team of Meituan Squirrel is responsible for the r&d, operation and maintenance of meituan large-scale distributed cache Squirrel, which supports the rapid and stable development of Meituan business. Meanwhile, the Squirrel team will continue to contribute internal optimizations and findings to the open source community, giving back to the community in the hope of working with the industry to make Redis strong and prosperous. If you are interested in Redis, please join us: hao.zhu#dianping.com.