The background,

In the live network environment, some services that use Redis cluster often need to perform node expansion as the service volume increases.

It has been learned that after the operation and maintenance students expanded some Redis clusters with large number of nodes, the performance of the cluster declined on the business side, which was reflected in the obvious increase of access delay.

Some services are sensitive to the Redis cluster access delay. For example, the live network environment reads the model in real time, or some services rely on reading the synchronous process of Redis cluster, which will affect the real-time process delay of services. The service side may not accept it.

To find out the root cause of this problem, we investigated the cluster performance degradation after a Redis cluster migration operation.

1.1 Problem Description

This time, the Redis cluster problem scenario is as follows: a Redis cluster has been expanded. On the service side, hiredis-VIP is used to access the Redis cluster and perform MGET operations.

The service side detects that the latency for accessing the Redis cluster increases.

1.2 Description of the Live Network Environment

  • At present, most Redis versions deployed in the live network are 3.x or 4.x.

  • There are various types of clients for business access to Redis cluster, most of which use Jedis. The hiredis-VIP client is used to access services during troubleshooting.

  • Redis cluster has a large number of nodes, with a scale of 100+.

  • The cluster has been expanded before. Procedure

1.3 Observation

Due to the high delay, we conducted investigation from the following aspects:

  • Whether the bandwidth is full;

  • Check whether the CPU usage is too high.

  • Whether OPS is high;

The bandwidth load is not high after simple monitoring. However, the CPU is abnormal:

1.3.1 Comparing OPS and CPU load

Looking at the MGET and CPU load used for business feedback, we found the corresponding monitoring curve.

In terms of time, MGET is not directly related to high CPU load. The feedback from the service side is that the delay of MGET generally increases. Here you see that MGET OPS and CPU load are off-peak.

It can be temporarily determined that there is no direct relationship between service requests and CPU load, but it can be seen from the curve that there is an indirect relationship between service requests and CPU load in the same timeline.

1.3.2 Comparing Cluster OPS and CPU load

Because some o&M colleagues reported that the cluster had been expanded before, slot migration must occur.

Given that business clients generally use caches to store slot topology information of Redis clusters, it is suspected that the Cluster directive has some connection with CPU load.

We did find some connections:

It is clear that CPU usage increases significantly when an instance executes the Cluster instruction.

Based on the above phenomenon, a simple focus can be roughly carried out:

  • The business side performs MGET and executes the Cluster instruction for some reason;

  • The Cluster instruction has high CPU usage for some reasons and affects other operations.

  • Suspected Cluster directives are performance bottlenecks.

At the same time, several issues need to be paid attention to:

Why are more Cluster instructions being executed?

Why are CPU resources high when Cluster instruction is executed?

Why is it easy for a cluster with a large number of nodes to be moved to a slot?

Two, problem investigation

2.1 Redis hotspot check

We use Perf Top to conduct a simple analysis of a Redis instance with high CPU load on site:

Came out from the above we can see, function (ClusterReplyMultiBulkSlots) takes up CPU resources as much as 51.84%, abnormal.

2.1.1 ClusterReplyMultiBulkSlots implementation principle

We analyze the clusterReplyMultiBulkSlots function:

void clusterReplyMultiBulkSlots(client *c) {
    /* Format: 1) 1) start slot * 2) end slot * 3) 1) master IP * 2) master port * 3) node ID * 4) 1) replica IP * 2) replica port * 3)  node ID * ... continued until done */
 
    int num_masters = 0;
    void *slot_replylen = addDeferredMultiBulkLength(c);
 
    dictEntry *de;
    dictIterator *di = dictGetSafeIterator(server.cluster->nodes);
    while((de = dictNext(di)) ! = NULL) {/* All primary nodes of the cluster recorded by the current Redis node are traversed */
        clusterNode *node = dictGetVal(de);
        int j = 0, start = -1;
 
        /* Skip slaves (that are iterated when producing the output of their * master) and masters not serving any slot. */
        /* Skip the standby node. Information about the standby node is obtained from the active node. * /
        if(! nodeIsMaster(node) || node->numslots ==0) continue;
        for (j = 0; j < CLUSTER_SLOTS; j++) {
            /* All slots recorded in the current node are traversed */
            int bit, i;
            /* Verify that the current node occupies the slot*/ of the loop terminal
            if((bit = clusterNodeGetSlotBit(node,j)) ! =0) {
                if (start == -1) start = j;
            }
            /* Return (); /* return (); If not, continue recursively to slot. If it starts, it starts a continuous interval until it disconnects from the current one. * /
            if(start ! = -1&& (! bit || j == CLUSTER_SLOTS-1)) {
                int nested_elements = 3; /* slots (2) + master addr (1). */
                void *nested_replylen = addDeferredMultiBulkLength(c);
 
                if (bit && j == CLUSTER_SLOTS-1) j++;
 
                /* If slot exists in output map, add to it's list. * else, create a new output map for this slot */
                if (start == j-1) {
                    addReplyLongLong(c, start); /* only one slot; low==high */
                    addReplyLongLong(c, start);
                } else {
                    addReplyLongLong(c, start); /* low */
                    addReplyLongLong(c, j-1);   /* high */
                }
                start = -1;
 
                /* First node reply position is always the master */
                addReplyMultiBulkLen(c, 3);
                addReplyBulkCString(c, node->ip);
                addReplyLongLong(c, node->port);
                addReplyBulkCBuffer(c, node->name, CLUSTER_NAMELEN);
 
                /* Remaining nodes in reply are replicas for slot range */
                for (i = 0; i < node->numslaves; i++) {
                    /* Note: the standby node information is traversed below the node, which is used to return */
                    /* This loop is copy/pasted from clusterGenNodeDescription() * with modifications for per-slot node aggregation */
                    if (nodeFailed(node->slaves[i])) continue;
                    addReplyMultiBulkLen(c, 3);
                    addReplyBulkCString(c, node->slaves[i]->ip);
                    addReplyLongLong(c, node->slaves[i]->port);
                    addReplyBulkCBuffer(c, node->slaves[i]->name, CLUSTER_NAMELEN);
                    nested_elements++;
                }
                setDeferredMultiBulkLength(c, nested_replylen, nested_elements);
                num_masters++;
            }
        }
    }
    dictReleaseIterator(di);
    setDeferredMultiBulkLength(c, slot_replylen, num_masters);
}
 
/* Return the slot bit from the cluster node structure. */
/* This function checks whether the specified slot belongs to the current clusterNodes */
int clusterNodeGetSlotBit(clusterNode *n, int slot) {
    return bitmapTestBit(n->slots,slot);
}
 
/* Test bit 'pos' in a generic bitmap. Return 1 if the bit is set, * otherwise 0. */
/* The process is used to determine whether the specified position is 1*/ on the bitmap
int bitmapTestBit(unsigned char *bitmap, int pos) {
    off_t byte = pos/8;
    int bit = pos&7;
    return (bitmap[byte] & (1<<bit)) ! =0;
}
typedef struct clusterNode {
    ...
    /* Use a char array of CLUSTER_SLOTS/8 length to record the currently allocated slots */
    unsigned char slots[CLUSTER_SLOTS/8]; /* slots handled by this node */. } clusterNode;Copy the code

Char slots[CLUSTER_SLOTS/8] bitmap is used to store slot allocation information for each ClusterNode.

ClusterNode ->slots is an array of CLUSTER_SLOTS/8 length. CLUSTER_SLOTS is a fixed value 16384. Each bit in the array represents a slot. The bitmap subscripts are 0 to 2047, and the slot ranges from 0 to 16383.

Pos bit = 1

  • Off_t byte = pos/8: Get the information about the POS position in the bitmap corresponding byte. Because a Byte has eight bits. Using pos/8 can tell you which Byte to find. The bitmap is treated as an array, which corresponds to the corresponding subscript Byte.

  • Int bit = pos&7: Get the information about the pos position corresponding to the bit in this byte. So PI sub 7 is essentially %8. Imagine grouping pos in groups of 8, and the number of the last group (not satisfying 8) corresponds to the subscript position of the bit array in the Byte corresponding to the bitmap.

  • (bitmap[byte] & (1<<bit)) : Checks whether the corresponding bit exists on the bitmap[byte].

For example, slot 10001 is used.

Therefore, slot 10001 corresponds to a Byte with subscript 1250 and a bit with subscript 1 is checked.

ClusterNode-> Slots

The green squares represent bitmap[1250], which is the Byte corresponding to slot 10001. The red box (bit[1]) corresponds to the position of 1<<bit. Bitmap [byte] & (1<<bit), that is, check whether the red box corresponds to 1. If yes, 10001 is marked on bitmap.

The summary of ClusterNodeGetSlotBit logic is: Determine whether the current slot is allocated to the current node. Therefore ClusterReplyMultiBulkSlots about logical representation is as follows:

The steps are as follows:

  • Traversal of each node;

  • For each node, all slots are traversed and ClusterNodeGetSlotBit is used to determine whether the traversed slot is allocated to the current node.

From the result of the CLUSTER SLOTS directive, you can see that the complexity is < number of CLUSTER hosts > *< total number of SLOTS >. The total number of slots is 16384, a fixed value.

2.1.2 Redis hotspot investigation summary

So far, the CLUSTER SLOTS directive latency increases linearly with the number of Redis CLUSTER hosts. However, the number of CLUSTER master nodes we investigated is relatively large, which can explain the large delay of CLUSTER SLOTS instruction in the live network phenomenon of this investigation.

2.2 Client Troubleshooting

I have learned that the operation and maintenance students have expanded their capacity. After the expansion, some keys will inevitably be involved and there will be MOVED errors when accessing.

Take a quick look at the hiredis-VIP client code currently in use and briefly analyze how the following hiredis-VIP client currently in use will handle ‘MOVED’ when it meets’ MOVED ‘. As the Jedis client is commonly used in most other businesses, the corresponding process of Jedis client is briefly analyzed here.

2.2.1 Implementation principle of hiredis-VIP to MOVED

Hiredis-vip operations for MOVED:

To view the process of calling Cluster_update_route:

Cluster_update_route_by_addr performs the CLUSTER SLOT operation. As you can see, when MOVED is acquired, hiredis-VIP will update the Redis cluster topology with the following features:

  • Because the nodes use IP :port as the key, the hash method is the same. If the cluster topology is similar, multiple clients can easily access the same node at the same time.

  • If a node fails to access, the iterator is used to find the next node. For the above reasons, it is easy for multiple clients to access the next node at the same time.

2.2.2 Implementation principle of Jedis for MOVED

A quick look at the Jedis client code shows that renewSlotCache is invoked if a MOVED error exists.

Moving on to calls to renewSlotCache, Jedis in CLUSTER mode will send the Redis command CLUSTER SLOTS to re-pull the Slot topology of the Redis CLUSTER when it receives an error for MOVED.

2.2.3 Client Implementation Principle Summary

Since Jedis is a Java Redis client and hiredis-vip is a c++ Redis client, it is easy to think of this exception handling mechanism as a common operation.

The process for ‘MOVED’ in client cluster mode is as follows:

In summary:

1) Access keys using the slot topology cached by the client;

2) The Redis node returns to normal:

  • If the access is normal, go to subsequent operations

3) the Redis node returns MOVED:

  • CLUSTER SLOTS instruction was executed on Redis nodes to update the topology.

  • Re-access the key using the new topology.

2.2.3 Client Troubleshooting Summary

CLUSTER SLOTS > CLUSTER SLOTS > CLUSTER SLOTS > CLUSTER SLOTS > CLUSTER SLOTS > CLUSTER SLOTS

If the migration key hit ratio is high, the CLUSTER SLOTS directive is executed more frequently. As a result, the Redis CLUSTER is continuously executed by the CLUSTER SLOTS directive during the migration.

2.3 Investigation Summary

Here, in combination with CLUSTER SLOTS on the Redis side and the client processing logic for MOVED, we can answer some of the previous questions:

Why are more Cluster instructions being executed?

  • Because of the migration, the business will take the MOVED keys back, and the client will re-pull the slot topology information to execute CLUSTER SLOTS.

Why are CPU resources high when Cluster instruction is executed?

  • Redis source code analysis shows that the time complexity of CLUSTER SLOT instruction is proportional to the number of master nodes. The current Redis cluster has a large number of primary nodes, which takes a lot of time and occupies a lot of CPU resources.

Why is it easy for a cluster with a large number of nodes to be moved to a slot?

  • The migration operation must result in some client returning ‘MOVED’ when accessing the key;

  • The client will execute the CLUSTER SLOTS directive for the return of MOVED;

  • The CLUSTER SLOTS instruction increases with the number of primary nodes in a CLUSTER.

  • Business access increases during slot migration due to the delay of CLUSTER SLOTS, externally perceived as the delay of executing instructions.

Three, optimize

3.1 Status Analysis

Based on the current situation, it is normal for the client to encounter MOVED for CLUSTER SLOTS execution because the slot topology of the CLUSTER needs to be updated to improve the efficiency of subsequent CLUSTER access.

In addition to Jedis and Hiredis-VIP, other clients should also perform similar slot cache optimization. There is not much room for process optimization here because of Redis’s cluster access mechanism.

Therefore, the cluster information records of Redis are analyzed.

3.1.1 Redis cluster metadata analysis

Each Redis node in the cluster has some metadata records of the cluster, which are recorded in server.cluster as follows:

typedef struct clusterState {
    ...
    dict *nodes;          /* Hash table of name -> clusterNode structures */
    /* Nodes records all nodes, using dict */. clusterNode *slots[CLUSTER_SLOTS];/* Slots records a slot array containing Pointers to Nodes */. } clusterState;Copy the code

As described in 2.1, the original logic obtains the topology by iterating through slot information for each node.

3.1.2 Redis cluster metadata analysis

Observe the result from CLUSTER SLOTS:

/* Format: 1) 1) start slot * 2) end slot * 3) 1) master IP * 2) master port * 3) node ID * 4) 1) replica IP * 2) replica port * 3)  node ID * ... continued until done */
Copy the code

Cluster ->slots can be used to traverse server. Cluster ->slots. Because server.cluster->slots has been updated with each cluster topology change, it saves node Pointers.

3.2 Optimization Scheme

The simple optimization idea is as follows:

  • The slot is traversed to find that the nodes in the slot are contiguous blocks.

  • If the nodes of the current traversal slot are the same as those of the previous traversal slot, the current accessed slot is in the same node as the previous one, that is, in the contiguous slot area under a node.

  • If the node of the current slot traversed is different from the node traversed previously, the current slot is different from the previous one. The previous “continuous” slot area can be output. The current slot serves as the start of the next new “continuous” slot area.

So server. Cluster ->slots traversal is sufficient. The simple expression is as follows:

This time complexity is reduced to < total number of slots >.

3.3 implementation

The optimization logic is as follows:

void clusterReplyMultiBulkSlots(client * c) {
    /* Format: 1) 1) start slot * 2) end slot * 3) 1) master IP * 2) master port * 3) node ID * 4) 1) replica IP * 2) replica port * 3)  node ID * ... continued until done */
    clusterNode *n = NULL;
    int num_masters = 0, start = -1;
    void *slot_replylen = addReplyDeferredLen(c);
 
    for (int i = 0; i <= CLUSTER_SLOTS; i++) {
        /* Traverses all slots */
        /* Find start node and slot id. */
        if (n == NULL) {
            if (i == CLUSTER_SLOTS) break;
            n = server.cluster->slots[i];
            start = i;
            continue;
        }
 
        /* Add cluster slots info when occur different node with start * or end of slot. */
        if(i == CLUSTER_SLOTS || n ! = server.cluster->slots[i]) {/* Iterate over the standby node below the primary node and add the information returned to the client */
            addNodeReplyForClusterSlot(c, n, start, i-1);
            num_masters++;
            if (i == CLUSTER_SLOTS) break;
            n = server.cluster->slots[i];
            start = i;
        }
    }
    setDeferredArrayLen(c, slot_replylen, num_masters);
}
Copy the code

Cluster ->slots traverses server.cluster->slots to find the “continuous” slot area under a node. If it is not consecutive, output the node information of the previous “continuous” slot area and its standby node information, and then output the next “continuous” slot area.

4. Comparison of optimization results

Do a horizontal comparison of the two versions of Redis’s CLUSTER SLOTS directive.

4.1 Test Environment & Pressure Test Scenario

Operating system: Manjaro 20.2

Hardware configuration:

  • CPU: AMD Ryzen 7 4800H

  • DRAM: DDR4 3200MHz 8G x 2

Redis cluster information:

1) Persistent configuration

  • Close the aof

  • Close the bgsave

2) Cluster node information:

  • Number of nodes: 100

  • All nodes are primary nodes

Pressure test scenario:

  • Use the Benchmark tool to continuously send CLUSTER SLOTS directives to a single CLUSTER node.

  • After one version is compacted, reclaim the cluster, deploy the cluster again, and perform the next compacted test.

4.2 CPU Usage Comparison

Perf exports the flame diagram. Original version:

After the optimization:

It can be obviously seen that the proportion of the optimized decreased significantly. Basically as expected.

4.3 Time Comparison

Test on, embed time-consuming test code:

else if(! strcasecmp(c->argv[1]->ptr,"slots") && c->argc == 2) {
        /* CLUSTER SLOTS */
        long long now = ustime();
        clusterReplyMultiBulkSlots(c);
        serverLog(LL_NOTICE,
            "cluster slots cost time:%lld us", ustime() - now);
    }
Copy the code

Enter logs for comparison;

Original log output:

37351:M 06 Mar 2021 16:11:39.313 * Cluster slots Cost Time :2061 us

Optimized version log output:

35562:M 06 Mar 2021 16:11:27.862 * Cluster slots Cost Time :168 us

In terms of time consumption, the decrease is obvious: from 2000+ US to 200-US; The elapsed time in a cluster of 100 master nodes was reduced to 8.2%; The optimization results are basically in line with expectations.

Five, the summary

Here is a brief description of the above actions in the article leading to a conclusion: performance defects.

Briefly summarize the above investigation and optimization process:

  • Redis large CLUSTER because of the CLUSTER command caused some nodes access delay is obvious;

  • Using perf top commands to screening of Redis instance, abnormal clusterReplyMultiBulkSlots command CPU resources;

  • To analyze clusterReplyMultiBulkSlots, this function significant performance problems;

  • Optimize the clusterReplyMultiBulkSlots, performance significantly.

From the above investigation and optimization process, a conclusion can be drawn: the current Redis has performance defects in CLUSTER SLOT instructions.

Because of the data sharding mechanism of Redis, the key access method in Redis cluster mode is to cache the topology information of slot. Optimization points can only be started at CLUSTER SLOTS. However, the number of cluster nodes of Redis is generally not so large, so the problem is not obvious.

In fact, hiredis-VIP logic has some problems. As mentioned in 2.2.1, hiredis-VIP’s slot topology update method is CLUSTER SLOTS traversed all nodes one by one. If the Redis cluster is large and the client size on the business side is large, there will be a chain reaction:

1) If the Redis CLUSTER is large, CLUSTER SLOTS responds slowly;

2) If a node does not respond or returns an error, the hiredis-VIP client will continue to request the next node;

3) The hiredis-VIP client iterates the nodes of the Redis cluster in the same way (because the cluster information is basically the same in all clients). At this time, when the client is large, a certain Redis node may be blocked and the hiredis-VIP client iterates the next Redis node.

4) A large number of Hiredis-VIP clients visit some Redis nodes one by one. If the Redis nodes cannot afford such requests, this will cause the Redis nodes to request one by one under the “traversal” of a large number of Hiredis-VIP clients:

In conjunction with point 3 above, imagine that there are 1W clients accessing the Redis cluster. Because a key with a high hit ratio is migrated, all clients need to update the slot topology. Because the cluster node information cached by all clients is the same, the sequence of traversing each node is consistent. The 1W clients all traverse CLUSTER SLOTS using the same sequence of nodes in the CLUSTER. Due to the poor performance of CLUSTER SLOTS in a large CLUSTER, the Redis node can easily become inaccessible due to a large number of client requests. The Redis nodes are accessed by a large number of clients (for example, 9K + clients) in the order traversed. CLUSTER SLOTS directives are executed, which blocks each Redis node.

5) As a result, the CPU load of most Redis nodes skyrocketed and many Hiredis-VIP clients continued to fail to update their slot topologies.

As a result, after slot migration is performed in a large-scale Redis cluster, the service side perceives that the latency of common instructions increases and the CPU resource usage of Redis instances increases under the access of large-scale Hiredis-VIP clients. This logic can be optimized.

The optimizations in section 3 above have now been committed and incorporated into Redis 6.2.2.

Vi. Reference materials

1, the VIP Hiredis – : github.com

2, Jedis: https://github.com/redis/jedis

3, Redis: github.com/redis/redis

4. Perf.wiki.kernel.org

Author: Yuan Jianwei, Vivo Internet Database Team