“This is the 24th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”
The amount of data a Redis Cluster can store and the throughput it can support are closely related to the Cluster instance size. Redis officially gives an upper limit on the size of a Redis Cluster, that is, a Cluster running 1000 instances.
So, you might ask, why limit cluster size? In fact, one of the key factors here is that the communication overhead between instances increases as the instance size increases, and the cluster throughput decreases beyond a certain size (such as 800 nodes). As a result, the actual size of the cluster is limited.
In today’s lesson, we will talk about how the communication cost between Cluster instances affects the size of Redis Cluster and how to reduce the communication cost between instances. With today’s content, you can scale up the Redis Cluster with proper configuration while maintaining high throughput.
Instance communication methods and the impact on cluster size
At runtime, the Redis Cluster stores Slot mapping (i.e., Slot mapping table) and its own state information on each instance.
In order for each instance in the cluster to know the state information of all other instances, instances communicate with each other according to certain rules. This rule is called the Gossip protocol.
The Gossip protocol works in two ways.
First, some instances are randomly selected from the cluster at a certain frequency. PING messages are sent to the selected instances to check whether they are online and exchange status information. PING messages encapsulate the status information of the instance that sends the message, some other instances, and Slot mapping tables.
Second, when an instance receives a PING message, it sends a PONG message to the instance that sent the PING message. A PONG message contains the same content as a PING message.
The following figure shows PING and PONG messaging between two instances.
The Gossip protocol ensures that, after a certain period of time, each instance in the cluster gets the status information for all other instances.
In this way, even when a new node is added, a node fails, or Slot changes occur, the cluster status can be synchronized on each instance through the transmission of PING and PONG messages.
After the analysis, it can be seen intuitively that the communication cost between instances using the Gossip protocol is affected by the size and frequency of communication messages.
The larger the message, the higher the frequency, the higher the corresponding communication overhead. If you want to achieve efficient communication, you can tune from both aspects. Next, we will analyze the actual situation of these two aspects in detail.
First, let’s look at the message size of the instance communication.
Gossip message size
The body of a PING message sent by a Redis instance is made up of the clusterMsgDataGossip structure, which is defined as follows:
typedef struct { char nodename[CLUSTER_NAMELEN]; //40 bytes uint32_t ping_sent; //4 bytes uint32_t pong_received; //4 bytes char IP [NET_IP_STR_LEN]; // uint16_t port; //2 bytes uint16_t cport; //2 bytes uint16_t flags; //2 bytes uint32_t notused1; //4 bytes} clusterMsgDataGossip;Copy the code
CLUSTER_NAMELEN and NET_IP_STR_LEN are CLUSTER_NAMELEN and NET_IP_STR_LEN are CLUSTER_NAMELEN and NET_IP_STR_LEN are CLUSTER_NAMELEN and NET_IP_STR_LEN are CLUSTER_NAMELEN and NET_IP_STR_LEN, respectively. That’s 104 bytes.
When each instance sends a Gossip message, in addition to its own status information, the default status information is also transmitted for one-tenth of the cluster instances.
So, for a cluster of 1000 instances, each instance sending a PING message contains status information for 100 instances, which is a total of 10400 bytes. Plus the information sent by the instance itself, a Gossip message is about 10KB.
In addition, to enable the Slot mapping table to propagate between instances, the PING message is accompanied by a 16,384-bit Bitmap with each bit of the Bitmap corresponding to a Slot. If one bit is 1, the Slot belongs to the current instance. This Bitmap is 2KB in bytes. We add the instance status information and Slot allocation information to get the size of a PING message, which is about 12KB.
The PONG message has the same content as the PING message, so its size is about 12KB. After each instance sends a PING message, it also receives a PONG message back, which adds up to 24KB.
While 24 KILobytes may not be very large in absolute terms, if a single request that an instance normally handles is only a few kilobytes, then the PING/PONG messages transmitted by the instance to maintain cluster state consistently will be larger than a single business request. Moreover, each instance sends PING/PONG messages to other instances. As the cluster size increases, the number of heartbeat messages increases, occupying part of the network bandwidth of the cluster and reducing the throughput of normal client requests.
In addition to the heartbeat message size will affect the communication cost, if the communication between instances is very frequent, the cluster network bandwidth will be frequently occupied. So what are the communication frequencies of instances in the Redis Cluster?
Frequency of communication between instances
After the Redis Cluster instance is started, five instances are randomly selected from the local instance list every second by default. The Redis Cluster instance that has not communicated with each other for the longest time is found and the PING message is sent to the instance. This is the basic practice for instances to send PING messages periodically.
However, there is a problem: the instance selected for the longest time without communication is, after all, selected from a random list of five instances, which does not guarantee that the instance will be the longest without communication in the entire cluster.
As a result, it is possible that some instances have never been pinged, causing the cluster state they maintain to expire.
To avoid this situation, the instance of Redis Cluster will scan the local instance list every 100ms. If it finds the time when the instance last received a PONG message, If the value is greater than half of the configuration item cluster-node-timeout (cluster-node-timeout/2), the system immediately sends a PING message to the instance to update the cluster status.
As the cluster size increases, network communication latency between instances increases due to network congestion or traffic competition between different servers. If some instances fail to receive PONG messages from other instances, frequent PING messages will be sent between instances, which will bring additional overhead to the cluster network communication.
To sum up the number of PING messages sent per second by a single instance, look like this:
Number of PING messages sent = 1 + 10 x Number of instances (the last time PONG messages were received exceeded cluster-Node-timeout /2)
Where, 1 means that the single instance routinely sends a PING message every one second, and 10 means that the instance will perform 10 checks every one second and send a message to the instance whose PONG message has timed out after each check.
Let me take you through an example of how PING messages take up cluster bandwidth at this frequency.
Assuming that a single instance detection finds that 10 instances of PONG message reception times out every 100 milliseconds, that instance would send 101 PING messages per second, or about 1.2MB/s bandwidth. If 30 instances in the cluster were to send messages at this frequency, it would take up 36MB/s of bandwidth, which would cannibalate the bandwidth in the cluster used to service normal requests.
So, we’re looking for ways to reduce the communication overhead between instances, and how do we do that?
How to reduce communication overhead between instances?
In order to reduce the communication overhead between instances, in principle, we can reduce the size of messages (PING/PONG messages, Slot allocation information) transmitted by instances. However, since cluster instances rely on PING/PONG messages, and Slot allocation information to maintain a uniform cluster state, once the size of messages transmitted is reduced, This will result in less communication between instances, which is not conducive to cluster maintenance. Therefore, we cannot adopt this approach.
So, can we reduce the frequency of sending messages between instances? Let’s analyze it first.
From what we’ve just learned, we now know that there are two frequencies for sending messages between instances.
- Each instance sends a PING message every 1 second. This is not a high frequency, and if you lower it any further, the state of the instances in the cluster may not propagate in time.
- Each instance will perform a PING check every 100 milliseconds to any node that receives more than cluster-Node-timeout /2 PONG messages. The frequency at which the instance checks every 100 milliseconds is the default Redis instance’s uniform frequency for periodic checking tasks, and we generally do not need to change it.
Then, only the cluster-node-timeout configuration item can be modified.
The configuration item cluster-node-timeout defines the heartbeat timeout period during which a cluster instance is considered to be faulty. The default value is 15 seconds. If the value of cluster-node-timeout is small, then in a large-scale cluster, PONG message timeout will occur frequently, resulting in the instance performing the operation of “PING PONG message timeout to the instance” 10 times per second.
Therefore, to avoid excessive heartbeat messages crowding the cluster bandwidth, we can increase the cluster-Node-timeout value, for example, to 20 or 25 seconds. In this way, the timeout of PONG message reception would be alleviated, and a single instance would not have to frequently perform 10 heartbeat dispatches per second.
Of course, we should not set the cluster-node-timeout too high, otherwise, if the instance does fail, we will have to wait for the cluster-node-timeout period to detect the fault, which will lead to the actual recovery time is extended. Cluster services may be affected.
In order to check whether adjusting the cluster-node-timeout value can reduce the cluster network bandwidth consumed by heartbeat messages, I suggest that you use the tcpdump command to capture the network packets sent by heartbeat messages before and after adjusting the cluster-node-timeout value.
For example, by running the following command, we can grab the heartbeat network packet sent from port 16379 by the instance on the machine 192.168.10.3 and save the contents of the network packet to r1.cap:
Tcpdump host 192.168.10.3 Port 16379 -i Nic name -w/TMP/R1.capCopy the code
By analyzing the number and size of network packets, you can determine the bandwidth occupied by heartbeat messages before and after adjusting the cluster-Node-timeout value.
summary
In this lesson, I introduce the Gossip protocol between Redis Cluster instances. When Redis Cluster runs, each instance needs to exchange information through PING and PONG messages. These heartbeat messages contain the status information of the current instance and some other instances, as well as Slot allocation information. This communication mechanism helps all instances in the Redis Cluster have complete Cluster status information.
However, as the size of the cluster increases, so does the traffic between instances. If we blindly expand the Redis Cluster, we may encounter Cluster performance slowdowns. This is because the large number of heartbeat messages between instances in the cluster can cannibalize the bandwidth for processing normal requests. In addition, some instances may not receive PONG messages in a timely manner due to network congestion, and each instance will periodically (10 times per second) detect if this happens and immediately send heartbeat messages to those instances whose PONG messages have timed out. The larger the cluster size is, the higher the probability of network congestion will be, and accordingly, the higher the probability of PONG message timeout will be, which will lead to a large number of heartbeat messages in the cluster and affect the normal cluster service requests.
Finally, I have a small suggestion for you. Although we can reduce the bandwidth usage of heartbeat messages by adjusting the cluster-Node-timeout configuration, in practical applications, I recommend you to limit the size of Redis clusters to 400 to 500 instances if large clusters are not particularly needed.
Assuming that a single instance can support 80,000 request operations per second (80,000 QPS) and each primary instance is configured with one slave instance, 400-500 instances can support 16-20 million QPS (200/250 primary instances * 80,000 QPS= 16/20 million QPS). This throughput performance can meet the requirements of many business applications.
Each lesson asking
As usual, I have a quick question for you. If we store cluster instance status and Slot allocation information on a third party storage system (such as Zookeeper) in a similar way as Codis does for Slot allocation information, does this have any impact on cluster size?
Welcome to write down your thoughts and answers in the comments area, and we will exchange and discuss together. If you find today’s content helpful, you are welcome to share it with your friends and colleagues. I’ll see you next time.