As an in-memory database, Redis has very high performance, with QPS of a single instance reaching about 10W. However, when we use Redis, we often have a lot of latency from time to time. If you don’t know the internal implementation of Redis, you will have no idea when troubleshooting problems.

Most of the time, Redis access delay increased, and our improper use or unreasonable operation and maintenance caused.

In this article, we will analyze the delay problems frequently encountered in the use of Redis and how to locate and analyze them.

Use complex commands

How do I troubleshoot a sudden increase in access latency while using Redis?

First, as a first step, I suggest you check out the Redis slow log. Redis provides the statistics function of slow log commands. You can check which commands are executed with a large delay by setting the following parameters.

First set the Redis slow log threshold. Only commands exceeding the threshold will be logged, in subtle units. For example, set the slow log threshold to 5 milliseconds, and set only the latest 1000 slow logs to be retained:

Slowlog-log-slower than 5000 # Slowlog-max-len 1000 slowlog-max-len 1000 slowlog-max-len 1000Copy the code

SLOWLOG get 5: SLOWLOG get 5: SLOWLOG get 5: SLOWLOG get 5: SLOWLOG get 5

127.0.0.1:6379> SLOWLOG 5 1) 1) (integer) 32693 # SLOWLOG ID 2) (integer) 1593763337 # SLOWLOG 3) (integer) 5299 # SLOWLOG 5 2) "user_list_2000" 3) "0" 4) "-1" 2) 1) (integer) 32692 2) (integer) 1593763337 3) (integer) 5044 4) 1) "GET" 2) "book_price_1000" ...Copy the code

By looking at the slow log, we can know which commands are time-consuming to execute at what time. If your business often uses more than O(N) complexity commands, such as sort, sUnion, or ZunionStore, or operates on a large amount of data when executing O(N) commands, In these cases, Redis can be time-consuming to process data.

If you don’t have a lot of requests for your service, but your Redis instance has high CPU usage, it is likely that you are using a complex command.

The solution is not to use these complex commands and not to fetch too much data at a time, but to operate on a small amount of data at a time so that Redis can process and return it in a timely manner.

Storage bigkey

If the slow log is not caused by a more complex command, such as SET and DELETE operations, then you may wonder if Redis wrote bigkey.

When Redis writes data, it allocates memory for new data, and when data is deleted from Redis, it frees the corresponding memory space.

Redis also takes time to allocate memory if a key writes a large amount of data. Similarly, when the key is deleted, freeing memory will take a long time.

You need to check your business code to see if there is a bigkey written to it. You need to evaluate the size of the data written to it. The business layer should avoid storing too much data on one key.

Is there any way to scan for bigkey data in Redis?

Redis also provides a way to scan bigkeys:

Redis -cli -h $host -p $port --bigkeys -i 0.01Copy the code

Using the command above, you can scan the distribution of the key size across the instance, which is shown in type dimensions.

It should be noted that the QPS of Redis will increase dramatically when we conduct bigkey scan in online instances. In order to reduce the impact on Redis during the scan, we need to control the scan frequency by using the -i parameter, which represents the time interval of each scan in the scan process, in seconds.

Redis uses this command internally to run the scan command through all keys. Then execute strlen, llen, hlen, scard, and zcard for different types of keys to get the length of the string and the number of elements of the container type (list/dict/set/zset).

For container keys, only the key with the most elements can be scanned. However, the key with the most elements does not necessarily occupy the most memory, which needs to be noted. However, using this command we can generally get a clear idea of the distribution of keys in the entire instance.

In version 4.0, Redis introduced a lazy-free mechanism to asynchronously release bigkey memory and reduce the impact on Redis performance. Even so, we do not recommend using bigkey. Bigkey can affect the performance of a cluster migration, which will be covered in more details later in the cluster related article.

Focus on overdue

Sometimes you will find that there is no large delay when using Redis, but at a certain point in time, there is a wave of delay, and the delay is very regular, such as an hour, or how long the interval will occur.

If this happens, you need to consider whether there are a large number of key sets that expire.

If a large number of keys expire at a fixed point in time, it is possible to increase latency when accessing Redis at that point.

Redis adopts two expiration strategies: active expiration and lazy expiration:

  • Active expiration: By default, Redis maintains a scheduled task that randomly retrieves 20 keys from the expired dictionary every 100 milliseconds and deletes expired keys. If the percentage of expired keys exceeds 25%, Redis continues to retrieve 20 keys and delete expired keys, and the cycle repeats. It does not exit the loop until the percentage of expired keys drops to 25% or the task takes more than 25 milliseconds to execute
  • Lazy expiration: Only when a key is accessed does it determine whether it has expired, and if it has, it is removed from the instance

Note that Redis scheduled tasks with active expiration are also executed in the main thread of Redis. That is to say, if a large number of expired keys need to be deleted during the active expiration process, service requests can be processed only after the expiration task is completed. In this case, the problem of service access latency increases, the maximum delay is 25 milliseconds.

And this access delay is not recorded in the slow log. In the slow log, only the time of executing a command is recorded. The Redis active expiration policy is executed before the operation command. If the operation command time does not reach the slow log threshold, it will not be counted in the slow log statistics.

In this case, you need to check your business to see if there really is concentrated expiration code. Generally, the concentrated expiration command is expireat or pexpireat command. You can search the keyword in the code.

If your business really needs to centrally expire some keys without causing Redis to jitter, what are the optimizations?

The solution is to add a random time to the set expiration, breaking up the time of keys that need to expire.

The pseudocode could be written like this:

Redis. Expireat (key, expire_time + random(300))Copy the code

This way Redis will not be overwhelmed and block the main thread when dealing with expiration due to centralized deletion of keys.

In addition, in addition to business usage, you can also use operation and maintenance methods to detect the problem in time.

Info: expired_keys = expired_keys = expired_keys = expired_keys = expired_keys = expired_keys = expired_keys

We need to monitor this indicator. When there is a sudden increase of this indicator in a very short time, we need to timely alarm it, and then compare and analyze it with the time point when the business report is slow to confirm whether the time is consistent. If so, it can be considered that the delay is indeed increased because of this reason.

The instance memory reaches the upper limit. Procedure

Sometimes when we use Redis as a pure cache, we give the instance a memory upper limit, maxMemory, and then enable the LRU elimination policy.

When an instance reaches maxMemory, you will notice that each subsequent write may be slower.

The reason for the slowdown is that when Redis memory reaches maxmemory, some data must be kicked out to keep the memory below maxMemory before any new data is written.

The logic of kicking out old data also takes time, depending on the configured elimination strategy:

  • Allkeys-lru: Eliminates the least recently accessed key, regardless of whether the key is set to expire
  • Volatile – lRU: Only the least recently accessed and expired keys are eliminated
  • Allkeys-random: Random elimination regardless of whether the key is set to expire
  • Volatile -random: Only keys that are set to expire are randomly eliminated
  • Allkeys-ttl: Eliminates keys that are about to expire, regardless of whether the key is set to expire
  • Noeviction: Won’t eliminate any key when it’s full capacity
  • Allkees-lfu: Disables the least frequently accessed key regardless of whether the key is set to expire (4.0+ support)
  • Volatile – lFU: Only the least frequently accessed expired keys are eliminated (4.0+ support)

The specific policy depends on the service scenario.

The most commonly used allKeys-LRU or volatile lRU strategies have the logic of randomly fetching a batch of (configurable) keys from the instance at a time, discarding the least accessed key, temporarily storing the remaining keys in a pool, and randomly fetching another batch of keys. And compared with the previous pool of keys, and the least visited key. This loop continues until memory drops below maxMemory.

If the allkeys-random or volatile-random strategy is used, it will be much faster than the LRU strategy. Because it is random elimination, the time consumption of comparing key access frequency will be reduced. After randomly taking out a batch of keys, it can be eliminated directly.

But all of the above logic is executed before the actual command is executed when accessing Redis, that is, it affects the command executed when accessing Redis.

In addition, if there is a bigkey stored in the Redis instance, it will take much longer to release memory when the bigkey is removed.

If your business is very busy and you must set maxMemory to limit the memory limit of the instance, and you are faced with the delay caused by discarding keys, you can also consider splitting the instance. Splitting an instance can spread the burden of eliminating keys from one instance over multiple instances, reducing latency to some extent.

Severe fork time

If your Redis is enabled to automatically generate RDB and AOF overrides, it is possible to generate RDB and AOF overrides in the background, resulting in increased Redis access latency, and by the time these tasks are completed, the latency will be gone.

This is usually the result of performing the RDB and AOF rewrite tasks.

Generated RDB and AOF the parent must fork out a child process for data persistence, in the execution of a fork, parent need to copy the memory page table to the child, if the instance memory footprint is very large, so need to copy the memory page table will be more time consuming, this process will consume large amounts of CPU resources, before the fork is completed, The entire instance would be blocked, unable to handle any requests, and if CPU resources were tight, the fork would take longer, even seconds. This can seriously affect Redis performance.

We can run the info command to see how long the last fork took, latest_fork_usec, in subtle units. This is the time when the entire instance is blocked and unable to process the request.

In addition to generating RDB for backup reasons, when the primary node establishes data synchronization for the first time, the primary node will also generate RDB files to perform a full synchronization for the secondary node, which will also affect Redis performance.

To avoid this, we need to plan the data backup cycle. It is recommended to perform the backup on the slave node, and preferably during the low peak period. If services are insensitive to data loss, it is not recommended to enable the AOF and AOF rewrite functions.

Also, fork time is system-specific, and increases if Redis is deployed on a virtual machine. Therefore, Redis is recommended to be deployed on physical machines to reduce the impact of fork.

Binding the CPU

Most of the time, in order to improve the performance and reduce the performance loss of context switch when the application uses multiple cpus, we usually use the operation of process binding CPU when deploying the service.

With Redis, we don’t recommend doing this for the following reasons.

Redis bound to the CPU, when data persistence,forkThe child process inherits the CPU usage preference of the parent process. In this case, the child process consumes a large amount of CPU resources for data persistence, and the child process competes with the primary process for CPU resources. As a result, the primary process has insufficient CPU resources and the access delay increases.

Therefore, when deploying the Redis process, if you need to enable RDB and AOF rewriting mechanism, must not perform CPU binding operation!

AOF coordination is not reasonable

As mentioned above, when performing AOF file overwrites, Redis delays are increased due to fork execution time. In addition, if AOF is enabled, improper policies can also cause performance problems.

After AOF is enabled, Redis writes the command to the file in real time. However, the file is written to the memory first. The data in the memory is written to the disk only when the value of the data in the memory exceeds a certain threshold or a certain period of time.

AOF provides three disk flushing mechanisms to ensure the security of writing files to disks:

  • appendfsync always: Disks are flushed every time data is written, which has the greatest impact on performance, occupies a high DISK I/O footprint, and has the highest data security
  • appendfsync everysec: Flusher disks once every second, which has little impact on performance. When a node is down, data will be lost at most one second
  • appendfsync no: Flushing disks based on the OPERATING system has minimal impact on performance and low data security. Data loss during node downtime depends on the flushing mechanism of the operating system

When the first mechanism appendfsync Always is used, Redis writes the command to disk every time it handles a write command, and this operation is performed in the main thread.

Data in the memory is written to the disk, which increases the I/O burden of the disk. The disk operation cost is much higher than the memory operation cost. If the write volume is high and every update is written to disk, the disk IO of the machine will be very high and slow down Redis performance, so we do not recommend using this mechanism.

In contrast to the first mechanism, appendfsync Everysec brushes every second, whereas appendfsync No is less secure depending on the flush time of the operating system. Therefore we recommend using appendfsync Everysec, which in the worst case only loses 1 second of data but maintains good access performance.

Of course, for some business scenarios that are not sensitive to data loss, AOF may not be enabled.

Use the Swap

If you find that Redis is suddenly very slow, with each access taking hundreds of milliseconds or even seconds, then check if Redis is using Swap. In this case, Redis is basically unable to provide high performance services.

As we know, the operating system provides the Swap mechanism, the purpose is to when the memory is insufficient, part of the memory data can be transferred to the disk, so as to achieve the memory usage buffer.

But when the data in memory is transferred to disk, accessing the data requires reading it from disk, which is much slower than memory!

Especially for a high performance in-memory database such as Redis, if the memory in Redis is switched to disk, this operation time is unacceptable for a very performance sensitive database such as Redis.

We need to check the memory usage of the machine to see if Swap was used due to insufficient memory.

If a Swap is used, you need to defragment the memory space to free enough memory for Redis to use, and then release Redis’s Swap so Redis can use the memory again.

To avoid adverse impact on services caused by the instance restart, you need to perform a primary/secondary switchover first, release the Swap on the original active node, and restart the service. After data synchronization is complete, the Swap is switched back to the active node.

It can be seen that when Redis is used in Swap, the high performance of Redis at this time is basically destroyed, so we need to prevent this situation in advance.

We need to monitor the memory usage and Swap usage of Redis machine, timely alarm when memory is insufficient and Swap is used, and timely conduct corresponding processing.

Network adapter Load Is Too High

If you have avoided all of the above performance problems and Redis has been running steadily for a long time, then after a certain point, access to Redis starts to slow down and continues until now, what is the cause of this situation?

We’ve had this problem before, where it starts to slow down after a certain point and continues to do so. At this time you need to check the machine’s network card traffic, whether there is network card traffic is full.

When the network adapter is overloaded, data transmission delay and packet loss occur at the network layer and TCP layer. In addition to memory, the high performance of Redis lies in network IO. The sudden increase of requests will lead to high load of network cards.

If this happens, you need to check which Redis instances on the machine have too much traffic and occupy the network bandwidth, and then confirm whether the traffic surge is normal. If it is, you need to expand or migrate instances in time to prevent other instances of the machine from being affected.

At the level of operation and maintenance, we need to increase monitoring of various indicators of the machine, including network traffic, alarm in advance when the threshold is reached, confirm with business and expand capacity in time.

conclusion

Above, we have summarized common scenarios in Redis that may lead to delay increase or even block, which involves both the use of business and the operation and maintenance of Redis.

It can be seen that in order to ensure the high performance of Redis, it involves CPU, memory, network, and even all aspects of disk, including the use of related features of the operating system.

As developers, we need to understand the operation mechanism of Redis, such as the execution time complexity of each command, data expiration strategy, data elimination strategy, etc., and use reasonable commands to optimize in combination with business scenarios.

As DBA operation and maintenance personnel, it is necessary to understand data persistence, operating system fork principle, Swap mechanism, etc., and make reasonable planning of Redis capacity, reserve enough machine resources, and complete monitoring of the machine, so as to ensure the stable operation of Redis.

Pay attention to wechat public number: Songhua said, get more wonderful!

A BLOG address:www.liangsonghua.com

Introduction to our official account: We share our technical insights from working in JD, as well as JAVA technology and best practices in the industry, most of which are pragmatic, understandable and reproducible