Wechat search attention to “water drop and silver bullet” public account, the first time to obtain high-quality technical dry goods. 7 years of senior back-end development, with a simple way to explain the technology clearly.

As an excellent in-memory database, Redis has very high performance, and the OPS of a single instance can reach about 10W. But that’s why, when we use Redis, if we see an increase in latency, it’s not what we expected.

You may have encountered the following scenarios at some point:

  • Why is the same command executed on Redis sometimes fast and sometimes slow?
  • Why does Redis take so long to execute SET and DEL commands?
  • Why did my Redis suddenly slow down and then return to normal?
  • Why did my Redis, which had been running steadily for so long, suddenly start to slow down at some point?
  • .

If you don’t understand the internal implementation of Redis, you will have no idea how to troubleshoot this delay problem.

If you are in the same situation, this article will give you a “comprehensive” idea of how to troubleshoot the problem, and a more efficient solution for the slow down situation.

Before the text begins, I need to remind you that this article is very long and covers a wide range of Redis knowledge points. The whole article is close to 2W words. If your reading environment is not suitable for focused reading at this time, I suggest that you collect this article first and then focus on reading this article at the appropriate time.

If you are patient and careful to read through this article, I can guarantee that your performance tuning of Redis will be very rewarding.

If you’re ready, follow my lead and get started!

Is Redis really slowing down?

First, before you start, you need to find out if Redis is really slowing down.

If you find that your business service API is slow to respond, you need to first look inside the service to see what is slowing down the entire service.

It is more efficient to integrate link tracing within the service, that is, the entry and exit from which the service accesses the external dependency, recording the response delay of the external dependency for each request.

If you do find that the Redis link is taking longer to operate, then you need to focus on the business service to Redis link for the moment.

There are also two possible reasons why the link from your business service to Redis is slow:

  1. Problems exist in the network between the service server and Redis server, such as poor network line quality, delay and packet loss in the transmission of network packets
  2. Redis itself has problems, and further investigation is needed to find out what causes Redis to slow down

Generally speaking, the probability of the first case is low. If there is a network problem between servers, all services deployed on the service server will experience network delay. In this case, you need to contact network operation and maintenance (O&M) colleagues to solve the network problem.

In this article, we focus on the second case.

That is to say, from the perspective of Redis, check whether there are scenes that lead to slow down, and what factors will lead to the increase of delay of Redis, and then optimize accordingly.

Excluding network reasons, how do you know if your Redis is really slow?

First, you need to benchmark Redis to see how well your Redis performs on production servers.

What is baseline performance?

To put it simply, baseline performance refers to the maximum and average response latency of Redis on a normally loaded machine.

Why test benchmark performance? Can’t I look at the response latency provided by others to determine if my Redis is slowing down?

The answer is no.

Because Redis has different performance in different hardware and software environments.

For example, if my machine is configured low, I would consider Redis to be slow when the latency is 2ms, but if your hardware is configured high, you might consider Redis to be slow when the latency is 0.5ms in your operating environment.

So, only when you know the baseline performance of your Redis on the production server can you further evaluate the latency to what extent Redis is actually slowing down.

How to do it?

To avoid network latency between the business server and the Redis server, you need to test the instance’s response latency directly on the Redis server. To test the maximum response delay for this instance within 60 seconds, run the following command:

$Redis -cli -h 127.0.0.1 -p 6379 -- intrinsics -latency 60
Max latency so far: 1 microseconds.
Max latency so far: 15 microseconds.
Max latency so far: 17 microseconds.
Max latency so far: 18 microseconds.
Max latency so far: 31 microseconds.
Max latency so far: 32 microseconds.
Max latency so far: 59 microseconds.
Max latency so far: 72 microseconds.

1428669267 total runs (avg latency: 0.0420 microseconds / 42.00 nanoseconds per run).
Worst run took 1429x longer than the average latency.
Copy the code

As you can see from the output, the maximum response delay for this 60-second period is 72 microseconds (0.072 ms).

You can also use the following command to view the minimum, maximum, and average access latency of Redis over a period of time:

$Redis -cli -h 127.0.0.1 -p 6379 --latency-history -i 1Min: 0, Max: 1, AVG: 0.13 (100 samples) -- 1.01 seconds range min: 0, Max: 1, AVG: 0.12 (99 samples) -- 1.01 seconds range min: 0, Max: 1, AVG: 0.13 (99 samples) -- 1.01 seconds range min: 0, Max: 1, AVG: 0.10 (99 samples) -- 1.01 seconds range min: 0, Max: 1, AVG: 0.13 (98 samples) -- 1.00 seconds range min: 0, Max: 1, AVG: 0.08 (99 samples) -- 1.01 seconds range...Copy the code

The above output results show that the average operation time of sampling Redis is 0.08 ~ 0.13 ms every 1 second.

Knowing the benchmark performance test method, you can follow these steps to determine if your Redis is really slowing down:

  1. Benchmark performance of a normal Redis instance is tested on the same configured server
  2. Find an instance of Redis that you think might be slow and test the baseline performance of that instance
  3. If you observe that the running latency of this instance is more than twice the normal Redis benchmark performance, you can assume that the Redis instance is indeed slowing down

Confirm that Redis is slowing down, so how do you figure out where the problem is?

Following my thoughts, we will analyze the factors that may cause Redis to slow down step by step, from easy to difficult.

Using commands that are too complex

First, as a first step, you need to check Redis’ slowlog.

Redis provides statistics on slow log commands, which records which commands take a long time to execute.

Before viewing Redis slow logs, you need to set the slow log threshold. For example, to set the slow log threshold to 5 milliseconds and retain the last 500 slow logs:

Slowlog-log-slower than 5000 slowlog-max-len 500 Slowlog-max-len 500Copy the code

After the setup, Redis records all commands that take more than 5 milliseconds to execute.

In this case, you can run the following command to query the latest slow logs:

127.0.0.1:6379> SLOWLOG get 5 1) 1) (integer) 32693 # SLOWLOG ID 2) (integer) 1593763337 # SLOWLOG timestamp 3) (integer) 5299 # SLOWLOG get 5 2) "user_List :2000" 3) "0" 4) "-1" 2) 1) (integer) 32692 2) (integer) 1593763337 3) (integer) 5044 4) 1) "GET" 2) "user_info:1000" ...Copy the code

By looking at the slow log, we can know at what point in time and which commands are time-consuming to execute.

If your application executes a Redis command that has the following characteristics, the operation may be delayed:

  1. More than O(N) complexity commands are often used, such as SORT, SUNION, and ZUNIONSTORE aggregation commands
  2. Use O(N) complexity commands, but the value of N is very large

In the first case, the slowness is due to the high time complexity of Redis operating on the in-memory data, which consumes more CPU resources.

In the second case, the reason for the slowness is that Redis needs to return too much data to the client at a time, and more time is spent in the process of data protocol assembly and network transmission.

In addition, we can analyze the resource utilization level. If your application does not have high OPS to operate Redis, but the CPU utilization of Redis instances is high, it is likely to be caused by the use of complex commands.

In addition, as we all know, Redis is single-threaded client request processing, if you often use the above command, then when Redis processing client request, once the previous command takes time, it will cause the subsequent request queuing, for the client, the response delay will be longer.

How to solve this situation?

The answer is simple, you can optimize your business using the following methods:

  1. Do not use commands that are more than O(N) complex. Data aggregation is performed on the client
  2. Run the O(N) command to ensure that N is as small as possible (N <= 300 is recommended), and obtain as little data as possible each time so that Redis can process and return data in time

Operating bigkey

If you query the slow log and find that the slow log is not caused by a complex command, but by a simple command like SET/DEL, then you have to wonder if your instance wrote bigkey.

When Redis writes data, it allocates memory for new data, and correspondingly, when data is deleted from Redis, it frees the corresponding memory space.

If a key writes a very large value, Redis takes time to allocate memory. Similarly, it takes time to free memory when deleting a key. This type of key is commonly called a bigkey.

At this point, you need to check your business code to see if there is a bigkey written. You need to estimate the size of the data to be written to a key, and try to avoid storing too much data with a key.

If bigkeys are already written, is there any way to scan the distribution of bigkeys in the instance?

The answer is yes.

Redis provides a command to scan for bigkeys. To scan the distribution of bigkeys in an instance, run the following command:

$redis-cli -h 127.0.0.1 -p 6379 --bigkeys -i 0.01... -------- summary ------- Sampled 829675 keys in the keyspace! Total Key length in bytes is 10059825 (AVG Len 12.13) Biggest string found 'Key :291880' has 10 bytes Biggest list found 'mylist:004' has 40 items Biggest set found 'myset:2386' has 38 members Biggest hash found 'myhash:3574' has 37 fields Biggest zset found 'myzset:2704' has 42 members 36313 strings with 363130 bytes (04.38% of keys, Avg size 10.00) 787393 Lists with 896540 items (94.90% of keys, Avg size 1.14) 1994 sets with 40052 members (00.24% of keys, Avg size 20.09) 1990 hashs with 39632 fields (00.24% of keys, Avg size 20.92) 1985 zsets with 39750 members (00.24% of keys, AVg size 20.03)Copy the code

From the output, we can clearly see which key has the most memory/elements for each data type, as well as the percentage of each data type in the entire instance and the average size/number of elements.

In fact, the principle of using this command is that Redis internally executes the SCAN command, traverses all keys in the entire instance, and then executes the STRLEN, LLEN, HLEN, SCARD, and ZCARD commands respectively according to the key types. To get the length of String and the number of elements of the container type (List, Hash, Set, ZSet).

There are two things I need to remind you of when executing this command:

  1. OPS of Redis increases dramatically when bigkey scans are performed on online instances. To reduce the impact of bigkey scans on Redis, it is best to control the scan frequency. Specify the -i parameter, which indicates the interval of rest after each scan, in seconds
  2. In the scan result, only the keys with the most elements can be scanned for container keys (List, Hash, Set, and ZSet). However, having more elements in a key does not necessarily mean more memory usage. You need to further evaluate the memory usage based on the service situation

What’s a good solution to bigKey’s latency problem?

There are two things that can be optimized:

  1. Service applications should avoid writing bigkeys
  2. If you are using Redis 4.0 or higher, use the UNLINK command instead of the DEL command. This command can reduce the impact on Redis by putting the key memory release operation into the background thread
  3. If you are using Redis 6.0 or higher, you can enable lazy-free (lazyfree-lazy-user-del = yes). When del is executed, the memory will be freed in the background thread

But even if you can use scenario 2, I don’t recommend storing bigKey in the instance.

This is because BigKey still causes performance problems in many scenarios. For example, BigKey also has a performance impact on data migration in the sharded cluster mode, as well as data expiration, data obsolescence, and transparent large pages, which I will discuss later.

Focus on overdue

If you find that when you operate Redis, there is not a lot of delay, but at a certain point in time there is a wave of delay, it is a very regular point of time, such as an hour, or at how long intervals the wave of delay occurs.

If this is the case, then you need to check your business code for a number of key sets that are set to expire.

If a large number of keys expire at a fixed point in time, accessing Redis at that point in time can lead to increased latency.

Why does centralized expiration cause Redis latency to increase?

This requires us to understand what Redis’s expiration strategy is.

Redis expiration data adopts two strategies: passive expiration + active expiration:

  1. Passive expiration: Only when a key is accessed does it determine whether the key has expired, and if it has, it is removed from the instance
  2. Active expiration: Redis maintains an internal timed task that, by default, randomly fetchs 20 keys from the global expired hash table every 100 milliseconds (10 times per second) and then deletes the expired keys. If the percentage of expired keys exceeds 25%, the process repeats. The loop does not exit until the percentage of expired keys drops below 25%, or the task takes more than 25 milliseconds to execute

Note that the scheduled task for the active expiration key is executed in the Redis main thread.

In other words, if a large number of expired keys need to be deleted during the active expiration process, the application must wait for the expiration task to end when accessing Redis before Redis can serve the client request.

In this case, application access to Redis will take longer.

If a Bigkey is expiring at this point, this will take longer. Also, the command that operates on delay is not logged in the slow log.

The slow log only records the time it takes for a command to actually operate memory data. However, Redis proactively deletes expired keys before the command is actually executed.

So, at this point, you can see that there are no commands in the slow log that take time to operate, but our application is aware of the delay, and the time is actually spent removing expired keys, which we need to pay special attention to.

How to analyze and troubleshoot this situation?

At this point, you need to check your business code to see if there is any logic to centralize expired keys.

The common set expiration is the expireat/pexpireat command, which you need to search for in your code.

If, after reviewing the code, there is some logic for a set of expired keys, but this logic is a business necessity, how can you optimize it without affecting Redis performance?

There are two ways to avoid this problem:

  1. Add a random expiration time to the cluster key to break up the cluster key expiration time and reduce the pressure on Redis to clean the expired key
  2. If you are using Redis 4.0 or higher, you can enable lazy-free to remove expired keys in a background thread to avoid blocking the main thread

The first option is to set the expiration time of the key by adding a random time, which can be written as follows:

Redis. Expireat (key, expire_time + random(300))Copy the code

This way, Redis will not be overwhelmed by removing too many keys in a centralized manner while dealing with expiration, thus avoiding blocking the main thread.

The second option, Redis 4.0 or higher, enables lazy-free:

#Release memory for expired keys and put them into background threads for execution
lazyfree-lazy-expire yes
Copy the code

In addition to optimizing and modifying configurations at the business level, you can also detect these situations through operations.

At the operation and maintenance level, you need to monitor the running status data of Redis. You can get all the running status data of this instance by executing the INFO command on Redis.

Here we need to focus on expired_keys, which represents the total number of expired keys deleted in the entire instance so far.

You need to monitor this indicator. When this indicator has a sudden increase in a very short period of time, you need to alarm it in time, and then compare it with the time point when the business application is reporting slow to confirm whether the time is consistent. If so, you can confirm that the delay is really increased because of the concentration of expired keys.

The instance memory reaches the upper limit. Procedure

If your Redis instance sets maxMemory, it may cause Redis to slow down.

When we use Redis as a pure cache, we usually give the instance a memory upper limit, maxMemory, and then a data flushing policy.

When an instance reaches maxMemory, you may find that the latency increases each time new data is written after that.

Why is that?

The reason is that when Redis reaches MaxMemory, Redis must first kick out some data from the instance to keep the entire instance below MaxMemory before writing in new data.

The logic of kicking out old data also takes time, depending on the elimination strategy you have configured:

  • Allkeys-lru: Eliminates the least recently accessed key, regardless of whether the key is set to expire
  • Volatile – lRU: Only the least recently accessed keys with an expiration date are eliminated
  • Allkeys-random: Randomly discards keys regardless of whether they are set to expire
  • Volatile -random: Only keys with an expiration date are randomly eliminated
  • Allkeys-ttl: Eliminates keys that are about to expire, regardless of whether the key is set to expire
  • Noeviction: Any key will not be knocked out, error will be returned directly after instance memory reaches maxmeory when writing new data
  • Allkeys-lfu: Disables the least frequently accessed key, regardless of whether the key is set to expire (version 4.0+ support)
  • Volatile – lFU: Disables only the lowest-frequently-accessed keys with an expiration date (version 4.0+ support)

The specific policy needs to be configured based on specific service scenarios.

The most commonly used allKeys-LRU/volatile- LRU elimination strategy is to randomly extract a configurable number of keys from the instance at a time, then eliminate the least accessed key, and then temporarily store the remaining keys in a pool. Continue to pick a random batch of keys and compare them with the ones in the previous pool to eliminate the least accessed key. Repeat until instance memory drops below maxMemory.

It should be noted that the logic of Redis to eliminate data is the same as that of deleting expired keys, which is also executed before the command is actually executed. That is to say, it will also increase the latency of Redis operation, and the higher the write OPS, the more significant the delay will be.

In addition, if you have a Bigkey stored in your Redis instance, it will take a long time to remove the bigkey to free memory.

See? Bigkeys are everywhere, which is why I warned you not to store bigkeys earlier.

What is the solution to this situation?

I will give you four suggestions for optimization:

  1. Avoid storing bigkeys, reducing memory release time
  2. Elimination strategy changed to random elimination, random elimination is much faster than LRU (depending on the business situation)
  3. Split the instances, spreading the burden of eliminating keys over multiple instances
  4. Lazyfree-lazy-eviction = yes If you are using Redis 4.0 or higher, enable layz-free to exclude key from memory in background thread

Severe fork time

To keep Redis data secure, we might turn on background timing RDB and AOF rewrite.

However, if you find that the Redis delay increases during the RDB and AOF rewrite periods, you need to look for possible slowdowns between these periods.

When Redis enabled RDB and AOF rewrite, both required the main process to create a child to persist data.

The main process creates child processes that call the operating system-provided fork function.

In fork, the main process needs to copy its own page table to the child process. If the instance is large, the copying process can be time-consuming.

Also, the fork process consumes a lot of CPU resources, and the entire Redis instance is blocked and unable to process any client requests until the fork is complete.

If your CPU resources are already tight at this point, forking can take longer, even seconds, which can seriously affect Redis performance.

How do you confirm that the Redis delay is actually due to the fork time?

You can run the INFO command on Redis to view the latest_FORK_usec entry in microseconds.

#The last fork time, in microseconds
latest_fork_usec:59477
Copy the code

This is the time when the main process forks the child process and the entire instance is blocked and unable to process the client request.

Be alert if you find that this takes a long time, which means your entire Redis instance is unavailable during this time.

In addition to the RDB generated by data persistence, when the master and slave nodes set up data synchronization for the first time, the master node also creates a child process to generate the RDB, and then sends it to the slave node for a full synchronization. Therefore, this process also affects the performance of Redis.

To avoid this, you can optimize by:

  1. Control the memory of the Redis instance: try to keep it below 10GB. Fork takes time depending on the size of the instance; larger instances take longer
  2. Configure persistent data policies: RDB backups are performed on slave nodes, preferably at low peak times, while AOF and AOF rewrite can be turned off for services that are insensitive to data loss (such as using Redis as a pure cache)
  3. Do not deploy Redis instances on virtual machines: fork time is also system dependent, and virtual machines take longer than physical machines
  4. Reduce the probability of full synchronization between primary and secondary libraries: Properly increase the repl-backlog-size parameter to avoid full synchronization between primary and secondary libraries

The large memory page is enabled

In addition to the delay caused by child processes such as RDB and AOF rewrite, there is another aspect that can cause performance problems: whether the operating system has enabled large pages in memory.

What is a large memory page?

As we all know, when an application requests memory from the operating system, it requests memory in pages, and the typical page size is 4KB.

Starting with 2.6.38, the Linux kernel supports the memory big page mechanism, which allows applications to request memory from the operating system in units of 2MB.

Each time an application requests more units of memory from the operating system, it also means that it takes longer to request memory.

How will this affect Redis?

When Redis does background RDB and AOF rewrite, it forks children. However, after the main process forks its child process, the main process can still receive Write requests. The incoming Write requests operate On memory data in the mode of Copy On Write.

In other words, when the main process has data that needs to be modified, Redis does not directly modify the data in the existing memory. Instead, Redis copies the data in the new memory first and then modifies the data in the new memory. This is called “copy-on-write”.

Copy on write you can also say that whoever needs to do a write needs to copy and then modify.

The advantage of this is that any write by the parent process does not affect the child process’s persistence (the child process only persists all data in the entire instance at the moment of fork, and does not care about new data changes, because the child process only needs a snapshot of memory to persist to disk).

Note, however, that the master copy in memory data, the new memory is involved at this stage to apply for, if the operating system opens the memory page, so in the meantime, the client even if only change the data of 10 b, Redis when applying for memory will also apply to the operating system, for the unit with the 2 MB application memory takes longer, This leads to increased latency per write request, which affects Redis performance.

Similarly, if the write request operates on a Bigkey, then the main process copies the Bigkey block and allocates more memory for a longer time. Here again, bigKey affects performance.

So how do you solve this problem?

It’s easy, you just need to turn off the memory page mechanism.

First, you need to check whether the Redis machine has large memory pages turned on:

$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
Copy the code

If the output option is always, it means that the memory page mechanism is currently enabled and we need to turn it off:

$ echo never > /sys/kernel/mm/transparent_hugepage/enabled
Copy the code

In fact, the advantage of the memory page mechanism provided by the operating system is that it can reduce the number of applications applying for memory in a certain program.

However, for a database like Redis that is extremely sensitive to performance and latency, we want Redis to take as little time as possible for each memory request, so I don’t recommend that you enable this mechanism on a Redis machine.

Open AOF

Previously we analyzed the impact of RDB and AOF rewrite on Redis performance, focusing on fork.

In fact, regarding data persistence, there are also factors affecting Redis performance, this time we will focus on AOF data persistence.

If your AOF is not configured properly, it can still cause performance problems.

When Redis turns on AOF, it works as follows:

  1. After Redis executes the write command, it writes the command to the AOF file memory (write system call)
  2. Redis flusher AOF memory data to disk according to the configured AOF flush policy (fsync system call)

In order to ensure the security of AOF file data, Redis provides three flushing mechanisms:

  1. Appendfsync always: The main thread overwrites disks immediately after each write operation. This scheme occupies a large amount of DISK I/O resources but has the highest data security
  2. Appendfsync no: The main thread returns only memory data after each write operation. The operating system decides when memory data is flushed to disk. This scheme has the least impact on performance and the lowest data security
  3. Appendfsync everysec: The main thread writes only to memory for each write operation and then the background thread brushes the disk every second (triggering the fsync system call). This scheme has a relatively small performance impact, but loses 1 second of data when Redis goes down

Let’s examine, in turn, the impact of each of these mechanisms on performance.

If your AOF configuration is set to appendfsync always, Redis writes the command to disk every time it handles a write operation before returning it, and the entire process is performed on the main thread, which increases the Redis write burden.

The reason is simple: operating on disk is hundreds of times slower than operating on memory, and this configuration can severely slow down Redis performance, so I don’t recommend setting the AOF flush mode to always.

Let’s move on to the appendfsync No configuration item.

In this configuration, Redis writes only to the memory each time, and the operating system decides when to flush the data in the memory to disk. This scheme has the least impact on Redis performance, but when Redis is down, some data will be lost. For data security, we generally do not adopt this configuration.

If your Redis is only used for pure caching and is insensitive to data loss, it is ok to configure appendfsync No.

Appendfsync Everysec = appendfsync Everysec = appendfsync Everysec = appendfsync Everysec

The advantage of this scheme is that the Redis main thread will return after writing to the memory, and the specific flush operation is carried out in the background thread, and the background thread will flush the data in memory to the disk every one second.

Isn’t this a perfect solution for keeping data as secure as possible while maintaining performance?

However, I’m going to pour cold water on this, you should also be cautious about using this scheme, because there are still situations where Redis delays can increase, and even block the entire Redis.

Why is that? If I put AOF’s most time-consuming flush operation into the background thread, it will also affect the main thread of Redis.

Imagine a situation where the Redis background thread is doing an AOF file flush, and if the disk’s IO load is high, the background thread will block during the flush operation (fsync system call).

At this time, the main thread will still receive write requests, and then the main thread needs to write data to the file memory (write system call), but at this time, due to the high disk load, fsync blocks, and the main thread will be blocked when executing the write system call. The main thread does not return successfully until the background thread fsync completes.

See? There is still a risk that the main thread will block during this process.

So, even if your AOF configuration is appendfsync everysec, beware of Redis performance issues due to disk stress.

Under what circumstances can disk I/O load be too high? And how to solve this problem?

I have summarized the following situations for troubleshooting:

  1. The child process is doing AOF rewrite, which can consume a lot of disk IO resources
  2. There are other applications that are doing a lot of file writing and occupy disk I/O resources

In case 1, the reis AOF (rewrite) child ran into AOF!

What about this? Would it take shutting down AOF rewrite?

Fortunately, Redis provides a configuration option that, while the child process is in AOF rewrite, lets the background child not flush (trigger fsync system calls).

This is equivalent to setting appendfsync to None during AOF rewrite:

#During AOF rewrite, the child thread behind AOF does not flush
#In the meantime, temporarily set appendfsync to None
no-appendfsync-on-rewrite yes
Copy the code

Of course, when you turn this on, you’ll lose more data if an instance goes down during AOF rewrite, and you’ll have to weigh performance against data security.

If other applications are hogging disk resources, it’s easier to figure out which application is doing a lot of disk writing, and then migrate that application to another machine to avoid affecting Redis.

Of course, if you have high requirements on the performance and data security of Redis, I suggest that you optimize the hardware level by replacing it with SSD disks to improve the I/O capacity of the disk and ensure that there are sufficient disk resources available during AOF.

Binding the CPU

Most of the time, in order to improve service performance and reduce the performance loss caused by the context switch between multiple CPU cores, we usually adopt the way of process binding CPU to improve performance when deploying services.

However, when deploying Redis, if you need to bind a CPU to improve its performance, I recommend you think twice before doing so.

Why is that?

If you don’t understand how Redis works, binding a CPU arbitrarily will not improve performance, and may even have the opposite effect.

As we all know, the average modern server has multiple cpus, and each CPU contains multiple physical cores. Each physical core is divided into multiple logical cores, and the logical cores under each physical core share L1/L2 Cache.

In addition to the main thread service client requests, Redis Server will also create child processes, child threads.

Child processes are used for data persistence, while child threads are used to perform time-consuming operations such as asynchronously releasing FDS, asynchronously flushing AOF, asynchronously lazy-free, and so on.

If you bind a Redis process to a single CPU logic core, then when Redis persists, the child fork will inherit the CPU usage preferences of the parent process.

In this case, the child process will consume a large amount of CPU resources for data persistence (scanning out all instance data requires CPU), which will lead to CPU contention between the child process and the main process, thus affecting the main process to service client requests and increasing access latency.

This is the performance problem with Redis CPU binding.

So how do you solve this problem?

If you do want to bind the CPU, it is best to bind the Redis process not to one CPU core, but to multiple cores, preferably to the same physical core, so that they can share L1/L2 Cache.

Of course, even if we bind Redis to multiple logical cores, it can only alleviate the competition for CPU resources between the main thread, child process, and background thread to a certain extent.

Because these child processes, child threads will still switch between the multiple logic cores, there is a performance loss.

How to further optimize?

You might already be wondering if we could tie the main thread, child process, and background thread to a fixed CPU core so that they don’t switch back and forth, so that the CPU resources they use don’t affect each other.

In fact, Redis has already thought of this plan.

The main thread, background thread, background RDB process, and AOF rewrite process can be configured with a fixed CPU logic core:

#Redis Server and IO threads are bound to the CPU core 0,2,4,6
server_cpulist 0-7:2

#The backend child thread is bound to the CPU core 1,3Bio_cpulist 1, 3
#The background AOF rewrite process is bound to the CPU core 8,9,10,11
aof_rewrite_cpulist 8-11

#Background RDB processes bind to the CPU core 1,10,11
#Bgsave_cpulist 1, 10-1
Copy the code

If you happen to be using Redis version 6.0, you can use the above configuration to further improve Redis performance.

I need to remind you that Redis performance is generally good enough, and it is not recommended to bind CPU unless you are more demanding on Redis performance.

As you can see from the above analysis, binding cpus requires you to have a very clear understanding of the computer architecture, or proceed with caution.

Let’s move on to what other scenarios might cause Redis to slow down.

Use the Swap

If you find that Redis suddenly becomes very slow, with each operation taking hundreds of milliseconds or even seconds, then you need to check whether Redis is using Swap, in which case Redis is basically unable to provide high performance services.

What is Swap? Why does using Swap cause Redis performance to degrade?

If you know anything about an operating system, you will know that in order to mitigate the impact of running out of memory on an application, the operating system allows a portion of the memory to be moved to disk to buffer the application’s memory usage. This memory data is swapped to an area on disk called Swap.

The problem is that when the data in memory is transferred to disk, Redis needs to read the data from disk, which is hundreds of times slower than accessing memory!

This delay is unacceptable, especially for a database such as Redis, which is very performance-sensitive and demanding.

At this point, you need to check the memory usage of the Redis machine to see if Swap is being used.

You can check whether the Redis process uses Swap in the following way:

#Find the process ID of Redis first
$ ps -aux | grep redis-server

#View Redis Swap usage
$ cat /proc/$pid/smaps | egrep '^(Swap|Size)'
Copy the code

The following output is displayed:

Size:               1256 kB
Swap:                  0 kB
Size:                  4 kB
Swap:                  0 kB
Size:                132 kB
Swap:                  0 kB
Size:              63488 kB
Swap:                  0 kB
Size:                132 kB
Swap:                  0 kB
Size:              65404 kB
Swap:                  0 kB
Size:            1921024 kB
Swap:                  0 kB
...
Copy the code

The result lists the memory usage of the Redis process.

Each line of Size indicates the Size of the memory used by Redis, and the Swap below Size indicates the Size of the memory and how much data has been swapped to disk. If these two values are equal, all the data in this memory has been swapped to disk.

If only a small amount of data is transferred to disk, such as a small percentage of the Size of each Swap, the impact is not significant. If a few hundred megabytes or even a few gigabytes of memory are transferred to disk, then you need to be aware that Redis performance will drop dramatically in this case.

The solution at this point is:

  1. Increase the memory of the machine so that Redis has enough memory to use
  2. Rearrange the memory space to free up enough memory for Redis to use, and then release Redis Swap so Redis can use the memory again

In the process of releasing the Swap of Redis, you usually restart the instance. To avoid the impact of restarting the instance on services, you usually perform a primary/secondary switchover first, release the Swap of the old active node, and restart the old active node. After data synchronization is complete, you can perform the primary/secondary switchover.

It can be seen that when Redis is used in Swap, the performance of Redis at this time has basically failed to meet the requirements of high performance (you can understand it as the martial arts are abandoned), so you need to prevent this situation in advance.

The way to prevent this is that you need to monitor the memory and Swap usage of the Redis machine. When the memory is insufficient or Swap is used, the alarm will be reported and dealt with in time.

defragmentation

Redis data is stored in memory. When our application frequently modifies the data in Redis, it may cause memory fragmentation in Redis.

Memory fragmentation can reduce the memory usage of Redis. You can run the INFO command to obtain the memory fragmentation rate of this instance:

# MemoryUsed_memory :5709194824 USed_memory_human: 5.32g USed_memory_rss :8264855552 USed_memory_rss_human: 7.70g... Mem_fragmentation_ratio: 1.45Copy the code

How is the memory fragmentation rate calculated?

Very simple, mem_fragmentation_ratio = used_memory_rss/used_memory.

Where used_memory represents the amount of memory Redis stores data, and used_memory_rss represents the amount of memory the operating system actually allocates to Redis processes.

If mem_fragmentation_ratio > 1.5, the memory fragmentation rate has exceeded 50%, at which point we need to take steps to reduce memory fragmentation.

The solutions are generally as follows:

  1. If you are using a version below Redis 4.0, you can only resolve this by restarting the instance
  2. If you are using Redis version 4.0, which happens to provide automatic defragmentation, you can configure it to be enabled

However, when memory defragmentation is enabled, it can also cause Redis performance to degrade.

The reason is that the defragmentation of Redis is also performed in the main thread, which inevitably consumes CPU resources and takes more time, thus affecting client requests.

Therefore, when you need to enable this feature, it is best to test and evaluate its impact on Redis in advance.

The parameters of Redis defragmentation are set as follows:

#Enable automatic memory defragmentation (master switch)
activedefrag yes

#If the memory usage is less than 100MB, no defragmentation is performed
active-defrag-ignore-bytes 100mb

#The memory fragmentation rate exceeds 10%. Procedure
active-defrag-threshold-lower 10
#The memory fragmentation rate exceeds 100%. Try your best to defragment
active-defrag-threshold-upper 100

#Minimum percentage of CPU resources occupied by memory defragmentation
active-defrag-cycle-min 1
#Maximum percentage of CPU resources occupied by memory defragmentation
active-defrag-cycle-max 25

#The number of Scan for List/Set/Hash/ZSet elements at one time during defragmentation
active-defrag-max-scan-fields 1000
Copy the code

You need to evaluate the load on the Redis machine and the latency range that your application can accept to adjust defragmentation parameters to minimize the impact on Redis during defragmentation.

Network Bandwidth Overload

If you have avoided all of the above performance problems and Redis has been running steadily for a long time, then after a certain point Redis suddenly starts to slow down and continues to do so, what is the cause of this situation?

At this point, you need to check whether the network bandwidth of the Redis machine is overloaded, and whether there is an instance of the network bandwidth of the entire machine.

When the network bandwidth is overloaded, the server may encounter packet sending delay and packet loss at the TCP and network layers.

The high performance of Redis, in addition to operating memory, lies in network IO. If there is a bottleneck in network IO, then the performance of Redis will be severely affected.

If this is the case, you need to confirm that the Redis instance occupies the full network bandwidth in time. If it is normal service access, you need to expand or migrate the instance in time to avoid the heavy traffic of this instance affecting other instances of the machine.

At the operation and maintenance level, you need to increase monitoring of various indicators of Redis machines, including network traffic. When network traffic reaches a certain threshold, you need to alarm in advance and confirm and expand capacity in time.

Other reasons

Ok, so these are the ideas and paths to troubleshoot Redis delays.

In addition to the above, there are a few smaller points that you should also be aware of:

1) Frequent short connections

Your business application should use long connection operation Redis to avoid frequent short connection.

Frequent short connections can cause Redis to spend a lot of time establishing and releasing connections. TCP’s three-way handshake and four-way wave also increase access latency.

2) Operation and maintenance monitoring

As I mentioned earlier, good monitoring is essential to anticipate Redis slowdowns.

In fact, monitoring is to collect various running indicators of Redis. The usual practice is that the monitoring program periodically collects Redis INFO information, and then performs data display and alarm according to the status data in the INFO information.

It is important to note that you should not take things lightly when writing monitoring scripts or using open source monitoring components.

When writing monitoring scripts to access Redis, collect status information in the long connection mode to avoid frequent short connections. At the same time, you need to control how often you access Redis to avoid interfering with business requests.

In the use of some open source monitoring components, it is best to understand the implementation principle of these components, as well as the correct configuration of these components, to prevent monitoring component bugs, resulting in short-term large operation of Redis, affecting the performance of Redis.

We had a situation where dbAs were using some open source components and the monitor was frequently set up and disconnected from Redis due to configuration and usage issues, causing Redis to be slow to respond.

3) Other programs compete for resources

Finally, you need to remind you that your Redis machine is best dedicated, only to deploy Redis instances, do not deploy other applications, try to provide a relatively “quiet” Redis environment, to avoid other applications occupy CPU, memory, disk resources. Resulting in insufficient resources allocated to Redis being affected.

conclusion

Ok, so that’s my summary of the common problem scenarios that can cause delays and even blocks when using Redis, and how to quickly locate and analyze these problems and provide targeted solutions.

Here I also put together a mind map to help you quickly analyze and locate Redis performance problems.

To recap, Redis is a performance issue that involves both business developer usage and DBA operations.

As business developers, we need to understand the basic principles of Redis, such as the time complexity of each command execution, data expiration policy, and data obsolescence policy, so as to use Redis commands more rationally and optimize them based on business scenarios.

As dbAs and o&M personnel, you need to understand the operation mechanism of Redis, such as data persistence, memory defragmentation, and process binding configuration. In addition, you also need to understand the operating system related knowledge, such as copy-on-write, memory large page, Swap mechanism, and so on.

At the same time, when deploying Redis, DBA needs to make capacity planning in advance, reserve enough machine resources, and complete monitoring of Redis machines and instances, so as to ensure the stable operation of Redis as far as possible.

Afterword.

If you have the patience to read this, you must have learned a lot about Redis performance tuning.

As you may have noticed, the performance of Redis involves a wide range of knowledge, covering almost all aspects of CPU, memory, network, and even disk. At the same time, you also need to understand the computer architecture, as well as various mechanisms of the operating system.

From the perspective of resource use, the knowledge points are as follows:

  • Cpu-related: Excessive USE of complex commands and persistence of data are related to excessive CPU resource consumption
  • Memory related: Bigkey memory allocation and release, data expiration, data obsolescence, defragmentation, memory large page, memory write copy are all related to memory
  • Disk related: Data persistence and AOF flush policies are also affected by disks
  • Network related: short connections, instance traffic overload, network traffic overload, also reduce Redis performance
  • Computer system: CPU structure, memory allocation, all belong to the most basic computer system knowledge
  • Operating system: copy-on-write, large memory page, Swap, CPU binding, all belong to the operating system level knowledge

Didn’t expect that? Redis involves so many optimizations in order to achieve maximum performance.

If you can absorb more than 90% of the content of this article, it shows that you have a deep understanding of Redis principle, computer foundation and operating system.

If you can absorb about 50% of your knowledge, you can sort out your blind spots so that you can target your learning.

If your absorption is less than 30%, you can start with the fundamentals of Redis, understand the various mechanisms of Redis, and then ask why Redis uses these mechanisms in order to improve performance. What features of computers and operating systems do these mechanisms take advantage of? Step by step to expand your body of knowledge, this is a very efficient learning path.

Due to space constraints, there are many details about Redis that cannot be fully explained. In fact, this article could write an article about every scenario that causes Redis performance problems.

For example, the Redis process binding to the CPU, and the operating system’s use of Swap, are also related to the impact of the non-consistent memory access NUMA architecture, which is not covered in detail.

If you want to read more high-quality original articles, please follow my official account.Water drops and silver bullets”.

Wechat pay attention to the “water drop and Silver Bullet” public account, the first time to obtain high-quality technical dry goods. 7 years of senior back-end development, with a simple way to explain the technology clearly.