Redis is usually an important component of our business systems, such as caching, account login information, leaderboards, etc.
Once the Redis request latency increases, it can cause an avalanche of business systems.
I work in a single matchmaker Internet company, which offers an activity to send an order to your girlfriend on Singles’ Day.
Who ever thought, after 12 o ‘clock in the morning, user volume explosion, there was a technical fault, users can not order, then the old fire braves three zhangs!
Redis Could not get a resource from the pool.
Connection resources cannot be obtained, and the number of Redis connections per machine in the cluster is high.
A large amount of traffic lost Redis cache response, directly to MySQL, and finally the database also crashed…
Therefore, the maximum number of connections and the number of waiting connections are changed. Although the frequency of error messages is alleviated, errors are still reported continuously.
Later, after offline testing, it was found that the character data stored in Redis was very large, and the average returned data was 1s.
As you can see, once the Redis delay is too high, it can cause various problems.
Today, Code-guy takes a look at how Redis can identify performance issues and solutions.
[toc]
Is there a problem with Redis?
The maximum latency is the time between the client issuing a command and the client receiving a response to the command. Normally, the Redis processing time is extremely short, in the level of microseconds.
When Redis has performance fluctuations, such as a few seconds to more than ten seconds, it is obvious that we can identify Redis performance slowed down.
Some hardware configurations are high, and when the delay is 0.6ms, we might consider it to be slow. If the hardware is poor, it may be 3 ms before we think there is a problem.
So how do we define Redis as really slow?
Therefore, we need to measure the Redis baseline performance of the current environment, that is, the basic performance of a system at low pressure with no interference.
When you find that Redis is running with more than twice the latency of baseline performance, you can determine that Redis performance is slowing down.
Delayed baseline measurement
The redis-cli command provides the – intrinsic-latency option to monitor and count the maximum latency in milliseconds during a test, which can be used as a baseline redis performance.
redis-cli --latency -h `host` -p `port`
Copy the code
For example, execute the following command:
redis-cli --intrinsic-latency 100
Max latency so far: 4 microseconds.
Max latency so far: 18 microseconds.
Max latency so far: 41 microseconds.
Max latency so far: 57 microseconds.
Max latency so far: 78 microseconds.
Max latency so far: 170 microseconds.
Max latency so far: 342 microseconds.
Max latency so far: 3079 microseconds.
45026981 total runs (avg latency: 2.2209 microseconds / 2220.89 nanoseconds per run).
Worst run took 1386x longer than the average latency.
Copy the code
Note: Parameter 100 is the number of seconds in which the test will be executed. The longer we run the tests, the more likely we are to find spikes in latency.
Running for 100 seconds is usually good enough to detect delays, but we can run it several times at different times to avoid errors.
The maximum latency is 3079 microseconds, so the baseline performance is 3079 (3 ms) microseconds.
Note that we are running on the server side of Redis, not the client side. In this way, the impact of the network on baseline performance can be avoided.
The -h host -p port can be used to connect to the server. If you want to monitor the impact of the network on Redis performance, you can use Iperf to measure the network latency from the client to the server.
If the network delay is several hundred milliseconds, other programs with heavy traffic may be running on the network, causing network congestion. Therefore, O&M needs to coordinate network traffic allocation.
Slow command monitoring
How do you tell if it’s slow?
See if the operation complexity is O(N). The official documentation describes the complexity of each command. If possible, use O(1) and O(log N) commands.
The complexity of collection operations is generally O(N), such as collection full query HGETALL, SMEMBERS, and collection aggregation operations: SORT, LREM, SUNION, etc.
Is there surveillance data to look at? I didn’t write the code, so I don’t know if anyone used slow instructions.
There are two ways to check:
- Use Redis slow log to detect slow commands.
- Latency -monitor tool.
In addition, you can use yourself (top, hTOP, prstat, etc.) to quickly check the CPU consumption of the main Redis process. If CPU usage is high and traffic is low, it usually indicates that a slow command is used.
Slow log function
The slowlog command in Redis allows us to quickly locate commands that are slow beyond the specified execution time. By default, commands that take more than 10ms to execute are logged.
Slowlog records only the time its commands take to execute, excluding I/O round-trips, and does not record slow responses caused solely by network latency.
We can customize the criteria for slow commands based on baseline performance (configured to be twice the maximum latency of baseline performance) and adjust the threshold for triggering slow command logging.
Redis-cli can be configured by entering the following commands:
redis-cli CONFIG SET slowlog-log-slower-than 6000
Copy the code
It can also be set in microseconds in the redis.config configuration file.
To view all the commands that take a long time to execute, run the slowlog get command using the redis-CLI tool. The third field in the slowlog get command output displays the command execution time in microseconds.
If you only need to view the last two slow commands, enter slowlog get 2.
Example: 127.0.0.1:6381> SLOWLOG get 2 1) 1) (integer) 6 2) (integer) 1458734263 3) (integer) 74372 4) 1) "hgetall" 2) "max.dsp.blacklist" 2) 1) (integer) 5 2) (integer) 1458734258 3) (integer) 5411075 4) 1) "keys" 2) "max.dsp.blacklist"Copy the code
Taking the first HGET command as an example, each slowlog entity has four fields:
- The slowlog field is a 1:1 integer indicating the ordinal number of slowlog occurrences, which increases after server startup and is currently 6.
- Field 2: Represents the Unix timestamp at the time the query was executed.
- Field 3: number of query execution microseconds. The current value is 74372 microseconds, or about 74ms.
- Field 4: indicates the commands and parameters to be queried. If the parameters are many or large, only part of them are displayed and the number of parameters is given. The current command is
hgetall max.dsp.blacklist
.
Latency Monitoring
Redis introduced Latency Monitoring in version 2.8.13 to monitor the frequency of various events on a second-by-second basis.
The first step in enabling the delay monitor is to set the delay threshold in milliseconds. Only the time above this threshold is recorded, such as when we set the threshold to 9 ms based on 3 times baseline performance (3ms).
This can be set using redis-cli or in redis. config.
CONFIG SET latency-monitor-threshold 9
Copy the code
Details of events recorded by the tool can be found in the official document: redis. IO /topics/late…
Get latencies, for example
127.0.0.1:6379> debug sleep 2
OK
(2.00s)
127.0.0.1:6379> latency latest
1) 1) "command"
2) (integer) 1645330616
3) (integer) 2003
4) (integer) 2003
Copy the code
- Event name;
- The latest delayed Unix timestamp for the event to occur;
- A time delay in milliseconds;
- Maximum delay for the event.
How do I fix Redis slowness?
Redis data reads and writes are performed by a single thread, which can block the main thread if it takes too long to perform operations.
What operations block the main thread and how can we resolve them?
Delay caused by network communication
Clients connect to Redis using TCP/IP connections or Unix domain connections. Typical latency for 1 Gbit/s networks is about 200 us.
The Redis client executes a command in four steps:
Send command – > Queue command – > Execute command – > Return result
This process is called Round trip time(RTT for short), mGET Mset effectively saves RTT, but most commands (such as HGEtall, but no MHGEtall) do not support batch operation, which requires N RTT consumption. Pipeline is needed to solve this problem at this time.
Redis Pipeline links multiple commands together to reduce network response round-trips.
Delay caused by slow instruction
According to the above slow instruction monitoring query document, query the slow instruction. You can solve the problem in the following two ways:
- For example, in a Cluster, O(N) operations such as aggregate operations are run on the slave or done on the client.
- Use efficient commands instead. Use incremental iteration to avoid querying a large amount of data at a time. For details, see the SCAN, SSCAN, HSCAN, and ZSCAN commands.
In addition, the KEYS command is disabled in production and is only used for debugging. Because it iterates through all key-value pairs, operation latency is high.
Fork Delays caused by RDB generation
To generate the RDB snapshot, Redis must fork the background process. The fork operation (running in the main thread) itself causes latency.
Redis uses the multi-process copy-on-write (COW) technology of the operating system to implement snapshot persistence and reduce memory usage.
But fork involves copying a lot of linked objects, and a large 24 GB Redis instance requires 24 GB / 4 kB * 8 = 48 MB page table.
When bgSave is executed, this involves allocating and copying 48 MB of memory.
In addition, read and write services cannot be provided during RDB loading from the library, so the data volume of the master library is controlled at about 2~4G, so that the loading of the slave library can be completed quickly.
Transparent Huge Pages
While regular memory pages are allocated at 4 KB, the Linux kernel has supported the large memory page mechanism since 2.6.38, which supports allocation of 2MB of memory pages.
Redis uses fork to generate RDB for persistence to ensure data reliability.
When creating an RDB snapshot, Redis uses a copy-on-write technique so that the host thread can still receive write requests from the client.
When data is modified, Redis makes a copy of the data and modifies it.
Large pages in memory are used. During RDB generation, Redis needs to copy large pages of 2MB even if only 50B of data is modified by the client. When many instructions are written, a large number of copies are made, resulting in slow performance.
To disable large Linux memory pages, use the following command:
echo never > /sys/kernel/mm/transparent_hugepage/enabled
Copy the code
Swap: operating system paging
When running out of physical memory (memory), swap some memory data to swap space, so that the system does not cause oom or even fatal condition due to running out of memory.
When a process requests memory from the OS and finds that the memory is insufficient, the OS swaps the unused memory data to the SWAP partition. This process is called SWAP OUT.
If a process needs the data and the OS finds that there is free physical memory, the data IN the SWAP partition is swapped back to the physical memory. This process is called SWAP IN.
Memory swap is a mechanism for the OPERATING system to swap memory data back and forth between the memory and disk.
What are the circumstances in which a swap can be triggered?
There are two common scenarios for Redis:
- Redis uses more memory than is available;
- Other processes running on the same machine as Redis perform a large number of file read/write I/O operations (including RDB files that generate large files and AOF background threads). File reads and writes consume memory, causing Redis to acquire less memory, triggering swap.
Code, how do I check if the performance is slow due to swap?
Linux provides excellent tools to troubleshoot this problem, so when interchange delays are suspected, just follow these steps.
Get the Redis instance PID
$ redis-cli info | grep process_id
process_id:13160
Copy the code
Go to the /proc filesystem directory for this process:
cd /proc/13160
Copy the code
Here is a smaps file that describes the memory layout of the Redis process. Run the following command to find the Swap field in all files with grep.
$ cat smaps | egrep '^(Swap|Size)'
Size: 316 kB
Swap: 0 kB
Size: 4 kB
Swap: 0 kB
Size: 8 kB
Swap: 0 kB
Size: 40 kB
Swap: 0 kB
Size: 132 kB
Swap: 0 kB
Size: 720896 kB
Swap: 12 kB
Copy the code
Each line of Size represents the Size of the memory block used by the Redis instance, and the Swap below Size corresponds to how much of the memory area has been swapped out to disk.
If Size == Swap, the data has been completely swapped out.
You can see that a 720,896 kB memory size has 12 kB swapped out to disk (only 12 kB swapped out), and that’s fine.
Redis itself uses a lot of memory blocks of different sizes, so you can see a lot of Size rows, some very small (4KB) and some very large (720,896KB). Different memory blocks are swapped out to disk at different sizes.
On the point
If Swap everything is 0 KB, or sporadic 4K, then everything is fine.
When a swap size of 100 MB or even GB occurs, it indicates that the Redis instance is under a lot of memory pressure and is likely to slow down.
The solution
- Increase machine memory;
- Run Redis on a separate machine to avoid running processes that require a large amount of memory on the same machine, thus meeting the memory requirements of Redis.
- Increase the number of clusters in a Cluster to reduce the memory required by each instance.
Delays caused by AOF and disk I/O
To ensure data reliability, Redis uses AOF and RDB snapshots for instant recovery and persistence.
** AOF can be configured to perform write or fsync on disk in three different ways using the appendfsync ** configuration (this can be modified at run time using the CONFIG SET command, for example: Redis -cli CONFIG SET appendfsync no).
- No: Redis does not perform fsync, and the only delay comes from the write call, which only needs to write the log to the kernel buffer to return.
- Everysec: Redis performs fsync every second. The fsync operation is completed asynchronously using the backend child thread. At most 1s of data is lost.
- Always: Fsync is executed for every write operation and replies to the client with an OK code (Redis actually tries to aggregate many commands executed simultaneously into a single fsync) without data loss. Performance is typically very low in this mode, and it is highly recommended to use fast disks and file system implementations that can perform fsync in a short amount of time.
We usually use Redis for caching, data loss is completely malicious from data, does not require high data reliability, recommended to set to no or everysec.
In addition, if the AOF file is too large, Redis will rewrite the AOF file to generate a smaller AOF file.
The no-appendfsync-on-rewrite configuration item can be set to yes to indicate that no fsync operation will be performed during AOF rewriting.
That is, the Redis instance writes the write command to memory and returns it without calling the background thread for fsync.
Expires Discards expired data
Redis has two ways of weeding out outdated data:
- Lazy delete: Deletes the key only when the key has expired.
- Scheduled deletion: Deletes expired keys every 100 milliseconds.
The algorithm for periodic deletion is as follows:
-
Randomly sampling the number of ACTIVE_EXPIRE_CYCLE_LOOKUPS_PER_LOOP keys and deleting all expired keys.
-
If more than 25% keys are expired, perform Step 1.
ACTIVE_EXPIRE_CYCLE_LOOKUPS_PER_LOOP The default value is 20. The ACTIVE_EXPIRE_CYCLE_LOOKUPS_PER_LOOP is executed 10 times per second.
If the second one is triggered, it causes Redis to consistently free memory when deleting expired data. Deletion blocks.
Hey, Code, what’s the trigger?
That is, a large number of keys set the same time parameters. In the same second, a large number of keys expire and need to be repeatedly deleted to reduce the number to less than 25%.
In short: A large number of keys that expire at the same time can cause performance fluctuations.
The solution
If a batch of keys do EXPIRE at the same time, you can add a random number within a certain size to the expiration time parameters of EXPIREAT and EXPIRE. In this way, you can ensure that keys are deleted within a neighboring time range and avoid the pressure caused by simultaneous expiration.
bigkey
A Key containing large data or a large number of members or lists is usually called a big Key. Here we will use several practical examples to describe the characteristics of a big Key:
-
A STRING Key with a value of 5MB
-
A Key of type LIST with 10000 lists (too many lists)
-
A Key of type ZSET with 10000 members (too many members)
-
A HASH Key with only 1000 members and a total value size of 10MB.
Bigkey presents a problem as follows:
- Redis memory becomes OOM, or maxMemory is set to block or important keys are ejected.
- The memory of a node in Redis Cluster is much larger than that of other nodes, but the memory on the node cannot be equalized because the minimum granularity of data migration in Redis Cluster is Key.
- Read requests by BigKey occupy too much bandwidth, slowing down the read requests and affecting other services on the server.
- Deleting a bigkey blocks the master library for a long time and causes a synchronization break or a master/slave switch.
Find bigkey
Use the redis-rdb-tools tool to find large keys in a customized manner.
The solution
Split large keys
For example, if a HASH Key containing tens of thousands of members is split into multiple HASH keys, and the number of members for each Key is within a reasonable range, in the Redis Cluster structure, splitting large keys plays a significant role in memory balance among nodes.
Asynchronously clear large keys
Redis 4.0 provides the UNLINK command, which can slowly and gradually clean up incoming keys in a non-blocking manner. Through UNLINK, you can safely delete large or even large keys.
conclusion
Here is a checklist to help you efficiently resolve Redis slowdowns.
- Get the current Redis baseline performance;
- Enable slow command monitoring to locate problems caused by slow command;
- Find slow instructions, use scan mode;
- The data size of the instance is controlled within 2-4GB to avoid blocking due to excessive RDB files loaded by the master/slave replication.
- Disable large pages in memory and use large pages in memory. During RDB generation, Redis needs to copy large pages of 2MB even if only 50B of data is modified by the client. When many instructions are written, a large number of copies are made, resulting in slow performance.
- Whether the memory used by Redis is too large to cause swap;
- Set no-appendfsync-on-rewrite to yes to prevent AOF rewriting from competing with fsync for disk I/O resources and causing delay in Redis.
- Bigkeys cause a number of problems, and we need to split them to prevent bigkeys and remove them asynchronously through UNLINK.