Recently Redis is very “busy”, one to 10 am, redis frequently various errors, most of the scene is ok, we capture the cache exception, just in the log, not throw exceptions, so at most it is slower. Still, there are some scenarios where we have to get busy:
1. In some scenarios, redis cache verification code is used. For example, when a user logs in through mobile verification code, it is clear that the verification code has been successfully sent and the correct verification code has been entered, but verification failure is displayed.
2. As each service that relies on Redis slows down, the site keeps showing up at 502 and eventually becomes essentially inaccessible.
Therefore, operation and maintenance personnel had to give the order to restart Redis. Eventually, the memory usage of redis, the cache, dropped dramatically, the number of connections dropped, and the site quickly stabilized, and everything seemed to come back to life perfectly.
This thing repeated for a few days, we found the rule, so simply set a regular task, every day at 5 o ‘clock in the morning, restart redis, for the upcoming 10 o ‘clock restlessness, to kill his prestige.
However, as a team with technical literacy and technical pursuit, how can you be so timid?
PHP generates a large cache (more than 1 GB) and continues to grow. 2. The number of Redis connections will surge around 10 o ‘clock every day.
A serious problem occurs when the cache memory usage exceeds 1 GB and the number of connections is high. As a result, most cache data cannot be read or written, resulting in slow website access. And if Redis fails to read, it keeps trying to write, but the write fails again, the next read fails again, and so on and so on, resulting in more writes than normal, an avalanche, and eventually Redis doesn’t work at all, causing the whole site to become inaccessible.
After the redis is restarted, the amount of memory used decreases dramatically because persistence is not configured. The cache write operations recover and the number of cache connections decreases significantly. However, over a period of one or two days, the amount of memory used by Redis gradually returns to its original level.
To solve this problem, we have made some attempts. For example, the original redis cache and queue use the same instance, we have separated them; Using long connections to Redis, however, has little effect and does not fundamentally solve the problem. So we started to go back to the beginning and solve the question of what we were actually doing with Redis.
Mang mang Cache Sea, do not know where to start, so, or first find the main problem, and then study the secondary problem.
A single large key
Redis -cli provides a simple command to find large keys:
redis-cli --bigkeys
Copy the code
The largest key is 1003547 bytes, or about 980K. This key is called when the APP searches for the merchant name, which is quite frequently called and not cached in the client. Using Kibana, we can see that the interface associated with this cache has a peak call rate of nearly 500 times every 30 minutes, which is a big and hot key.
In addition to this outstanding key accident, the remaining largest key is 150K, 95K, 90K and so on, therefore, the most important goal is to eliminate the biggest culprit first.
Considering that this API interface is used by APP, if major adjustments are made, the APP must be re-released. Therefore, we filter the output content of the interface and only retain the data used by app. When the interface is republished after some tweaking, the cache contents are immediately much smaller and the overall response time is faster.
Based on the above analysis, we analyze redis storage size, interface text length and response speed.
The dimension | To improve the former | The improved | Rate of change |
---|---|---|---|
Redis stores the size | 980k | 437k | – 55.4% |
Interface text length | 5302k | 969k | – 81.7% |
Response speed | 700ms | 40ms | – 94.2% |
Obviously, the overall improvement is quite large.
This is the largest key, and the improvements have had a significant impact on the interface, but have not significantly changed the cache as a whole. We still have to explore.
In addition, through the source code inspection, also found that the redis read and write timeout configuration is not actually effective. In the configuration, we set the timeout time to 100ms, and this large key, the read and write time has already passed 100K, but still can read down, thanks to this. Because the timeout period does not take effect, the reading time of large keys will be extended and the redis resources will be occupied for a longer time, thus affecting the efficiency of other connections using Redis resources and thus lowering the overall service capability of Redis.
A series of major key
In addition to a single big key, there are a series of big keys. A series of large keys refers to a series of stored keys of the same nature. They have the same key name prefix and different feature IDS at the end. For example, post_1111, post_1112, and post_1233 are keys. Although their individual keys do not occupy much space, their actual hit ratio is very low due to their large number, frequent read and write, or even read-only without writing.
This is the problem we found when we analyzed the key stored in redis. We can get the key that redis is writing through tcpflow packet capture:
tcpflow port 6379 -cp -i ens192|grep SET -A 2 > setcache.log
Copy the code
After grabbing for a period of time, we do statistics on the usage frequency of the specific stored keys inside, so as to know which keys are more active. Then we need to know exactly what these keys store, what they are used for, and what their approximate length is, so that we can do further processing.
To solve these problems, we wrote a few simple tools:
-
Obtain the corresponding content based on the key. This is because large cache entries are compressed and need to be restored using tools.
-
Obtain the list of all keys according to the prefix of the key, and sum the corresponding length of each key to calculate the total storage size and average size.
Redis KEYS allows you to find the key with a specific prefix, such as:
redis-cli KEYS prefix*
Copy the code
So how do you calculate the length of each cached entry? Since redis-CLI knows which key is the largest, the truth must be found inside. Therefore, the author looked up the redis- CLI source code, as expected we need the answer.
static void getKeySizes(redisReply *keys, int *types, unsigned long long *sizes) { redisReply *reply; char *sizecmds[] = {"STRLEN","LLEN","SCARD","HLEN","ZCARD"}; unsigned int i; /* Pipeline size commands */ for(i=0; i<keys->elements; i++) { /* Skip keys that were deleted */ if(types[i]==TYPE_NONE) continue; redisAppendCommand(context, "%s %s", sizecmds[types[i]], keys->element[i]->str); }...Copy the code
Each data type has a function to calculate its length. When we use Redis as a cache, we actually serialize the data to text and store it, so we can use STRLEN to find the storage size of each key.
Our tools are roughly as follows:
#! /bin/env php <? php $key = $argv[1]; $redis = new \Redis(); $redis - > connect (' 172.16.0.1, 6379); $keys = $redis->keys($key); $num = count($keys); if ($num == 0) { echo "found none.\n"; exit(0); } echo "found $num keys.\n"; $total = 0; foreach($keys as $i => $key) { $total += $redis->strlen($key); if ($i % 1000 == 0) { echo $i.":".$total."\r"; } } $redis->close(); printf("\navarage size:%.2fk", $total/$num/1024); printf("\ntotalsize:%.2fM", $total/1024/1024); echo PHP_EOL;Copy the code
Through these tools, through further analysis of the real-time cache keys, finally find several important series of large keys.
key | The number of | The average size | Total size |
---|---|---|---|
php_newGetTopic* | 131393 | 2.73 k. | 350.78 M |
php_bbs_posts* | 144066 | 1.51 k. | 212.22 M |
php_getAskPostList* | 39155 | 1.61 k. | 61.75 M |
A combined | 314614 | – | 624.75 M |
Such a look, surprised, a total of more than 1G cache items, these series of large key accounts for half of the country. Why is that?
Take a look at the cache contents of these keys, which are used to store the content and answers of posts and q&a questions. These are not very active right now, but they are a crawler’s favorite. What’s more, the reptile will follow it and climb from one to the next without fear of fatigue. We looked at the cache expiration time and it was usually set to more than one day. In reality, most of the content is only crawled once a day, and the cache is no longer useful, so the hit ratio of such a cache is always zero. In addition, as a post, the content is longer, plus a large amount of natural space more.
On second analysis, these caches are generally associated with the data through the primary key, from the database is also very efficient. Plus the hit rate is too low, it’s better to get rid of it. … After some code modification, release online, finally, we expect the effect, just look at the image to see how beautiful.
After a few days of transition, our cache finally returned to the 300M or so, the overall operation is stable, the number of connections to maintain a low level, the jitter is very small, no longer appeared before the busy scene.
conclusion
Cache problem search flow
-
Check the change rule of the cache capacity
If you add keys all the time, you can find which keys are added more frequently and have a long expiration time and large storage contents
-
If the total storage capacity is large, you can add cache instances. If too many connections and short connections are used, the number of time_waits is too large, you can use long connections to connect the cache.
-
Find the large key in the cache
redis-cli --bigkeys Copy the code
- Check whether there is a key with a variable (such as a numeric ID) inside the large key. If so, look up the number of such keys, evaluate their hit ratio, and try to cache only key primary keys. The content that can be found by looking for the primary key can be found directly from the DB without caching.
Specification problem
Cache in addition to the problem to find the cause of the model, although effective but always some lag, or need to have advance estimation, need to cooperate with the norms and systems, from the root to ensure the rationalization of the use of cache, and make the rationality of the use of cache can be checked, there is evidence to rely on.
Through this search problem, the following aspects of the use of cache can be standardized:
- The length of the key is not too long. Md5 is not recommended.
- Key should be known by its name and express its true intention.
- Value should not be too large, preferably within 10K or lower (experience, check later). Too much network transmission speed is affected.
- Value is used to store key ids and relationships, and objects corresponding to primary keys can be directly retrieved through db.
- Only cache hotspot data to improve cache hit ratio.
Finally, an extended reading: “Aliyun Redis development specification”.