preface

When designing a cache system, we have to consider the avalanche effect of cache penetration, cache breakdown and failure.

The cache to penetrate

Cache penetration refers to the query of a certain data does not exist, because the cache is passively written when the data is not hit, and for fault tolerance, if the data cannot be found from the storage layer, the data will not be written to the cache, so that the non-existent data will be queried to the storage layer every time, losing the significance of cache.

During heavy traffic, DB may fail. If someone attacks our application frequently with non-existent keys, this is a vulnerability.

The solution

There are several ways to effectively solve the cache penetration problem. The most common one is to use bloom filters, which hash all possible data into a bitmap large enough that a data that must not exist will be intercepted by the bitmap, thus avoiding the query pressure on the underlying storage system.

A more crude approach (which we did) is that if a query returns empty data (whether data is nonexistent or a system failure), we still cache the empty result, but its expiration time is short, no more than five minutes.

Cache avalanche

Cache avalanche refers to that the cache is set with the same expiration time, so that the cache becomes invalid at a certain time and all requests are forwarded to DB. The DB is overloaded instantly.

The solution

The avalanche effect of a cache failure on the underlying system is terrifying. Most system designers consider locking or queuing to ensure that cached singleline (process) writes are kept from falling to the underlying storage system when a large number of concurrent requests fail.

A simple solution shared here is to spread out the cache expiration time. For example, we can add a random value on the basis of the original expiration time, such as 1-5 minutes random, so that the repetition rate of each cache expiration time will be reduced, and it is difficult to trigger the collective failure event.

Cache breakdown

For some keys that are set to expire, it is a very “hot” data if the keys are likely to be accessed at a very high concurrency at some point in time. At this point, there is a problem to consider: cache “breakdown”. The difference between this and cache avalanche is that the cache is for one key, whereas the cache is for many keys.

When the cache expires at a certain point in time, a large number of concurrent requests for the Key come in. These requests find that the cache expires and generally load data from the backend DB and bounce it back into the cache. In this case, a large number of concurrent requests can overwhelm the backend DB.

The solution

1. Use mutex keys

A common practice in the industry is to use Mutex. In simple terms, when the cache fails, instead of loading the db immediately, the cache tool will first set a mutex key with the return value of the successful operation (such as Redis SETNX or Memcache ADD). When the return value of the successful operation is returned, Load DB and set the cache; Otherwise, retry the entire GET cache method.

SETNX, which stands for “SET if Not eXists”, can be used to achieve the effect of a lock. Prior to redis2.6.1 the setnx expiration time was not implemented, so here are two versions of the code reference:

String get(String key) {String value = redis.get(key);if (value  == null) {
		if (redis.setnx(key_mutex, "1")) {
			// 3 min timeout to avoid mutex holder crash  
			redis.expire(key_mutex, 3 * 60)  
			        value = db.get(key);
			redis.set(key, value);
			redis.delete(key_mutex);
		} elseThread.sleep(50); get(key); }}}Copy the code

New version code:

public String get(key) {
	String value = redis.get(key);
	if(value == null) {// Indicates that the cache value expires. // Sets a timeout of 3 minutes to prevent db from being loaded when the del operation failsif(redis.setnx(key_mutex, 1, 3 * 60) == 1) {value = db.get(key); redis.set(key, value, expire_secs); redis.del(key_mutex); }else{sleep(50); sleep(50); get(key); // Try again}}else {
		returnvalue; }}Copy the code

Memcache code:

if (memcache.get(key) == null) {
	// 3 min timeout to avoid mutex holder crash  
	if (memcache.add(key_mutex, 3 * 60 * 1000) == true) {
		value = db.get(key);
		memcache.set(key, value);
		memcache.delete(key_mutex);
	} else{ sleep(50); retry(); }}Copy the code

2. Use mutex keys “ahead of time”

Set a timeout value (timeout1) inside the value, which is smaller than the actual memcache timeout(timeout2).

When timeout1 is read from the cache and found to be out of date, extend timeout1 and reset it to the cache. The data is then loaded from the database and set to the cache.

The pseudocode is as follows:

v = memcache.get(key);
if (v == null) {
	if (memcache.add(key_mutex, 3 * 60 * 1000) == true) {
		value = db.get(key);
		memcache.set(key, value);
		memcache.delete(key_mutex);
	} else{ sleep(50); retry(); }}else {
	if (v.timeout <= now()) {
		if (memcache.add(key_mutex, 3 * 60 * 1000) == true) {
			// extend the timeout for other threads  
			v.timeout += 3 * 60 * 1000;
			memcache.set(key, v, KEY_TIMEOUT * 2);
			// load the latest value from db  
			v = db.get(key);
			v.timeout = KEY_TIMEOUT;
			memcache.set(key, value, KEY_TIMEOUT * 2);
			memcache.delete(key_mutex);
		} else{ sleep(50); retry(); }}}Copy the code

3. “Never expire”

Here “never expires” has two meanings:

(1) From redis, it is true that there is no set expiration time, which ensures that there will not be hot key expiration problem, that is, “physical” does not expire.

(2) From the point of view of function, if not expired, it is not static? So we store the expiration time in the value corresponding to the key. If we find that the expiration time is about to expire, we build the cache through a background asynchronous thread, which is called “logical” expiration

From a practical point of view, this approach is very performance-friendly. The only downside is that while building the cache, other threads (non-building cache threads) may access old data, but this is tolerable for general Internet functionality.

String get(final String key) {
	V v = redis.get(key);
	String value = v.getValue();
	long timeout = v.getTimeout();
	if(v.timout <= system.currentTimemillis ()) {// Async update background exception threadPool.execute(newRunnable() {
			public void run() {
				String keyMutex = "mutex:" + key;
				if (redis.setnx(keyMutex, "1")) { // 3 min timeout to avoid mutex holder crash redis.expire(keyMutex, 3 * 60); String dbValue = db.get(key); redis.set(key, dbValue); redis.delete(keyMutex); }}}); }return value;
}Copy the code

4. Resource protection

With Netflix Hystrix, you can isolate resources to protect the main thread pool, and if you can apply this to caching, you can do the same.

Four solutions: there is no best, only the most suitable



conclusion

For business systems, it is always a case by case analysis, there is no best, only the most appropriate.

Finally, the common problems of full cache and data loss in the cache system need to be analyzed according to the specific business. LRU strategy is usually used to deal with overflow, and Redis RDB and AOF persistence strategy is used to ensure data security under certain circumstances.

For more technical articles, please pay attention to wechat public number: Java Programmers Gathering Place.