The sense of security, is the stability of the dust settled, is never abandon the belief.

background

Under the premise of using Redis as the cache scheme, with the increase of data volume and the complexity of data structure caused by performance problems, the increase of the number of access requests, network IO resources can not be ignored, Redis failure and other problems come one after another, a multi-level cache practice is urgent.

Multi-level cache is named multi-layer-cache (MLC) for the time being.

The company has a B2B project electricity business number and types of many, as the supplier from time to time to do some seconds kill “goods,” “goods promotion”, “goods auction” and other activities, lead to order summary, inventory, logistics, commodity details such as the condition of the hot cache access link application, of course before each event to promote hot data can be in network simulation test, Data engineers can roughly analyze hotspot data to solve the uncertainty of hotspot data. However, when accessing data, a small number of hot key accesses at the application layer generate a large number of access requests to the cache system. These requests occupy broadband resources, impact the cache system, and ultimately affect the overall stability of the application layer.

  • Implement multi-level cache solution based on the original projectPain points
    • Effect verification: How can the application layer check the local cache hit ratio and hotspot key data to verify the effect of multi-level cache
    • Data consistency: Data consistency between application – layer caches and distributed caches
    • Secure access: Minimize system damage during access

It mainly provides hotspot detection and local cache, reducing the impact of hotspot access on downstream distributed cache services and preventing the performance and stability of application services.

Preparation of multi-level caching schemes

Local cache selection

EhCache: as a caching framework within Java processes, EhCache has the advantages of strong flexibility, light and small, and compatibility with stable and popular applications.

  • Memory capacity Problem
    • EhCache supports two default storage modes, one is entirely memory-based and the other can be stored on disk.
    • EhCahe also provides LRU, LFU, FIFO elimination algorithms, basic attributes support hot configuration, support more plug-ins, flexible use.
    • EhCache provides listener modes, such as cache manager listener and cache listener, to facilitate the implementation of broadcast and specific functional needs and meet more scenarios when caching related operation behaviors.
  • Distributed applications
    • EhCache has inherent defects in distribution, such as stability and consistency.
    • RMI model
      • Java’s inherent RMI support introduces unnecessary intro, where the caller only needs to focus on the serialization protocol between key and value.
      • Manual configuration, which requires manual compilation of other node information to configure the cluster. In actual development, the difference between online and offline environments naturally results in different configuration information, and manual operation is prone to error.
      • The automatic discovery mechanism realizes the discovery of nodes in the cluster by means of multicast. Compared with manual configuration, it is more flexible, simple and error-prone.
    • The pattern of JGroup
      • Complex configuration files
      • RMI mode provides TCP based unicast mode and UDP based broadcast mode, and relies on third – sent packets.
    • EhCache Server
      • Independent cache Server, internal use of EhCache as cache, using the above two methods for clustering (EhCache Server is like a wrapper, external packaging)
      • Provides a programming language-independent interface that is RESTful over Http or SOAP.

Summary EhCache: EhCache supports element-level cache operations and elimination algorithms. Spring’s framework and annotations support EhCache, which has a rich and stable ecosystem. EhCache is used in many production environments, making it easy to get started

Caffieine is a high-performance cache library that is almost the best based on Java 8. Caffeine provides a memory cache from the experience of designing the Guava cache and ConcurrentLinkedHashMap

  • Caffeine has the following advantages
    • Data can be automatically loaded to the cache. This operation can be asynchronous, but asynchronous refresh operation is also supported.
    • Recycle and expiration strategies can be based on volume or time frequency
    • Keys and values are wrapped by weak references or soft references, which have an advantage in GC strategy.
    • A listener notifies a key when it is recycled or deleted
    • The cache access has statistical capabilities
    • Of course, Caffeine has many more cache features.
  • disadvantages
    • The capacity is limited by the application memory.
    • No persistence
    • Data cannot be synchronized in a distributed environment

For more information, visit Caffeine’s official website

When open source caches are not transparent, detailed, or too important for your project, you can opt for a self-built Cache component. It’s reinventing the wheel but it’s bragging rights, and we can maintain our core components better.

  • If you only need to store a few values, a HashMap is a good choice.
  • If you want to be thread-safe, you can build your own cache component using ConcurrentHashMap.
  • The elimination mechanism can be implemented to control the object, add an expiration time to the corresponding key to process, open a thread in the background to process the expiration strategy, of course, you can inherit the LinkedHashMap object LRU and FIFO elimination algorithm to process and manage the expiration key.

Select Caffeine with stable and excellent performance in the local cache. From Caffeine’s official website, it matches other features:

There is a lot of learning to do with Caffeine usage and the underlying implementation, so the time cost has to be taken into account.

Distributed cache selection

Since the original project already used Redis, Redis was chosen as the distributed cache in order not to increase the system complexity

  • Of course, there are more reasons to choose Redis
    • Rich data structures
    • LUA Scripting
    • LRU driver events
    • persistence
    • High availability through Redis Sentinel (Sentinel and Automatic Partitioning (Cluster)

    .

Caffeine and Redis have strengths and weaknesses that make them mutually beneficial.

For more information about distributed buffering, click Cache Component Selection

MLC overall architecture

  • The overall architecture of MLC is divided into two layers as shown in the figure above
    • Storage layer: Provides the basic KV data storage capacity, as well as the final DB.
    • Application layer: Provides a unified client for application services, with built-in functions such as hotspot detection and local cache, and is transparent to services.

Positioning of application layer: SDK, transparent to service, hot spot, concurrency control (limiting traffic).

As the pressure from performance increases, the storage layer also requires data cutting. In this case, a proxy layer can be added between the two layers to provide a unified cache access and communication protocol for the application layer, and to undertake the routing function forwarding after horizontal segmentation of distributed data. Of course, the addition of the agent layer helps decouple the application layer from the storage layer, while bringing about architectural complexity and reduced availability.

Now we mainly consider the details of data acquisition and data consistency and hotspot detection in the application layer client.

Read and write of multi-level cache

Plan to sort out

In a distributed system, if the cache and database exist at the same time, the write sequence between the caches should be determined. The read/write operation and maintenance scheme must be deduced step by step. When a double write operation occurs at a point in time:

  • Work on the database first, then cache: the steps should be [3,2,1,6,5,4]. There are also no data consistency problems after the cache is loaded in subsequent read operations.

When a read or write operation occurs at a point in time:

  • Cache first, then database: If the step sequence according to go,2,3,4,5,6,7,8 [1] that’s not a problem, but when the step order to,2,3,4,6,7,5,8 [1], [1,3,4,2,6,7,8,5] old data will exist in the cache, such as read every time is old data, cache and data and database data inconsistencies in the end.

  • Work on the database, then the cache: if you follow the sequence of steps [3,2,1,4,5,6,7,8], the final consistency of the data is achieved with dirty reads allowed. Obviously, this scheme has no major defects, but it is possible that any of the steps [1,2] fail to delete the cache, although the probability is relatively small and obviously better than the previous scheme, so this scheme is adopted.

To ensure the normal operation of read/write and dual write operations, write operations are adopted to operate the database first and then the cache scheme.

Cache Aside Pattern, which is a good practice for multi-level caching of data

  • The detailed steps of the read request are as follows:

    • Every time data is read, it is read from the cache
    • If it does, it returns a cache hit
    • If the data cannot be read from the cache, it is retrieved from the DB, which is called cache miss
    • I’m going to stuff it into the cache, and the next time I read it, I’m going to hit it
  • The detailed steps of a write request are as follows:

    • Write the changes to the database
    • Delete the corresponding data from the cache

Due to the failure of cache deletion in read and write and the isolation of MLC as a local cache of distributed applications between services, the operation steps need to be filled to ensure data consistency.

Method 1: Put data update and cache delete actions in a transaction, advance and retreat simultaneously.

Method 2: After the cache deletion fails, retry for a certain number of times. If the fault persists, the cache service may fail. In this case, log and delete the keys when the cache service recovers.

Method 3: Delete the cache, update the data, and delete the cache again. It’s more maneuverable, but it’s safer.

Method 4:

Binlog is a MySQL log file used to record data updates or potential updates (for example, when a DELETE statement deletes data that does not meet the requirements). Binlog is also used in MySQL primary/secondary replication.

In the above flow, DB-> ASyn update->Redis Cache can asynchronously eliminate the key through the binlog of the database. The binlog data can be collected by third-party tools such as Ali’s Canal and sent to MQ for confirmation processing through the ACK mechanism. This ensures data consistency in DB->Redis Cache. Use binlog between DB->Redis Cache to increase system complexity.

Dirty reads still exist (the Cache deletion operation fails and dirty data is generated when a read request is received, or the read request completes a write request to update the Cache after obtaining data from the database).

Since there are so many options for Caching synchronization mode, such as Read Through Pattern, Write Through Pattern, and Write Behind Caching Pattern, how to choose? Business, business, business. Three times what’s important. These patterns are also widely used, but because most of the business is unaware, many people do not organize them. Most of them exist in middleware or are implemented in lower-level databases, and writing business code may not touch these things. For example, Write Behind Caching: Data is first dropped to the cache, and then asynchronous threads slowly drop the cached data to the DB. The benefit of this design is that data I/O operations are fast, and because of asynchrony, write Back can also merge multiple operations on the same data, so the performance improvement is significant. Before you use this model you need to evaluate whether your data can be lost and whether your cache capacity can withstand peak business. But it has nothing to do with our business needs right now.

pub/sub

The interaction process between level-2 Cache Redis Cache and DB has been basically dealt with in the previous section. As a distributed application, MLC can build a cluster. Data update between Level-2 Cache Caffeine and Level-2 Redis Cache needs to be handled well to ensure data consistency.

When data is deleted/updated by an MLC in one service, how is the MLC level 1 cache associated with other services?

There are generally two ways to ensure the consistency of multi-machine data. One is push mode, which has good real-time performance, but the push message may be lost. The other is pull mode, pressure sharing, data will not be lost, but the real-time is not good.

MLC mainly combines push/pull modes to ensure data consistency of multiple machines.

Pull is mainly message basedThe offsetWhile pushing is mainly based on redis pub/sub mechanism. In implementation, push: atom-based broadcast mode, pull: based on scheduled task.

public class RedisListMessageListener implements RedisPubSubListener<String.Object>{
	@Override
    public void message(String channel, String message) {
    	// update time

    	// pull message}}Copy the code

Offset Indicates the position at which the service processes cached messages. The position of offset is updated after each message is processed.

  • Allows repeated consumption of messages
    • Hot-key deletion action: Because even if the level-1 cache is deleted, it will be rebuilt based on the level-2 cache.
    • Hotspot key acquisition action: Data is obtained from the DB after being deleted.

However, the list data can be cleared according to requirements.

The pub/ SUB reconnection problem can be designed as A timed task, which records the time when the push message was last processed, A, and the time when the pull message was last processed, B. If the time difference is greater than 8s, the connection is considered disconnected, and then a reconnection attempt is initiated. Meanwhile, the latest Redis offset is obtained to the local, and the data is processed and the local offset and update time are updated.

  • Why use Redis PUB/SUB for update synchronization of level 1 cache
    • Using caches inherently allows dirty reads, so some latency is allowed.
    • Redis itself is a highly available database, and the delete action is not a very frequent action, so using redis native publish-subscribe has no performance problems.

Zookeeper can also use synchronization schemes. MLC can monitor the zNode corresponding to the hot spot cache specified by Zookeeper. If there is a change, it can immediately sense it. In fact, PUB/Sub is more suitable for the synchronization scheme of the MLC local cache in the project of our company. Because the project volume of our company is not very large, and the tasks of Zookeeper are very heavy, pub/ Sub of Redis is adopted.

Cache design details

Class tip: cache breakdown. When a key is frequently queried or is not popular, a large number of access requests for this key are suddenly received. As a result, a large number of concurrent requests directly penetrate the cache and request the database, instantly increasing the access pressure on the database.

  • Cache breakdown prevention
    • When multiple requests acquire a value that does not exist in the cache, acquire the lock first (Redis distributed lock, lock key is Golbal index + key).
    • A request for not obtaining a lock is blocked
    • The lock request retrieves the value from the database and updates it to the cache
    • After the blocked thread is woken up, it checks whether the key exists from the cache. If so, it returns directly
    • If no, repeat the preceding steps

However, if the DB does not have key values, the above operations increase the burden on the system and affect the performance of the upper-layer system. Therefore, null-valued objects need to be set in the design to prevent the above useless operations. The disadvantage is that memory space is consumed (whether to use null-valued objects needs to be weighed by the business, and the code implementation also needs to set null-valued switches).

public final class NullObject implements Serializable {
    private static final long serialVersionUID = 1454545343454684645L;

    public static final Object INSTANCE = new NullObject();
}
Copy the code

  • Prevention of cache penetration

Lesson tip: Buffer penetration. When a user queries a piece of data, the database and cache do not have any record of this data, and this data is not found in the cache, the user will ask the database for data. When it can not get data, it will always query the database, which will cause a lot of pressure to the database access.

The data fetched from DB needs to be filtered to eliminate avoidable database stress.

MLC local cache

At the present stage, the project is implemented based on Spring Cache, annotation and Redis. In order to access MLC safely, the idea of Spring Cache is used for reference and AOP + annotation and other technologies are used to realize the decoupling of Cache and business logic. Where you need to cache query results, use annotation tags, write business code using RedisTemplate, based on the spring.data.redis package and Lettuce. Finally, request interaction with cache server storage layer Redis is made through Netty – based StatefulRedisConnection. To annotate “hot spot discovery” + “local cache” function of proxy, so as to complete the transparent access of functions.

The overall structure

  • Module partition

    • Luttuce-client: The direct interface between the Java application and the cache server. The interface definition is the same as the original Luttuce-client + SpringBoot JPA;
    • MLC SDK: SDK package to achieve “hot spot discovery + local cache” function
    • Cache cluster: the cache cluster consists of the proxy layer and storage layer, providing unified distributed cache service entrance for application clients.
  • The basic flow

    • The key value for
      • When a Java application calls GET to get the cached value of the key, it asks the MLC SDK if the key is currently a hot key.
      • For the hotspot key, the value of the hotspot key is directly obtained from the local cache, and the cache cluster is not accessed. In this way, the access request is forwarded to the application layer.
      • For non-hot keys, the value is retrieved from the cache cluster through Callable callbacks.
      • Each key access request is recorded (asynchronously) by the communication module to achieve hotspot detection.
    • The key value to delete
      • When a Java application calls EVice to obtain the cached value of a key, it uses a communication module to issue a push to Redis
      • The MLC subscriber performs the delete action
    • Hot found
      • The hot spot discovery system (described later) receives key access records from MLC and counts the number of key accesses during the configured interval
      • Push Redis to tell MLC which keys are locally cached in the cluster and which keys are deleted
      • The MLC subscriber fetches data from the cache cluster through Luttuce-client

Hot found

Considering the hardware performance of the company, the single instance deployment machine of common system may be a 4-core 8G machine, leaving little space for local cache. Therefore, local cache Caffeine basically only slows down storage of hot data.

What is the existence of hot data? In general, when there is a cache hot spot, your concurrency must be very high, may be hundreds of thousands or even millions of requests per second, this is possible, such as the data of an active item, a hot information introduction.

When I was designing hot spot discovery, I communicated with data engineers and learned that the streaming computing technology of big data can be embedded in MLC external interface to conduct real-time data access statistics, such as Storm and Flink, etc. (I don’t understand any of them).

Then, in the real-time statistics of data access times, according to their own rules, for example, if they find that the access times of a certain data suddenly exceed 500 within a second, they will immediately identify the data as hot data, and then write the found hot data into Reids and publish it to other MLCS in PUB/SUB.Although this is the solution to reduce their own burden, but combined with the project consideration, added to the system’s complexity. In streaming computing, such as storm system, it can distribute the same piece of data among many machines for local calculation, and then aggregate the local calculation results to one machine for global summarization, ensuring availability and clustering for performance. Access request Storm system also consumes I/O so much traffic is not to be ignored.

The blog from Praised technology has praised transparent multi-level cache solution (TMC) has mentioned the implementation of hot spot discovery, there is a detailed explanation, take a copy of the cat to implement their own.

Current limiting fuse protection

MLC should specifically add a circuit breaker protection for hot data access. Once it’s over, it can be fused, disallowing the request to cache the cluster, returning a blank message, and then the user will refresh the page again or something.

LongAdder, as access statistics, can protect the cache cluster and database cluster from being killed by the fusing protection measures directly added by the application layer itself (the prevention of cache breakdown was also introduced earlier, and the double protection makes the system more robust). Process and collect statistics on all keys that need to be fused based on scheduled tasks.

public class LimitStat {
    private String key;
    private LongAdder longAdder;
    private longtime; .public LimitStat(String key) {
    	// Fuse limit rule
    }

    public LimitStat recreate(a) {
    	/ / reconstruction
    }

    public LimitStat decrement(a) {
    	// Reduce and judge}... }Copy the code

In dealing with hit ratio simple idea is to cache hit and miss using LongAdder in memory temporarily, through the scheduled task to synchronize to Redis and reset LongAdder. In a clustered environment, a distributed lock is required to synchronize data to Redis to ensure data accuracy. Caffeine itself provides hit statistics:

	@Bean
	public Cache myCache(a) {
		return new CaffeineCache("local1", Caffeine.newBuilder()
				.maximumSize(35000)
				.expireAfterWrite(1000, TimeUnit.MINUTES)
				.recordStats() // Record hit and miss information
				.build(), true);
	}
Copy the code
Cache<String, Object> cache = Caffeine.newBuilder()
    // Customize the data collector
    .recordStats(() -> new StatsCounter() {
        // Override method
}).build();
Copy the code

At the end

Hot spot detection and concrete implementation are separated independently. As more excellent and efficient detection algorithms come out, they can be inserted without perception. In the future evolution, flexible configuration can be expanded to achieve hotspot threshold and number of hotspot key probes, and free configuration to achieve better use effects. Continue to update the article as progress is made.

Of course, the company is growing step by step, and I hope the solution can be applied more quickly.

Finally, happy May Day holiday!!