Performance tuning can be done in a number of ways, as discussed in the previous article. Consider caching when index performance is still not ideal. Performance optimization using caching solves the most fundamental problem: bridging the gap between high CPU power and slow IO reads and writes.
Caching and multi-level caching
Introduction of caching
When the initial business volume is small, the database can bear the read-write pressure, the application can directly interact with the DB, the architecture is simple and strong. After a period of development, the service volume increases on a large scale, and the DB query pressure and time consumption increase. At this time, the introduction of distributed cache, while reducing DB pressure, but also provide higher QPS. Further development, distributed cache also becomes a bottleneck, high-frequency QPS is a burden; In addition, cache ejection and network jitter will affect the stability of the system. In this case, introducing local cache can reduce the pressure of distributed cache and reduce the network and serialization overhead.
Read and write performance is improved
Caching improves read and write performance by reducing I/O operations. There is a table, you can see disk, network I/O operation time, much higher than memory access.
- Read optimization: After a request hits the cache, it can be returned directly, skipping the I/O read and reducing the read cost.
- Write optimization: Merges write operations in buffers so that I/O devices can process them in batches, reducing write costs.
A cache Miss
Caching Miss is an inevitable problem. The cache needs to ensure that the hotspot data is maintained in the cache under the limited capacity, so as to achieve a balance between performance and cost. Caches typically use an LRU algorithm to weed out recently used keys.
How to avoid short-term mass failures
In some scenarios, applications load data in batches to the cache, such as uploading data through Excel, which is parsed by the system and then written in batches to DB and cache. If not designed, the timeouts for this batch of data are usually the same. After the cache expires, the traffic borne by the cache is transferred to the DB, which reduces the performance and stability of the interface and even the system. You can use random numbers to scatter the cache expiration time. For example, set TTL=Nhr+random(8000)ms.
Cache consistency
The system should try to ensure the consistency of DB and cache data, and cache aside design mode is more commonly used. Avoid unconventional cache design patterns: update the cache first, then update the DB; Update DB first, then cache aside. The risk of inconsistencies in these patterns is high.
Cache design mode Service systems use cache aside mode. Operating systems, databases, and distributed caches use write throgh and write back.
Cache aside
The Cache aside mode works well most of the time, but inconsistencies can occur in extreme scenarios. Mainly from two aspects:
- Cache invalidation failed due to middleware or network problems.
- Unexpected cache invalidation, read timing.
From heap memory to direct memory
Introduction of direct memory
There are two types of Java native caches, heap-based and direct memory-based.
The main problem with using heap memory for caching is GC, because cached objects tend to have long lifetimes and need to be reclaimed through the Major GC. If the cache size is large, GC can be time-consuming.
The main problem with direct memory caching is memory management. The program needs to independently control the allocation and recycling of Memory, and there is a risk of OOM or Memory Leak. In addition, direct memory can not access the object, need to be serialized during operation.
Direct memory reduces GC stress because it only needs to hold references to direct memory, where the objects themselves are stored. References promoted to the old age take up very little space and the burden on GC is negligible.
Direct memory recycling depends on the System. A gc call that the JVM is not guaranteed to execute or when it will execute, and its behavior is uncontrollable. Programs generally need to manage themselves and call malloc and free in pairs. Relying on this “manual, C-like” memory management, it can increase the control and flexibility of memory reclamation.
Direct memory management
Since direct memory allocation and reclamation are expensive, physical memory needs to be manipulated through the kernel. Usually, large memory is allocated for fast, and then smaller blocks are allocated to threads according to demand. Instead of being released directly, it is put into the memory pool for reuse. How to find a free block quickly, how to reduce memory fragmentation, how to recycle quickly and so on, it is a systematic problem, and there are many special algorithms. Jemalloc is an algorithm with good comprehensive capability. Free BSD and Redis use this algorithm by default. OHC cache also recommends that the server configure this algorithm.
CPU cache
With distributed and local caches, CPU caches can be further improved. It is subtle, but has some impact on performance under high concurrency. The CPU cache is classified into L1, L2, and L3. The closer the CACHE is to the CPU, the smaller its capacity is and the higher its hit ratio is. If data cannot be obtained from the L3 cache, obtain data from main memory. CPU cache line THE CPU cache consists of cache lines. Each cache line is 64 bytes and can hold eight longs. When the CPU fetches data from main memory, it loads data in units of cache line, so that adjacent data is loaded into the cache at the same time. It is easy to imagine that sequential traversal of arrays and computation of adjacent data are very efficient. False sharing
CPU caches also have consistency issues, which are guaranteed by the MESI protocol and the MESIF protocol. The pseudo-share results from high concurrency and cache line inconsistency occurs. Data in the same cache line can be modified by different threads, which can affect each other and degrade processing performance.