background

DSP system is an Internet advertising demand platform, used to undertake media traffic, advertising. The service is characterized by high concurrency and low average response (HMS).

In order to effectively improve the performance of DSP system, Meituan platform introduces a Cache structure LruCache(Least Recently Used Cache) with the clearing mechanism. In the current DSP system, The LruCache + key-value database mechanism is used to store remote data into local cache data, which not only reduces the average information acquisition time, but also maintains the service memory usage within a safe range through a certain clearing mechanism.

Based on the actual application scenarios, this paper will explain the reasons for introducing LruCache, and make a detailed and in-depth introduction of challenges and solutions under high QPS, hoping to inspire students who are interested in DSP.

Introduction of LruCache

The LruCache cache algorithm is LRU(Least Recently Used), which is the Least Recently Used algorithm. The core idea of this algorithm is that when the cache data reaches the preset upper limit, the least recently used cache objects are preferentially eliminated.

LruCache maintains a bidirectional linked list and a mapping list internally. The linked list stores cached data in the order of use. The data used earlier is closer to the end of the list, and the data used later is closer to the head of the list. The mapping table provides efficient lookup operations through key-value structure. You can determine whether a certain data is cached based on the Key Value. If the cache directly obtains the linked list node to which the cached data belongs, it can obtain further cached data. The LruCache structure diagram is as follows. The upper part is a bidirectional linked list and the lower part is a mapping list (not necessarily ordered). In a bidirectional list, value_1 is at the head of the list and value_N is at the tail of the list.

The LruCache read operation checks whether cached data exists in the mapping table by key value. If data exists, the node where the cached data resides is removed from the current position in the linked list and moved to the head of the linked list. If it does not exist, the search fails and waits for new data to be written. The following figure shows the LruCache structure changes after key_2 is searched by LruCache.

Write operation when the LruCache does not reach the preset upper limit. The cache data is directly added to the head of the linked list. At the same time, the cache data key value and the double-linked list node where the cache data resides are inserted into the mapping table as key-value pairs. The following figure shows the data structure after M is written when the preset upper limit of LruCache is greater than N.

When the LruCache reaches the preset upper limit, the cache data at the end of the linked list is deleted from the mapping table, and the data at the end of the linked list is deleted, and then the new data is normally written to the cache. The following figure shows the data structure after M is written to the preset upper limit of LruCache.

Thread-safe LruCache uses locks for critical section protection during read and write operations to ensure thread-safe cache use.

Application scenario of LruCache in Meituan DSP system

The key value storage database is widely used in meituan DSP system. For example, Redis is used to store advertising information. The service can obtain advertising information through advertising ID. Each request gets the advertising information from the remote key-value store database, and the request takes a very long time. With the development of business, QPS shows a huge growth trend. In such a high concurrency application scenario, it is a common solution to reduce query time by migrating advertising information from the remote key-value store database to the local database. In addition, the memory footprint of the service itself should be stabilized within a safe range. In the face of increasing advertisement information, LruCache + key value storage database mechanism is introduced to improve system performance and maintain safe and stable memory occupation.

Application evolution of LruCache + Redis mechanism

In practical application, LruCache + Redis mechanism has gone through four stages respectively: LruCache introduction, LruCache increase time removal mechanism, HashLruCache meet high QPS application scenario and zero copy mechanism. The test machine for each stage is a 16-core 16G machine.

Evolution 1: LruCache is introduced to improve the performance of meituan DSP system

In a lower QPS environment, Redis is directly requested to obtain advertising information, which can meet the requirements of the scene. However, with the increase of stand-alone QPS, it will take more time to directly request Redis to obtain advertising information, which cannot meet the needs of business scenarios.

LruCache is introduced to localize information stored remotely in Redis. The LruCache cache upper limit can be determined based on the memory usage of the server and the memory usage of the service. After LruCache is added, the memory usage of the service is within a safe range. At the same time, the hit ratio of cache data in actual use can be calculated according to the query operation.

The figure below is the comparison diagram of the average time (MS) of data acquisition for continuously increasing QPS before and after adding LruCache structure and when the LruCache hit ratio is higher than 95% :

According to the average time consumption chart, it can be concluded that:

  1. When THE QPS is higher than 250, the average time of directly requesting Redis to obtain data is more than 10ms, which is completely unable to meet the requirements of use.
  2. Adding LruCache reduces the time by an order of magnitude. From the perspective of average time, when QPS is less than 500, the time is less than 2ms.

The figure below is a comparison diagram of Top999 time (ms) of data acquisition for continuously increasing QPS before and after the addition of LruCache structure and when the hit rate of LruCache is higher than 95% :

According to the Top999 time-consuming chart, the following conclusions can be drawn:

  1. After the LruCache structure is added, the Top999 time is one order of magnitude longer than the average time.
  2. Even at low QPS, Top999 using LruCache structure is time-consuming.

In practice, the introduction of LruCache structure can effectively reduce the time consuming of data acquisition within a certain range of QPS. However, after QPS exceeds a certain range, the average time and Top999 time are both very high. Therefore, the performance of LruCache needs to be further optimized at higher QPS.

Evolution 2: LruCache added a time-out clearance mechanism

In a business scenario, it is possible to modify the advertising data in Redis. The service itself, as the user of the data, is not aware of changes in the data source. When the cache hit ratio is high or some data is hit multiple times in a long period of time, data invalidation may occur. That is, the data source has changed, but the service cannot update the data in a timely manner. In view of this business scenario, a time – effective liquidation mechanism is added.

The mechanism consists of three parts: setting the expiration time of cached data, adding time stamps to cached data units and judging the validity of queries. The cache data unit caches the timestamp of the data entering the LruCache along with the data. Cache expiration time indicates the upper limit of the cache unit cache time. The judgment of timeliness in the query means that the difference between the query timestamp and the cache timestamp exceeds the cache expiration time, so the data will be forcibly cleared and Redis will be requested to obtain data for caching again.

Making time judgment in query can minimize the interruption of time judgment to service. When the preset upper limit of LruCache is low, the service is not affected by periodically clearing full data. However, if the preset upper limit of LruCache is very high, it may take seconds or even minutes to clear all data, seriously interrupting the running of the service. Therefore, the judgment of timeliness is added into the query, and the judgment of timeliness is made only for a single cache unit, which makes a compromise between service performance and data validity to meet business requirements.

Evolution 3: Application of HashLruCache at high QPS

After LruCache is introduced into MEituan DSP system, it has supported the business development well for a period of time. As the business iterates, stand-alone QPS continue to rise. In the case of higher QPS, the query time of LruCache increases significantly, and the LruCache cannot adapt to service scenarios with low level noise. In this case, the HashLruCache mechanism was introduced to solve the problem.

Reasons for the increase of LruCache time under high QPS:

Thread-safe LruCache has locks in it. Each read/write operation is preceded by a lock operation, and the read/write operation is followed by an unlock operation. Under low QPS, the time of lock competition can be ignored. However, under high QPS, a large amount of time is spent waiting for the lock, resulting in an increase in time.

HashLruCache ADAPTS to high QPS scenarios:

The solution is to minimize the critical section when a large number of synchronous wait operations increase time. The Hash mechanism is introduced to fragment all data and form HashLruCache based on the original LruCache to reduce the query time.

HashLruCache introduces some kind of hash algorithm that spreads the cached data over N Lrucaches. The simplest hash algorithm is to use the modulus algorithm, the advertising information according to its ID modulus, scattered to N LruCache. The same hash algorithm is used to obtain the possible data fragments and then query the data on the corresponding fragments. In this way, the parallelism of read and write operations on the LruCache is increased and the synchronous waiting time is reduced.

The figure below is a comparison of the average data acquisition time (MS) obtained for the continuously increasing QPS before and after using the 16-fragment HashLruCache structure and when the hit ratio is higher than 95% :

According to the average time consumption chart, the following conclusions can be drawn:

  1. Using HashLruCache reduces the average time taken by nearly half, and the effect is significant.
  2. By comparing the average time spent without HashLruCache, it can be found that the average time spent using HashLruCache is not sensitive to the increase of QPS and does not increase significantly.

The figure below is the comparison diagram of Top999 time (ms) of data acquisition for continuously increasing QPS before and after using the 16-fragment HashLruCache structure and when the hit ratio is higher than 95% :

According to the Top999 time-consuming chart, the following conclusions can be drawn:

  1. With HashLruCache, Top999 takes about a third as long as it would without HashLruCache.
  2. Top999 using HashLruCache takes significantly less time to grow with QPS than without using HashLruCache, and is relatively less sensitive to QPS growth.

After the introduction of HashLruCache structure, in actual use, the average time and Top999 time are significantly reduced, and the effect is very significant.

The number of HashLruCache fragments is determined:

According to the above analysis, one way to further improve the performance of HashLruCache is to determine the most reasonable number of fragments, increase sufficient parallelism and reduce the consumption of synchronous waiting. So the number of fragments can be the same as the number of cpus. With the use of hyperthreading technology, the number of fragments can be further increased to increase parallelism.

The figure below is a comparison of the average time (ms) of data acquisition under different QPS with different number of fragments after using HashLruCache mechanism, and the hit ratio is higher than 95% :

The graph of average time consumption shows that under high QPS, the average time consumption does not decrease significantly with the increase of the number of fragments, and basically maintains a stable state.

The following figure is a comparison diagram of Top999 time (ms) for data acquisition under different QPS with different number of fragments after the HashLruCache mechanism is used, and the hit ratio is higher than 95% :

The Top999 time consumption chart shows that when QPS is 750, the number of fragments increases from 8 to 16 and then to 24, the Top999 time consumption has a certain decrease, which is not significant. When QPS is 1000, the number of fragments decreases significantly from 8 to 16, but basically maintains a stable state when it increases from 16 to 24. Obviously, there is a strong correlation with the actual number of cpus in the machine.

In practice, the HashLruCache mechanism can adjust the number of fragments according to the machine performance and the QPS of the actual scenario to achieve the best performance.

Evolution 4: Zero copy mechanism

Thread-safe LruCache maintains a set of data internally. When data is provided externally, a complete copy of the corresponding data is provided to the caller for use. If you store simple data, the cost of copying is minimal and this mechanism is not a performance bottleneck. However, in the application scenario of Meituan DSP system, the data structure stored in LruCache is very complex, and the cost of a single copy operation is very high, so this mechanism becomes a performance bottleneck.

Ideally, LruCache simply provides the data address, or pointer, externally. The consumer retrieves data through data Pointers where the business needs it. In this way, complex data copy operations can be changed into simple address copy, greatly reducing the performance consumption of copy operations, that is, the zero data copy mechanism. The direct zero-copy mechanism has security risks. That is, due to the time-out clearing mechanism in LruCache, some data may be deleted after expiration, but the user still obtains the data by holding invalid data Pointers.

Further analysis determined that the core of the above problem was that the data life cycle stored in LruCache was opaque to the user. The solution to this problem is to add a reference count of atomic variables to the data stored in LruCache. The use of atomic variables not only ensures thread-safe reference counts, making reference counts read consistently by threads, but also minimizes synchronization performance overhead for concurrent states. Incrementing the reference count each time a data pointer is fetched, whether in the LruCache or by the user; Similarly, the reference count is reduced by 1 when the data pointer is no longer held. When the reference count is 0, the data is not used by any user and has been deleted from the LruCache. It is safe to delete the data.

The figure below is a comparison of the average time (ms) of data acquisition under different QPS after the zero copy mechanism is enabled and the hit ratio is higher than 95% :

The average elapsed time graph shows a significant decrease of over 60% with the zero-copy mechanism.

The following figure shows the comparison of Top999 data acquisition time (ms) under different QPS after the zero copy mechanism is enabled and the hit ratio is higher than 95% :

According to the Top999 time-consuming chart, the following conclusions can be drawn:

  1. After using zero copy, Top999 time reduced by nearly 50%, the effect is very obvious.
  2. In the case of high QPS, the Top999 time using the zero-copy mechanism increases with QPS more slowly than that without it, and is relatively less sensitive to the growth of QPS.

After the zero-copy mechanism is introduced, copy Pointers are used to replace copy data, which greatly reduces the time to obtain complex service data and minimizes the critical area. Thread-safe atomic variable increment and decrement operations are implemented in many basic libraries. For example, C++11 provides built-in integer atomic variables to implement thread-safe atomic variable increment and decrement operations.

The introduction of zero-copy mechanism in HashLruCache can further effectively reduce the average time and Top999 time, and has a very good effect on stabilizing Top999 time under high QPS.

conclusion

The figure below is a comparison of the average time (MS) of data acquisition under different QPS before and after a series of optimization measures, when the hit rate is higher than 95% :

The average elapsed time graph shows that the average elapsed time after optimization is less than 20% of that before optimization, and the performance is significantly improved. The optimized average time is less sensitive to QPS growth and better supports business scenarios with high QPS.

The figure below is a comparison diagram of Top999 data acquisition time (ms) under different QPS before and after a series of optimization measures, when the hit rate is higher than 95% :

The Top999 elapsed time chart shows that the elapsed time of the optimized Top999 is only less than 20% of that before optimization, and the elapsed time of the long-tail request is significantly reduced.

LruCache is a very common data structure. It plays an important role in the high QPS business scenario of Meituan DSP. In order to meet the needs of the business, in addition to the original liquidation mechanism, a timely mandatory liquidation mechanism was added. With the development of services, the HashLruCache mechanism is used to reduce the cache query time in service scenarios with higher QPS. In view of different specific scenarios and under different QPS, more reasonable fragment number is constantly tried to improve the query performance of HashLruCache. By reference counting, zero copy mechanism is introduced in HashLruCache to further reduce the average time and Top999 time, and better serve the development of business scenarios.

Author’s brief introduction

Wang Can, joined Meituan in November 2018 as a senior engineer, responsible for the research and development of back-end infrastructure of MEituan DSP system.

Cui Tao joined Meituan in June 2015 as a senior advertising technology expert. During this period, he guided and built meituan DSP platform from 0 to 1. He has rich experience in large-scale computing engine development and performance optimization.

Shuang Shuang, who joined Meituan in June 2015, is a senior engineer of Meituan and the principal of back-end infrastructure and machine learning architecture of MEituan DSP system. She is fully responsible for the architecture design and optimization of ADVERTISING recall and sorting services of DSP business.

recruitment

Meituan online marketing DSP team sincerely recruits engineering, algorithm, data and other aspects of the elite, send resume to [email protected], together to support the high reliable system research and development and optimization of ten billion traffic.