First published on the official accountMXPlayer technical teamWelcome your attention.

The introduction

Mxplayer, as a popular online video player, has a large number of users. In many scenarios, such as short video recommendation, it is necessary to filter out the short video that users have seen before from the recall results to ensure the diversity of recommended content to improve user experience. Therefore, it is necessary to save the recommended data into the user’s history. With the growth of users and the increase of user history, the storage and efficient filtering of massive historical data is a problem that needs to be solved. Mx recommendation system effectively solves this problem by using BloomFilter.

MX solution (Option 1)

In the initial user history filtering, the user’s history data (item ID list) was stored in redis’ Zset. Because the number of users’ history may be large, a truncation operation was made to keep only the last 1400 items. Each time, all the historical data of the user is extracted from Redis and filtered. Then, before the result list is returned, the data of the list is added to the historical data and pushed to Redis together. This kind of history filtering not only takes up a lot of memory of Redis, but also needs to transmit a lot of history data with Redis twice for each request, which consumes a lot of time and network bandwidth. Meanwhile, the server also consumes a lot of memory because of caching history data.

BloomFilter Historical filtering (Scheme 2)

BloomFilter profile

Bloom Filter is a spatially efficient random data structure. Its principle is that when an element is added to a set, it is mapped to K points in a bit array by K Hash functions, setting them to 1. If any of these points have a 0, the element being retrieved must not be there. If both are 1, the element being retrieved is likely to exist.

An example of a Bloom filter, representing the set {x, y, z}. The colored arrows show the positions in the bit array that each set element is mapped to. The element w is not in the set {x, y, z}, because it hashes to one bit-array position containing 0. For this figure, m = 18 and k = 3.
Copy the code

Refer to the Wikipedia article for details. As can be seen from the figure above, if an element v does not exist in the set, but the value obtained by three hash functions is (1,3,4), then BloomFilter will determine that v exists in the set, so BloomFilter has the possibility of False positives.

BloomFilter practices in MX

BloomFilter has a variety of implementations. The MX recommendation system uses BloomFilter directly from Google’s Guava library and assigns each user a BloomFilter with a capacity of 10000 and a false positive rate of 0.01.

  • Historical stored procedure

    Each time before the recommender system returns the data to the user, add the ID of each piece of data to the user’s BloomFilter, then export the BloomFilter as a byte array and store it in Redis, set the expiration time to 7 days, and asynchronously store the original item ID to the Zset of PIKA (left at the end).

  • Historical filtering process

    Take the byte array of BloomFilter from Redis to construct guava BloomFilter object for filtering

Compare the two schemes

  1. storage

    As can be seen from the figure above, the memory size of BloomFilter is approximately one order of magnitude smaller than that of item List for storing the same amount of historical data. The test results show that guava BloomFilter with 10000 capacity and 0.01 false positive rate exports a byte array of 11990 length, which takes about 12KB of memory, that is to say, it takes only 12KB to store 10000 historical data of a user. However, if the historical data is stored in the form of item ID list, since the item ID is a 32-bit character string (hexadecimal number) encoded in UTF-8, each character occupies one byte, storing 10,000 item ids will occupy 10000×32×1=320000B. It is about 312KB, which is the amount of memory used for BloomFilter storage26Times! That is, using BloomFilter saves approximately more than using the item list96%Memory, reduce a large amount of storage overhead. Although we also use PIKA to store the Item ID List as the base, PIKA is a hard disk database, which is far cheaper than memory and almost negligible compared to Redis.

  2. The efficiency of

    In scheme 1, there are two ways to write data to the Zset of Redis. One is to add the IDS in the item List to the Zset asynchronously through a loop, and the other is to add the item ID list and the corresponding score to an array, and then transfer the array to Redis as zADD parameter once and execute it. The efficiency of the two methods is not that much different. Option 2 simply exports BloomFilter as a byte array and stores it directly in Redis as byte. From a coding point of view, BloomFilter is much simpler; From the perspective of network transmission, the item ID List consumes longer time than BloomFilter due to the large amount of data. From the point of view of command execution on the Redis side, scheme 1 needs to execute zadd multiple times, while Scheme 2 only needs to execute set(bytes) once, so scheme 2 still takes less time than scheme 1. So overall, plan TWO is much more efficient.

extension

  1. The choice of BloomFilter

    Redis supports modules from version 4.0, we can develop and load any Module, the official also provides many excellent modules, including BloomFilter implementation Module – RedisBloom, so we can directly use redis BloomFilter. Instead of using Guava BloomFilter to export byte arrays to Redis. However, if the load balancing of the system is session-friendly (i.e., sticky sessions), then using Guava BloomFilter and caching locally will greatly improve the system performance, whereas RedisBloom cannot cache locally because we don’t know what RedisBloom’s algorithm is. Or not implementing the same algorithm as RedisBloom.

  2. False positive rate of BloomFilter

    Assuming that the Hash function selects and sets a certain bit in the bit array with equal probability, assuming that the position of the bits to be set is independent of each other calculated by each Hash, m is the size of the bit array, k is the number of Hash functions, and n is the number of elements to be inserted, the false positive rate is approximately:


    Detailed calculations can be found in the chapter on Bloem filters in The Beauty of Mathematics, Second edition. As can be seen from the above formula, we can predict the number of target elements and adjust the size of the bit array to achieve an acceptable false positive rate. Similarly, we can calculate the size of the bit array by setting the number of target elements and the false positive rate (refer to the construction parameters of guava BloomFilter).