The birthday paradox: the probability that at least two out of 23 people share the same birthday is greater than 50%. In an elementary school class of 30, for example, there is a 70% chance that two people will share the same birthday. For a large class of 60, the probability is greater than 99 percent. The birthday paradox is not a “paradox” in the sense that it causes logical contradictions. But this mathematical fact is so counterintuitive that it’s called a paradox.

The birthday paradox is an interesting concept, but what does it have to do with saving hundreds of gigabytes of memory?

background

First introduces the background, the work I was responsible for an advertisement in the data system, one of the features are on the same time request AD exposure to heavy, because we just need to know the request the advertisement a exposure, those same time request record produced by repeated exposure is meaningless, but also the consumption will increase our storage costs. Therefore, there needs to be a logic to determine whether each new exposure is only previously recorded. The old scheme is to store the request unique identifier (UUID) and advertisement ID(ADID) in Redis. Every time the data comes in, we check whether there is a uUID + aDID key in Redis. Not filtered and recorded in Redis has appeared.

The problem was that the number of redis records was huge (over 400GB in two days) and as our business grew, our Redis cluster was overwhelmed… Of course, money and machines are the simplest way, after all, money can solve the problem is not a problem. In order to save money for the company, I took the path of optimization.

How to optimize?

The first thing you can be sure of is the number of pieces of data, because that’s where the business is, so reducing the amount of data is not going to work. Can we reduce the length of each piece of data?

Let’s take a look at the redis storage design, as shown below:The next record is 45 bytes in total. Can we shorten this length? Of course, we can intercept some UUID, but this brings up a new problem. Intercepting UUID increases the probability of duplication, so first figure out how and how much to intercept.

We are using random UUID here, and in this version there are 122 valid bits, so there are $2^{122}$valid UUID. Because it’s random, there must be a probability of a duplicate UUID, what is the probability of a duplicate UUID? To calculate the repetition probability, $2^{122}$is not enough. You also need to know the number of UUID you have. What is the probability that there is a duplicate in $2^{36}$UUID? ? p(n) \approx 1-e^{-\frac{n^{2}}{2 x}} ? Well, isn’t this the birthday paradox at scale? Of course, this probability can be calculated according to the above formula, where x is the total number of UUID $2^{122}$, n is $2^{36}$, according to Baidu Baike, the probability is $4 *10^{-16}$, which is much less than the probability of you going out of the door to be hit by a meteorite.

n chance
2 ^ 36 4 x 10^-16
2 ^ 41 4 x 10^-13
2 ^ 46 4 x 10^-10

In addition, it can also be seen from the above formula that when n is fixed, the probability P increases as the number of effective binary bits decreases. Going back to our AD scenario, the maximum number of requests per day, N, is basically fixed, and we can tolerate a small probability, P (less than one in 10,000), and then deduce how big this x is.

$\frac{n^{2}}{2 x} >= 10$ I was able to calculate the size of n from the historical numbers, then calculate x, and leave some buffs, then redesign the redis key according to the size of n. (The intermediate calculation is not published here because it involves company data.)

New design,

Finally, I selected 40 valid binary bits (10 hexadecimal bits), but I did not directly capture the first 10 bits of a UUID, because the first 10 bits of a UUID depend on time and are not very random. I chose to re-hash the whole UUID in MD5, then intercept the first 10 bits of MD5, and then splice adId to form the final key, as shown below:Obviously, the key’s length has been reduced by half, resulting in an overall storage savings of at least 50%. Note: In fact, the specific storage implementation of redis is slightly different from the above description. In order not to steal the show, the actual implementation is specially simplified in the above description, so in the end, the actual memory is not saved more than half, only about 35%.

How to further optimize?

In fact, adId also contains a lot of redundant information can also be intercepted. In fact, we can bear a higher repetition rate. In fact, we can set redis data storage time to be shorter…

Each of the above methods can be further optimized without an order of magnitude reduction in storage space, while the following method can reduce storage space by more than 99%.

BloomFilter

For the principle of BloomFilter, you can refer to my previous article BloomFilter principle implementation and performance testing. Bloom filter is completely designed to remove the scene, conservative estimation of our advertising to remove the scene to cut bloom filter, save at least 90% of memory.

The reason WHY I didn’t use bloom filters is because the implementation is complicated. Redis supports module after 4.0, some of them developed the module of Bloom filter RedisBloom, but we still use the 3.x version of Redis! I also considered the application side based on Redis to implement bloom filter, but our application side is a cluster, need to solve some distributed data consistency issues, no more.

conclusion

In fact, the specific storage implementation of redis is slightly different from the above description. In order not to steal the show, the above specially simplified description of the actual implementation, so in the end, the actual memory is not saved more than half, only about 35%.

Finally 400G+ optimization can reduce more than 100 G of memory, in fact, it is a server, even in the future, it is to expand several servers. For the company, it is thousands of cost savings every month, and our company is such a big factory actually will not care about this money. But even if the cost of a few thousand won’t eventually translate into my salary or bonus, optimization like this will have to be done. If everyone is in line with the principle of doing the same thing at the lowest cost to do everything well, the size of our company, the cost of tens of millions of yuan a month can be saved.

The resources

  1. Baidu Baike birthday paradox
  2. Baidu Encyclopedia UUID
  3. BloomFilter principle implementation and performance test
  4. RedisBloomBloom filter based on Redis

    This article is from blog.csdn.net/xindoo