This article mainly introduces Redis to achieve unique count of 3 methods to share, this article explains based on SET, based on bit, based on HyperLogLog three methods, need friends can refer to

Unique counting is a very common feature in web site systems, such as unique visitor (UV) where a web site needs to count the number of visitors per day. The counting problem is very common, but it can be very complicated to solve. First, the amount of data needed to count may be very large. For example, a large site has millions of people visiting it every day, so the amount of data is quite large. Second, you often want to extend the dimension of counting, such as weekly or monthly UVs as well as daily UVs, which makes the calculation very complicated.

On relational database storage systems, the only way to implement a count is to select count(DISTINCT <item_id>), which is simple, but slow to execute if there is a large amount of data. Another problem with relational databases is that insert data performance is also poor.

Redis is handy for solving such counting problems, faster and less resource-intensive than relational databases, and even offers three different approaches.

1. Based on the set

Redis set is used to save a unique data set, through which you can quickly judge whether an element exists in the set, can also quickly calculate the number of elements in a set, and can merge the set into a new set. The commands involved are as follows:

The copy code is as follows:

SADD key member SCARD key member SADD key member SCARD keyCopy the code

The method based on set is simple and effective, accurate counting, wide application and easy to understand. Its disadvantage is that it consumes large resources (of course, it is much less than the relational database). If the number of elements is very large (such as hundreds of millions of counts), it consumes terrible memory.

2. Based on the bit

Redis’s bit can be used to implement a count that is more compressed than set memory, which stores the existence of an element with a bit 1 or 0. For example, the number of unique visitors to a website can be set to user_id as the offset of bit. If user_id is set to 1, it can store the daily access count of more than 8 million users. The commands involved are as follows: Copy code The code is as follows:

SETBIT key offset value GETBIT key offset value obtain bit information BITCOUNT key [start end] # Count BITOP operation destkey key [key ...]. # bitmap mergeCopy the code

The bit-based approach consumes much less space than the set approach, but it requires elements to be easily mapped to bit offsets, which is much narrower. In addition, the amount of space it consumes depends on the maximum offset, regardless of the count. If the maximum offset is large, the memory consumption can be considerable.

3. Based on HyperLogLog

It is difficult to achieve accurate unique Counting of large amount of data, but if it is only approximate, there are many efficient algorithms in computing science, among which HyperLogLog Counting is a very famous algorithm, which can only use about 12 K memory to achieve hundreds of millions of unique Counting. And the error is controlled within 1% or so. The commands involved are as follows: Copy code The code is as follows:

PFADD key element [element ...] PFCOUNT key [key...] # countCopy the code

This counting method is really amazing, which involves some uniform distribution, random probability, Bernoulli distribution in statistics, etc., which I have not fully understood, so I can study relevant articles in depth if I am interested.

The three unique counting methods provided by Redis have their own advantages and disadvantages, which can fully meet the counting requirements in different situations.

4. Based on bloomfilter

BloomFilter uses data structures like bitmaps or bit sets to store data. It uses bit arrays to represent a set succinctly and can quickly determine whether an element already exists in the set. Although BloomFilter is not 100% accurate, you can reduce the error rate by adjusting the parameters, using the number of Hash functions, and the size of the bit array. This adjustment can reduce the error rate to close to zero. This is enough for most scenarios.

Suppose there is a set S = {x1, x2… Xn}, Bloom Filter uses k independent hash functions to map each element in the set to {1,… The range of m}. For any element, the number mapped to acts as the index of the corresponding bit array, and that bit is set to 1. For example, if the element x1 is hash mapped to the number 8, the eighth bit of the array will be set to 1. In the following figure, set S has only two elements x and y, which are mapped by three hash functions to positions (0,3,6) and (4,7,10) respectively, and the corresponding bit will be set to 1:Now, if you want to determine whether another element is in the set, you just need to map the three hash functions to see if there is a 0 in the corresponding position. If there is, this element definitely does not exist in the set, otherwise it may exist.

Redis use bloom filter need to install the plug-in: https://blog.csdn.net/u013030276/article/details/88350641.

Hive computing maximum number of consecutive login days Hadoop data migration usage details Hbase repair tool Hbck data warehouse modeling layered theory article to understand Hive data storage and compression components focus on learning this several