The premise
In the future, a large number of projects or requirements will be used in Redis. During this period of time when business is not too busy, take some time to preview and review the related content of Redis. Just see the blog below UV and PV statistics, thought of the recent book mentioned in the HyperLogLog data type, so take some time to analyze its use and use scenarios (temporarily not explore the implementation principle of HyperLogLog). The HyperLogLog data type in Redis is introduced by Redid 2.8.9. Ensure that Redis >= 2.8.9 when using HyperLogLog.
HyperLogLog profile
Cardinality counting is used to count the number of unique elements ina set. A very common example is counting UV (Unique Visitor) for an article. In the context of large data volume, to realize cardinality counting, elements that store the full set of cardinality will not be selected in most cases, because the memory cost of storage can be calculated. Assuming that the average size of each element being counted is 32bit, the memory occupied by 100 million data will be:
32 * 100000000/8/1024/1024 ≈ 381M
.
The complexity of this operation can be devastating if there are multiple collections and the result of the merge count is allowed to be computed for multiple collections. Therefore, data structures such as Bitmap, Tree or HashSet will not be used to directly store the collection of counting elements for counting. Instead, the probability algorithm of radix counting is used for counting on the premise of not pursuing absolutely accurate counting results. Currently, there are three common probability algorithms:
Linear Counting(LC)
.LogLog Counting(LLC)
.HyperLogLog Counting(HLL)
.
Therefore, HyperLogLog is actually a radix counting probability algorithm, which is not unique to Redis. Redis implements HyperLogLog based on C language and provides relevant command API entry.
Antirez, the author of Redis, in order to commemorate Philippe Flajolet’s research on combinatorial mathematics and radix calculation algorithm analysis, he used Philippe Flajolet’s English initials PF as a prefix when designing HyperLogLog commands. That said, Dr. Philippe Flajolet was a significant contributor to the HLL algorithm, but he was not actually the developer of the HyperLogLog data type in Redis. Unfortunately, Dr. Philippe Flajolet passed away in Paris on 22 March 2011 after an illness. This is a Wikipedia photo of Dr. Philippe Flajolet:
Features of the HyperLogLog data type provided by Redis:
- Basic features: use
HyperLogLog Counting(HLL)
The implementation,Do cardinality calculations only and do not save metadata. - Memory usage:
HyperLogLog
eachKEY
Most take up12K
The memory space can be calculated close to2 ^ 64
Its storage space is stored by sparse matrix, which occupies very little space. Only when the number of counting bases gradually increases and the space occupied by sparse matrix gradually exceeds the threshold value, it will be converted into dense matrix at one time, and will be occupied after it is transformed into dense matrix12K
Memory space. - Counting error range: The result of a cardinal count is a standard error (
Standard Error
) for0.81%
When the amount of data is small, the result may also be an accurate value.
The small memory footprint (up to 12K per KEY) is HyperLogLog’s biggest advantage, but it has two relatively obvious limitations:
- The calculated result is not an exact value, and there is a standard error, because it is essentially a probabilistic algorithm.
- Cardinality metadata is not saved, which is not friendly to scenarios that require metadata for data analysis.
The HyperLogLog command is used
Redis provides three command apis for HyperLogLog data types: PFADD, PFCOUNT, and PFMERGE.
PFADD
The PFADD command parameters are as follows:
PFADD key element [Element...Copy the code
The Redis version that supports this command is: >= 2.8.9 Time complexity: The complexity of each added element is O(1).
- Function: Sets all element parameters
element
Add to key forkey
theHyperLogLog
Data structure.
The process for executing the PFADD command is as follows:
The modes of using the PFADD command are as follows:
127.0.0.1:6379> PFADD Food apple fish (INTEGER) 1 127.0.0.1:6379> PFADD food apple (integer) 0 127.0.0.1:6379> PFADD Throwable (integer) 1 127.0.0.1:6379> SET name doge OK 127.0.0.1:6379> PFADD name Throwable (error) WRONGTYPE Key is not a valid HyperLogLog string value.Copy the code
Although the HyperLogLog data structure is essentially a String, you cannot use HyperLogLog commands on String keys.
PFCOUNT
The PFCOUNT command parameters are as follows:
PFCOUNT key [key]...Copy the code
The Redis version that supports this command is as follows: >= 2.8.9 Time complexity: The complexity of the cardinality count of a single HyperLogLog is O(1) and the average constant time is low. When the parameter is multiple keys, the complexity is O(N), where N is the number of keys.
- when
PFCOUNT
Using a single commandkey
Returns the value stored in the given keyHyperLogLog
The approximate cardinality of the data structure, returned if no key exists0
. - when
PFCOUNT
Command to usemoreakey
Returns all stored in the givenHyperLogLog
The approximate cardinality of the union of data structures, that is, will divide all of theHyperLogLog
Data structure merged into a temporaryHyperLogLog
Data structure, and then calculate the approximate cardinality.
The PFCOUNT command can be used as follows:
127.0.0.1:6379> PFADD POST:1 ip-1 ip-2
(integer) 1
127.0.0.1:6379> PFADD POST:2 ip-2 ip-3 ip-4
(integer) 1
127.0.0.1:6379> PFCOUNT POST:1
(integer) 2
127.0.0.1:6379> PFCOUNT POST:1 POST:2
(integer) 4
127.0.0.1:6379> PFCOUNT NOT_EXIST_KEY
(integer) 0
Copy the code
PFMERGE
The PFMERGE command parameters are as follows:
PFMERGE destkey sourcekey [sourcekey ...]
Copy the code
The Redis version that supports this command is: >= 2.8.9 Time complexity: O(N), where N is the number of HyperLogLog data structures to be merged. The constant time of this command is relatively high
- Function: Multiple
HyperLogLog
The data structure is merged into a new key fordestkey
theHyperLogLog
Data structure, mergedHyperLogLog
The cardinality of is close to all inputsHyperLogLog
The visible set ofObserved Set
Cardinality of the union of. - Command return value: Only strings are returned
OK
.
The following describes how to use the PFMERGE command
127.0.0.1:6379> PFADD POST:1 ip-1 ip-2
(integer) 1
127.0.0.1:6379> PFADD POST:2 ip-2 ip-3 ip-4
(integer) 1
127.0.0.1:6379> PFMERGE POST:1-2 POST:1 POST:2
OK
127.0.0.1:6379> PFCOUNT POST:1-2
(integer) 4
Copy the code
Use HyperLogLog statistics UV case
Suppose you have a simple scenario where you count uVs for blog posts, and you don’t need to count uVs accurately, and you don’t need to save client IP data. The following scenario, using HyperLogLog to do a simple scheme and coding implementation.
The sequence of possible steps in the process may change, but the operations to be done remain essentially the same. Let’s briefly assume that the content and statistics of the article are returned by the background service, and the two interfaces are designed separately. Introducing Redis’s premium client Lettuce dependency:
<dependency>
<groupId>io.lettuce</groupId>
<artifactId>lettuce-core</artifactId>
<version>5.2.1. RELEASE</version>
</dependency>
Copy the code
The encoding is as follows:
public class UvTest {
private static RedisCommands<String, String> COMMANDS;
@BeforeClass
public static void beforeClass(a) throws Exception {
// Initialize the Redis client
RedisURI uri = RedisURI.builder().withHost("localhost").withPort(6379).build();
RedisClient redisClient = RedisClient.create(uri);
StatefulRedisConnection<String, String> connect = redisClient.connect();
COMMANDS = connect.sync();
}
@Data
public static class PostDetail {
private Long id;
private String content;
}
private PostDetail selectPostDetail(Long id) {
PostDetail detail = new PostDetail();
detail.setContent("content");
detail.setId(id);
return detail;
}
private PostDetail getPostDetail(String clientIp, Long postId) {
PostDetail detail = selectPostDetail(postId);
String key = "puv:" + postId;
COMMANDS.pfadd(key, clientIp);
return detail;
}
private Long getPostUv(Long postId) {
String key = "puv:" + postId;
return COMMANDS.pfcount(key);
}
@Test
public void testViewPost(a) throws Exception {
Long postId = 1L;
getPostDetail("111.111.111.111", postId);
getPostDetail("111.111.111.222", postId);
getPostDetail("111.111.111.333", postId);
getPostDetail("111.111.111.444", postId);
System.out.println(String.format("The uv count of post [%d] is %d", postId, getPostUv(postId))); }}Copy the code
Output result:
The uv count of post [1] is 4
Copy the code
It is appropriate to call getPostDetail() with a larger number of different client IP addresses, and then count the error.
Off topic – How to accurately count UV
If you want accurate UV statistics, you need to pay attention to several points:
- You need to have enough memory or disk capacity, because as far as cardinality counting is concerned, no algorithm can do an accurate count without saving metadata.
- If user behavior analysis is needed, metadata ultimately needs to be persisted, which should rely on big data system. I have no experience in this aspect, so I will not talk about it for the time being.
Assuming that without considering the memory cost, we can still use Redis to do accurate and real-time UV statistics, simple use of Set data type, increase UV only need to use SADD command, statistics UV only need to use SCARD command (time complexity O(1), can be trusted to use). For example:
127.0.0.1:6379> SADD puv:1 ip-1 ip-2
(integer) 2
127.0.0.1:6379> SADD puv:1 ip-3 ip-4
(integer) 2
127.0.0.1:6379> SCARD puv:1
(integer) 4
Copy the code
If these statistics are only client-side presentations, then an asynchronous design can be used:
In small volumes, all of the above application functions can be performed in the same service, and message queues can be replaced by asynchronous schemes of thread pools.
summary
This article only briefly introduces the use of HyperLogLog and statistical UV usage scenarios. In general, HyperLogLog can be used for counting in scenarios where (1) the amount of original data is huge, (2) the memory usage is required to be as small as possible, (3) certain errors are allowed in counting, and (4) metadata storage is not required.
References:
- antirez-Redis new data structure: the HyperLogLog
- Redis Commands
- wikipedia
(C-3-D E-A-20191117)
Technical official account (Throwable Digest), push the author’s original technical articles from time to time (never plagiarize or reprint) :
Entertainment public account (” Sand carving every Day “), select the interesting sand carving text and video push irregularly, relieve the pressure of life and work: