I. Background introduction
The product wants to count the daily live volume and total number of active users of a page in the current system. Since both logged in and unlogged users can access the page, the user ID cannot be used to count the daily live volume and total number of active users. Instead, IP addresses are used to perform statistics and de-processing.
Two, technology selection
- First thought of the solution is to use redis set data structure, because it is an unordered set, we get the IP address, and then into the set of statistics can be realized and the effect of weight, but the set has a big problem is that every piece of data takes up the space is big, if the data in large amounts may cause memory problems.
- Therefore, I think of using some space-saving data structures, and I think of the bitmap I have learned before. The space occupation is relatively low, but bitmap is more suitable for the scenario of user ID increasing, because THE IP address is random, it directly occupies a large space. If the ID is increasing, it can be gradually expanded.
- Finally, it is found that Redis Hyperloglog data structure is very consistent, Hyperloglog can achieve mass data statistics and deduplication. Unlike a Bitmap, it simply inputs elements to calculate the cardinality and does not store the elements themselves. This time, you do not need to store the value of specific IP addresses. You only need to count the total number of IP addresses and delete them.
Third, concrete implementation
- The first step is to use an IPUtil utility class to get the IP of the user visiting the site.
/** * @Author: qubingquan * @Date: 2020/9/14 1:51 PM */ public class IPUtil {/** ** Request. GetRemoteAddr (); The reason is that it is possible that users use proxy software to avoid real IP addresses. * However, if multiple levels of reverse forwarders are Forwarded, x-Forwarded-For does not carry a single value. It carries a string of IP addresses. Which one is the true IP address of the client? @Forwarded-For @param Request @return */ Public static String @forwarded-For @forwarded-For @forwarded-For @forwarded-For @forwarded-For getIpAddress(HttpServletRequest request) { String ip = request.getHeader("x-forwarded-for"); if (ip == null || ip.length() == 0 || "unknown".equalsIgnoreCase(ip)) { ip = request.getHeader("Proxy-Client-IP"); } if (ip == null || ip.length() == 0 || "unknown".equalsIgnoreCase(ip)) { ip = request.getHeader("WL-Proxy-Client-IP"); } if (ip == null || ip.length() == 0 || "unknown".equalsIgnoreCase(ip)) { ip = request.getHeader("HTTP_CLIENT_IP"); } if (ip == null || ip.length() == 0 || "unknown".equalsIgnoreCase(ip)) { ip = request.getHeader("HTTP_X_FORWARDED_FOR"); } if (ip == null || ip.length() == 0 || "unknown".equalsIgnoreCase(ip)) { ip = request.getRemoteAddr(); If (" 127.0.0.1 ". The equals (IP) | | "0:0:0:0:0:0:1-0." equals (IP)) {/ / according to the network card in the machine configuration of IP InetAddress inet = null; try { inet = InetAddress.getLocalHost(); } catch (UnknownHostException e) { e.printStackTrace(); } ip= inet.getHostAddress(); } } return ip; }}Copy the code
- Then we use the redisTemplate to execute Hyperloglog commands.
// Get the visitor's IP String ipAdress = iputil. getIpAddress(httpServletRequest); Log.debug (" Access list page IP address :[{}]",ipAdress); / / will be credited to redis IP HyperLogLogOperations < String, the String > hyperlog = redisTemplate. OpsForHyperLogLog (); hyperlog.add("cpp_bank_list_total_size_today",ipAdress);Copy the code
Here the PFADD command is executed to store the data into a Hyperloglog data structure. We store the IP of this visit into the variable cpp_bank_LIST_TOTAL_size_today.
- Finally, we look at the daily scheduled tasks performed at 0 o ‘clock
private static String COUNT = "cpp_bank_list_total_size_today"; private static String TOTAL_COUNT = "cpp_bank_list_total_size"; private static String TOTAL_ID = "0"; @Scheduled(cron = "0 0 0 * * ?" ) //@Scheduled(cron = "*/5 * * * * ?" ) public void saveUserAccessLog(){ HyperLogLogOperations<String,String> hyperlog = redisTemplate.opsForHyperLogLog(); int count = hyperlog.size(COUNT).intValue(); Calendar calendar = Calendar.getInstance(); calendar.add(Calendar.DAY_OF_MONTH,-1); CppBankAccessLog cppBankAccessLog = CppBankAccessLog.builder() .id(Sequence.getInstance().getSequenceNumber()) .time(new SimpleDateFormat("yyyy-MM-dd").format(calendar.getTime())) .count(count) .build(); cppBankAccessLogMapper.insertSelective(cppBankAccessLog); Hyperlog. union(TOTAL_COUNT,COUNT); int totalCount = hyperlog.size(TOTAL_COUNT).intValue(); cppBankAccessLogMapper.updateCountById(TOTAL_ID,totalCount); The info (" living quantity information storage, data yesterday: [{}], according to the total: [{}] ", the count, totalCount); // Delete data from today hyperlog.delete(COUNT); }Copy the code
The main thing to be done here is to import the statistics of the day into the database, then use the union command to merge the daily activity and the total active volume of the day, and then enter the total active volume of the database to delete the daily active data of the day.
Four, Hyperloglog principle introduction
First of all, HyperLogLog doesn’t actually store the value of every element. It uses a probabilistic algorithm to count elements by storing the first 1 of their hash value. This is error-prone and not suitable for scenarios with absolutely accurate counting. HyperLogLog in Redis, only need 12K memory, in the standard error of 0.81% under the premise of 2 power 64 data statistics.
- Bernoulli experiment
To understand the Hyperloglog principle, one must first understand Bernoulli’s experiment. So Bernoulli’s experiment was to flip a coin, flip the coin a few times, and then see how many flips it takes to get heads, as you can see here.K is the number of times it takes to flip 1 in each turn. What we know is the maximum value of K, which can be expressed as kmax. Since there are only 0 and 1 outcomes of each coin flip, the probability of kmax appearing in any turn isSo we can derive thatBut the error rate is huge, so if K is 3, we’re going to get 8 flips, but chances are I’m going to get 001 the first time, so K is 3.
In order to reduce the error rate, the concept of buckets is introduced to calculate the weighted average of M buckets. Here is the LogLog estimation formula:
The DVLL of the above formula corresponds to N. Constant is the correction factor, and its specific value is uncertain, which can be set in branches according to the actual situation. M is the number of rounds. The R with a bar on the head is the average :(k_max_1 +… + k_max_m)/m.
This algorithm optimization by increasing the number of test rounds and then taking the k_max average is LogLog’s approach. The difference between HyperLogLog and LogLog is that instead of using an average, HyperLogLog uses a harmonic average. The advantage of the harmonic mean over the mean is that it is less susceptible to large values. Example: Find the average salary:
A has 1000/ month and B has 30,000 / month. The mean is: (1000 + 30000) / 2 = 15500. The harmonic mean is: 2/(1/1000 + 1/30000) ≈ 1935.484Copy the code
Harmonic mean formula:
- Hyperloglog
For an input string, we first get a 64-bit hash value, using the first 14 bits to locate the bucket (a total of 2 ^ 14, 16,384 buckets). The last 50 bits are the Bernoulli process, and each bucket has 6 bits. Count is the first occurrence of 1. If count>oldcount, oldcount is replaced by count.
Imitating the process above, multiple user ids are split into different buckets, each with its own K_max. Then when it comes time to figure out how many users hit the page, it’s an estimate. Finally, the k_max in all buckets is combined into the estimation formula to obtain the estimated value. Each bucket has 6 bits, that is, [000 000]. The maximum value is [111 111], indicating 63.
Write at the end
Actual Redis will have sparse storage structure and dense storage structure two implementation, to learn more please refer to the resources below, there is a comprehensive introduction.
Vi. Reference materials
zhuanlan.zhihu.com/p/58519480