define

PV stands for Page View, or page views, and is often the primary measure of a network news channel or website, or even a piece of web news. The number of page views is one of the most commonly used indicators to evaluate website traffic, referred to as PV

UV is short for unique Visitor, a natural person who visits the web page over the Internet.

Through the above concept, it can be clearly seen that PV is better designed. Every time the website is visited, PV will increase, but UV will not necessarily increase. Uv is essentially recorded as a natural person divided according to a standard, which we can define ourselves, for example: It is possible to define visitors with the same IP as the same UV, which is one of the most common UV definitions, along with definitions based on cookies, and so on. Whether PV or UV, need a period of time to describe, usually we say pv, UV number refers to 24 hours (a natural day) of data.

Pv compared to uv, technically easier to some, today let’s do a quick rundown of uv, uv statistics is relatively difficult to why say, because the uv involves a natural person to the same standard weight, especially a uv levels of the site, must design a good uv statistical system maybe is not as easy to imagine.

So let’s design a UV statistics system that takes a natural day as the time period, a natural person (UV) is defined as the same source IP (of course you can also customize other criteria), and the data volume level is assumed to be on the order of 10 million UV per day.

Note: the focus of our discussion today is on how to design a UV statistical system after obtaining the definition of a natural person, not how to obtain the definition of a natural person. The design of uv system is not as simple as imagined, because UV may appear with the marketing strategy of the site, such as the site held a second kill event.

DB based scheme

There is a saying in server-side programming: there is no table that cannot solve a function, and if there is, there are two tables and three tables. A UV statistics system can indeed be implemented based on a database, and it is not complicated. The recording table of UV statistics can be similar to the following (don’t worry too much about the design of the following table) :

field type describe
IP varchar(30) Source IP address of the client
DayID int Short for time, for example, 20190629
The other fields int Other Field Description

When a request arrives at the server, the server needs to query the database each time to see if there is the current IP and time access record, if there is, then it is the same UV, if not, then it is a new UV record, insert the database. Of course, the above two steps can also be written to a SQL statement:

If exists(select 1 from table where IP =' IP 'and dayID = dayID) Begin return 0 End else Begin INSERT into The table... EndCopy the code

Almost all database-based solutions are more prone to bottlenecks in the case of large data volumes. This database-based approach may not be optimal in the face of tens of millions of uv statistics per day.

Optimization scheme

For every system design, we should think about the specific business. As for uv statistics this business has several characteristics:

  1. Each request needs to determine whether the same UV record already exists
  2. Persisting UV data cannot affect normal business
  3. The accuracy of UV data can tolerate some degree of error
Hash table

In a database-based solution, one of the causes of performance bottlenecks in the case of large amounts of data is to determine whether the same record already exists, so to optimize the system, this step must be optimized first. Can you think of a data structure to solve this problem? Yes, it’s a hash table. The time complexity of hash table searching for value based on key is O (1) constant level, which can perfectly solve the operation bottleneck of searching for the same record. Perhaps when the amount of UV data is relatively small, hash table may be a good choice, but in the face of tens of millions of levels of UV data, hash table hash conflict and expansion, and the memory occupied by hash table may not be a good choice. Assuming that each key and value of a hash table takes up 10 bytes, 10 million uv data takes up about 100 megabytes. For modern computers, 100 megabytes isn’t that much, but is there a better solution?

Optimize the hash table

Hash table-based solutions are just about okay with tens of millions of bytes, but what about a billion bytes? Is there a better way to do a billion uv statistics? I’m going to leave out persistent data here, because persistence is designed to be optimized in different tables and different libraries in the database, and we’ll talk about that later. Is there a better way to quickly determine whether a record exists in a billion uv levels? To minimize the amount of memory used, arrays of type bit can be pre-allocated, and the size of the array is a multiple of the maximum amount of data counted, which can be customized. Now assume that the system uv maximum amount of data is 10 million, the system can pre-allocate a length of 50 million bit array, bit occupies the smallest memory, only one bit. Compute the hash value of each piece of data according to a hash function with low hash conflicts and set the corresponding hash position of the bit array to 1. Due to the conflict of hash functions, different data may have the same hash value, resulting in misjudgment. However, we can use multiple different hash functions to calculate the same data to generate different hash values, and set the array position of these multiple hash values to 1, thus greatly reducing the misjudgment rate. The array we just created is a multiple of the maximum number of data in order to minimize collisions (the larger the size, the smaller the collisions). When a 10 million uv data level, 50 million bit array takes up only a few tens of megabytes of memory, much smaller than the hash table, the memory gap will be larger at the 1 billion level.

Here is an example code:

class BloomFilter { BitArray container = null; public BloomFilter(int length) { container = new BitArray(length); } public void Set(string key) { var h1 = Hash1(key); var h2 = Hash2(key); var h3 = Hash3(key); var h4 = Hash4(key); container[h1] = true; container[h2] = true; container[h3] = true; container[h4] = true; } public bool Get(string key) { var h1 = Hash1(key); var h2 = Hash2(key); var h3 = Hash3(key); var h4 = Hash4(key); return container[h1] && container[h2] && container[h3] && container[h4]; } // Simulate hash function 1 int Hash1(string key) {int hash = 5381; int i; int count; char[] bitarray = key.ToCharArray(); count = bitarray.Length; while (count > 0) { hash += (hash << 5) + (bitarray[bitarray.Length - count]); count--; } return (hash & 0x7FFFFFFF) % container.Length; } int Hash2(string key) { int seed = 131; // 31 131 1313 13131 131313 etc.. int hash = 0; int count; char[] bitarray = (key+"key2").ToCharArray(); count = bitarray.Length; while (count > 0) { hash = hash * seed + (bitarray[bitarray.Length - count]); count--; } return (hash & 0x7FFFFFFF)% container.Length; } int Hash3(string key) { int hash = 0; int i; int count; char[] bitarray = (key + "keykey3").ToCharArray(); count = bitarray.Length; for (i = 0; i < count; i++) { if ((i & 1) == 0) { hash ^= ((hash << 7) ^ (bitarray[i]) ^ (hash >> 3)); } else { hash ^= (~((hash << 11) ^ (bitarray[i]) ^ (hash >> 5))); } count--; } return (hash & 0x7FFFFFFF) % container.Length; } int Hash4(string key) { int hash = 5381; int i; int count; char[] bitarray = (key + "keykeyke4").ToCharArray(); count = bitarray.Length; while (count > 0) { hash += (hash << 5) + (bitarray[bitarray.Length - count]); count--; } return (hash & 0x7FFFFFFF) % container.Length; }}Copy the code

The test procedure is as follows:

BloomFilter bf = new BloomFilter(200000000); int exsitNumber = 0; int noExsitNumber = 0; for (int i=0; i < 10000000; i++) { string key = $"ip_{i}"; var isExsit= bf.Get(key); if (isExsit) { exsitNumber += 1; } else { bf.Set(key); noExsitNumber += 1; }} console. WriteLine($" Determine the amount of data: {exsitNumber}"); Console.WriteLine($" Determine the amount of data that does not exist: {noExsitNumber}");Copy the code

Test results:

Check the amount of existing data: 7017 Check the amount of non-existing data: 9992983Copy the code

The memory usage is 40M, and the miscalculation rate is less than 1 ‰, which is acceptable in this service scenario. In a real business, the system would not allocate such a large array of bits at startup, but would grow to a certain size as conflicts increased.

Asynchronous optimization

The next step is to persist the data to DB. If the data volume is large or instantaneous, you can consider using MQ or NOSql with large read/write I/OS instead of directly inserting into a relational database.

The whole UV process can also be asynchronous, and this is recommended.

More interesting articles

  • Distributed large concurrent series
  • Architectural Design Series
  • Series of interesting algorithms and data structures
  • Design Pattern series