This is the fifth article in this series. In the last article, we introduced the PV/UV implementation and the calculation logic of the program. In this article, we introduce how to implement this indicator in the Spark+Hbase architecture.
The habit of the big pig is to try not to BB, a good picture is able to talk, big pig is also trying to achieve.
Detailed analysis process
-
Big pig 25 registered a Jane book account through an article, 26 went to wave.
-
27 Log in Jane book again, buddy guess which day is retained?
-
Such a simple question, our little friends can certainly answer it.
The answer is: two days on the 25th
Ah? Big pig how did I answer wrong ah
Don’t panic. The current time is 28th, Spark+Hbase calculates the data from March to 27th, because only one person, Big Pig, accesses the data on 27th, so the data can only be +1. Now look at the next picture.
- On the 21st, there is a big pig fan big Red pig registered simple book account through PV/UV article, cough…
- A fan of the big pig, Rhubarb Pig, registered with Jane book through a small, high-performance ETL article
- Then the two big fat pigs came on March 28
- Then calculate which days are retained
Big pig this time I understand, is 21 7 days retention and 25 3 days retention. I said that small partners are smart, this is relatively easy to understand.
Next, we’ll look at how to implement retention as an algorithm, and we’ll try to design it as SQL.
The big pig has designed the retention form, and it must be the same with the big pig. After all, they are like-minded partners.
A user table is required to record the user registration time, and many other indicators are used. Hbase user creation expressions are as follows:
Spark calculates registered users and writes them to the user table. The calculation algorithm must be in front of all indicators as follows. The yellow box is batch write.
Now let’s look at how the specific index calculation of Come is calculated
Since the user table is involved, it is actually added to the user registration time on the basis of UV de-duplication:
The first box in the last chapter of the PV/UV index has been explained, is to mark the user.
The second box, with the first box logic is about the same, is batch to check the user registration time.
The third box, is the first box with the second box data merged together, the registration time merged in, so that each data have a registration time, the following can be used to calculate the retention SQL.
Sharp-eyed friends see the core algorithm of retention, which is the yellow box.
Why so many functions? The date_add function is the date +N days, IF is simple, the judgment just meets the corresponding retention date of the data SUM to the corresponding come retention.
Why do you write that? In this way, Spark can use a single job to calculate all retention metrics, and users need to taste it carefully to get a sense of it. If they do not write it in this way, how do users feel about it? Big pig is a retention before writing a SQL, is not very problematic?
In the next article, we will continue to introduce algorithms for other interesting indicators.
This source source portal => retained computing source