Using the behavior data of zhihu users, we run Apriori algorithm and find some interesting association rules. The following is a brief analysis.
The data collection
Where does the data come from? Of course it’s not from Zhihu, it’s from the reptile. How do you climb? Not in this article.
The data processing
In the previous crawler, in order to store convenience, a long string was stored in the topic associated with a user and the answers under each topic, which is a pit. Now, for modeling, you need to do some data processing, extracting the topic ids from the long strings using Python regular expressions, and turning them into a one-to-many structured data box. This process turns 3220,712 rows of data into 36856,177.
Correlation analysis
Of course, the R switch method is still used to do correlation analysis. But this data volume is too large, full read words single memory will explode, let alone Apriori full table scan, step by step iterative calculation…… Therefore, some samples can be selected, and 100W data can be taken as samples to run the model.
library(readr)library(arules)library(arulesViz)library(dplyr)topic_info <- read_csv(“E:/data/data/zhihu_topics.csv”)Encoding(topic_info$topic) <- “gbk”user_topic_sample <- read_csv(“E:/data/data/zhihu_user_topic_sample.csv”)user_topic_sample <- user_topic_sample=””>% left_join(topic_info[,1:2])trans <- as(split(user_topic_sample$topic,user_topic_sample$user_token),”transactions”)rules Sorted < -by =”lift”>% head(50) LHS RHS Support Confidence Lift Count [1] => {fashion} 0.1015915 0.7318048 3.065149 3479[2] {fitness, fashion} => {fashion} 0.1031099 0.6927604 2.901612 3531[3] {film, travel, psychology} => {fashion} 0.1069937 0.6879459 2.881447 3664[4] {food, psychology} => {home} 0.1003066 0.5069362 2.868005 3435[5] {film, Travel, food} => {fashion} 0.1104687 0.6830986 2.861144 3783[6] {film, Food, Psychology} => {fashion} 0.1116659 0.6745458 2.825320 3824[7] {fitness, psychology} => {fashion} 0.1055921 0.6569767 2.751733 3616[8] {Home} => {fashion} 0.1146153 0.6484388 2.715972 3925[9] {travel, psychology} => {fashion} 0.1209228 0.6474359 2.711771 4141[10] {fitness, travel} => {fashion} 0.1037232 0.6473483 2.711404 3552[11] {travel, gourmet} => [12] {film, Travel, Fashion} => {food} 0.1104687 0.8419764 2.689440 3783[13] {Travel, Fashion, Psychology} => {food} 0.1015915 0.8401352 2.683559 3479[14] {business} => {startup} 0.1386772 0.6043523 2.653679 4749[15] {startup} => {business} 0.1386772 0.6089242 2.653679 4749[16] {gourmet, psychology} => {fashion} 0.1250986 0.6322314 2.648088 4284[17] {Gourmet, Design} => {fashion} 0.1017667 0.6320276 2.647234 3485[18] {movie, fitness, food} => {travel} 0.1030223 0.8275862 2.635608 3528[19] {movie, Home} => {food} 0.1067601 0.8175313 2.611357 3656[20] {film, life} => {music} 0.1106731 0.6273796 2.605143 3790[21] {Design, psychology} => {fashion} 0.1066433 0.6206662 2.599647 3652[22] {travel, psychology} => {education} 0.1022631 0.5475297 2.595536 3502[23] {film, Fashion, Psychology} => {food} 0.1116659 0.8118896 2.593336 3824[24] {food, Fashion, Psychology} => {travel} 0.1015915 0.8120915 2.586262 3479[25] {Film, Food, Fashion} => {travel} 0.1104687 0.8102377 2.580358 3783[26] {movie, travel, psychology} => {gourmet} 0.1241349 0.7981600 2.549481 4251[27] {Home, Psychology} => {gourmet} 0.1003066 0.7958758 2.542185 3435[28] {economics} => {business} 0.1366915 0.5831568 2.541385 4681[29] {business} => {economics} 0.1366915 0.5956987 2.541385 4681[30] {travel, psychology} => {career development} 0.1016791 0.5444028 2.538890 3482[31] {travel, fashion} => {food} 0.1232005 0.7948380 2.538870 4219[32] {film, fitness, psychology} => {food} 0.1009490 0.7898104 2.522811 3457[33] {food, psychology} => {education} 0.1051248 0.5312869 2.518538 3600[34] {film, business} => {Internet} 0.1016207 0.6815511 2.518041 3480[35] {entrepreneurship, film} => {Internet} 0.1006862 0.6791412 2.509137 3448[36] {film, Fitness, Psychology} => {travel} 0.1004818 0.7861549 2.503662 3441[37] {movie, fitness, travel} => {food} 0.1030223 0.7826087 2.499807 3528[38] {health} => {life} 0.1190539 0.6937213 2.498579 4077[39] {film, Design, Psychology} => {gourmet} 0.1091254 0.7806559 2.493570 3737[40] {education} => {career development} 0.1122500 0.5321152 2.481586 3844[41] {Career development} => {education} 0.1122500 0.5234918 2.481586 3844[42] {Film, Fashion, Psychology} => {travel} 0.1069937 0.7779193 2.477434 3664[43] {fitness, gourmet} => {travel} 0.1156373 0.7769276 2.474276 3960[44] {Gourmet, Psychology} => {Career development} 0.1046576 0.5289256 2.466711 3584[45] {Movie, Fitness} => {fashion} 0.1102351 0.5883728 2.464387 3775[46] {movie, Internet} => {business} 0.1016207 0.5648434 2.461576 3480[47] {design, fashion} => {gourmet} 0.1017667 0.7699956 2.459519 3485[48] {fitness, fashion} => {travel} 0.1037232 0.7721739 2.459137 3552[49] {film, Internet} => {entrepreneurship} 0.1006862 0.5596494 2.457391 3448[50] {gourmet, fashion} => {travel} 0.1232005 0.7705936 2.454104 4219
The association rule with the highest improvement is {travel, food, psychology} => {fashion}, more than three times! In fact, many of these top50 association rules are oriented to the topic of “fashion”, which is really compelling.
plot(rules, method=”graph”, control=list(type=”items”))
image.png
I don’t know how to interpret association rules.
Add the top100 followers of each topic:
Zhihu follows top100 topics
The results of association analysis can be used as a recommendation system. Compared with the collaborative filtering algorithm, it does not need to calculate the adjacency matrix of pair similarity. Moreover, collaborative filtering algorithm can only calculate similarity, and association rules have indicators such as support degree, confidence degree and promotion degree, which are more explanatory. However, in the collaborative filtering algorithm, because there are two similarity degree, as long as there is a new input, it can always be recommended according to the highest similarity degree. However, in association rules, only the corresponding association rules can be recommended, so its coverage is not as wide as collaborative filtering.
Step to summarize
Set association rule parameters (support, confident, etc.) and establish association rules so that association rules can be sorted according to a certain index (lift, support, etc.) and resolved visually Normalized association rule
Problem extension
Fortunately, just changed a game, otherwise whether Python processing data, or run Apriori model, it is estimated that the card into a dog this is just a case studyApriori algorithm in the large amount of data calculation is also terrible, The method of fP-growth using R to switch and model on a single machine can be considered during learning, but there is still a gap from industrial-level modeling. For data modeling of this case volume, running algorithms on distributed parallel computing platforms such as Spark is the way to go
Need help? Contact us