1. Basic Concepts
1. Overview of algorithm
Association rule mining allows us to discover the relationship between items from data sets, and it has many application scenarios in our life. “Shopping basket analysis” is a common scenario, which can discover the association between goods from consumer transaction records. In turn, bundle sales or related recommendations to drive more sales. Therefore, association rule mining is a very useful technology.
Association rules reflect the interdependence and correlation between one thing and other things, and are often used in the recommendation system of physical stores or online e-commerce: Through to customer purchase records database for association rules mining, the ultimate aim is to find customers buying habits in common, such as buying A product of the joint probability of purchase products B at the same time, according to the mining results, adjust the display shelves of the layout, design, sales promotion combination plan, achieve sales promotion, the application of the classic case is < > beer and diapers.
Association analysis, also known as association mining, is to find frequent patterns, associations, correlations or causal structures existing between item sets or object sets in transaction data, relational data or other information carriers. Interesting associations and correlations between item sets can be found from large amounts of data. A classic example of association analysis is shopping basket analysis. The process analyzes customers’ buying habits by finding connections between different items they put in their shopping basket. By understanding which items are frequently purchased at the same time, the discovery of such associations can help retailers develop marketing strategies. Other applications include price list design, product promotion, product emissions and customer segmentation based on buying patterns.
Rules such as “the occurrence of some events causes the occurrence of others” can be correlated from the database. For example, “67% of customers buy beer and diapers at the same time”. Therefore, reasonable shelf placement or bundling sales of beer and diapers can improve the service quality and benefit of supermarkets. For example, “Students who are excellent in THE C language course are 88% likely to be excellent in the learning of ‘data structure'”, then we can improve the teaching effect by strengthening the learning of “C language”.
2. Application scenarios
01) Internet recommendation
Personalized recommendation: recommend related products to users on the interface
Combination coupon: coupons are issued to users who purchase items in a combination at the same time
Bundling: The sale of a group of related products
02) Analysis of offline stores
Product configuration analysis: which products can be purchased together, and how to display/promote related products
Customer demand analysis: analysis of customer buying habits/customer buying time/place, etc
3) Financial insurance
The ability to design different service combinations to maximize profits through basket analysis; Basket analysis can be used to detect and prevent potentially unusual insurance combinations.
4) Risk control
Analyze simultaneous accounts and find effective policy combinations
3. Several concepts
The three core concepts of association rules are support degree, confidence degree and promotion degree. The most classic beer-diaper is used to illustrate these three concepts. The following is a list of products purchased by several customers:
01) Support
Support: The ratio of the number of occurrences of a product combination to the total number of occurrences.
In this example, we can see that “milk” appears 4 times, so the support for “milk” in these 5 orders is 4/5=0.8.
Similarly, “milk + bread” appears 3 times, so the support degree of “milk + bread” in these 5 orders is 3/5=0.6
This is very easy to understand, you can use your hands to calculate the support of ‘diaper + beer’
02) Confidence
Confidence: What is the probability that you will buy good B if you buy good A
Confidence (milk → beer) = 3/4=0.75, representing the probability that you will buy beer if you buy milk
Confidence (beer → milk) = 3/4=0.75, which means if you buy beer, what is the probability that you will buy milk?
Confidence (beer → diapers) = 4/4=1.0, which represents the probability that you will buy diapers if you buy beer
From the above example, we can see that confidence is A conditional concept, that is, the probability of B occurring in the case of A.
03) Degree of lift
Lift: When making product recommendations or strategies, we mainly consider the Lift degree, because the Lift degree represents the degree to which the appearance of product A improves the occurrence probability of product B.
Degree of improvement (A→B) = Confidence (A→B)/ support (B)
So there are three possibilities for improvement:
Degree of improvement (A→B)>1: indicates that there is improvement;
Degree of improvement (A→B)=1: indicates that there is no improvement or decrease;
Degree of elevation (A→B)<1: indicates A decrease.
Lift (beer → diaper)= confidence (Beer → diaper)/Support (diaper)=1.0/0.8=1.25
It can be seen that beer has an improvement on diapers with an improvement degree of 1.25. In fact, it can be simply understood as: in the case of the complete set, the probability of diapers is 0.8, while in the subset containing beer, the probability of diapers is 1. Therefore, the limitation of the subset improves the probability of diapers.
04) Frequent itemsets
Frequent itemset: indicates the itemsets whose Support is greater than or equal to the Min Support threshold. Therefore, items whose Support is less than the minimum are non-frequent itemsets, and items whose Support is greater than or equal to the minimum are frequent itemsets. Itemsets can be individual goods or combinations
Core ideas of Apriori algorithm:
If an item set is frequent, then all subsets of it are frequent.
{Milk, Bread, Coke} is frequent → {Milk, Coke} is frequent
If an itemset is an infrequent itemset, then all its supersets are infrequent itemsets {Battery} is infrequent → {Milk, Battery} is infrequent
As shown in the figure, we find that {A,B} is infrequent, so the superset of {A,B,C},{A,B,D} and so on are also infrequent, which can be ignored and not computed.
Using the idea of Apriori algorithm, we can remove many infrequent item sets and greatly simplify the calculation.
Second, algorithm introduction
Here is the Python example, using the apriori package, of course, R language and other languages, there are corresponding algorithm package, the original is the same.
PIP install efficient-Apriori # Load the packagefromEfficient_apriori import Apriori # Efficient_apriori= [('milk'.'bread'.'Diaper'.'beer'.'durian'),
('coke'.'bread'.'Diaper'.'beer'.'Jeans'),
('milk'.'Diaper'.'beer'.'eggs'.'coffee'),
('bread'.'milk'.'Diaper'.'beer'.'pajamas'),
('bread'.'milk'.'Diaper'.'coke'.'chicken'] # mine frequent itemsets and frequent rules= apriori(data, min_support=0.6, min_confidence=1Print (itemsets) {1: {('beer',) :4, ('Diaper',) :5, ('milk',) :4, ('bread',) :4}, 2: {('beer'.'Diaper') :4, ('beer'.'milk') :3, ('beer'.'bread') :3, ('Diaper'.'milk') :4, ('Diaper'.'bread') :4, ('milk'.'bread') :3}, 3: {('beer'.'Diaper'.'milk') :3, ('beer'.'Diaper'.'bread') :3, ('Diaper'.'milk'.'bread') :3}}
itemsets[1] # unary combinations that satisfy the condition {('beer',) :4, ('Diaper',) :5, ('milk',) :4, ('bread',) :4}
itemsets[2]# meet the condition of the binary combination {('beer'.'Diaper') :4, ('beer'.'milk') :3, ('beer'.'bread') :3, ('Diaper'.'milk') :4, ('Diaper'.'bread') :4, ('milk'.'bread') :3}
itemsets[3]# meet the conditions of the triplet {('beer'.'Diaper'.'milk') :3, ('beer'.'Diaper'.'bread') :3, ('Diaper'.'milk'.'bread') :3} # print(rules) [{beer}->{diaper}, {milk}->{diapers}, {bread}->{diaper}, {beer, milk}->{diapers}, {beer, bread}->{diapers}, {milk, bread}->{diaper}]Copy the code
Three, mining examples
Each director has his own preferences, such as Stephen chow is star girl, zhang yimou has seeks the girl, and gong li often appear in zhang yimou’s film, as a result, every director’s choice of actor has certain preference, ning hao director, for example, we analyze the choice of actors some preference, found no public data sets, part manually stripped, roughly as follows, Some of them are a little too much, so simplify the analysis
As you can see, we’ve got nine movies, and when we count them, when we count the support, that’s nine.
# Convert movie data to lists
data = [['ge you'.'Huang Bo'.'corporate dilettantes'."Deng chao".'Shen Teng'.'Zhang Zhanyi'.'Wang Baoqiang'.'Xu Zheng'.'ni yan'.'Mary'],
['Huang Bo'.'Zhang Yi'.'Han Haolin'.'study on modern hotel groups'.'ge you'.'Liu Haoran'.'song jia'.'Wang Qianyuan'.'Ren Suxi'.'吴京'],
['guo tao'.'c'.'even jin'.'Huang Bo'.'Xu Zheng'.'optimal travelling'.'Roland'.'Wang Xun'],
['Huang Bo'.'shu qi'.'Wang Baoqiang'.'张艺兴'.'Yu He Wei'.'Wang Xun'.'Li Qinqin'.'Li You-lin'.'ning hao'.'hu guan'.'nicole'.'Xu Zheng'.'Chen Sen'.'zhang'],
['Huang Bo'.'Shen Teng'.'Tom Perfrey'.'Matthew Morrison'.'Xu Zheng'.'Yu He Wei'.'Lei Jia Yin'.'c'.'deng fei'.'Michael Tsai'.'ge wang'.'Kate Nelson'.'Wang Yanwei'.'yi road'],
['Xu Zheng'.'Huang Bo'.'more than male'.'Dobje'.'Wang Shuangbao'.'and more'.'Yang Xinming'.'Guo Hong'.'hong tao'.'Huang Jing yi'.'yan-fang'.'maike'],
['Huang Bo'.'cheung yung'.'nine holes'.'Xu Zheng'.'Wang Shuangbao'.'and more'.'Dong Lifan'.'jack kao'.'Ma Shaohua'.'Wang Xun'.'liu'.'WorapojThuantanon'.'zhao ran'.'Li Qilin'.'Jiang Zhigang'.'Wang Lu'.'ning hao'],
['Huang Bo'.'Xu Zheng'.'wintenberger e'.Zhou Dongyu.'TaoHui'.'Yue Xiaojun'.'Shen Teng'.'Zhang Li'.'马苏'.'Liu Mei Ham'.'Wang Yanhui'.'Jiao Junyan'.'guo tao'],
['Lei Jia Yin'.'hong tao'.'Cheng Yuanyuan'.'Keiichi Yamazaki'.'guo tao'.'corporate dilettantes'.'sun'.'c'.'Huang Bo'.'Yue Xiaojun'.'Fu Heng'.'wang wen'.'Yang Xinming']]
Copy the code
# Algorithm applicationItemsets, rules = apriori(data, min_support=0, min_confidence=1)print(itemsets)
{1: {('Xu Zheng',) : 7, ('Huang Bo',) : 9}, 2: {('Xu Zheng'.'Huang Bo'7}}) :print[{Xu Zheng} -> {Huang Bo}]Copy the code
From the above analysis, it can be seen that:
In ning Hao’s films, Huang Bo and Xu Zheng are used the most, Huang Bo 9 times, 100% support, Xu Zheng 7 times, 78% support, (‘ Xu Zheng ‘, ‘Huang Bo ‘) 7 times at the same time, 100% confidence. It seems that there is Xu Zheng, there must be Huang Bo, Really ning Hao must hire the golden partner.
Of course, the amount of data is relatively small, which can be seen by the naked eye. This is only an analysis case to consolidate the basic knowledge. When the human eye cannot directly perceive large-scale data, the mining and discovery of the algorithm will be particularly meaningful.
Machine learning: Deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning851320808To join the wechat group, please scan the code:Copy the code