1. Write at the front

If you want to engage in data mining or machine learning, it is necessary to master common machine learning algorithms. Common machine learning algorithms are as follows:

  • Supervised learning algorithms: logistic regression, linear regression, decision tree, Naive Bayes, K-nearest Neighbor, support vector machine, integration algorithm Adaboost, etc
  • Unsupervised algorithms: clustering, dimensionality reduction, association rules, PageRank, etc

Published:

Algorithm theory + practical K nearest neighbor algorithm

Algorithmic theory + practical decision tree

Naive Bayes for algorithm theory and practice

【 Vernacular Machine Learning 】 Algorithm theory + Practice Support Vector Machine (SVM)

Algorithmic theory + Practical AdaBoost algorithm

Algorithm theory + practical K-means clustering algorithm

PCA dimensionality reduction for algorithm theory and practice

EM clustering for algorithm theory and practice

For detailed understanding of the principles, have seen watermelon book, statistical learning methods, such as machine learning field, also heard some machine learning courses, but always feel more abstruse words, don’t have the patience to read, and theory are everywhere, but practice is the most important, so here want to use the most simple language to write a vernacular theory + practice series of machine learning algorithm.

In my opinion, it is more important to understand the idea behind the algorithm and its use than to understand its mathematical derivation. Idea will let you have an intuitive feeling, so as to understand the rationality of the algorithm, the mathematical deduction is to express this kind of rationality in a more rigorous language, for example, a pear is sweet and can be expressed in mathematical language to sugar content is 90%, but only the bite personally, you can truly feel the pear how sweet, And really understand the math of what 90 percent sugar looks like. If the algorithm is a pear, the primary purpose of this article is to take a bite out of it. There are also several other purposes:

  • Test your understanding of the algorithm and make a summary of the algorithm theory
  • Can be happy to learn the core ideas of these algorithms, find the interest in learning these algorithms, for in-depth learning these algorithms lay a foundation.
  • The theory of each class will be put to a practical case, can really learn to apply, not only can exercise programming ability, but also can deepen the grasp of algorithm theory.
  • Also want to put all the previous notes and reference in a piece, convenient for the convenience of checking later.

The process of learning algorithms should not only gain the theory of algorithms, but also have fun and the ability to solve practical problems!

Today is the second part of vernacular machine learning algorithm theory + actual combat, learning association rule mining. Through today’s learning, I should be able to master the basic principles of association rule mining algorithm Apriori algorithm and FPgrowth, and master the application of the algorithm through the following small actual combat.

The outline is as follows:

  • Understand several important concepts in association rules (frequent itemsets, support, confidence, promotion)
  • Apriori algorithm of association rule algorithm
  • Fp-growth algorithm of association rule algorithm
  • Efficient_apriori With mlXtend And efficient_Apriori

OK, let’s go !

2. Mining association rules: I’m going to start with beer and diapers again:

In case the rest of the story gets a little obscure, let’s take a look at the classic case of data mining: beer and diapers:

u

In American families with babies, the mother stays at home and the young father goes to the supermarket to buy diapers. Fathers often buy beer for themselves when they buy diapers, and beer and diapers, two seemingly unrelated items, often end up in the same shopping basket. If the young father can only buy one of two items at the store, he will probably give up shopping and go to another store until he can buy both beer and diapers at the same time. Walmart discovered this unique phenomenon and began to experiment with placing beer and diapers in the same section of the store, so that young fathers could find both items at the same time and finish shopping quickly. And Walmart can get those customers to buy two items at a time instead of one, and make good merchandise sales. That’s the beer and diapers story.

How do you feel after reading this story? Isn’t that weird? Beer and diapers, two things that sound completely unrelated, can actually be correlated, and the sales can be better by digging out the relationship between the two, which is the mining of association rules.

As long as to master the skills, you can not only according to the schedule of customer goods in the supermarket, dig out what goods together to increase sales, also can analyze the cross booking bank financial products, and the connection between the APP every mobile phone users, can also climb took his movie, data analysis, like to use what actors get director, And which actor often works with which actor.

Ha ha, is it beginning to be tempting? But you need some knowledge, like, what are association rules?

  • The concept of association rule was first proposed by Agrawal et al in 1993.
  • Association rule mining allows us to discover the relationship between items from data sets, and it has many application scenarios in our life. “Shopping basket analysis” is a common scenario, which can discover the association between goods from consumer transaction records. In turn, bundle sales or related recommendations to drive more sales.

But before we look at algorithms, we need to understand a few concepts, because these algorithms rely on these criteria to select association rules.

3. Understand several concepts of association rules

For the sake of plain English, let’s introduce it by example. Let’s look at an example of shopping in a supermarket:

3.1 Frequent item sets

Frequent itemsets refer to items that often appear together, such as {beer, diapers, milk} in the image above. The diaper -> beer association rule can also be found in the above data set, which means that if someone buys diapers, there is a good chance he will also buy beer. So how do you define and represent frequent itemsets and association rules? This introduces support and credibility.

3.2 support

Support is a percentage, which is the ratio of the number of occurrences of a particular product mix to the total number of occurrences. The higher the support, the more frequently the combination occurs.

u

For example, in the above example, we see that “milk” appears 4 times, so in 5 orders, the support degree of “milk” is 4/5=0.8. Similarly, we see that “milk + bread” appears 3 times, so the support degree of “milk + bread” in these 5 orders is 3/5 = 0.6

3.3 confidence

It’s just saying what’s the probability that you’ll buy good B if you buy good A, which is actually A conditional probability.

  • Confidence (milk → beer) =2/4=0.5, which means that if you buy milk, there is a 0.5 probability that you will buy beer.
  • Confidence (beer → milk) =2/3=0.67, which means that if you buy beer, there is a 0.67 probability that you will buy it.

u

In the example above: we see that beer was purchased 2 times out of 4 milk purchases, so the confidence (milk -> beer) is 2/4 = 0.5. Similarly, two out of three beer purchases were milk purchases, so the confidence level (beer → milk) =0.67.

3.4 degrees

When making product recommendation, we focus on the promotion degree, because the promotion degree represents the degree of “the appearance of product A improves the probability of the appearance of product B”.

u

Let’s look at the example above. If we simply look at the confidence (Cola → diapers)=1, that is to say, when the cola appears, the user will buy diapers, then when the user buys cola, do we need to recommend diapers?

In fact, even if users don’t buy coke, they will buy diapers directly, so whether they buy coke or not doesn’t have much effect on the improvement of diapers. (That is, users will buy diapers where there is no Coke, or don’t recommend hot items based on unpopular items.)

We can use the following formula to calculate the degree of improvement of good A over good B:

u

Degree of improvement (A→B)= Degree of confidence (A→B)/degree of support (B) This formula is used to measure whether the occurrence of A will improve the probability of B.

So there are three possibilities for improvement:

  • Degree of improvement (A→B)>1: indicates that there is improvement;
  • Degree of improvement (A→B)=1: indicates that there is no improvement or decrease;
  • Degree of elevation (A→B)<1: indicates A decrease.

About association rule mining, the above concepts are also more key, based on these key, there are several mining algorithms below, the more famous is Apriori algorithm and FP – Growth algorithm. Let’s see.

4. The working principle of Apriori

After understanding the important concepts of support, confidence and promotion in association rules, let’s take a look at how the Apriori algorithm works.

Apriori algorithm is actually a process of finding frequent itemsets. The definition of frequent item set requires minimum support, and the item set greater than or equal to the minimum support is frequent item set. The concept of an itemset, called an itemset, can be a single item or a combination of items. Again, to illustrate how to do this:

u

We represent the commodities in the above case with IDS. The commodity ids of milk, bread, diapers, Coke, beer and eggs are set to 1-6 respectively. The above data table can be changed into:

Suppose I randomly specify that the minimum support is 50%, which is 0.5. Let’s see how the Apriori algorithm works.

u

First of all, we calculate the support degree of a single product, that is, the support degree of the item K=1:Since the minimum support is 0.5, you can see that goods 4 and 6 do not meet the minimum support and do not belong to the frequent itemset, so the frequent itemset of the filtered goods becomes:On this basis, we pairwise combine the goods to obtain the support degree of k=2:Then we screen out the commodity portfolio with support less than the minimum value, and we can get:Then we carry out the combination of goods with K=3 items, and we can get:Then, if the commodity combination with support degree less than the minimum value is screened, the following formula can be obtained:

Through the above procedure, we can get the frequent item set {1,2,3} of K=3, i.e. {milk, bread, diaper}.

The above simulation of the entire Apriori algorithm process, the recurrence of the Apriori algorithm summarized below:

  1. K=1, calculate the support degree of K item set;
  2. Filter out items less than the minimum support;
  3. If the item set is empty, the result of the corresponding k-1 item set is the final result. Otherwise K=K+1, repeat steps 1-3.

But Apriori has the following shortcomings in the calculation process:

  • A large number of candidate sets can be generated. Because by permutation and combination, we’ve combined all the possible sets of terms;
  • Each calculation requires a rescan of the data set to calculate the support of each item set.

Therefore, Apriori algorithm wastes a lot of computing space and time. Therefore, people put forward FP-growth algorithm, which has the following characteristics:

  • An FP tree is created to store frequent itemsets. Before creation, delete items that do not meet the minimum requirements to reduce storage space. I’ll explain how to construct an FP tree later;
  • The whole process only traverses the data set twice, which greatly reduces the amount of computation.

Let’s look at the FP-growth algorithm.

5. Improved algorithm for Apriori: FP-growth algorithm

Frequent item set mining consists of constructing FP tree and mining frequent item set from FP tree.

  1. Constructing FP tree When constructing FP tree, firstly, the frequency of each element in the data set is counted, and the elements whose frequency is less than the minimum support are deleted. Then, each record in the data set is sorted according to the frequency of occurrence, and the remaining elements are called frequent items. Next, the FP tree is built with each record in the updated dataset, while the header pointer table is updated. The head pointer table contains all frequent items and their frequency, as well as a pointer for each frequent item to the next same element, which is mainly used when mining FP trees. Take a look at the following example: Count support for items and remove infrequent items:Then sort:

Let’s begin the detailed process of building a FP tree:

  • Scan No. 1:
  • Scan No. 2

Insert a picture description here

  • Scan No. 3
  • Scan No. 4
  • Scan No. 5Above is the complete FP tree building process. The final version is as follows:
  1. Mining frequent item sets from the FP tree here, we have an FP tree that stores frequent item sets, and an item header table. We can mine each frequent item set through the item header table.

    The specific operation uses a concept called “conditional pattern base”, which refers to the bottom-up calculation of the FP subtree by taking the nodes to be mined as leaf nodes, and then setting the ancestor nodes of the FP subtree as the sum of leaf nodes.

    I take the node of “beer” as an example. A FP subtree can be obtained from the FP tree, and the support degree of the ancestor node can be recorded as the sum of the leaf nodes, so as to obtain:

u

You can see that the diaper and milk frequent item sets are reduced compared to the original FP tree. This is because we have an FP subtree with “beer” as the node, that is, beer must be included in the frequent term set. Order 1{milk, bread, diapers} and Order 5{milk, bread, diapers, Cola} do not have a “beer” item, so for order 1, the diaper → milk → bread item is removed from the FP tree. For order 5, which also includes diaper → milk → bread, the itemset will also be removed from the FP tree, so you can see that in the FP tree with “beer” as the node, the count in the diaper/milk/bread itemset is 2 less than it was before.

The conditional schema base does not include the “beer” node, and the ancestor node will be pruned if it is less than the minimum support, so the “beer” conditional schema base is empty. Similarly, we can obtain the conditional mode basis of “bread” as follows:So the frequent item set for bread is {diaper, bread}, {diaper, milk, bread}. Similarly, we can get the frequent item set for milk, diapers, which we won’t show here.

This is how FPGrowth works.

6. Warm-up before actual combat

Here is the warm-up before the actual combat of the project, here I mainly talk about how to use Apriori algorithm to analyze the shopping basket through the tool kit (there are two libraries to call Apriori algorithm, here they are used respectively, and say the difference). Then the idea of shopping basket analysis is transferred to supermarket orders, cross-presale of bank financial products and association mining of APP. The tasks are basically similar.

6.1 Commodity analysis of shopping basket

First, construct the data set:

data = [['milk'.'bread'.'diapers'],
           ['coke'.'bread'.'diapers'.'beer'],
           ['milk'.'diapers'.'beer'.'eggs'],
           ['bread'.'milk'.'diapers'.'beer'],
           ['bread'.'milk'.'diapers'.'coke']]
Copy the code

There are many ways to realize Apriori algorithm. Here are two kinds:

  1. Use the Apriori algorithm in the MLXtend package for data association analysis. If you use this package, Apriori and association_rules are used together, and the format of the data to be processed is required (must be a wide table format).
import pandas as pd

from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder
Copy the code

TransactionEncoder is used for data conversion, which needs to convert the above data into the form of a wide table. What is a wide table?

"""Data conversion"""
transEn = TransactionEncoder()
oht_ary = transEn.fit_transform(data)
new_data = pd.DataFrame(oht_ary, columns=transEn.columns_)
new_data
Copy the code

The data then looks something like this, the broad table, where you put all the items in a column, and each purchase is a 1 if the item was purchased, and a 0 otherwise.Only the above data, Apriori can be processed, the following is very simple. Two steps:

  • Step 1: Calculate frequent itemsets, where the minimum support can be defined to filter frequent itemsets:
"""Compute frequent itemsets"""
frequent_itemset = apriori(new_data, min_support=0.5, use_colnames=True)
frequent_itemset
Copy the code

The results are as follows:

  • Step 2: Dig out the association rules, which can be confidence or lift
rules = association_rules(frequent_itemset, metric='confidence', min_threshold=1)
rules
Copy the code

Mining association rules in the here {the beer} – > {diapers}, {milk} – > {diapers}, {bread} – > {diapers}, {milk, bread} – > {diapers}.

Here is a summary of this method:

u

This approach typically uses three functions:

  • TransactionEncoder: You need to convert the data into a wide table
  • Apriori (): This needs to specify the minimum support
  • Association_rules (): This specifies filtering criteria (either confidence or lift or support)

Advantages: In the association rules displayed at the end, the support, confidence, promotion and other information is very detailed and clear at a glance. Disadvantages: Data has special rule requirements, processing is more troublesome, and use association rules that two functions separate, use more troublesome.

  1. Efficient-apriori Toolkit This package is simpler to use, requiring only one line of code to simultaneously find the association rules for frequent item sets, and requiring no special processing of the data.
itemsets, rules = apriori(data, min_support, min_confidence)
Copy the code

u

Where Data is the data set we will supply, which is a list array type. The min_support parameter is the minimum support. In the efficient-Apriori toolkit, values from 0 to 1 represent percentages, such as 0.5 for a minimum support of 50%. Min_confidence is the minimum confidence level, and the number also represents a percentage, such as 1 for 100%.

Use this package to implement the following:

from efficient_apriori importApriori # set data = [('milk'.'bread'.'diapers'),
           ('coke'.'bread'.'diapers'.'beer'),
           ('milk'.'diapers'.'beer'.'eggs'),
           ('bread'.'milk'.'diapers'.'beer'),
           ('bread'.'milk'.'diapers'.'coke'Itemsets, rules = Apriori (data, min_support=0.5,  min_confidence=1)
print(itemsets)
print(rules)
Copy the code

The result is as follows (as the association rule generated above) :The advantage of this is that it is simple to use, and the efficient-Apriori toolkit puts items from each data set into a single set, regardless of the order in which they are evaluated. Because in practice, items in the same basket do not need to be purchased in order. However, other Apriori algorithms may have incorrect results in the calculation of frequent item sets due to the consideration of sequence. So the tool kit here is efficient-Apriori.

6.2 Thought transfer Below, complete other examples

The association rule method above can transfer the following and complete many other tasks, such as real order data in supermarkets, cross-booking of financial products in banks, association between mobile apps, etc. How do you migrate it? It is mainly about how data is processed: how to convert ordinary data into a wide table form.

6.2.1 Real data of supermarket orders

The data of a supermarket order generally looks like this:This type of data needs to be converted to a wide table, but before converting to a wide table, it needs to be sorted and summarized: the following line of code

new_data = data.groupby(['Order Number'.'Buy goods'[])'number'].sum().unstack().reset_index().fillna(0).set_index('Order Number')
new_data
Copy the code

Let’s map the values to form a wide table:

def encode_unit(x):
    if x <= 0:
        return 0
    if x >=1 :
        return 1

new_data = new_data.applymap(encode_unit)
new_data
Copy the code

This is the form of the wide table, which is the same as shopping basket analysis, computes frequent itemsets and mines association rules:

frequent_itemset = apriori(new_data, min_support=0.5, use_colnames=True)
rules = association_rules(frequent_itemset, metric='confidence', min_threshold=1)
Copy the code

6.2.2 Cross-presale of bank financial products

Bank data looks like this:In fact, and the above similar, is the first classification summary -> into the form of a wide table -> mining

bankset = bank.groupby(['CSR_ID'.'FIN_PROD']).size().unstack().reset_index().set_index('CSR_ID').fillna(0) bank_data = bankset.applymap(encode_unit) # See bank_data above for this functionCopy the code

And then you end up like this, and then you do the same analysis

6.2.3 Association between mobile apps

APP data looks like this:See this is not understand the way of migration, first according to the device id, APP name summary, and then into the form of a wide table, and then mining.

app_new = app.groupby([Equipment Identification Number.'the name of the APP]).size().unstack().fillna(0)
app_data = app_new.applymap(encode_unit)
Copy the code

Insert a picture description here

7. Project practice – How does the director choose the actors

This is through the Apriori algorithm for association rules mining, analysis of directors generally like which actors, which actors and which actors together in the film.

The director used here is ning hao (you can use another one), and the format of the dataset used is as follows :(the titles of the films directed by ning hao are in front, followed by the names of the actors in them.)For this data, to use crawler technology, go to https://movie.douban.com and enter the director’s name in the search box, such as “Ning Hao”.About the crawler technology, here is not to say, before writing a Python crawler quick start, can completely solve the data crawling problem here. Just the code is given below:

"""Download a director's movie data set."""Def dowloadData (director): def dowloadData (director): def dowloadData (director):'C:/Users/ZhongqiangWu/AppData/Local/Google/Chrome/Application/chromedriver') # write to CSV file file_name ='/' + director + '.csv'
    out = open(file_name,'w', newline=' ', encoding='utf-8-sig')
    csv_write = csv.writer(out, dialect='excel')
    flags = []
    """Download data for a specified page"""
    def download(request_url):

        driver.get(request_url)
        time.sleep(1)

        html = driver.find_element_by_xpath("/ / *").get_attribute("outerHTML") HTML = etree.html (HTML) # Set movie name, director actor's XPATH movie_lists = html.xpath("/html/body/div[@id='wrapper']/div[@id='root']/div[1]//div[@class='item-root']/div[@class='detail']/div[@class='title']/ a[@class='title-text']")
        name_lists = html.xpath("/html/body/div[@id='wrapper']/div[@id='root']/div[1]//div[@class='item-root']/div[@class='detail']/div[@class='meta abstract_2']"Num = num; num = numlen(movie_lists)
        if num > 15: # the first page will have it16Movie_lists = movie_lists[1:]
            name_lists = name_lists[1:]

        for(movie, name_list) in zip(movie_lists, name_lists): # The data will be emptyif name_list.text is None:
                continue
            print(name_list.text)
            names = name_list.text.split('/') # check whether the director is the specified directorif names[0].strip() == director and movie.text not in flags: # set first field to movie name [0] = movie.text
                flags.append(movie.text)
                csv_write.writerow(names)

        if num >=14: # There may be a page14A movie # Continue next pagereturn True
        else: # No next pagereturnFalse # Start with ID0, added per page15A base_url ='https://movie.douban.com/subject_search?search_text='+director+'&cat=1002&start='
    start = 0
    while start < 10000: # maximum extraction10000* * * * * * * * * * * * * * * * * * *if flag:
            start = start + 15
        else:
            break
    out.close(a)print('finished')

"""Call the above function"""
directorname = 'ning hao'
dowloaddata(directorname)
Copy the code

Here is the association rule mining, in the second way:

director = 'ning hao'
file_name = '/'+director+'.csv'
lists = csv.reader(open(file_name, 'r', encoding='utf-8-sig'Data = []for names in lists:
    name_new = []
    forName in names: # Remove the space name_new from the actor data.append(name.strip())
    data.append(name_new[1Itemsets, rules = apriori(data, min_support=0.3,  min_confidence=0.8)
print(itemsets)
print(rules)
Copy the code

The results are as follows:As you can see, Director Ning Hao likes to use Xu Zheng and Huang Bo, and when Xu is present, he usually uses Huang Bo. You can also use the code above to dig into the casting patterns of other directors.

8. To summarize

Today, I spent a day learning association rule mining. The superficial knowledge here is not too much, and the above knowledge is enough, but I can only get started. I hope I can understand the principle of Apriori and FPgrowth. If you want to learn more about this area, please check out other resources. Association rules can be used to recommend the relevant direction of the system.

Reference:

  • Note.youdao.com/noteshare?i…
  • www.ibm.com/developerwo…
  • Github.com/xiaomiwujie…

Machine Learning Online Manual Deep Learning online Manual AI Basics download (PDF updated to25Set) site QQ group1003271085To join the wechat group, please reply "Add group" to obtain a discount website knowledge planet coupon, copy the link directly open: HTTPS://t.zsxq.com/yFQV7am like the article, click on it
Copy the code