Association analysis is to find the correlation between item sets from a large amount of data, and analyze rules such as “the occurrence of some events leads to the occurrence of other events”.

A typical example of association analysis is shopping cart analysis. The process analyzes a user’s buying habits by discovering the connections between the different items he or she adds to the cart and learns which items are frequently purchased simultaneously. The discovery of this correlation can help merchants to formulate marketing strategies, such as product promotion and placement mix.

Fp-growth algorithm is an association analysis algorithm proposed by Han Jiawei et al. In this algorithm, items in the original data are compressed to a FP-tree (Frequent Pattern tree) through two data scans, and then the conditional Pattern base of each item is found through the FP-tree, and finally all Frequent item sets are obtained.

  • Advantages: Easy to use, easy to deploy.
  • Disadvantages: Need more data, and the effect is mediocre.

1, FP – Growth

1.1 Preparing Raw Data

“TID” is defined as the order ID, and “Items bought” is the product purchased in an order. The following table shows the raw data to be prepared.

TID Items bought
100 {f, a, c, d, g, i, m, p}
200 {a, b, c, f, l, m, o}
300 {b, f, h, j, o}
400 {b, c, k, s, p}
500 {a, f, c, e, l, p, m, n}

1.2 Establish item header table

Before building the FP-tree, first create a table of item headers. Scan raw data and count each item. You can set thresholds here, such as keeping items that must appear at least three times. After screening, six commodities a, B, C, F, M and P are left, and these commodities are arranged in descending order by quantity to generate item head table.

Item Count
f 4
c 4
a 3
b 3
m 3
p 3

1.3 Filter and sort the original data

Then the second scan of the original data was performed. For each data, the items on the non-item head list were removed and ranked in descending order of support. For example, for order 100, the purchased goods are {F, a, C, D, G, I, M, P}, {f, a, C, m, P} after excluding the data, and {f, C, a, M, P} after sorting. The other original data are processed by analogy to obtain the sorted data set.

TID Items bought (Ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

1.4 build FP – Tree

With the item header table and the raw data set filtered and sorted, you can build the FP-tree. To establish the FP-tree, we need to read and filter the sorted original data one by one, and insert them into the Tree in sequence. If there is a common ancestor node, increment the corresponding ancestor node by 1. At the same time, when a new node appears, it needs to be linked to the node linked list corresponding to the item header. Until all data is inserted into the tree, the build is complete.

Let’s use the data above as an example. TID 100 is inserted first. At this time, there is no other node in fP-tree, so {f, c, a, m, p} is an independent path. The count of all nodes is 1.

Then insert TID 200, {f, c, a, b, m}, where f, C and A are common, and the count becomes 2. A new branch is generated to node B, and M is the child node of b, and the count of the two nodes is 1. Of course, the linked list of the corresponding b and M nodes should also be updated.

The remaining three data are processed in turn, and the FP-tree is constructed

1.5 Mining frequent itemsets

With the FP-tree in place, you can start mining frequent itemsets. Iterate through the item head table and excavate in turn to find the conditional pattern base of each item. The conditional pattern base is the subtree corresponding to the leaf node as the node to be mined. Get the subtree, set the count of each node in the subtree to the count of leaf nodes, and delete the nodes whose count is lower than the convention. From this conditional pattern base, we can recursively mine frequent itemsets.

Assuming that the minimum node count required is 2, start mining p node, as shown in the figure below

Delete the red {c:1}, {b:1}, {p:1} nodes whose node count is less than 2 in the figure, and the conditional mode basis of {p:2} nodes is {f:2, C :2, A :2, m:2}. Through it, frequent binomial set {f:2, p:2}, {C :2, P :2}, {A :2, P :2}, {m:2, P :2}, frequent triterm set {f:2, C :2, p:2}, {f:2, a:2, p:2} and so on can be obtained recursively. Finally, the maximum frequent item set is frequent pentaterm set. Is {f:2, c:2, a:2, m:2, p:2}.

If the minimum node count is 1, the red node {p:1} can be found according to the node linked list corresponding to the item header after mining the above subtree, and the frequent item set can be continued to be mined.

At this point, the mining of p node is completed, and other nodes can be continued to be mined. After all nodes are mined, all frequent itemsets are obtained. As for the last f node, since its conditional mode base is empty, it is unnecessary to mine it.

2, the PFP – Growth

From the above introduction, it can be noticed that when mining frequent itemsets, fP-tree may be very large, and the frequent itemsets obtained by recursion may have multiple exponents. However, it is also found that the mining subtrees corresponding to each item head table are mutually independent, that is to say, frequent sets can be mined in parallel. Therefore, it is possible to build a FP-tree for a single item or for several items to build a FP-tree as a group.

Recalling the steps of mining frequent itemsets in a single machine, nodes that can be executed in parallel can be summarized as follows:

  • Create the item header table
  • Build the FP – Tree
  • Mining frequent itemsets

Therefore, the steps of parallel mining frequent sets can be summarized as follows:

  • The first step is to segment the original data and generate item header table according to the parallel counting of each piece of data.

Let’s copy the algorithm and put it here:

  • The second step is to group the items in the item header table and conduct separate frequent item set mining for each group.

Let’s copy another algorithm and put it here:

  • The third step is to summarize the results of the second step.

Finally, copy an algorithm and put it here:

3, other

Let’s talk about a couple of points where we can interfere with the results of the algorithm. In addition to the two nodes that can filter data when the algorithm is introduced above, the original data can also be processed before the algorithm, or the generated frequent item set can be processed after the algorithm.

1. First of all, you can set weighting value according to the number of goods appearing in the order, order time, daily sales and other information. The item header table and FP-tree are constructed according to the weighted data.

2. Secondly, for the generated frequent result sets, the final results can be screened by calculating the confidence and promotion degree.

  • The rule for calculating the degree of confidence, the conf (X – > Y) = P (Y | X) = P (X, Y)/P (X). The higher the confidence, the stronger the correlation.
  • Ascension degree calculation rules, the lift (X – > Y) = P (X, Y)/(P (X) P (Y)) = conf (X – > Y)/P (Y). If the promotion degree of association rules is greater than 1, it is valid. If the value is less than or equal to 1, it is invalid. In particular, when the elevation is equal to 1, it means that two things are independent of each other, i.e. the occurrence of one has no effect on the occurrence of the other.

Finally, post two reference articles.

  • Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach
  • PFP: Parallel FP-Growth for Query Recommendation