Write in front: Apriori algorithm and FP-growth algorithm are very classical mining algorithms, I also learn these two algorithms intermittently. This paper mainly records my understanding and difference of these two algorithms, please give me more advice if there is something wrong

Apriori algorithm

Apriori algorithm uses the iterative method of layer by layer search, that is, k-item set exploration (K +1) -item set (subsequent hui-Miner, FHN, UMEpi and other algorithms are the same idea). The algorithm uses downward closure property, i.e. all non-empty subsets of a frequent itemset must also be frequent itemsets. If item set A does not meet the minimum support threshold, that is, if item set B is not frequent, then the new item set (AUB) cannot be frequent if item set B is added to item set A

Algorithm steps

  1. The data set is scanned to determine the support degree of each unary item. By calculating the support degree, the set F1 of all frequent 1-item sets can be obtained.
  2. Using the (K-1) -item set discovered last time, a new candidate K-item set is generated by pin-pair combination.
  3. Calculate the support degree of the candidate item set, scan the database again, and use our own function to determine all the k- item sets contained in each transaction item T;
  4. Delete all candidate item sets whose support is less than the threshold;
  5. Repeat steps 2 to 4 until no new frequent itemsets are generated, and the algorithm terminates

disadvantages

As can be seen from the above algorithm steps, 1) Apriori algorithm needs to scan the data set for many times to calculate the support degree of frequent item sets from the candidate item sets. When the data set is large, THE I/O expenditure becomes unacceptable. 2) Secondly, the set of candidate items generated is very large and generally considered to grow exponentially, which is undoubtedly unacceptable for memory consumption and time expenditure

advantages

There is no doubt that The Apriori algorithm, as the earliest and most classic frequent item set mining algorithm, uses the prior nature to create a new mining idea. And easy to understand, low data set requirements; Finally, the algorithm is widely used in practice and can mine many useful association rules


FP – Growth algorithm

Fp-growth algorithm has made many improvements on various problems of Apriori algorithm, especially the fP-tree structure designed to store key information. Borrowing Tree eliminates the need to scan the data set to confirm results (subsequent up-growth, up-GNV, and RFM-Growth algorithms use this storage structure). Frequent patterns can be generated directly by recursively calling the methods of FP-growth, so there is no need to generate candidate patterns throughout the discovery process

Algorithm steps

  1. The data set is scanned, the ordered item table header is constructed based on frequent unary items, and each item element is sorted during the second scanning of the data set.
  2. The fP-tree storage structure is constructed according to the data set adjusted by item table header and sorting
  3. High-order itemsets are constructed according to the elements in the header of the item table from bottom up (which also means bottom-up fP-tree), and each high-order itemset is a branch path

disadvantages

Pruning is not thorough enough and the boundary values used are not close enough to the true support

advantages

Fp-tree is the concentration of data set, which does not contain the data of infrequent unary items; There is no need to scan the data set repeatedly for finding frequent item sets each time. Only one element is fixed and each node is traversed successively from bottom to top until root constructs a Sub-FP-tree (projection).

conclusion

  • On the core idea,

    • Apriori algorithm uses two low-order frequent itemsets to construct high-order frequent itemsets
    • The FP-growth algorithm uses a low-order frequent item set as the premise to screen out all high-order frequent item sets containing this frequent item set
  • In the implementation process,

    • Apriori algorithm needs to scan the data set several times to confirm the support of frequent item sets
    • Fp-growth algorithm mines frequent item sets by constructing more sub-FP-trees with a fixed node
  • The two algorithms represent different mining ideas, according to which more efficient algorithms can be obtained by combining different pruning strategies and storage structures

Ps. In addition to studying the performance differences of these algorithms, it is more important to clarify the context. Why is this algorithm recognized by everyone? Why don’t people use other algorithms that seem to perform much better?