Write in front: The efficient itemset obtained according to the traditional utility itemset mining algorithm is a highly abstract set, which does not reflect why the internal elements of the set can form the efficient itemset, nor can it be known what correlation exists between them. Therefore, cross-level HUIM algorithms is proposed to solve this problem

Mining Top-K Cross-Level High Utility Itemsets

define

Most of the utility value calculation points in hui-Miner, kHMC algorithm notes have been introduced, here is only a little additional supplement

  • 5. Taxonomy: a tree (τ\tauτ), with leaves representing specific items and parent nodes representing classification criteria (varying degrees of abstraction)

  • Generalized item: indicates a high level of abstraction. For example, the Persian cat and the tiger can both be classified as felines, abbreviated as GIGIGI. This leads to the concept of location relationship, namely, in τ\tauτ, there is a path from node GGG to node III, There is (g, I)∈LR⊆GI×I(g, I) \in LR \subseteq GI \times I(g, I)∈LR⊆GI×I

  • An abstract item is a non-generalized one. It can be abbreviated as AIAIAI = GI∪IGI \cup IGI∪I (III is a set of terms). In the same way, a generalization relationship can be generalized, that is, in τ\tauτ, there is a path from node DDD to node FFF, There is (D,f)∈GR⊆AI×AI(D,f) \in GR \subseteq AI \times AI(D,f)∈GR⊆AI×AI

  • To be specific: Starting from some generalized term GGG node and moving along the branch to the leaf node is called specialization of GGG, Formula for the Leaf (g, tau) Leaf (g, \ tau) Leaf (g, tau) = {I ∣ (g, tau) ∈ LRi \ mids (g, \ tau) \ in LRi ∣ (g, tau) ∈ LR}; When the item set XXX, YYY has ∀ F ∈X\forall F \in X∀ F ∈X, F ∈Yf \in Yf∈Y or ∃d∈Y\exist d \in Y∃d∈Y have f∈Desc(d,τ)f \in Desc(d, \tau)f∈Desc(d,τ) and ∣X∣ mid X \mid∣X∣ = ∣Y∣ mid Y \mid∣Y∣ XXX can be said to be the concrete term set of YYY

  • Derivation: naturally, all the children of GGG are formulated as Desc(g,τ)Desc(g, tau)Desc(g,τ) = {f∣(g,f)∈GRf \mid (g,f) \in GRf∣(g,f)∈GR}; When the item sets XXX and YYY have ∀ F ∈X\forall F \in X∀ F ∈X, ∃ d ∈ \ \ d exist in Y Y ∃ d ∈ Y have f ∈ Desc (d, tau) f \ in Desc (d, \ tau) f ∈ Desc (d, tau) and ∣ X ∣ \ mid X \ mid ∣ X ∣ = ∣ Y ∣ \ mid Y \ mid ∣ Y ∣, XXX is the derived item set of YYY

  • Cross-level pattern: contains both generalized and abstract items

  • Multi-level Pattern: Cross-level patterns obtained after multi-level filtering with different thresholds for different levels of abstraction

  • Generalized – weighted utilization: Similar to TWU, GWU(X)GWU(X)GWU(X) = ∑Ti∈ G (X)TU(Ti)\sum_{T_i \in g(X)}TU(T_i)∑Ti∈g(X)TU(Ti), Where G (X)g(X)g(X) refers to the set of transaction items containing item set XXX

  • Partial Order Relation (\ suCC) : For all the items in AIAIAI, there is a sort rule, If level(a)< Level (b) Level (a)< Level (b) Level (a)< Level (b)level(a)< Level (b) or Level (a)level(a)level(a) == Level (b)level(b)level(b) level(b), When GWU(a)

  • Join-based extensions: For the item set XXX, the Jion-based extension adds the term Y ∈AIy \in AIy∈AI directly to XXX, with the condition I ≺yi \prec yi≺ Y and ∀ I ∈X\forall I \in X∀ I ∈X, Y ∉ Desc (I, tau) y \ not \ in Desc (I, \ tau) y  ∈ Desc (I, tau) to set up

  • Tax-based Extensions: Any child node that replaces the current last item yyy in itemset XXX with XXX is called tax-based Extensions

    Ps. The schematic diagram of the two extension methods is as follows:

  • Tax-utility-list: TuList (X)tuList(X)tuList(X) tuList(X) = < (tid,iutil,rutil), children(tid, iutil,rutil), The children (dar, iutil rutil), children >, which childrenchildrenchildren record itemsets XXX the last item in the corresponding tuListtuListtuList all child nodes

    The schematic diagram of Ps. TuList structure is as follows:

motivation

The products in sale can be divided into many categories (broadly) such as bread, milk and so on, and in these categories can be divided into many small categories (vertical segmentation), such as white bread, oat bread, buckwheat bread and so on. The traditional HUIM algorithm can only analyze the set in a broad sense, that is, the results show that “milk + bread” can sell well, but cannot find out the specific “which kind of bread + milk” sells well (Ps. The traditional HUIM algorithm may filter some results for being below the threshold if the mining object is too narrow). Secondly, it is difficult to obtain satisfactory results in a short time by artificially setting the threshold value (too large or too small threshold value will seriously affect the mining results), so top-KKK thought is used to transform the thinking of obtaining results to avoid this problem

strategy

Pruning strategy based on antimonotonicity

This strategy is universal in utility mining. When GWU(X)

Pruning strategy based on residual utility

This strategy is the easiest to implement in the algorithm based on utility-list data structure, and is also the most consistent with the characteristics of list structure. When reu(X)reu(X)reu(X) reu(X) = ∑ E ∈tuList(X)(e.util + e.util)

Utility threshold promotion strategy based on generalized term

In preparation for mining, find the first order terms of GWU>minUtilGWU >minUtilGWU >minUtil, and the true utility value of these lower order terms can be used to raise the current minimum threshold

Pseudo code

This algorithm is an improved version of Top-KKK based on CLH-Miner algorithm. Through depth-first traversal, the high-order efficient item set is mined by two expansion methods of Join-based and tax-based

Main procedure

Find procedure

Since the original article did not provide pseudocode for implementing construct Join-based Extensions, Construct tax-based Extensions, and Construct (), Therefore, I intercept corresponding pseudocodes from CLH-Miner and Hui-Miner algorithms

ConstructJoinExtension procedure

ConstructTaxExtension procedure

Construct procedure

conclusion

Since this paper was published in the same year as its benchmark algorithm clH-MINER, it did not make a good optimization in threshold lifting strategy, and only adopted some common strategies in top-KKK HUIM algorithm. However, this algorithm is the first to solve the top-KKK cross level HUIM problem. According to the experimental results, the algorithm performs well in time, but increases sharply in memory with the increase of KKK value, which is a common fault of utility-list data structure. In my opinion, we should improve the data structure and use a higher compression degree to improve the mining efficiency and reduce the generation of invalid candidates