This algorithm is a top-K mining algorithm designed based on Hui-Miner’s Utility-list structure. Many threshold increment strategies adopted in this algorithm are worth studying. The purpose of this note is to introduce these strategies and personal ideas

An efficient algorithm for mining the top-k high utility itemsets, using novel threshold raising and pruning strategies

The sample

External utility of items

Example database

define

You can refer to the hui-Miner algorithm notes for most of the content, but here are some additional additions

Estimated Utility Co-occurrence Structure (EUCS)

EUCS is defined as a tuple ({(a; b; TWU(ab))\{(a; \ b; \ TWU(ab) ){(a; b; TWU (ab)), a, b ∈ I ∗. A, b \ ^ I * in a, b ∈ I ∗, TWU (ab) ∈ R +} TWU (ab) \ in R ^ + \} TWU (ab) ∈ R +}). Where aaa and BBB are terms (not repeated by default) R+R^+R+ indicates that the algorithm does not consider the case of disutility. For the convenience of understanding, the author drew several matrix diagrams to illustrate (Ps. In fact, the information is stored in hashmap, not in real matrix structure). The detailed construction process is as follows:

Utility list construction

Item set utility: Iutil ∑ul∈ ul (X)ul. Iutil \sum_{\rm ul \in ul (X)}ul. Iutil ∑ul∈ ul (X)ul. UL(X)UL(X)UL(X) UL(X) refers to the utility list of item set X (see Hui-Miner for how to construct)

Transitive extension upper-bound utility

Given any item set XXX and its extensibility XXX, The new boundary value calculation formula is teu(X,y)teu(X, Y) teu (X, y) = ∑ ul ∈ ul (X) ul. Iutil \ sum_ {\ rm ul \ ul (X)} in ul. Iutil ∑ ul ∈ ul (X) ul. Iutil + ∑ ul ∈ ul (y) ul. Iutil \ sum_ {\ rm ul \ in UL (y)} UL. Iutil ∑ UL ∈ UL (y) UL. Iutil + u u u (y) (y) (y)

Coverage

When there is a relationship g(I)⊆ G (j)g(I) \subseteq g(j)g(I)⊆ G (J), we say that JJJ contains (cover) term III, Can be deduced about the item iii covers C (I) = {g (I) ⊆ g (x) Sunday afternoon x ∈ I} (I) = C \ {g (I) (x) \ \ subseteq g land x \ \} I in C (I) = {g (I) ⊆ g (x) Sunday afternoon x ∈ I}, Where g(x)g(x)g(x) represents the number (TID) of all transactions containing item XXX, i.e. G (x)={TID ∣x⊆Ttid}g(x) = \{TID \mid x \subseteq T_{\rm TID}\}g(x)={TID ∣x⊆Ttid}

Further expansion, coverage relationship also exists between item sets. When item set X=x1x2… XnX = x_1x_2 \ldots x_{\rm n}X=x1x2… Xn, xi∈C(x1)(2≤ I ≤n)x_{\rm I} \in C(x_1) (2 \le I \le n)xi∈C(x1)(2≤ I ≤n) is said to be a coverage itemset

This is actually a closure concept, right? Replace the following set of extended items with a small number of items that share the same properties, and move on to other useful properties based on the concept of coverage. The proof process is described in detail in the original text, and you can derive it manually

Property 1: Let iii and JJJ have the relation j∈C(I)j \in C(I)j∈C(I), Have u (ij) ∣ p u (I) + g (I) ∣ x p (j) > 0 u (ij) \ ge u (I) + / mid g (I)/mid/p (j) > 0 times u (ij) ∣ p u (I) + g (I) ∣ x p (j) > 0, Where the item set ijijij is extended by the combination of two items
Property 2: If j∈C(I)j \in C(I)j∈C(I), then TWU(I)=TWU(ij)TWU(I)=TWU(ij)TWU(I)=TWU(ij)
Property 3: set X=x1x2… XnX = x_1x_2 \ldots x_{\rm n}X=x1x2… Xn, Are u (X) = ∑ 2 or less I nu (x1xi) or less – (n – 1) * u (x1), u (X) = \ sum_ 2 I \ \ le le n} {u (x_1x_ {I} \ rm) – (n – 1) \ times U (x_1) u (X) = ∑ 2 or less I nu (x1xi) or less – (n – 1) * u (x1)
Property 4: According to property 3, we can get a new way of calculating utility: Set item XXX, xix_ixi and item set YYY where XI ∈C(x) X_I \in C(x)xi∈C(x), Y⊂C(x)Y \ C(x) subset C(x)Y⊂C(x) and xi yerx_I \not\in Yxi∈Y, Have u (xYxi) u (xYx_i) u (xYxi) = u (x, y) + u (xxi) – u (x) u (x, y) + u (xx_i) – u (x) u (x, y) + u (xxi) – u (x), We also know that u(xYxi)>u(xY)u(xYx_i) >u(xY)u(xYxi)>u(xY)
Property 5: According to property 4, it is easy to think that minUtil≤u(xY)

strategy

The estimated utility co-occurrence pruning strategy (EUCP)

At the end of the EUCS construction according to the definition, remove the TWU(xy)

Binary number of itemsets can be clearly seen, much less than the original, then on the basis of the binary itemsets mining much higher levels of efficiency with the collection, the pruning strategy core still use the traditional TWUTWUTWU properties, but most of the previous algorithm do first-order itemsets and pruning, and the algorithm can successfully on the second-order itemsets using this property, It has great reference value for other algorithms

The efficient utility-list pruning strategy

According to the two definitions of item set utility value calculation and item set residual utility value calculation, a new pruning strategy can be obtained. The ∑ ul ∈ ul (X) ul. Iutil + ∑ X ≻ iU (I) < minUtil \ sum_ {\ rm ul \ ul (X)} in ul. Iutil + \ sum_ \ succ {X} I U (I)” MinUtil ∑ul∈ ul (X)ul.iutil+∑X≻iU(I)

The Early Abandoning Additional Strategy (EA)

LA property: Given two item sets XXX and YYY, When the ∑ ∀ Ti ∈ DU (X, Ti) + RU (X, Ti) – ∑ ∀ Tj, D ∈ X ⊆ Tj Sunday afternoon Y ⊈ TjU (X, Tj) + RU (X, Tj) < minUtil \ sum_ rm {\ \ forall T_i \} in D U (X, T_ {I} \ rm) + RU (X, T_{\rm i}) – \sum_{\rm \forall T_j \in D, X \subseteq T_{\rm j} \land Y \not\subseteq T_{\rm j}}U(X, T_{\rm j}) + RU(X, T_ {\ rm j}) < minUtil ∑ ∀ Ti ∈ DU (X, Ti) + RU (X, Ti) – ∑ ∀ Tj, D ∈ X ⊆ Tj Sunday afternoon Y  ⊆ TjU (X, Tj) + RU (X, Tj) < minUtil, It has ∀X star X ‘∧Y star Y’ ε X ‘targeted HUIs\forall X \subseteq X^\prime \land Y \subseteq Y^\prime \Rightarrow X^\prime Y^\ Prime \not\in HUIs ∀ ⊆ X X ‘Sunday afternoon ⊆ Y Y’ ⇒ X ‘Y’  ∈ HUIs

The core idea of EA strategy is to stop constructing the Utility list of an item set when it is set to LA Property, so as to save space and time. The proof and derivation process is described in detail in the paper

Transitive extension pruning strategy (TEP)

According to the Transitive Extension upper-bound Utility definition, there are two properties:

$teu(X, y) \ge u(Xy)$
$teu(X, y) \ge \sum_{\rm ul \in UL(Xy)}(ul.iutil + ul.rutil)$

The proof process is derived in detail in the paper, here only to say the conclusion. By this property, when teu(X,y)

Real item utilities threshold raising strategy (RIU)

This strategy can be easily implemented on many algorithms, and there are many variations. The main idea is to collect the utility of each item or the utility of the transaction item set while scanning the data set, and usually we can use hashMap to store these values (because later mining can also be used).

The co-occurrence threshold raising strategy (CUD)

On the basis of EUCST, the information of 2-Itemsets can be used to raise the threshold, and other algorithms can be used to consider whether a CUDM should also be built to store these information (is there any way to avoid the calculation results obtained by scanning the data set twice).

The coverage threshold raising strategy (COV)

The main method is to construct a COV List to raise the threshold. The process is shown in the pseudocode below. Personally, I think this method is more applicable because only unary items are involved

algorithm

construct utility list

The key of this method is that each candidate has a “common prefix”. However, this method is actually inefficient and will produce many useless candidates. Therefore, it is necessary to cut off those inefficient candidates as much as possible before the stitching operation

iConstruct algorithm

The above utility list construction algorithm is actually very inefficient. It takes time and memory to traverse (albeit in dichotomies) two lists to find items with the same prefix and then concatenate them. KHMC has designed a new construction method that uses EA strategies to avoid constructing inefficient itemsets

kHMC algorithm

The pseudo code of the whole algorithm is as follows:

conclusion

KHMC algorithm applies a lot of threshold lifting strategies, and it is only based on the utility-list structure. When I first came into sight of this paper, I only glanced at it roughly, without a careful understanding of the applied methods and features. The author also makes many comparative experiments to test the superiority of kHMC algorithm from various aspects. At the end, the author also mentions that the concept of coverage can actually be used in utility mining in other fields. Personally, I think this algorithm has great reference value and is worth studying. The only drawback is that no code can be found on SPMF…

KHMC algorithm