It has been a long time since the learning of USpan algorithm in the last paper. This time, I will learn sequence mining again. This paper solves the derivation of association rules and advances the previous work of mining sequence patterns. HUSRM algorithm uses a lot of strategies, it needs to spend some time to study
Efficient Mining of High-Utility Sequential Rules
The sample
sequence dataset
utility table
define
For the basic definition, please refer to this USpan algorithm note, which is only a supplement here
- Let the sequential rule have two itemsets XXX and YYY with X,Y⊆IX, Y \ Subseteq IX,Y⊆I, X∩Y= \emptyX∩Y=∅ X and X,Y≠∅X, Y \not= \emptyX,Y=∅, the sequence rule X→YX \rightarrow YX→Y is related meaning that in the same sequence, As soon as the item set XXX happens, the item set YYY happens, the sequence rule size is ∣X∣×∣Y∣\mid X \mid \times \mid Y \mid∣X∣×∣Y∣, Meanwhile, seq(r)={s∣ S ∈SDB∧ R ⊆ S} SEq (r)=\lbrace s \mid s \in SDB \land r \subseteq s \rbraceseq(r)={s∣ S ∈SDB∧ R ⊆ S} was used to indicate that the rule RRR was included All sequences of, Ant (r)={s∣ S ∈SDB∧X star}ant(r)=\lbrace s \mid s \in SDB \land X \subseteq s \rbraceant(r)={s∣ S ∈SDB∧X star} S \ rBRACeant (r)={s∣ S ∈SDB∧X star} S
- The support of association rule RRR in the data set is also expressed in the form of ratio. Calculation expression for supSDB (r) = ∣ seq (r) ∣ ∣ SDB ∣ sup_ (r) = {SDB} \ frac {\ mid seq (r) \ mid} {\ mid SDB \ mid} supSDB (r) = ∣ SDB ∣ ∣ seq ∣ (r)
- The confidence of association rule RRR in the data set is also expressed in the form of ratio. Calculation expression for confSDB = ∣ seq (r) ∣ ∣ ant (r) ∣ conf_ = {SDB} \ frac {\ mid seq (r) \ mid} {\ mid ant (r) \ mid} confSDB = ∣ ant (r) ∣ ∣ seq ∣ (r)
- Utility of a sequential rule When a rule R :X→Yr: X \rightarrow Yr:X→Y Its utility value calculation expression for u (r, sc) = ∑ (I ∈ X ∪ Y) Sunday afternoon (r) ⊆ sc u (I, sc), u (r, s_c) = \ sum_ {(I \ \ X in CPU Y) \ land (r \ subseteq s_c)} u (I, S_c) u (r, sc) = ∑ (I ∈ X ∪ Y) Sunday afternoon (r) ⊆ sc u (I, sc), if r ⊈ SCR \ not \ subseteq s_cr ⊆ sc, u (r, sc) = 0 u (r, s_c) = 0 u (r, sc) = 0; The total utility value of this rule in the whole data set is uSDB(r)=∑s∈SDBu(r,s)u_{SDB}(r) = \sum_{s \in SDB}u(r, s)uSDB(r)=∑s∈SDBu(r,s)
- Given three thresholds of minsupminSupminsup, minconfMinConfminconf and minutilminutilminutil, When both values of a rule are greater than the threshold, we consider the rule to be an efficient sequential rule (i.e., more constraints)
One thing that should be paid attention to in mining rules is the extension method. It is also explained in this paper that the rules with the size of 1×11 \times 11×1 are gradually extended to the left and right to get longer interesting rules. However, it should be noted that the order of left and right extension and the recursive method will greatly affect the final mining results. For example, rule RRR: {a}→{c}\lbrace a \rbrace \rightarrow \lbrace c \rbrace{a}→{c} {a,b}→{c,d}\lbrace a,b \rbrace \rightarrow \lbrace C,d \rbrace{a,b}→{c,d} {c}→{e}\lbrace c \rbrace \rightarrow \lbrace e \rbrace{b}→{e} \lbrace c \rbrace \rightarrow \lbrace e{c}→{e}\lbrace c \rbrace \rightarrow \lbrace e {b,c}→{e}\lbrace b,c \rbrace \rightarrow \lbrace \rbrace{b,c}→{e} The solution is described in the definitions below
-
Extendability
-
For rule X→YX \rightarrow YX→Y and item I ∈Ii \in Ii∈I, When for ∀j∈X\forall j \in X∀j∈X has I ≻lexji \succ_{lex} ji≻lexj and I hereunder Yi \not\in Yi∈Y, The left extension is X∪{I}→YX \cup \lbrace I \rbrace \rightarrow YX∪{I}→Y
-
Similarly, for ∀j∈Y\forall j \in Y∀j∈Y there is I ≻lexji \succ_{lex} ji≻lexj and I misplaced Xi \not\in Xi∈X, X→Y∪{I}X \rightarrow Y \cup \lbrace I \rbraceX→Y∪{I}
-
Now, ≻lex\succ_{lex}≻lex is — that is, between items is lexicographical — which is used to solve the problem of different rules producing the same new rules because of expansion; Because it is possible for an item to be extended both left and right, the default is left extension first, then right extension, and the right extension can continue to extend left, but left extension can only be extended right, to solve the same rule because of different extension order to get repeated new rules
-
-
Sequence estimated Utility of an item has the same function as TWU, The calculation expression is SEU(I)SEU(I)= ∑(sc∈SDB)∧({I} star SC)SU(sc)\sum_{(s_c \in SDB) \land (\lbrace I \rbrace \subseteq S_c)}SU(s_c)∑(sc∈SDB)∧({I} star SC)SU(sc), where SUSUSU is the total utility value of scs_csc sequence, the same as TU, which means to calculate the sum of the utility values of all sequences including item III in the whole dataset; Similarly, SEU(r)SEU(r)SEU(r)= ∑sc∈seq(r)SU(sc)\sum_{s_c \in seq(r)}SU(s_c)∑sc∈seq(r)SU(sc), The purpose of these estimates is to help determine whether the current item/rule is potentially interesting, and as long as they are not less than minutilminutilminutil, then we think the item/rule is scalable and most likely efficient
-
This is a new storage structure designed separately to record all utility information related to rule RRR. It is represented by UT (r) UT (r) UT (R) UT (R). There are many pieces of data in the utility list structure. Format for (sid, iutil lutil, rutil, lrutil) (sid, iutil lutil, rutil, lrutil) (sid, iutil lutil, rutil, lrutil), including:
- Sid: represents the sequence number containing RRR (Ssid∈ SEq (r)S_{SID} \in SEq (r)Ssid∈ SEQ (r))
- Iutil: represents the utility value of RRR in the current sequence
- Lutil: represents the sum of utility values for items in the current sequence that can be left extended to the RRR
- Rutil: represents the sum of the utility values of the items in the current sequence that can extend the RRR to the right
- Lrutil: represents the sum of the utility values that can extend the RRR left/right in the current sequence
In fact, the utility values of the last three records of expansion can be used as part of the estimate, which is convenient for pruning RRR. In the process of use, it is necessary to clarify the current expansion mode, and it is easy to understand the relationship between the extension iii and the extended item/item set JJJ. Let’s continue with how to calculate the extension rule by utility-table. Given the extended rule r ‘r ^\primer ‘, rule RRR, I ≻lexj∈ SCI \succ_{lex} j \in s_ci≻lexj∈ SC and ut(r)ut(r)ut(r) ut(r ‘)ut(r^\prime)ut(r ‘) :
- Sid: This must be consistent because the extended rule must be in the same sequence as the original rule
- Iutil ‘: Iutil ‘(r’) iutil ^ \ prime (r ^ \ prime) iutil ‘(r’) = iutil (r) iutil iutil (r) (r) + u ({I}, ssid), u (I \ \ lbrace rbrace, s_{sid})u({i},ssid)
- Lutil ‘: Lutil ‘(r’) lutil ^ \ prime (r ^ \ prime) lutil ‘(r’) = lutil lutil (r) (r) lutil (r), u (j, ssid) u (j, s_ {sid}), u (j, ssid) – u (I, ssid) u (I, S_ {sid}) U (I,ssid), in which JJJ cannot be the left extension of rule r ‘r ^\primer ‘, but the left extension of rule RRR. In addition, if item III can be the left extension of rule RRR, then its utility value should be subtracted. Otherwise, there is no need to subtract the part value of item III
- Rutil ‘: Rutil ‘(r’) rutil ^ \ prime (r ^ \ prime) rutil ‘(r’) = rutil rutil (r) (r) rutil (r) – u (j, ssid) u (j, s_ {sid}) u (j, ssid) – u (I, ssid) u (I, S_ {sid})u(I,ssid
- Lrutil ‘: Lrutil ‘(r’) lrutil ^ \ prime (r ^ \ prime) lrutil ‘(r’) = lrutil lrutil (r) (r) lrutil (r), u (j, ssid) u (j, S_ {sid})u(j,ssid) -u (I,ssid) U (I, s_{sid}) U (I, SSID), where JJJ cannot be the left/right extension of rule r ‘r ^\primer ‘, but the left/right extension of rule RRR, Also subtracting the utility value of item III if it can be a left/right extension of rule RRR, otherwise do not subtract the part of item III
The reason why we need to subtract the utility values of item set III and JJJ after the extension is that the same item can not be repeated in the association rules, and also to avoid the right extension and left extension of an item
-
In the confidence of extension rule, bit vector B (I)b(I)b(I) B (I) is used to indicate the sequence in which item III appears. 111 is used to indicate that item III appears, and 000 indicates that it does not appear. Then we take the bit vectors of each item in the item set and operate them bit by bit, and the final number of 111 is the item set given Ant (X)∣ mid Ant (X) \mid∣ Ant (X)∣ RRR is calculated in the same way, the support can be intuitively obtained by using the Utility table; Bv bv such as bv (a) (a) (a) = 1111 Sunday afternoon \ land Sunday afternoon bv bv bv (b) (b) (b) = 1011 ⇒ \ Rightarrow ⇒ bv (ab) bv (ab) bv (ab) = 1011
Ps. While in the paper it looks like a long array is used to store the zeros and ones, in the code a hashmap is used to store the corresponding key-value pairs, and the new regular bit vector is calculated directly by comparing the keys
strategy
Since pruning strategy involves two well-researched mining fields (frequent and utility), pruning strategy is carried out by their anti-monotonicity respectively. As long as a condition is not met, we will delete the node and do not carry out further expansion work. Hui-miner has a detailed introduction
utility-table
In fact, the information stored in the Utility-list structure is very intuitive, because all the information of the item/set and the item is stored in the list, we only need to combine these known conditions to get the key information
- U (r)u(r)u(r) u(r)= ∑ I ∈ Sidiutil (ri)\sum_{I \in sid}iutil(r_i)∑ I ∈ Sidiutil (ri)
- Similarly, the formula for calculating the support of the rule RRR is supSDB(r)sup_{SDB}(r)supSDB(r)= ∣ut(r)∣ SDB∣\frac{\mid ut(r) \mid}{\mid SDB \mid}∣SDB∣ ut(r)∣
- Iutil (r)iutil(r)+ rutil(r)rutil(r) Rutil (r)rutil(r)+ Lutil (r) Lutil (r) Lutil (r)+ Lrutil (r)≤SEU(r)lrutil(r) \le SEU(r)lrutil(r)≤SEU(r)
- If rule RRR is extended left/right once to get a new rule TTT, Have u (t) or less iutil (r), u (t) \ le iutil (r) u (t) or less iutil + rutil (r) (r) rutil rutil (r) (r) + lutil lutil (r) (r) lutil (r) + Lrutil (r)lrutil(r)lrutil(r) (because with each extension, there are fewer sequences containing this new rule)
- If rule RRR gets a new rule TTT after a left extension, U (t)≤iutil(r)u(t) \le iutil(r)u(t)≤iutil(r)+ lutil(r)lutil(r)lutil(r)+ lrutil(r)lrutil(r)lrutil(r) lrutil(r)lrutil(r)
compact utility-table
The author observed two rules during the experiment:
-
During the left extension:
- Rutilrutilrutil is never used, so you don’t need to calculate it
- The sum of Lutillutillutil and lrutillrutillrutil is always used, so you can use the sum instead of the two values
Obviously, this makes the table smaller and takes less time to construct
-
In any sequence, the items after the first occurrence of a rule can be extended to the right, and correspondingly, the items that precede it can be extended to the left
algorithm
HUSRM algorithm
Left Expension
Right Expension
conclusion
The above is all the content of HUSRM algorithm, borrowing from the idea of Utility-list, directly through the list without traversing the data set to construct higher level rules, is also a breakthrough work in the field of efficient use of sequential rules mining. Personally, I think it is very interesting for the author to split the original algorithm into four versions of different degrees of optimization for experimental comparison. In fact, we may try to figure out the purpose of the author’s arrangement. The biggest problem of this kind of algorithm is that the running time is too long. Because of the complex construction process, the experimental results of many algorithms are not very good. More research is needed to find out whether there is a more efficient storage structure to simplify the construction process or a new construction idea.