Write in front: Episode pattern mining is different from Itemset pattern mining, and its complexity is much more than itemset. First, “plot” means the occurrence of a series of events, and multiple Itemsets are generated within a period of time. These Itemsets may occur in sequence or simultaneously. That is, episode mining should consider not only horizontal sequences of events, but also vertical sets of events.

Mining High Utility Episodes in Complex Event Sequences

define

Episode mining starts with Frequent calculations, so here are some basic definitions

The Frequent class

  • Simple event sequence: let ε={E1,E2… , Em} \ varepsilon = \ {E_1, E_2, \ dots, E_ {\ rm m} \} epsilon = {E1, E2,… ,Em} is a set of finite events, simple event sequence SS=<(E1,T1),(E2,T2)… , (Em, Tm) > SS = < (E_1, T_1), (E_2, T_2), \ dots, (E_ {} \ rm m, T_ {\} rm m) > SS = < (E1, T1), (E2, T2),… ,(Em,Tm)> is an ordered sequence of events in which each event Ei∈εE_{\rm I} \in \varepsilonEi∈ε is associated with a time point Ti∈N+T_{\rm I} \in N^+Ti∈N+. For example, in Figure 1 below:

  • Alpha alpha is a set of non-empty ordered events. For example, in FIgure (FIgure 1) < (A), (C) > < \ (A), (C) > < (A), (C) > is A simple plot

  • Simultaneous Event set SE=(E1,E2,… ,Em)SE =(E_1, E_2, \dots, E_{\rm m})SE=(E1,E2… Em) refers to a set of events that occur on TiT_{\rm I}Ti at the same point in time.

  • Complex event sequence: CS=<(SE1,T1),(SE2,T2)… , CS = < (SE_1, T_1), (SE_2, T_2), \ dots, CS = < (SE1, T1), (SE2, T2),… , (SEn,Tn)>(SE_{\rm n}, T_{\rm n})>(SEn,Tn)> is a series of ordered event sets, in which there are multiple simultaneous event sets, each of which forms an event sequence. For example, see the following Figure (Figure 2) :

  • Length and Size: The Length of event episode α\alphaα is defined as the number of events in the set, which is called KKk-episode; The magnitude of event episode α\alphaα is equal to the number of simultaneous event sets in that episode. For example: “(AB), (C) > < (AB), (C) > < (AB), (C) > is the size of 2 3 – episode came

  • Occurrence: Given a scenario, Occurrence occurs when 1) scenario α=<(SE1),(SE2)… , (SEm) > \ alpha = < (SE_1), (SE_2), \ dots, (SE_ {\} rm m) > alpha = < (SE1), (SE2),… ,(SEm)> occurs in the time period [Ts,TeT_{\rm s}, T_{\rm e}Ts,Te]; 2) The first simultaneous event set SE1SE_1SE1 of episode α\alphaα occurs at TsT_{\rm s}Ts time, and the last simultaneous event set SEmSE_{\rm m}SEm occurs at TeT_{\rm e}Te time, Then the plot α\alphaα is said to occur at the time interval [Ts,TeT_{\rm s}, T_{\rm e}Ts,Te]. Among them, all the time periods of the episode α\alphaα form the set occSet(α)occSet(\alpha)occSet(α). Such as: In Figure (Figure 2) occSet (< (AB), (C) >) = {[1, 2], [1, 3], [1, 6], [1, 7], [5, 6], [5, 7]} occSet (< (AB), (C) >) = \ {[1, 2], [1, 3], [1, 6]. [1, 7], [5, 6], [5, 7] \} occSet (< (AB), (C) >) = {[1, 2], [1, 3], [1, 6], [1, 7], [5, 6], [5, 7]}

  • Minimal occurrence: A given plot of alpha \ alpha alpha two interval [Ts, TeT_} {\ rm s, T_} {\ rm e Ts, Te], [Ts’ and ‘Te T_} {\ rm s’, T_ {\} rm e’ Ts’, ‘Te]. [Ts’, ‘Te T_} {\ rm s’, T_ {\} rm e’ Ts’ and ‘Te] is [Ts, TeT_} {\ rm s, T_} {\ rm e Ts, Te] a subset of the: When 1) the plot α\alphaα takes place in the time period [Ts,TeT_{\rm s}, T_{\rm e}Ts,Te]; 2) [Ts ‘, Te ‘T_{\rm s}’, T_{\rm e}’Ts ‘, Te’] no subset exists. The shortest occurrence period is denoted as Mo (α) Mo (\alpha) Mo (α). Among them, the set moSet(α)moSet(\alpha)moSet(α) is composed of all the shortest occurrence time periods of the plot α\alphaα. Such as: The plot “(AB), (C) > < (AB), (C) > < (AB), (C) > in the shortest time [1, 2], And moSet (< (AB), (C) >) = {[1, 2], [5, 6]} moSet (< (AB), (C) >) = \ {[1, 2], [5, 6] \} moSet (< (AB), (C) >) = {[1, 2], [5, 6]}

  • Support of an episode: SC(α)SC(\alpha)SC(α) is defined as the number of the shortest occurrence time periods in moSet(α)moSet(\alpha)moSet(α) (Ps. The support rate is defined as the ratio of SC(α)SC(\alpha)SC(α) in CS to the point in time

  • Frequent episodes: An episode is called a Frequent episode when its support level is no lower than the support threshold set by the user (minSupminSupminSup).

The paper also mentioned why a new Utility mining was designed to replace frequency mining: One is that the meaning of frequency is too narrow (e.g., diamond and bread), and the other is that it is difficult to talk about Simultaneous occurrence (defined as only one event occurring at the same time) based on frequency.

The Utility class

CS = < (tSE1, T1), (tSE2, T2),… , (tSEn, Tn) > CS = < (t_ {SE_1}, T_1), (t_ {SE_2}, T_2), \ dots, (t_ {SE_ {\ rm n}}, t_ n} {\ rm) > CS = < (tSE1, T1), (tSE2, T2),… (tSEn,Tn)>, where Ti∈N+T_{\rm I} \in N^+Ti∈N+, Every event Ei∈εE_{\rm I} \in \varepsilonEi∈ε is associated with profit P (Ei,CS)p(E_{\rm I}, CS)p(Ei,CS) (external utility) and quantity Q (Ei,Ti)q(E_{\rm I}, T_{\rm I})q(Ei,Ti) (internal utility). The following Table (Table 1) and Figure (Figure 3) respectively represent the external and internal utility of each event

  • Utility of an event at a time point At some point in time Ti ∈ N + T_ {I} \ rm \ N ^ + Ti ∈ N + in an event under EiE_ {I} \ rm utility of Ei is defined as the u (Ei, Ti) = p (Ei, CS) x q (Ei, Ti) u (E_ {I} \ rm, T_ {I} \ rm) = p (E_ {I} \ rm, CS) \ times q (E_ {I} \ rm, T_ {I} \ rm) u (Ei, Ti) = p (Ei, CS) x q (Ei, Ti) (the amount of profit * \ times *)

  • Utility of a simultaneous event set at a time point At a time point Ti∈N+T_{\rm I} \in N^+Ti∈N+ the event set SE=(E1,E2… En)SE =(E_1, E_2, \dots, E_{\rm n})SE=(E1,E2… , the effect of En) is defined as the u (SE, Ti) = ∑ j = 1 nu (Ej, Ti) u (SE, T_ {I} \ rm) = \ sum_ ^ {j = 1} {n} u (E_} {\ rm j, T_ {I} \ rm) u (SE, Ti) = ∑ j = 1 nu (Ej, Ti)

  • Total Utility of Database complex Event Sequence Based on the complex sequence of events CSCSCS total utility is defined as u nu (CS) = ∑ I = 1 (SEi, Ti) u (CS) = \ sum_ {I = 1} ^ {n} u (SE_ {I} \ rm, T_ {I} \ rm) u nu (CS) = ∑ I = 1 (SEi, Ti)

  • Utility Value of an episode W.R. Minimal occurrence: Mo (alpha) = / Ts, Te mo (\ alpha) = [T_} {\ rm s, T_} {\ rm e] mo (alpha) = [Ts, Te] is the plot of alpha = < (SE1, SE2,… , SEn) > \ alpha = < (SE_1 SE_2, \ dots, SE_ {\ rm n}) > alpha = < (SE1, SE2,… ,SEn)>, in which each parallel occurrence event set SEi∈αSE_{\rm I} \in \alphaSEi∈α is associated with a certain time point Ti∈N+T_{\rm I} \in N^+Ti∈N+. Happened about the shortest period of time the plot of the utility is defined as u (alpha, mo (alpha)) = ∑ I = 1 nu (SEi, Ti) u (\ alpha, mo (\ alpha)) = \ sum_ {I = 1} ^ {n} u (SE_ {I} \ rm, T_ {I} \ rm) u (alpha, mo (alpha)) = ∑ I = 1 nu (SEi, Ti)

  • Utility of an episode in a complex event sequence MoSet (alpha) = [TI1, TI2,…, TIn] moSet (\ alpha) = [T_ {\ rm I_1}, T_ {\ rm I_2}, \ dots, T_ {\ rm I_n}] moSet (alpha) = [TI1, TI2,…, TIn] is about the plot of alpha \ alpha alpha minimum set. The utility of scenario based on complex time series CSCSCS is defined as uv(α,CS)=∑ I =1nu(α,TIn) UV (\alpha, CS)= \sum_{I =1}^{n}u(\alpha, T_ {\ rm I_n}) uv (alpha, CS) = ∑ nu (alpha, TIn), I = 1 and u (alpha) = uv (alpha)/u (CS) u (\ alpha) = uv (\ alpha)/u (CS) u (alpha) = uv (alpha)/u (CS)

  • Maximum time duration: Let MTDMTDMTD be the maximum duration preset by the user, and mo(α)=[Ts,Te] Mo (\alpha) =[T_{\rm s}, T_{\rm e}] Mo (α)=[Ts,Te] is the shortest occurrence period of the episode α\alphaα. When (+ 1) Te – Ts MTD or less (T_} {\ rm e – T_} {\ rm s + 1) \ le MTD (+ 1) Te – Ts MTD or less, Mo (α) Mo (\alpha) Mo (α) is constrained (or satisfied) by MTDMTDMTD (ps.MTD plays the role of time window, because we must define a range to calculate, otherwise infinite time or infinitesimal is not good for us to study)

  • Simultaneous and serial concatenations: let α=<(SE1),(SE2)… , (SEx) > \ alpha = < (SE_1), (SE_2), \ dots, (SE_ {\ rm x}) >, alpha = < (SE1), (SE2),… , (SEx) >, beta = < (SE1 ‘), (SE2 ‘),… , (SEy ‘) > \ beta = < (SE_1 ‘), (SE_2), \ dots (SE_ {\ rm y} ‘) > beta = < (SE1 ‘), (SE2 ‘),… , (SEy ‘) >, The parallel connection between α\alphaα and β\betaβ is defined as simulsimul-concat (α,β)=<(SE1),(SE2),concat(\alpha, \ beta) = < (SE_1), (SE_2), concat (alpha, beta) = < (SE1), (SE2),… , (SEx ∪ SE1 ‘), (SE2 ‘), (SEy ‘) > \ dots, (SE_ {\ rm x} \ CPU SE_1 ‘), (SE_2), (SE_ {\ rm y} ‘) >… , (SEx ∪ SE1 ‘), (SE2 ‘), (SEy ‘) >, The serial connection between α\alphaα and β\betaβ is defined as serialserialserial-concat (α,β)=<(SE1),(SE2),concat(\alpha, \beta) =<(SE_1), (SE_2), concat (alpha, beta) = < (SE1), (SE2),… , (SEx), (SE1 ‘), (SE2 ‘),… , (SEy ‘) > \ dots, (SE_ {\ rm x}), (SE_1 ‘), (SE_2), \ dots (SE_ {\ rm y} ‘) >… , (SEx), (SE1 ‘), (SE2 ‘),… ,(SEy ‘)> (ps. both are extended itemsets, but the combination form is different)

  • High Utility Episode: If the utility value of EiE_{\rm I}Ei is u(Ei)≥minUtilu(E_{\rm I}) \ge minUtilu(Ei)≥minUtil, it is considered as pattern (HUE), otherwise it is not

  • Episode-Weighted Utilization of an Episode: MoSet (alpha) = [TI1, TI2,…, TIn] moSet (\ alpha) = [T_ {\ rm I_1}, T_ {\ rm I_2}, \ dots, T_ {\ rm I_n}] moSet (alpha) = [TI1, TI2,…, TIn], And TIi∈moSet(α)T_{\rm I_i} \in moSet(\alpha)TIi∈moSet(α) satisfies the MTDMTDMTD constraint. Then, based on complex event sequence CSCSCS, the plot α\alphaα utility weight EWU(α)=∑ I =1nEWU(α,TIi)/u(CS)EWU(\alpha) = \sum_{I =1}^{n}EWU(\alpha, T_{\rm I_i}) /u(CS) EWU(α)=∑ I =1nEWU(α,TIi)/u(CS). Such as: MTD = 3, alpha = < (A), (C) > MTD = 3, \ alpha = < (A), (C) > MTD = 3, alpha = < (A), (C) >, Then the EWU (< (A), (C) >) = [u ((AB), T1) + u ((BC), T2) + u ((C), T3)] + [u ((AB), T5) + u ((CD), T6) + u ((C), T7 has)] / u (CS) = 40/40 EWU (< (A), (C)>) = [u((AB), T_1) + u((BC), T_2) + u((C), T_3)] + [u((AB), T_5) + u((CD), T_6) + u((C), T_7)] / u(CS) = 40 / 40EWU(<(A),(C)>)=[u((AB),T1)+u((BC),T2)+u((C),T3)]+[u((AB),T5)+u((CD),T6)+u((C),T7)]/u(CS)=40/40

  • About Episode weight utility of an Episode W.R.T. a minimal occurrence: Mo (alpha) = / Ts, Te mo (\ alpha) = [T_} {\ rm s, T_} {\ rm e] mo (alpha) = [Ts, Te] is the plot of alpha = < (SE1), (SE2),… , (SEn) > \ alpha = < (SE_1), (SE_2), \ dots, (SE_ {\ rm n}) > alpha = < (SE1), (SE2),… ,(SEn)> the shortest occurrence period, where Mo (α) Mo (\alpha) Mo (α) is constrained by MTDMTDMTD. So in [Ts, TeT_} {\ rm s, T_} {\ rm e Ts, Te] under the circumstances of alpha \ alpha alpha plot weight utility EWU (alpha, mo (alpha)) = EWU (\ alpha, Mo (\ alpha)) = EWU (alpha, mo (alpha)) = ∑ I = 1 nu (SEi, Ti) + ∑ I = e (s + MTD – 1) u (tSEi, Ti) \ sum_ {I = 1} ^ {n} u (SE_ {I} \ rm, T_ {I} \ rm) + \ sum_} {I = e ^ {(s + MTD – 1)} u (T_ {SE_ {\ rm I}}, T_ {I} \ rm) nu ∑ I = 1 (SEi, Ti) + ∑ I = e (s + MTD – 1) u (tSEi, Ti), TSEit_ {SE_{\rm I}}tSEi is the time point TiT_{\rm I}Ti of the simultaneous event set in CSCSCS. Such as: Set MTD = 4, alpha = < (C), (A) >, mo (< (C), (A) >) = (3, 5) MTD = 4, \ alpha = < (C), (A) >, mo (< (C), (A) >) = (3, 5) MTD = 4, alpha = < (C), (A) >, mo () < (C), (A) > = (3, 5), The EWU (< > (A) (C), and (3, 5)) = [u (C), (T3)] + [u ((AB), T5) + u ((CD), T6)] = 25 EWU (< (C), (A) >, [3, 5]) = [u ((C), T_3)] + [u ((AB), T_5) + u ((CD), T_6)] = 25 ewu (< > (A) (C), and (3, 5)) = [u (C), (T3)] + [u ((AB), T5) + u ((CD), T6)] = 25

  • High Weighted Utilization episodes: When a scenario α\alphaα has EWU(α)≥minUtilEWU(\alpha) \ge minUtilEWU(α)≥minUtil, the scenario is called a high-weighted utility scenario (HWUE), which is the candidate event for the next definition

  • Promising Event: When EWU(e)≥minUtilEWU(e) \ge minUtilEWU(e)≥minUtil, the event EEE is a Promising event and its Epsiode Utility needs to be further calculated for accurate judgment. Otherwise unpromising events, when α\alphaα is a non-candidate event, their superset γ\gammaγ is inefficient for any event β\betaβ.

  • Episode-Weighted Price Closure Property (EWDC) : Suppose α\alphaα and β\betaβ are plots, And γ=simul\gamma =simul\ γ= simul-concat (α,β)concat(\alpha, \alpha)concat(α,β) or serialserialserial-concat (α,β)concat(\alpha, \beta)concat(α,β), When EWU(α)

strategy

Discarding Global unpromising Events(DGE)

Using the downward closure of the plot weight, those non-effective events are removed at the beginning to achieve the purpose of optimizing the algorithm.

Discarding Local unpromising Events(DLE)

The downward closure of plot weight is used to remove those non-effective events in the operation process (such as in the data set after pruning and sorting) to achieve the purpose of optimizing the algorithm

algorithm

The idea of the main algorithm is the old one: find a story of length 1, prefix it with this event and add other events to check if it can become a HEWU. It is important to extend one event at a time, and the time of the event should not be more than MTD from the time of the prefix

The UP – Span:

MiningSimultHUE:

Mainly responsible for mining the concurrent events of the current event set

MiningSerialHUE:

Mainly responsible for mining the current event set of continuous events

Personal summary

After learning high-utility Pattern Mining algorithm, it is not difficult to understand, and most definitions add the dimension of time. From the perspective of pseudocode, the algorithm is not very complex as a whole, and its thinking is roughly the same as that of Hui-Miner algorithm. However, there are few studies in this area and there is still a lot of room for expansion. This paper is the first to put forward the concept of efficient use term mining based on plot and implement it successfully. The drawback is that only taking the proposed EWU as the boundary value will inevitably be too vague and produce many candidate sets that are not needed originally. It is also necessary to use better pruning algorithms such as those used in FHM or to change the data storage structure to improve retrieval efficiency