SPADE is an early and famous algorithm in sequence mining. It provides a set of solutions for a series of pain points of sequence mining task itself, such as Lattice search technique, vertical ID-list and so on. Episode Utility, previously studied, also belongs to a branch of sequence mining. The difference lies in that SPADE considers that each sequence is independent of each other, while episode sequence is interrelated. Although the performance of this algorithm is not as good as the current, but its design ideas can still be learned

An Efficient Algorithm for Mining Frequent Sequences

introduce

A series of people’s behaviors can be regarded as a sequence (they can be independent or related to each other. For simplification, only independent relations are considered in this paper). Each sequence can be divided into several events. Different steps can be executed in serial or parallel. Each step can be further subdivided into specific actions (items), which are also in sequence (there is only one subject and no simultaneous actions by default). Since symbolization is adopted in the algorithm, the sorting rule between each item is set as lexicographic order. For convenience, SPADE algorithm defaults that multiple events will not occur simultaneously in the same sequence (different from episode), as shown in the figure below:

To solve the problem

  1. Dig frequent subsequences, such as D→BF→AD \rightarrow {BF} \rightarrow AD→BF→A to get BF{BF}BF;
  2. Find A span subsequence, such as D→BF→AD \rightarrow {BF} \rightarrow AD→BF→A to get D→BFD \rightarrow {BF}D→BF;
  3. The mining objects have high generality and can be directly applied to categorical domain, such as DNA sequence, text analysis, web behavior analysis and so on

strategy

Lattice theory

Set P={X,Y,Z}P = \{X, Y,Z \}P={X,Y,Z \}, the following partial order relation holds:

  • Reflexivity: X≤XX \le XX≤X
  • X=YX =Y \le X=Y \le X=Y \le X=Y \le X=Y
  • Transitivity: X≤YX \le YX≤Y and Y≤ZY \le ZY≤Z then X≤ZX \le ZX≤Z

With respect to the set S⊆PS \subseteq PS⊆P, union (\lor)/intersection (∧\land∧) for SSS could be proved with respect to the upper/lower bound (upper/lower bound) when XXX was the upper/lower bound of SSS. Minimum upper bound)/maximum lower bound; When any two elements in the set LLL have upper and lower bounds, the LLL is called a lattice, and the sub-lattice of LLL must also meet the above conditions. Further, when X∧YX \land YX∧Y/ X∨YX \lor YX∨Y represents an upper/lower bounded set, then LLL is called hyper-lattice.

Ps. It should be noted that in this paper, the object of the author’s study is supercase, but it is referred to as case, so please do not confuse it

Download-closure Property

In the absence of constraint, a sequence can produce an infinite number of subsequence combinations (that is, the lattice is infinitely large). Therefore, according to the actual situation, a sequence is limited to a maximum of CCC events and each event contains a maximum of TTT items, i.e. the longest sequence is C TC · TC⋅T. Suppose that both XXX and YYY of a given sequence are frequent, then the operation of their intersection is called meeting-semilattice, and similarly, the operation of the two phase union is called join-semilattice (Ps. Refer to the introduction to discrete mathematics for a detailed concept of “semilattice”). Therefore, when XXX and YYY are frequent, X∧YX \land YX∧Y is also frequent (maximum lower bound). Conversely X∨YX \ LOR YX∨Y is not valid (minimum upper bound). Thus, a very famous pruning strategy can be obtained. There is no infrequent subset in frequent sets. When a subset is detected to be infrequent, all sets containing this subset can be skipped in the subsequent detection process

Intersecting count

In this paper, a method to calculate the support degree is proposed. By constructing an ID list for each element, the two parameters (sequence’s ID, event’s ID) can directly locate the position of the modified element in the data set, as shown in the figure below:

The support degree of two elements to construct higher-order set can be obtained through intersection operation. Firstly, temporal Join based on sequence’s ID and then Equality Join based on event’s ID can be obtained. In this way, complete information of the occurrence position of the higher-order set can be obtained

Divide and conquer thoughts

According to the paper, sequence mining takes time and memory (the time complexity of sequence of length K is O(
n k n^k
), so a strategy is used to slice large tasks into individual smaller tasks. The algorithm adopts the idea of common prefix to process sets with the same prefix together, and the partition method can be determined by the length of common prefix, as shown in the figure below:

In this paper, the Depth First Search (DFS) and Breadth First Search (BFS) retrieval strategies are discussed by using this method. For example, BFS has the advantage of knowing more information in advance than DFS, which can assist the subsequent pruning process of generating high-order sets. Since DFS mines only on a single branch, the information stored in memory is consistent, which can save a lot of memory resources (Ps. At this point, most mining algorithms favor the DFS strategy.)

Pseudo code

It should be pointed out that SPADE algorithm does not adopt horizontal database which is treated like Apriori and FP-growth algorithm, but chooses Vertical database

Main algorithm

Searching space

Ps. How to avoid duplication of sets composed of different elements is involved here, please refer to the previousHUSRM algorithmSeveral construction ideas introduced in the notes

conclusion

Because the SPADE algorithm published early, so analysis code is not complicated, personally think that the beauty of this paper is that took a long space clear the principle of frequency mining, is a rare literature, as a beginner can study, such as the concept of lattice is the core of the mining algorithm, and in order to simplify the calculations, Depth-first traversal and breadth-first traversal methods are optimized to different degrees, and unnecessary nodes are cut out in advance. In addition, if you read the papers published in related fields in recent years, mining may not be from bottom to top, but top-down mining ideas based on the characteristics of data sets are also good in terms of performance.