Sequence learning is one of the hot topics of deep learning in recent years. From recommendation systems to speech recognition to natural language processing, its potential seems endless, driving innovation and unprecedented solutions.

Sequence learning is implemented in various forms, such as Markov model and directed graph in machine learning domain, RNNs/LSTMs in deep learning domain and so on.



In this article, we will use a little-known algorithm called CompactPredictionTree (CPT) for sequence learning. This approach is surprisingly simple and more powerful than well-known algorithms such as Markov and directed graphs.

Before diving into this article, I recommend reading “Sequence Models you must Read (Side Cases),” in which Author Tavish introduces sequence models and their typical use cases and application scenarios.

Contents of this article:





























Introduction to sequence learning


Sequence learning is required when we need to predict a particular event that is likely to occur after an event. Sequence learning is widely used in various industries, such as:

Page prefetch:



Product recommendation:



Prediction of clinical events:



Weather forecast:


Sequence learning has many other potential applications as well.

Solution Status

To explore more solutions in this area, we launched the Serial Learning hackathon. The most popular is LSTMs/RNNs, which is in the top 10 in private usage.

LSTMs/RNNs has become a popular choice for sequence modeling, whether text, audio, etc. However, they have two basic problems:








A Preliminary Study on CPT(Compact Prediction Tree)

Compact prediction tree (CPT) is a more accurate algorithm than traditional machine learning models (such as Markov models) and deep learning models (such as autoencoders).

CPT algorithm is unique in its fast training and prediction time. I was able to train and predict the hackathon sequence data set above in 4 minutes.

Unfortunately, this algorithm is currently only implemented in Java, so it hasn’t caught on with data scientists (especially those using Python).

To do this, I created a Python version of the library, based on the documentation of the algorithm startup. The Java code is certainly helpful in understanding some parts of this article.

The public can Python library (https://github.com/analyticsvidhya/CPT), here you are welcome to contribute to it. The library is still incomplete in the sense that it hasn’t been recommended by algorithm startups yet, but it’s good enough to score 0.185 on the hackathon leaderboard, and all in a matter of minutes. I believe the library, when complete, should have performance comparable to RNNs/LSTMs.

In the next section, we will introduce the inner workings of the CPT algorithm and how it performs better than traditional machine learning models such as Markov chains and DG.

Understand data structures in CPT

As a prerequisite, you first need to understand the data format accepted by PythonCPT. CPT receives two.csv files – training and testing. The training file contains the training sequence, and the test file contains the next three items that each sequence needs to predict. For clarity, the sequences in the training and test files are defined as follows:

1,2,3,4,5,6,7

5,6,3,6,3,4,5

.

.

Note that the length of the sequence can be different. In addition, thermal coding sequences are not applicable.

The CPT algorithm uses three basic data structures, which are briefly described below.

1. The prediction tree

Prediction trees have multiple nodes, each with three elements:












A prediction tree is basically a TRIE data structure that compresses the entire training data into a single tree. If you don’t know how the TRIE structure works, the TRIE structure diagrams for the following two sequences will illustrate.

Sequence 1:A, B, C

Sequence 2:A, B, D



The TRIE data structure starts with the first element, A, of sequences A, B, and C, and adds it to the root node. And then B is added to A, and C is added to B. For each new sequence, the TRIE starts again at the root node and skips if an element has already been added to the structure.

The resulting structure is shown above. This is how predictive trees effectively compress training data.

2. Inverted index (II)

An inverted index is a dictionary in which the keys are data items in the training set and the values are collections of sequences in which that item appears. Such as:

Sequence 1:A,B,C,D

Sequence 2:B,C

Sequence 3:A,B

The inverted index of the above sequence is as follows:

II = {

‘A’ : {‘ Seq1 ‘, ‘Seq3’},

“B” : {‘ Seq1 ‘, ‘Seq2’, ‘Seq3’},

“C” : {‘ Seq1 ‘, ‘Seq2’},

‘D:’ {} ‘Seq1’

}

3. Lookup table (LT)

The lookup table is a dictionary with sequence ids and keys for the end nodes of the sequence in the prediction tree. Such as:

Sequence 1:A, B, C

Sequence 2:A, B, D

LT = {

“Seq1” : the node (C),

“Seq2” : the node (D)

}

CPT for training and prediction

Here is an example to understand how the CPT algorithm is trained and predicted. The sample training set is:



The training set has three sequences. Let’s represent sequences with ids: seq 1, seq 2, and seq 3. A, B, C, and D are the data items in the training set.

1. Training phase

In the training phase, the prediction tree, inversion index (II) and lookup table (LT) are simultaneously established. The entire training process is as follows.

Step 1: Insert A,B,C



A lookup table

You get a root node and a current node that is initially set to root.

We start with A and check whether the child node A, which is the root node, exists. If not, we add A to the sublist of the root node, add an entry of A to the inverted index with the value seq 1, and then move the current node to A.

Look at the next item, B, to see if B exists as A child of the current node A. If not, we add B to A’s sublist, add B’s entry to the inverted index with the value seQ1, and then move the current node to B.

Repeat the process until we have added the last element of SEq 1. Finally, we will add C, the last node of seq 1, to the lookup table using key= “seq 1” and value=node(C).

Step 2: Insert A,B



Step 3: Insert A,B,D,C



Step 4: Insert B,C



Repeat this process until you have exhausted every row in the training data set (remember, a row represents a single sequence). Now that we have all the necessary data structures in place, we can start making predictions against our test data set.

2. Prediction stage

In the prediction phase, each data sequence in the test set is predicted iteratively. For a single row, we use the inverted index (II) to find sequences similar to that row. Then, find the results of similar sequences, add them to the data items in the count dictionary, and give their scores. Finally, the count is used to return the item with the highest score as the final prediction. Each step is detailed below.

Target sequence: A,B

Step 1: Find a sequence that is similar to the target sequence.

Find the sequence similar to the target sequence using the inverted index. Take the following steps to find out:












Such as:

There exists A sequence set of terms A = {‘ Seq1 ‘, ‘Seq2’, ‘Seq3’}

The set of sequences with B terms = {‘ Seq1 ‘, ‘Seq2’, ‘Seq3’, ‘Seq4’}

Sequence similar to the target sequence = intersection of the two = {‘ Seq1 ‘, ‘Seq2’, ‘Seq3’}

Step 2: Find subsequent sequences that are similar to the target sequence

For each similar sequence, the subsequent sequence is defined as the oldest sequence after the occurrence of the last item in the similar sequence of the target sequence, minus the items that exist in the target sequence.

Note: This is a little different from what developers do in their papers, but my approach seems to work better than theirs.

To understand this, consider the following example:

The target sequence = [‘ A ‘, ‘B’, ‘C’] similar sequence = [‘ X ‘, ‘A’, ‘Y’, ‘B’, ‘C’, ‘E’, ‘A’, ‘F’]

Last item in target sequence = ‘C’

Eldest sequence after occurrence of ‘C’ in similar sequence = [‘ E ‘, ‘A’, ‘F’]

Subsequent sequence = [‘ E ‘, ‘F’]

Step 3: Add the corresponding items to the Count dictionary, along with their scores.

The subsequent items of each similar sequence are added to the dictionary along with the score. For example, continuing the example above, the score for the subsequent [‘ E ‘, ‘F’] item is calculated as follows:

The initial state of the count dictionary = {} is an empty dictionary.

If the item is not in the dictionary, then:

Score = 1 +(1/ number of similar sequences) +(1/ number of items in the current count dictionary +1)*0.001, otherwise, score = (1 +(1/ number of similar sequences) +(1/n Number of items in the current count dictionary +1)*0.001) * old score value

So for element E, the first term that follows, the score is zero

Score [E] = 1 + (1/1) +1 /(0+1)*0.001 = 2.001

Score [F]=1 + (1/1) +1 /(1+1)*0.001 = 2.0005

After the above calculation, the counting dictionary is,

Count dictionary = {‘E’ : 2.001, ‘F’: 2.0005}

Step 4: Make a prediction using the values of the count dictionary

Finally, the key with the highest value in the count dictionary is returned as the predicted value. In the example above, E is returned as a predicted value.

Modeling and forecasting

Step 1: Copy the GitHub repository.

git clone https://github.com/NeerajSarwan/CPT.git

Step 2: Read the.csv file using the following code, train the model and make a prediction.

#Importing everything from the CPT file

from CPT import *

#Importing everything from the CPT file

from CPT import *

#Creating an object of the CPT Class

model = CPT()

#Reading train and test file and converting them to data and target.

Data, target = model. Load_files (“. / data/” train “CSV”, “. / data/test. The CSV “)

#Training the model

model.train(data)

#Making predictions on the test dataset

Predictions = model predict (data, target, 5, 1)

endnotes

In this paper, we introduce an efficient and accurate sequence learning algorithm called compact prediction tree. I encourage you to try the Hackathon Dataset with sequences, and good luck climbing the private leaderboards!

If you want to contribute to the CPT library, feel free to ask questions. If you know of any other ways to perform sequence learning, write them in the comments section below. Don’t forget to asterisk the CPT library.


The original article was published at: 2018-05-11

Author: NSS

This article is from “Data THU”, a partner of the cloud community. For more information, follow “Data THU”.