• Extreme Rare Event Classification using Autoencoders in Keras
  • By Chitta Ranjan
  • The Nuggets translation Project
  • Permanent link to this article: github.com/xitu/gold-m…
  • Translator: ccJia
  • Proofread by: LSvih

In this article, we will learn to build a rare event classifier using an autoencoder. We will use a real-world scenario rare event dataset from [1].

background

What is an extremely rare event?

In the rare event problem, we are dealing with an unbalanced data set. This means that we have very few positive samples compared to negative samples. In a typical rare event problem, positive samples make up about 5-10% of the total data. In the extremely rare event problem, we have less than 1% positive sample data. For example, in the data set we used, the positive sample was only 0.6 percent.

Such extremely rare events are common in the real world. For example, paper breakage and machine failure in factories, click or buy in the online sales industry.

Classifying these rare events can be challenging. More recently, deep learning has been widely applied to classification problems. However, a small number of positive samples limit the application of deep learning. No matter how large the data is, the positive sample size will limit the effect of deep learning.

Why bother using deep learning?

It’s a fair question. Why don’t we think about using other machine learning methods?

The answer is subjective. We can use machine learning methods. To make it work, we can take a negative sample of the negative sample data so that the data is close to equilibrium. Since the positive sample data is only 0.6%, the reduced dataset is about 1% of the size of the original dataset. Some machine learning methods, such as SVM, random forest, etc., can work on this amount of data. However, its accuracy can be limited. That’s because we don’t use the other 99% of the data.

Deep learning will perform better if there is enough data. It can also flexibly improve the model by using different structures. So we’re going to try deep learning.


In this tweet, we will learn how to build a rare event classifier using a simple full-connection-layer autoencoder. The purpose of the tweet is to demonstrate an autoencoder implementation of an extremely rare event classifier. We leave it to the reader to explore the different autoencoder constructs and configurations. If you find anything interesting, please share it with us.

Autoencoder for classification

The method of autoencoder processing classification task is similar to anomaly detection. In anomaly detection, we learn patterns of normal processes. Anything that is not consistent with this pattern, we consider to be an anomaly. For a binary task of rare events, we can use autoencoders in a similar way (read more [2]).

Quick review: What is autoencoder?

  • An autoencoder consists of an encoder and a decoder.
  • The underlying features of the learning process used by the decoder. These features are usually represented by a small number of dimensions.
  • The decoder can reconstruct raw data from underlying features.

How to use autoencoder to classify rare events?

  • We first divide the data into two parts: positive sample label and negative sample label.
  • Negative sample labels are agreed to be the normal state of the process. The normal state is an event-free process.
  • We will ignore the positive sample data and train the autoencoder on the negative sample.
  • Now, the autoencoder learns all the features of normal processes.
  • A fully trained autoencoder can predict any process from the normal state (because they have the same pattern and distribution).
  • Therefore, the refactoring error is relatively small.
  • However, if we refactor data from a rare event, the autoencoder will encounter difficulties.
  • This results in a high refactoring error when refactoring rare events.
  • We can capture these high refactoring errors and mark them as rare events
  • This process is similar to exception detection.

implementation

Data and questions

This is a dichotomous label on paper breakage data from a paper mill. Paper breakage is a serious problem in paper mills. A single paper break can cause thousands of dollars of damage, and the factory breaks paper at least once or more a day. This costs millions of dollars a year in losses and jobs at risk.

Because of the nature of the process, detecting interrupt events can be challenging. As mentioned in [1], even a 5% reduction in fracture can bring significant benefits to the mill.

After 15 days of collection, we have 18K rows of data. Column ‘y’ contains the dichotomous label, with 1 representing break. The other columns are predictors. There are 124 positive samples (~0.6%).

Download data from [here] (docs.google.com/forms/d/e/1…). Download data.

code

Import the desired libraries.

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd
import numpy as np
from pylab import rcParams

import tensorflow as tf
from keras.models import Model, load_model
from keras.layers import Input, Dense
from keras.callbacks import ModelCheckpoint, TensorBoard
from keras import regularizers

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, precision_recall_curve
from sklearn.metrics import recall_score, classification_report, auc, roc_curve
from sklearn.metrics import precision_recall_fscore_support, f1_score

from numpy.random import seed
seed(1)
from tensorflow import set_random_seed
set_random_seed(2)

SEED = 123 #used to help randomly select the data points
DATA_SPLIT_PCT = 0.2

rcParams['figure.figsize'] = 8.6
LABELS = ["Normal"."Break"]
Copy the code

Note that we set the random number seed for reproducible results.

The data processing

Now let’s read and prepare the data.

df = pd.read_csv("data/processminer-rare-event-mts - data.csv")
Copy the code

The purpose of this rare event problem is to predict a rupture before it occurs. We tried to predict the break four minutes in advance. To build this model, we advance the data label by 2 lines (corresponding to 4 minutes). Df.y = df.y.hift (-2) However, in this case, what we want to do is determine whether row N will be labeled as a positive sample,

  • Let’s label n minus 2 and n minus 1 as 1. This helps the classifier learn to predict up to four minutes in advance.

  • Delete n rows. Because we don’t want the classifier to learn to predict the breaking that’s happening.

We will develop the following UDF for this curve movement.

sign = lambda x: (1.- 1)[x < 0]

def curve_shift(df, shift_by):
    This function is used to offset binary labels in the data. For example, if the offset is -2, the following processing will occur: if n rows are labeled 1, then - make (n+shift_by):(n+shift_by-1) = 1 - remove the NTH row. So the label will move up by 2 lines. Enter: df the pandas data of a classified tag column. The name of the label column is' Y '. Shift_by an integer indicating the number of rows to be moved. Output: DF translated data by offset. ' ' '

    vector = df['y'].copy()
    for s in range(abs(shift_by)):
        tmp = vector.shift(sign(shift_by))
        tmp = tmp.fillna(0)
        vector += tmp
    labelcol = 'y'
    Add vector to df
    df.insert(loc=0, column=labelcol+'tmp', value=vector)
    # drop labelcol == 1
    df = df.drop(df[df[labelcol] == 1].index)
    # discard labelCol while TMP as labelcol.
    df = df.drop(labelcol, axis=1)
    df = df.rename(columns={labelcol+'tmp': labelcol})
    # Make hashtags
    df.loc[df[labelcol] > 0, labelcol] = 1

    return df
Copy the code

Now, we divide the data into training sets, validation sets, and test sets. We will then train the autoencoder using only the subset labeled 0.

df_train, df_test = train_test_split(df, test_size=DATA_SPLIT_PCT, random_state=SEED)
df_train, df_valid = train_test_split(df_train, test_size=DATA_SPLIT_PCT, random_state=SEED)

df_train_0 = df_train.loc[df['y'] = =0]
df_train_1 = df_train.loc[df['y'] = =1]
df_train_0_x = df_train_0.drop(['y'], axis=1)
df_train_1_x = df_train_1.drop(['y'], axis=1)

df_valid_0 = df_valid.loc[df['y'] = =0]
df_valid_1 = df_valid.loc[df['y'] = =1]
df_valid_0_x = df_valid_0.drop(['y'], axis=1)
df_valid_1_x = df_valid_1.drop(['y'], axis=1)

df_test_0 = df_test.loc[df['y'] = =0]
df_test_1 = df_test.loc[df['y'] = =1]
df_test_0_x = df_test_0.drop(['y'], axis=1)
df_test_1_x = df_test_1.drop(['y'], axis=1)
Copy the code

standardized

For autoencoders, it is usually best to use normalized data (converted to Gaussian, mean 0, and variance 1).

scaler = StandardScaler().fit(df_train_0_x) df_train_0_x_rescaled = scaler.transform(df_train_0_x) df_valid_0_x_rescaled  = scaler.transform(df_valid_0_x) df_valid_x_rescaled = scaler.transform(df_valid.drop(['y'], axis = 1))

df_test_0_x_rescaled = scaler.transform(df_test_0_x)
df_test_x_rescaled = scaler.transform(df_test.drop(['y'], axis = 1))
Copy the code

Self-coding classifier

Initialize the

First, we will initialize the autoencoder framework. We just build a simple autoencoder. More complex structures and configurations are left for the reader to explore.

nb_epoch = 100
batch_size = 128
input_dim = df_train_0_x_rescaled.shape[1] #num of predictor variables, 
encoding_dim = 32
hidden_dim = int(encoding_dim / 2)
learning_rate = 1e-3

input_layer = Input(shape=(input_dim, ))
encoder = Dense(encoding_dim, activation="tanh", activity_regularizer=regularizers.l1(learning_rate))(input_layer)
encoder = Dense(hidden_dim, activation="relu")(encoder)
decoder = Dense(hidden_dim, activation='tanh')(encoder)
decoder = Dense(input_dim, activation='relu')(decoder)
autoencoder = Model(inputs=input_layer, outputs=decoder)
Copy the code

training

We will train the model and save it to the specified file. Storing training models is a good way to save future analysis time.

autoencoder.compile(metrics=['accuracy'],
                    loss='mean_squared_error',
                    optimizer='adam')

cp = ModelCheckpoint(filepath="autoencoder_classifier.h5",
                               save_best_only=True,
                               verbose=0)

tb = TensorBoard(log_dir='./logs',
                histogram_freq=0,
                write_graph=True,
                write_images=True)

history = autoencoder.fit(df_train_0_x_rescaled, df_train_0_x_rescaled,
                    epochs=nb_epoch,
                    batch_size=batch_size,
                    shuffle=True,
                    validation_data=(df_valid_0_x_rescaled, df_valid_0_x_rescaled),
                    verbose=1,
                    callbacks=[cp, tb]).history
Copy the code

classifier

Next, we will show how we can classify using the reconstruction error of the autoencoder for rare events.

As mentioned earlier, if the reconstruction error is high, we will consider it a fracture. We need to set a threshold.

We use the validation set to set the threshold.

valid_x_predictions = autoencoder.predict(df_valid_x_rescaled)
mse = np.mean(np.power(df_valid_x_rescaled - valid_x_predictions, 2), axis=1)
error_df = pd.DataFrame({'Reconstruction_error': mse,
                        'True_class': df_valid['y']})

precision_rt, recall_rt, threshold_rt = precision_recall_curve(error_df.True_class, error_df.Reconstruction_error)
plt.plot(threshold_rt, precision_rt[1:], label="Precision",linewidth=5)
plt.plot(threshold_rt, recall_rt[1:], label="Recall",linewidth=5)
plt.title('Precision and recall for different threshold values')
plt.xlabel('Threshold')
plt.ylabel('Precision/Recall')
plt.legend()
plt.show()
Copy the code

Now, we will classify the test data.

We should not estimate classification thresholds based on test data. This leads to overfitting.

test_x_predictions = autoencoder.predict(df_test_x_rescaled)
mse = np.mean(np.power(df_test_x_rescaled - test_x_predictions, 2), axis=1)
error_df_test = pd.DataFrame({'Reconstruction_error': mse,
                        'True_class': df_test['y']})
error_df_test = error_df_test.reset_index()

threshold_fixed = 0.85
groups = error_df_test.groupby('True_class')

fig, ax = plt.subplots()

for name, group in groups:
    ax.plot(group.index, group.Reconstruction_error, marker='o', ms=3.5, linestyle=' ',
            label= "Break" if name == 1 else "Normal")
ax.hlines(threshold_fixed, ax.get_xlim()[0], ax.get_xlim()[1], colors="r", zorder=100, label='Threshold')
ax.legend()
plt.title("Reconstruction error for different classes")
plt.ylabel("Reconstruction error")
plt.xlabel("Data point index")
plt.show();
Copy the code

In Figure 4, the orange and blue dots above the threshold line represent true and false positives, respectively. As we can see, we have a lot of false positives. For a better understanding, we use the obfuscation matrix.

pred_y = [1 if e > threshold_fixed else 0 for e in error_df.Reconstruction_error.values]

conf_matrix = confusion_matrix(error_df.True_class, pred_y)

plt.figure(figsize=(12.12))
sns.heatmap(conf_matrix, xticklabels=LABELS, yticklabels=LABELS, annot=True, fmt="d");
plt.title("Confusion matrix")
plt.ylabel('True class')
plt.xlabel('Predicted class')
plt.show()
Copy the code

We can predict 9 out of 32 breaks. It is worth noting that these results were predicted two to four minutes in advance. That’s about 28%, which is a good recall rate for the paper industry. False positives are about 6.3%. It’s not perfect, but it’s not bad for the factory either.

The model can be further improved to improve recall rate when the false positive rate is small. We’ll discuss AUC below, and then the next improvement approach.

ROC curve and AUC

false_pos_rate, true_pos_rate, thresholds = roc_curve(error_df.True_class, error_df.Reconstruction_error)
roc_auc = auc(false_pos_rate, true_pos_rate,)

plt.plot(false_pos_rate, true_pos_rate, linewidth=5, label='AUC = % 0.3 f'% roc_auc)
plt.plot([0.1], [0.1], linewidth=5)

plt.xlim([0.01.1])
plt.ylim([0.1.01])
plt.legend(loc='lower right')
plt.title('Receiver operating characteristic curve (ROC)')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
Copy the code

The structure of AUC is 0.624.

Making the warehouse

The annotated code is here. cran2367/autoencoder_classifier **Autoencoder model for rare event classification. Contribute to Cran2367 / autoencoder_classifier development by creating… **github.com

What could be better?

This is a (multivariate) time series data. We did not consider the temporal information/patterns in the data. We will explore in our next tweet whether it is possible to classify in combination with RNN. We’re going to try LSTM Autoencoder.

conclusion

We study an autocoding classifier for binary data of extremely rare events working in a paper mill. We achieved good accuracy. Our goal is to show the basic application of autoencoders to the problem of rare event classification. We will try to develop other methods later, including LSTM Autoencoder which can combine spatio-temporal features to achieve a better effect.

LSTM Autoencoder LSTM Autoencoder for Rare Event Classification

reference

  1. Ranjan, C., Mustonen, M., Paynabar, K., & Pourak, K. (2018). Dataset: Classification of Rare events in Multivariate Time Series. ArXiv Preprint arXiv:1809.10717.
  2. www.datascience.com/blog/fraud-…
  3. Making repo: github.com/cran2367/au…

If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.


The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.