I. Introduction to anomaly detection
Outlier detection is to find outlier data inconsistent with the distribution of data set through data mining method, also known as outlier detection, outlier detection and so on.
1.1 Exception Detection Applicable Scenarios
The anomaly detection algorithm applies to the following scenarios:
(1) No label or unbalanced category;
(2) There is a large difference between abnormal data and most data in the sample;
(3) The proportion of abnormal data in the overall data sample is very low.
Common application cases include: financial field: identify fraudulent users from financial data, such as credit card application fraud, credit card theft, credit fraud, etc. Security: determine the fluctuation of traffic data and whether it is attacked and so on; E-commerce: identify “malicious buyers” from transaction data, such as the Wool Party and malicious screenbrushing gangs; Ecological disaster warning: Judging possible future extreme weather based on weather index data; Medical monitoring: detecting abnormal data that may indicate disease status from medical device data;
1.2 Challenges in anomaly detection
Anomaly detection is a popular research field, but there are still many challenges in the whole field due to the complexity of unpredictability, heterogeneity, particularity and diversity of anomalies:
- 1) One of the most challenging problems is the difficulty of achieving high abnormal detection recall rate. Because anomalies are so rare and heterogeneous, it is difficult to identify all of them.
- 2) In order to improve the precision of the anomaly detection model, it is often necessary to deeply combine the business characteristics, otherwise the effect is not good, and it is easy to cause algorithm bias against minority groups.
Abnormal detection methods
According to whether the training set contains outlier or not, it can be divided into outlier detection and Novelty detection. The representative method of novelty detection is One Class SVM.
According to different types of anomalies, anomaly detection can be divided into: anomaly detection (such as abnormal consumer users), context anomaly detection (such as time series anomaly), and group anomaly detection (such as abnormal gang).
According to different learning methods, anomaly detection can be divided into: There are Supervised Anomaly Detection, semi-supervised Anomaly Detection, and Unsupervised Anomaly Detection Detection). Due to the difficulty of collecting abnormal label samples, unsupervised anomaly detection is most widely used.
According to its algorithm, unsupervised anomaly detection can be roughly divided into the following categories:
2.1 Clustering based methods
Clustering based anomaly detection methods usually rely on the following assumptions: 1) Normal data instances belong to a cluster of data, while abnormal data instances do not belong to any cluster; 2) Normal data instances are close to their nearest cluster centroid, while abnormal data is far from their nearest cluster centroid; 3) Normal data instances belong to large and dense clusters, while abnormal data instances either belong to small clusters or sparse clusters; By grouping data into different clusters, outlier data are those that belong to small clusters or not to any cluster or are far from the center of the cluster.
- Take the data far from the center of the cluster as outliers:
These methods include SOM, K-means, Expectation Maximization (EM), semantic Anomaly factor based algorithm, etc.
- Take the small cluster data obtained from clustering as outliers:
Representative methods include K-means clustering;
- An outlier that does not belong to any cluster:
Representative methods include DBSCAN, ROCK and SNN clustering.
2.2 Statistics-based methods
Statistical methods rely on the assumption that the data set obeies a certain distribution (such as normal distribution, Poisson distribution, binomial distribution, etc.) or probability model, and realize anomaly detection by judging whether a data point conforms to the distribution/model (i.e., by discriminating small probability events). According to the probability model, it can be divided into:
-
- The parameter method estimates model parameters (such as Gaussian model) from the known distribution data. The simplest parameter anomaly detection model assumes that the samples obey unary normal distribution. When the difference between data point and mean is greater than two or three times of variance, the point is considered as abnormal.
-
- Non-parametric method, when the data distribution is unknown, histogram can be drawn to detect anomalies by detecting whether the data is in the histogram generated by the training set. The degree of variation (mean difference, standard deviation, coefficient of variation, quartile spacing, etc.) of the data can also be used to find outlier data in the data.
2.3 Depth-based approach
This approach maps data into a hierarchical structure of k-dimensional space and assumes that outliers are distributed on the periphery, while normal data points are near the center of the hierarchical structure (the higher the depth).
- ISODEPTH method, by calculating the depth of each point, and judging abnormal data points according to the depth value.
- Minimum Volume Ellipsoid Estimator (MVE) method. According to the probability distribution model of most data points (usually > 50%), the boundary of the smallest elliptic sphere shown by a solid line ellipse is fitted. Data points outside this boundary range will be judged as outliers.
- Isolated forest. The temporal complexity of the above two depth-based basic models increases exponentially with the increase of the characteristic dimension K, and they are generally applicable to the dimension K ≤3, while the isolated forest can also be applied to high-dimensional data by changing the way of calculating the depth.
The isolated forest algorithm is an Ensemble based anomaly detection method, so it has linear time complexity. Moreover, it has high accuracy and high speed in processing big data, so it has been widely applied in the industrial industry. The basic idea is that the sample space is randomly segmented by tree model method. Those clusters with high density need to be cut many times before they stop cutting (that is, each point exists in a separate subspace), but those sparsely distributed points (that is, outliers) mostly stop in a subspace very early. The algorithm steps are as follows: 1) Select a random ψ sample from the training data to train a single tree.
2) Randomly specify a Q dimension (attribute) to randomly generate a cut point P in the current node data. The p-cut point is generated between the maximum and minimum value of the specified Q dimension in the current node data.
3) The selection of the cutting point generates a hyperplane, which divides the data space of the current node into two subspaces: the points less than P under the current selected dimension are placed on the left branch of the current node, and the points greater than or equal to P are placed on the right branch of the current node;
4) Recurse steps 2 and 3 on the left and right branches of the node, continuously construct new leaf nodes until there is only one data on the leaf node (no further cutting is possible) or the tree has grown to the set height. (The maximum height of a single tree is set because abnormal data records are few and the path length is low, and we only need to distinguish normal records from abnormal records, so we only need to care about the part below the average height, which makes the algorithm more efficient.)
5) Since the cutting feature space process of each tree training is completely random, ensemble method should be used to make the results converge, that is, to establish several more trees, and then comprehensively calculate the average value of the segmentation results of each tree. For each sample X, the comprehensive outlier score S is calculated by the following formula.H (x) is the height of x in each tree, and C (ψ) is the average path length given the number of samples ψ, which is used to normalize the path length h(x) of sample X.
2.4 Based on classification model:
The representative method is One Class SVM, whose principle is to find a hyperplane and circle the positive examples in the sample. Prediction is to use this hyperplane to make decisions, and the samples in the circle are considered to be positive samples. Because kernel calculation is time-consuming, it is not often used in scenarios with massive data.
2.5 Proximity based method:
The assumption is that normal data instances are in dense neighborhoods and that the samples near abnormal data instances are sparse. Can be further subdivided based on density/neighbor:
-
Based on density, the method calculates the density of each data region in the data set, and takes the region with lower density as the outlier region. The classical method is local outlier factor (LOF). The LOF method is different from the traditional definition of outliers. Outliers are defined locally as outliers, and each data is assigned a LOF value representing its neighborhood. The larger LOF is, the lower the neighborhood density is, and the more likely it is to be outliers. However, it is difficult to determine the minimum nearest neighbor domain in LOF, and the computational complexity and time complexity increase with the increase of data dimension.
-
Based on distance, the basic idea is to detect anomalies by calculating the distance between the comparative data and the nearest neighbor data set. Normal data points are similar to the nearest neighbor data, while abnormal data is different from the nearest neighbor data.
2.6 Bias based approach
When given a data set, the deviation based method can be used to find the points inconsistent with the characteristics of the whole data set, and the variance of the data set will decrease with the removal of outliers. The method can be divided into sequential exception technique and OLAP data cube technique. At present, this method is seldom used in practice.
2.7 Method based on reconstruction
The representative method is PCA. There are two approaches to PCA anomaly detection: one is to map data to low-dimensional feature space, and then check the deviation between each data point and other data in different dimensions of feature space; The other is to map data to low-dimensional feature space, and then map back to the original space from low-dimensional feature space, try to reconstruct the original data with low-dimensional feature, and see the size of reconstruction error.
2.8 Methods based on neural network:
The representative methods include autoencoder (AE), LSTM and so on.
- LSTM can be used for anomaly detection of time series data: using historical sequence data to train the model and detect outliers with large differences from predicted values.
- Autoencoder Exception detection
Autoencoder essentially uses a neural network to produce a low-dimensional representation of a high-dimensional input. Autoencoder is similar to principal component analysis PCA, but Autoencoder overcomes the limitation of PCA linearity when using nonlinear activation functions. The basic assumption of the algorithm is that outliers follow different distributions. Autoencoder trained according to normal data can rebuild and restore normal samples, but cannot restore data points different from normal distribution well, resulting in large error based on reconstruction. When the reconstruction error is greater than a certain threshold, it is marked as an outlier.
Summary: The elements of the unsupervised anomaly detection method are the selection of relevant features and the selection of appropriate algorithms based on reasonable assumptions, which can give better play to the anomaly detection effect.
Iv. Actual combat of the project: Credit card anti-fraud
The project is the classic credit card fraud detection on Kaggle. The data set is of high quality and the ratio of positive and negative samples is very different. In this project, we mainly used unsupervised Autoencoder novel point detection to identify abnormal fraud samples according to reconstruction errors.
#!/usr/bin/env python
# coding: utf-8
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt
plt.style.use('seaborn')
import tensorflow as tf
import seaborn as sns
from sklearn.model_selection import train_test_split
from keras.models import Model, load_model
from keras.layers import Input, Dense
from keras.callbacks import ModelCheckpoint
from keras import regularizers
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_curve, auc, precision_recall_curve
# 安利一个异常检测Python库 https://github.com/yzhao062/Pyod
# 读取数据 :信用卡欺诈数据集地址https://www.kaggle.com/mlg-ulb/creditcardfraud
d = pd.read_csv('creditcard.csv')
# 查看样本比例
num_nonfraud = np.sum(d['Class'] == 0)
num_fraud = np.sum(d['Class'] == 1)
plt.bar(['Fraud', 'non-fraud'], [num_fraud, num_nonfraud], color='dodgerblue')
plt.show()
# 删除时间列,对Amount进行标准化
data = d.drop(['Time'], axis=1)
data['Amount'] = StandardScaler().fit_transform(data[['Amount']])
# 为无监督新颖点检测方法,只提取负样本,并且按照8:2切成训练集和测试集
mask = (data['Class'] == 0)
X_train, X_test = train_test_split(data[mask], test_size=0.2, random_state=0)
X_train = X_train.drop(['Class'], axis=1).values
X_test = X_test.drop(['Class'], axis=1).values
# 提取所有正样本,作为测试集的一部分
X_fraud = data[~mask].drop(['Class'], axis=1).values
# 构建Autoencoder网络模型
# 隐藏层节点数分别为16,8,8,16
# epoch为5,batch size为32
input_dim = X_train.shape[1]
encoding_dim = 16
num_epoch = 5
batch_size = 32
input_layer = Input(shape=(input_dim, ))
encoder = Dense(encoding_dim, activation="tanh",
activity_regularizer=regularizers.l1(10e-5))(input_layer)
encoder = Dense(int(encoding_dim / 2), activation="relu")(encoder)
decoder = Dense(int(encoding_dim / 2), activation='tanh')(encoder)
decoder = Dense(input_dim, activation='relu')(decoder)
autoencoder = Model(inputs=input_layer, outputs=decoder)
autoencoder.compile(optimizer='adam',
loss='mean_squared_error',
metrics=['mae'])
# 模型保存为model.h5,并开始训练模型
checkpointer = ModelCheckpoint(filepath="model.h5",
verbose=0,
save_best_only=True)
history = autoencoder.fit(X_train, X_train,
epochs=num_epoch,
batch_size=batch_size,
shuffle=True,
validation_data=(X_test, X_test),
verbose=1,
callbacks=[checkpointer]).history
# 画出损失函数曲线
plt.figure(figsize=(14, 5))
plt.subplot(121)
plt.plot(history['loss'], c='dodgerblue', lw=3)
plt.plot(history['val_loss'], c='coral', lw=3)
plt.title('model loss')
plt.ylabel('mse'); plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper right')
plt.subplot(122)
plt.plot(history['mae'], c='dodgerblue', lw=3)
plt.plot(history['val_mae'], c='coral', lw=3)
plt.title('model mae')
plt.ylabel('mae'); plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper right')
# 读取模型
autoencoder = load_model('model.h5')
# 利用autoencoder重建测试集
pred_test = autoencoder.predict(X_test)
# 重建欺诈样本
pred_fraud = autoencoder.predict(X_fraud)
# 计算重构MSE和MAE误差
mse_test = np.mean(np.power(X_test - pred_test, 2), axis=1)
mse_fraud = np.mean(np.power(X_fraud - pred_fraud, 2), axis=1)
mae_test = np.mean(np.abs(X_test - pred_test), axis=1)
mae_fraud = np.mean(np.abs(X_fraud - pred_fraud), axis=1)
mse_df = pd.DataFrame()
mse_df['Class'] = [0] * len(mse_test) + [1] * len(mse_fraud)
mse_df['MSE'] = np.hstack([mse_test, mse_fraud])
mse_df['MAE'] = np.hstack([mae_test, mae_fraud])
mse_df = mse_df.sample(frac=1).reset_index(drop=True)
# 分别画出测试集中正样本和负样本的还原误差MAE和MSE
markers = ['o', '^']
markers = ['o', '^']
colors = ['dodgerblue', 'coral']
labels = ['Non-fraud', 'Fraud']
plt.figure(figsize=(14, 5))
plt.subplot(121)
for flag in [1, 0]:
temp = mse_df[mse_df['Class'] == flag]
plt.scatter(temp.index,
temp['MAE'],
alpha=0.7,
marker=markers[flag],
c=colors[flag],
label=labels[flag])
plt.title('Reconstruction MAE')
plt.ylabel('Reconstruction MAE'); plt.xlabel('Index')
plt.subplot(122)
for flag in [1, 0]:
temp = mse_df[mse_df['Class'] == flag]
plt.scatter(temp.index,
temp['MSE'],
alpha=0.7,
marker=markers[flag],
c=colors[flag],
label=labels[flag])
plt.legend(loc=[1, 0], fontsize=12); plt.title('Reconstruction MSE')
plt.ylabel('Reconstruction MSE'); plt.xlabel('Index')
plt.show()
# 下图分别是MAE和MSE重构误差,其中橘黄色的点是信用欺诈,也就是异常点;蓝色是正常点。我们可以看出异常点的重构误差整体很高。
# 画出Precision-Recall曲线
plt.figure(figsize=(14, 6))
for i, metric in enumerate(['MAE', 'MSE']):
plt.subplot(1, 2, i+1)
precision, recall, _ = precision_recall_curve(mse_df['Class'], mse_df[metric])
pr_auc = auc(recall, precision)
plt.title('Precision-Recall curve based on %s\nAUC = %0.2f'%(metric, pr_auc))
plt.plot(recall[:-2], precision[:-2], c='coral', lw=4)
plt.xlabel('Recall'); plt.ylabel('Precision')
plt.show()
# 画出ROC曲线
plt.figure(figsize=(14, 6))
for i, metric in enumerate(['MAE', 'MSE']):
plt.subplot(1, 2, i+1)
fpr, tpr, _ = roc_curve(mse_df['Class'], mse_df[metric])
roc_auc = auc(fpr, tpr)
plt.title('Receiver Operating Characteristic based on %s\nAUC = %0.2f'%(metric, roc_auc))
plt.plot(fpr, tpr, c='coral', lw=4)
plt.plot([0,1],[0,1], c='dodgerblue', ls='--')
plt.ylabel('TPR'); plt.xlabel('FPR')
plt.show()
# 不管是用MAE还是MSE作为划分标准,模型的表现都算是很好的。PR AUC分别是0.51和0.44,而ROC AUC都达到了0.95。
# 画出MSE、MAE散点图
markers = ['o', '^']
colors = ['dodgerblue', 'coral']
labels = ['Non-fraud', 'Fraud']
plt.figure(figsize=(10, 5))
for flag in [1, 0]:
temp = mse_df[mse_df['Class'] == flag]
plt.scatter(temp['MAE'],
temp['MSE'],
alpha=0.7,
marker=markers[flag],
c=colors[flag],
label=labels[flag])
plt.legend(loc=[1, 0])
plt.ylabel('Reconstruction RMSE'); plt.xlabel('Reconstruction MAE')
plt.show()
Copy the code
This article was first published in the algorithm advanced, the public number to read the original text can access the GitHub project source