Essentially, in anomaly detection, we’re looking for observations that deviate from the norm, that are either better than what we’ve found or defined as normal, or that don’t keep up. Therefore, anomaly detection provides benefits from both a business and technical perspective. To perform exceptions, one must rely on tools such as SciKit Learn. However, when it comes to performing end-to-end tasks, there are only a few options, such as PyFBAD, a Python-based software package. From the start, we could load data from a variety of distributed servers and run the SOTA algorithm for exception detection. We’ll talk about these tools in this article, but first, we’ll look at some of the important things listed below.
directory
- What is exception detection?
- Anomaly detection techniques
- Algorithm for anomaly detection
- How does PyFBAD handle exceptions?
Let’s start by understanding exception detection.
What is exception detection?
Exceptions are data points in a data set that stand out from other data and contradict the expected behavior of the data. These data points or observations differ from the typical behavior patterns of the dataset. Anomaly detection is a technique for anomaly detection in a data set, which is based on unsupervised data processing. Anomalies can be classified into several categories, including outliers, outliers, outliers, outliers, outliers, outliers, and outlier patterns that appear in data collection in a temporary or unsystematic manner. Drift, long-term data changes are slow and asymmetrical.
Exception detection is useful for detecting fraudulent transactions, detecting disease, and handling case studies with high levels of imbalance. Data science models with more powerful anomaly detection techniques can be built.
Outlier analysis (also known as outlier detection) is a data mining step that detects data points, events, and/or observations that deviate from the usual behavior of a data set. Unusual amounts of data can reveal basic events, such as technical glitches, or prospective possibilities, such as changes in consumer behavior. Anomalies are increasingly being detected using machine learning.
Techniques for anomaly detection
There are three types of anomaly detection techniques: unsupervised, semi-supervised and supervised. The best anomaly detection method is basically determined by the tags in the dataset. Supervised exception detection techniques require a data set with a complete set of “normal” and “exception” labels in order for the classification algorithm to function. Classifiers must also be trained as part of the method.
Outlier detection is similar to traditional pattern recognition, except that it produces a natural strong imbalance between categories. Because outlier detection is inherently unbalanced, it is not well suited to all statistical classification algorithms.
Semi-supervised anomaly detection uses normal, labeled training data sets to build a model that represents normal behavior. They then use the model to find exceptions by determining the likelihood that it will produce any given instance.
Unsupervised anomaly detection methods detect anomalies in unmarked test data sets only based on the inherent properties of the data. Its working assumption is that the vast majority of instances in the dataset are normal, as they are most of the time. The anomaly detection algorithm then looks for instances that don’t seem to fit with the rest of the data set.
Algorithm for anomaly detection
Isolation of the forest
The isolated forest algorithm uses a tree-based approach to detect anomalies. It is based on modeling of normal data to isolate the few and different anomalies in the feature space. The algorithm basically does this: It generates a random forest in which the decision tree grows randomly: at each node, features are randomly selected and a random threshold is selected to split the dataset in half.
It keeps chopping down the data set until all instances are isolated from each other. Because an exception is usually far from other instances, it is isolated by fewer steps on average than a normal instance (in all decision trees).
Density-based algorithms
Common density-based techniques include K-nearest Neighbors (KNN), local outliers (LOF), and others. Both regression and classification systems can benefit from these techniques.
Each of these algorithms produces the desired behavior following the highest data point density line. Any points that fall outside these clusters in a statistically significant number are marked as outliers. Because most of these techniques rely on the distance between points, the key is to scale the data set and normalize the units to ensure accurate results.
SVM based method
A supervised learning model that can generate robust prediction models is support vector machine (SINGLE class SVM) technology. It is mainly used for classification. The technique takes a series of training instances, each marked as belonging to one of two groups.
The system then produces criteria for classifying other cases. To maximize the difference between the two categories, the algorithm converts examples into points in space.
If a value is too far outside the range of the two categories, the system recognizes it as an outlier. If you don’t have labeled data, you can use unsupervised learning strategies to build categories by looking for groups between cases.
How does PyFBAD handle exceptions?
The PyFBAD library is an unsupervised exception detection package that works from start to finish. The source code for all ML-flow phases is included in this package. With numerous PyFBAD packages, data can be read from files such as CSV, databases such as MongoDB, or MySQL. The preprocessor can be used to prepare the read data for the model.
Different machine learning models, such as Prophet or Isolation Forest, can be used to train the model. The results of exception detection can be sent via email or Slack. In other words, the entire project cycle can be completed using only the source code provided by PyFBAD and no other libraries.
Let’s start with this package, which we first install with PIP and import all dependencies, while Plotly Dash is used for interactive drawing in this implementation.
import plotly.express as px
import plotly.graph_objects as go
from pyfbad.data import database as db
from pyfbad.models import models as md
from pyfbad.features import create_feature as cf
Copy the code
Since we mentioned that the tool is an end-to-end platform, we can leverage our data from a high-level database; This can be done through database objects. Here we are loading a standard CSV file that holds Microsoft stock information and it can be loaded as.
# initialize the connection
connection = db.File()
data = connection.read_from_csv('/content/Microsoft_Stock.csv')
data.head()
Copy the code
For time series exception prediction, we need to create a feature set that contains a date_time and the data we want to detect the exception. In our case, it’s the number of shares.
features = cf.Features()
features_set = features.get_model_data(df=data, time_column_name = 'Date', value_column_name = 'Volume')
features_set
Copy the code
Next, by using the feature set generated above, PyFBAD provides a model object that we can use to detect exceptions. At this point, it has prophets and isolation forests as algorithms to work with.
# initialize the algorithm
models = md.Model_Prophet()
# train algorithm on the features
trained_features = models.train_model(features_set)
# get the anomalies
forecast_anomaly = models.train_forecast(trained_features)
Copy the code
Now that we have detected a set of outliers in our dataset, let’s visualize them using Plotly Dash, as shown below. The first image shows the main series, followed by the outliers detected by the model.
The last word
Through this article, we discussed outliers and the importance of detecting and appropriately handling them to get the right business solution. We briefly discussed the basic techniques and algorithms used to process it. Finally, to detect exceptions in the dataset, we use PyFBAD, a Python-based toolkit.
reference
- Anomaly detection
- PyFBAD official database
- Link to the code above