10 Tips for Tackling Class Imbalances in Machine Learning

The author compiled | | GUEST BLOG Flin source | analyticsvidhya

The overview

Familiar imbalance
Learn about the various techniques for handling unbalanced classes, such as –
- Random undersampling
- Random oversampling
- NearMiss
You can check the code execution in my GitHub library here

introduce

Class imbalance occurs when the observed value of one class is higher than the observed value of the other classes.

Example: Detecting fraudulent credit card transactions. As shown in the figure below, fraudulent transactions are about 400 and non-fraudulent transactions are about 90,000.

Class imbalance is a common problem in machine learning, especially in classification problems. Unbalanced data can hamper the accuracy of our models for a long time.

Class imbalances occur in many areas, including:

Fraud identification
Spam filtering
Disease screening
SaaS Subscription churn
Click advertising

Quasi-imbalance problem

Most machine learning algorithms work best when the number of samples in each category is roughly equal. This is because most algorithms are designed to maximize accuracy and reduce error.

However, if the data set is unbalanced, then in this case you can achieve a fairly high degree of accuracy just by predicting the majority of classes, but you can’t capture the minority, which is often the primary purpose of creating a model.

Credit card fraud detection example

Suppose we have a data set of credit card companies and we have to find out whether credit card transactions are fraudulent.

But there’s a catch… Fraudulent transactions are relatively rare, with only 6% of transactions being fraudulent.

Now, before you even start, can you think of how the problem should be solved? Imagine if you didn’t spend any time training the model. Conversely, what if you write only one line of code that always predicts “no fraudulent transactions”?

def transaction(transaction_data) :
    return 'No fradulent transaction'
Copy the code

Well, guess what? Your “solution” will be 94% accurate!

Unfortunately, this accuracy is misleading.

For all of these non-fraudulent transactions, you will have 100% accuracy.
For fraudulent transactions, your accuracy is 0%.
Just because most transactions are not fraudulent (not because your model is good), your overall accuracy is high.

This is clearly a problem, since many machine learning algorithms are designed to maximize overall accuracy. In this article, we’ll look at different techniques for handling unbalanced data.

data

We will use the credit card fraud detection dataset in this article, which you can find here.

www.kaggle.com/mlg-ulb/cre…

After the data is loaded, the first five rows of the data set are displayed.

# check the target variable that is fraudulet and not fradulent transactiondata['Class'].value_counts()# 0 -> non fraudulent
# 1 -> fraudulent
Copy the code

# visualize the target variable
g = sns.countplot(data['Class'])
g.set_xticklabels(['Not Fraud'.'Fraud'])
plt.show()  
Copy the code

You can clearly see that there are huge differences between data sets. 9,000 non-fraudulent transactions and 492 fraudulent transactions.

Indicators of trap

One of the main problems new developer users encounter when dealing with unbalanced data sets relates to the metrics used to evaluate their models. Using simpler metrics, such as accuracy scores, can be misleading. In data sets with highly unbalanced classes, the classifier always “predicts” the most common classes without feature analysis, and its accuracy is so high that it is obviously not correct.

Let’s do this experiment using a simple XGBClassifier and featureless engineering:

# import linrary
from xgboost import XGBClassifier

xgb_model = XGBClassifier().fit(x_train, y_train)

# predict
xgb_y_predict = xgb_model.predict(x_test)

# accuracy score
xgb_score = accuracy_score(xgb_y_predict, y_test)

print('Accuracy score is:', xbg_score)OUTPUT
Accuracy score is: 0.992
Copy the code

We can see 99% accuracy and what we get is a very high accuracy because it predicts most of the categories to be 0 (non-fraudulent).

Resampling technique

A widely used technique for dealing with highly unbalanced data sets is called resampling. It involves removing samples from most classes (undersampling) and/or adding more examples from a few classes (oversampling).

While there are many benefits to balancing classes, there are downsides to these techniques.

The simplest implementation of oversampling is to copy random records of minority group categories, which can lead to overfishing.

The simplest implementation of undersampling involves deleting random records from most classes, which can result in information loss.

Let’s implement this using the credit card fraud detection example.

We will first separate class 0 from class 1.

# class count
class_count_0, class_count_1 = data['Class'].value_counts()

# Separate class
class_0 = data[data['Class'] = =0]
class_1 = data[data['Class'] = =1]# print the shape of the class
print('class 0:', class_0.shape)
print('class 1:', class_1.shape
Copy the code

1. Random undersampling

Undersampling can be defined as deleting observations of most classes. This is done before the majority and minority classes are balanced.

Undersampling can be a good option when you have a lot of data, like millions of rows. But one disadvantage of undersampling is that we may be deleting valuable information.

class_0_under = class_0.sample(class_count_1)

test_under = pd.concat([class_0_under, class_1], axis=0)

print("total class of 1 and0:",test_under['Class'].value_counts())# plot the count after under-sampeling
test_under['Class'].value_counts().plot(kind='bar', title='count (target)')
Copy the code

2. Random over-sampling

Oversampling can be defined as adding more copies to a few classes. Oversampling can be a good option when you don’t have a lot of data to work with.

One disadvantage to consider when undersampling is that it can lead to overfitting and poor generalization of the test set.

class_1_over = class_1.sample(class_count_0, replace=True)

test_over = pd.concat([class_1_over, class_0], axis=0)

print("total class of 1 and 0:",test_under['Class'].value_counts())# plot the count after under-sampeling
test_over['Class'].value_counts().plot(kind='bar', title='count (target)')
Copy the code

Use imbalance to learn python modules to balance data

Many more sophisticated resampling techniques have been proposed in the scientific literature.

For example, we can cluster the records of most classes and seek to retain information by undersampling the records by removing them from each cluster. In oversampling, we can introduce minor changes to these copies to create more diverse synthetic samples, rather than creating accurate copies of minority records.

Let’s apply some of these resampling techniques using the Python library Imbalanced-learn. It is compatible with scikit-learn and is part of the Scikit-learn-contrib project.

import imblearn
Copy the code

3. Use imBLearn to perform random undersampling

RandomUnderSampler A quick and easy way to balance data by randomly selecting a subset of data for a target category. Undersampling of most categories by randomly selecting samples with or without substitutes.

# import library
from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=42, replacement=True)# fit predictor and target variable
x_rus, y_rus = rus.fit_resample(x, y)

print('original dataset shape:', Counter(y))
print('Resample dataset shape', Counter(y_rus))
Copy the code

4. Use imBLearn to perform random oversampling

One way to address unbalanced data is to generate new samples in minority populations. The most naive strategy is to generate new samples by replacing the currently available samples with random samples. Random oversampling provides such a scheme.

# import library
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(random_state=42)

# fit predictor and target variablex_ros, y_ros = ros.fit_resample(x, y)

print('Original dataset shape', Counter(y))
print('Resample dataset shape', Counter(y_ros))
Copy the code

5. Undersampling: Tomek links

Tomek links are a pair of instances that are very close, but of opposite categories. Removing each pair of instances of most classes increases the space between the two classes, thus aiding the classification process.

If two samples are each other’s nearest neighbors, a Tomek link exists

In the following code, we will resample most classes for ratio=’majority’.

# import library
from imblearn.under_sampling import TomekLinks

tl = RandomOverSampler(sampling_strategy='majority')

# fit predictor and target variable
x_tl, y_tl = ros.fit_resample(x, y)

print('Original dataset shape', Counter(y))
print('Resample dataset shape', Counter(y_ros))
Copy the code

6. Synthetic Minority Oversampling Technique (SMOTE)

This technique is a synthetic minority oversampling technique.

SMOTE works by taking a random point from a few classes and calculating the K nearest neighbor of that point. The composite point is added between the selected point and its neighbors.

SMOTE works in four simple steps:

Select a few classes as input vectors
Find k of its closest neighbors (in SMOTE() specify k_NEIGHBORS as the parameter)
Select one of these neighbors and place the composite point anywhere on the line connecting the point in question and its selected neighbor
Repeat these steps until the data is balanced

# import library
from imblearn.over_sampling import SMOTE

smote = SMOTE()

# fit predictor and target variable
x_smote, y_smote = smote.fit_resample(x, y)

print('Original dataset shape', Counter(y))
print('Resample dataset shape', Counter(y_ros))
Copy the code

7. NearMiss

NearMiss is an undersampling technique. Rather than resampling minority classes using distance, it is better to equate majority classes with minority classes.

from imblearn.under_sampling import NearMiss

nm = NearMiss()

x_nm, y_nm = nm.fit_resample(x, y)

print('Original dataset shape:', Counter(y))
print('Resample dataset shape:', Counter(y_nm))
Copy the code

8. Change performance indicators

Accuracy is not the best metric when evaluating unbalanced data sets, as it can be misleading.

Indicators that can provide better insight are:

Confusion matrix: A table showing the types of correct and incorrect predictions.
Accuracy: The number of true positives divided by all positive predictions. Accuracy is also called positive predictive value. It is a measure of the accuracy of the classifier. Low precision indicates a large number of false positives.
Recall rate: The number of true positives divided by the number of positives in test data. Recall is also called sensitivity or true positive rate. It is a measure of the integrity of the classifier. A lower recall rate indicates a high number of false negatives.
F1: Score: weighted average of accuracy and recall.
Area under THE ROC Curve (AUROC) : AUROC represents the likelihood that the model will distinguish the observed value from the two classes.

In other words, if you pick a random observation from each class, what is the probability that your model will “order” them correctly?

9. Penalty algorithm (cost sensitive training)

The next strategy is to use a punitive learning algorithm, which increases the cost of misclassifying a few classes.

A popular algorithm for this technique is Penalised-SVM.

During training, we can use the parameter class_weight=’balanced’ to penalize a few classes for errors that are proportional to the degree of underrepresentation.

If we want to enable probability estimation for the support vector machine algorithm, we also want to include the parameter probability=True.

Let’s use the Penalised-SVM training model on the original unbalanced data set:

# load library
from sklearn.svm import SVC

# we can add class_weight='balanced' to add panalize mistake
svc_model = SVC(class_weight='balanced', probability=True)

svc_model.fit(x_train, y_train)

svc_predict = svc_model.predict(x_test)# check performance
print('ROCAUC score:',roc_auc_score(y_test, svc_predict))
print('Accuracy score:',accuracy_score(y_test, svc_predict))
print('F1 score:',f1_score(y_test, svc_predict))
Copy the code

10. Change the algorithm

While it’s a good rule of thumb to try various algorithms in every machine learning problem, it’s especially beneficial for unbalanced data sets.

Decision trees often perform well on unbalanced data. In modern machine learning, tree integrations (random forests, gradient enhancers, etc.) almost always trump individual decision trees, so we’ll jump straight to:

Tree-based algorithms work by learning a hierarchy of if/else problems. This forces resolution of two classes.

# load library
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()

# fit the predictor and target
rfc.fit(x_train, y_train)

# predict
rfc_predict = rfc.predict(x_test)# check performance
print('ROCAUC score:',roc_auc_score(y_test, rfc_predict))
print('Accuracy score:',accuracy_score(y_test, rfc_predict))
print('F1 score:',f1_score(y_test, rfc_predict))
Copy the code

Advantages and disadvantages of undersampling

advantages

When the training data set is large, it can help improve running time and storage problems by reducing the number of training data samples.

disadvantages

It can discard potentially useful information, which can be important for building rule classifiers.
Samples selected by random undersampling may be biased samples. May result in inaccurate results from the actual test data set.

Advantages and disadvantages of oversampling

advantages

Unlike undersampling, this method does not result in information loss.
They perform better under sampling conditions

disadvantages

Because it replicates minority events, it increases the possibility of overfitting.

You can check the implementation of the code in my GitHub repository.

Github.com/benai9916/H…

reference

Elitedatascience.com/imbalanced-…
Towardsdatascience.com/methods-for…

conclusion

In summary, in this article we have seen various techniques for dealing with class imbalances in data sets. When dealing with unbalanced data, there are actually a number of approaches you can try. Hope you found this article helpful.

Thanks for reading!

The original link: www.analyticsvidhya.com/blog/2020/0…

Welcome to panchuangai blog: panchuang.net/

Sklearn123.com/

Welcome to docs.panchuang.net/