Data category imbalance problem processing

Reprint address

1. What is the category imbalance problem

If there is a slight difference in the number of training samples in different categories, it usually has little effect, but if there is a large difference, it will cause trouble to the learning process. For example, there are 998 negative examples, but only 2 positive examples, so the learning method only needs to return a learner that will always predict the new sample as a negative example, and can achieve 99.8% accuracy. However, such a learner is often worthless because it cannot predict any positive examples.

Class -imbalance refers to a situation in which the number of training samples of different types in classified tasks is greatly different. In practical classification learning tasks, we often encounter category imbalance. For example, when solving multi-classification problems by splitting method, OvR (a pair of others, One vs. Rest, OvR for short), MvM (many-to-many, Many vs. Many, It is necessary to understand the basic method of handling the category imbalance because the binary task drop caused by MvM strategy may cause category imbalance.

 

2. Address category imbalance

2.1 Undersampling method

(1) What is the undersampling method

“Undersampling” is directly carried out on the majority of samples in the training set, that is, to eliminate some samples in the majority of samples so that the number of positive and negative examples is close, and then the learning is carried out.

(2) Random undersampling method

Random undersamplingAs the name impliesThat is, from the majority classIn the sample set is composed of a random selection of some of these. And then you take the sample setfromRemoved. New data set.

Disadvantages:

The random undersampling method modifies the sample distribution by changing the proportion of most samples so as to make the sample distribution more balanced, but it also has some problems. For random undersampling, because the sample set sampled is less than the original sample set, some information will be lost, that is, deleting most of the samples may cause the classifier to lose important information about most of the classes.

In order to overcome the problem of information loss caused by random undersampling method and ensure that the algorithm shows a good performance of unbalanced data classification, the representative algorithms of undersampling method EasyEnsemble and Balanceccade algorithm appear.

(3) Under-sampled representative algorithm -EasyEnsemble

Algorithm steps:

1) There are n random samples put back from the majority of classes, and the number of samples close to the number of minority classes is selected each time, then n sample sets can be obtained and denoted as.

2) Then, a subset of each majority class sample is combined with a minority class sample to train a model, and n models can be obtained.

3) These models are finally combined to form an integrated learning system, and the final model result is the average value of these N models.

Figure 1: EasyEnsemble algorithm

(4) Undersampling representativeness algorithm — Balanceccade

BalanceCascade algorithm is based on Adaboost and uses Adaboost as a base classifier. Its core idea is:

1) In each round of training, a training set with equal number of majority classes and minority classes is used to train an Adaboost base classifier.

2) Then use the classifier to predict the majority of all classes, control the False Positive Rate by controlling the classification threshold, and delete all correctly judged classes.

3) Finally, enter the next iteration and continue to reduce the number of most classes.

Figure 2: Balanceccade algorithm

Read more:

Liu X Y, Wu J, Zhou Z H. Exploratory undersampling for class-imbalance learning[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 2009, 39(2): 539-550.

This paper proposes two undersampling methods: EasyEnsemble and Balanceccade.

 

2.2 Oversampling method

(1) What is the oversampling method

To oversampling a few classes in the training set, that is to oversampling a few classes so that the number of positive and negative examples is close, and then to learn.

(2) Random oversampling method

Random oversampling is in a few classes, and then generate sample sets by copying the selected samples, add them toTo enlarge the original data set to obtain a new set of minority classes. New data set.

Disadvantages:

For random oversampling, the complexity of model training increases because a few samples need to be copied to expand the data set. On the other hand, it is also easy to cause the over-fitting problem of the model, because random over-sampling is simply copying and sampling the initial sample, which makes the rules learned by the learner too specific, which is not conducive to the generalization performance of the learner and causes the over-fitting problem.

In order to solve the model fitting problem caused by stochastic oversampling and ensure the balance of data set, SMOTE and borderline-smote are the representative algorithms of oversampling.

(3) Oversampling representative algorithm -SMOTE

SMOTE Synthetic Minority Oversampling. SMOTE algorithm is an improvement of the stochastic oversampling method, because the stochastic oversampling method directly resends a few classes, there will be a lot of repeated samples in the training set, and it is easy to cause the model overfitting problem. The basic idea of SOMT algorithm is that each minority class sample, and randomly selects a sample from its nearest neighborIs a sample of a few classes), and then inA point on the line is randomly selected as a sample of the newly synthesized minority class.

SMOTE the SMOTE algorithm for synthesizing the new minority sample is described as follows:

1). For each sample in a few classesAnd calculate it to the sample set of a few classes using Euclidean distance as the standardThe k nearest neighbor is obtained from the distance of all samples in.

2). Set a sampling ratio according to the sample imbalance ratio to determine the sampling ratio N, for each minority sample, randomly selects some samples from its k nearest neighbors, assuming that the selected is

3). For each randomly selected neighbor, respectively, withConstruct new samples according to the following formula.

Let’s describe again, graphically, the SMOTE algorithm.

1). Select a minority sample at random

2). Find this minority sampleOf K neighbors (assuming K=5), 5 neighbors have been circled.

3). A sample is randomly selected from the K neighbors(circled in green).

4). In a small number of samplesAnd the selected neighbor samplePick a random point on the line between. This point is the synthetic new sample point (marked with a green plus sign).

SMOTE eliminates the random-oversampling replicate sample and prevents the easy over-fitting problem in random-oversampling. In practice, this method has been shown to improve the performance of the classifier. But SMOTE has two drawbacks:

1) Since new samples are generated for each minority class sample, it is easy to generate overlapping samples.

2) In SMOTE algorithm, there is over generalization, mainly due to the method of producing synthesized sample. In particular, SMOTE produces the same number of composite samples for each of the original minority, without taking into account the distribution of the adjacent sample, which increases the likelihood of duplication between classes.

Reason of Defect 2) : Combined with the principle of SMOTE algorithm described above, during SMOTE algorithm creating a new artificial minority sample, it only interpolates between the same kind of close neighbors, and does not consider the distribution of majority of SMOTE samples around the minority. As shown in Figure 3, green plus signs 1 and 2 are distributed around the majority of class samples, and they are closest to the majority of class samples, which makes it possible for them to be divided into the majority of class samples. Therefore, in Figure 3, we can see that there is some blindness in the sample generation mechanism of SMOTE algorithm.

Figure 3: Results of SOMTE algorithm

In order to overcome the limitations of the above two points, a variety of adaptive sampling methods have been put forward, among which the representative one is Borderline-SMOTE algorithm.

Read more:

Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: synthetic minority over-sampling technique[J]. Journal of artificial intelligence research, 2002, 16: 321-357.

This paper gives us the SMOTE algorithm.

(4) Introduction to the borderline-Smote algorithm

The most interesting thing about the Borderline-smote algorithm is the way it identifies a small number of seeds. In borderline-smote, recognize a minority seed sample as follows:

1) First, for each, a series of nearest neighbor sample sets are determined, and the data set isAnd,.

2) Then, for each sample, determine the number of samples belonging to the majority class in the nearest neighbor sample set, namely:.

3) Finally, choose the one that satisfies the following inequality

The above shows that only the ones with more classes than the ones with fewer classes in the nearest neighbor sample setWill be selected to form the “DANGER set”. Therefore, the samples in DANGER’s set represent the boundary of a small number of samples (the samples most likely to be misclassified). Then use SMOTE algorithm on DANGER set to generate artificial minority class sample near the boundary.

We can see that if. That is:All k nearest neighbor samples of are multiclass. As the sample point C shown in FIG. 4, it is considered that sample point C is noise and cannot generate synthetic samples.

Figure 4: Data establishment based on samples at the boundary

From the above, we have some idea of the borderline-smote algorithm. In order to make you understand this algorithm more thoroughly, I will draw a flow chart for you to introduce it in detail.

Figure 5: Borderline-SMOTE algorithm flow chart

In Flow chart 5, the training sample set is F, a minority sample.

1) Step 1:

(I) Calculate the sample set of a few classesK nearest neighbors in training set F for each sample in.

(ii) And then, according to the k nearest neighbor pairsAre classified as follows:

  • Assuming that the k nearest neighbors are all majority class samples, we define the sample as noise sample and place it inIn the collection.
  • Anyway, k nearest neighbors are all minority samples that are far away from the classification boundary, and we put them in the S set.
  • Finally, K nearest neighbors with both majority class samples and minority class samples are considered as boundary samples and put into B set.

2) Step 2:

(I) Set the boundary sample setCalculate each sample in set B, in the sample set of a few classesK nearest neighbors of, form the set.

(ii) Randomly select s(1< S <n) nearest neighbors.

(iii) Calculate the differences of all attributes between them and the sample.

(iv) then multiply by a random number. ifisA sample of the set or S, then.

(v) The artificially generated minority samples are as follows:.

3) Step 3:

The process of Step 2 is repeated until the number of artificial minority class samples generated meets the requirements and the purpose of sample set balancing is achieved.

Read more:

Han H, Wang W Y, Mao B H. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning[C]//International Conference on Intelligent Computing. Springer, Berlin, Heidelberg, 2005: 878-887.

This article suggests the borderline-Smote algorithm.

2.3 Cost-Sensitive Learning

(1) Cost matrix

The sampling algorithm solves the problem of unbalanced data learning from the data level. At the algorithmic level, the methods to solve unbalanced data Learning are mainly based on cost-sensitive Learning.

In practical tasks, it is common to encounter such situations: different types of errors have different consequences. For example, in medical diagnosis, both patients are wrongly diagnosed as healthy people and healthy people are wrongly diagnosed as patients, which seem to be “a mistake”. However, the influence of the latter is to increase the trouble of further examination, while the consequence of the former may be the loss of the best opportunity to save lives. For example, the access control system incorrectly blocks the passable personnel outside the door, which will make the user experience poor, but incorrectly puts strangers into the door, which will cause serious security accidents; If normal use is mistaken for normal use, the user experience may be bad. However, if normal use is mistaken for normal use, the user will suffer huge losses. Errors can be subjected to unequal cost, to weigh the different costs caused by different types of errors.

The core element of cost sensitive learning method is cost matrix, as shown in Table 1. Among themSaid it was the firstClass a sample prediction is noThe cost of class samples. In general,; If the loss caused by discriminating class 0 into class 1 is greater, then; The greater the difference in loss,The greater the difference between the values of. whenEqual time for cost insensitive learning problems.

Table 1: Cost matrix

(2) Cost sensitive learning method

Based on the above analysis of cost sensitive matrix, cost sensitive learning methods mainly have the following three realization methods:

1). Starting from the learning model, to transform a specific learning method to make it adapt to learning under unbalanced data, researchers put forward cost-sensitive versions for different learning models such as perceptron, support vector machine, decision tree and neural network. Taking the cost-sensitive decision tree as an example, it can be transformed from three aspects to adapt to the learning of unbalanced data, namely, the selection of decision threshold, the selection of split criteria and pruning. Cost matrix can be introduced into all three aspects.

2). Starting from The Bayesian risk theory, cost-sensitive learning is regarded as a post-processing of classification results. A model is learned in accordance with the traditional method, and the results are adjusted in order to achieve the minimum loss. The advantage of this method is that it can be independent of the specific classifier used, but the disadvantage is also obvious, it requires the output value of the classifier to be probability.

3). From the perspective of preprocessing, the cost is used for weight adjustment to make the classifier meet the cost-sensitive characteristics. The AdaCost algorithm, a weight updating strategy based on Adaboost, is explained below.

(3) AdaCost algorithm

To understand the AdaCost algorithm, we need to know the Adaboost algorithm, as shown in Figure 6. Adaboost algorithm by repeated iterations, each round of iteration to learn a classifier, and based on the current classifier performance updates the weight of the sample, as shown in figure in the red box, the update strategy for the correct classification of sample weight is reduced, the error classification sample weight increase, the final model is a weighted linear combination of multiple iterative model. The more accurate the classifier is, the more weight will be given to it.

Figure 6: Adaboost algorithm

AdaCost algorithm modified the weight updating strategy of Adaboost algorithm. Its basic idea is to greatly increase the weight of the misclassified samples with high cost, and appropriately reduce the weight of the correctly classified samples with high cost, so that the weight reduction is relatively small. The general idea is that the weight of the high cost sample increases greatly and decreases slowly. The sample weight is updated according to the following formula. Among themRespectively represents the case where the sample is correctly and incorrectly classifiedThe value of.

 

2.4 Evaluation methods of unbalanced learning

(1) F1 measure

This part involves the evaluation method of the model. If you have not learned it, you can read the article about this part posted on my official account. At the same time, I also posted the link address for you to learn quickly.

Error rate, accuracy, precision, recall and F1 measurement are introduced in detail

ROC curve and AUC area understanding

Table 2: Confusion matrix of classification results

For example, in the scenario of cancer prediction, it is assumed that the samples without cancer are positive cases and the samples with cancer are negative cases, and the proportion of negative cases is very small (about 0.1%). If the classifier is directly set to predict all positive cases, then the accuracy and accuracy are 99.9%. Visibility accuracy, error rate and precision cannot represent model performance under unbalanced data. The F1 value takes into account both the precision and recall rates of a few classes, so it can measure the performance of the model with unbalanced data.

 

(2)G-Mean

G-mean is another indicator, which can also evaluate the model performance of unbalanced data, and its calculation formula is as follows.

 

(3)ROC curve and AUC area

My article gives a comprehensive ROC curve and AUC area analysis. ROC curve and AUC area can well evaluate the model performance of unbalanced data.

ROC curve and AUC area understanding

3. How to choose

(1) When SMOTE has very few positive and negative samples, use data synthesis, for example, SMOTE algorithm and borderline-SMOTE algorithm.

(2) In the case that both positive and negative samples are sufficient and the proportion is not particularly unequal, sampling method or weighting method should be considered.

Conclusion:

This paper mainly introduces the commonly used algorithms and evaluation indexes in the classification of unbalanced category learning. The algorithms are mainly introduced from the data and model levels. The algorithms at the data level are mainly about over-sampling, under-sampling and improved algorithms, and the models are mainly about cost-sensitive learning. The evaluation indexes mainly explain F1 measurement, G-mean and AUC area of ROC curve.

 

Reference:

(1) Liu X Y, Wu J, Zhou Z H. An Exploratory Study on class learning[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 2009, 39(2): 539-550.

(2) SMOTE: Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: synthetic minority over-sampling technique[J]. Journal of artificial intelligence research, 2002, 16: 321-357.

(3) Han H, Wang W Y, Mao B H. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning[C]//International Conference on Intelligent Computing. Springer, Berlin, Heidelberg, 2005: 878-887.

(4) EasyEnsemble and BalanceCascade paper download address: cs.nju.edu.cn/zhouzh/zhou…

(5) SMOTE algorithm journal page: www.jair.org/index.php/j…

(6) SMOTE algorithm download address: www.jair.org/index.php/j…

(7) Borderline-SMOTE […

Learning (8) disequilibrium sampling method – CSDN blog https://blog.csdn.net/u011414200/article/details/50664266

(9) under the unbalanced data of machine learning methods introduction to https://www.jianshu.com/p/3e8b9f2764c8

(10) the unbalanced classification problem | BalanceCascade methods and Python implementation of https://zhuanlan.zhihu.com/p/36093594