· Processing method of data imbalance in deep learning
0. Introduction of problems
Category imbalance refers to the large difference in the proportion of samples of different categories in the classification learning algorithm, which will cause significant interference to the learning process of the algorithm. In a binary classification problems, for example, there are 1000 samples of five samples, 995 samples of negative, in this case, the algorithm only need to forecast all of the sample is negative samples, then its accuracy can reach 99.5%, while the accuracy of the results is very high, but it still has no value, because the learning algorithm cannot predict positive samples. Here we can see that the imbalance problem will lead to a high misclassification rate in the category with fewer samples, that is, a larger proportion of the samples with fewer categories will be predicted to be in the category with more samples.
1. Solutions
1, undersampling, reduce the number of samples of the larger category, so that the proportion of positive and negative samples is balanced. 2. Over sampling, increase the number of samples of the type with a small number to balance the proportion of positive and negative samples. 3. If the sample is not processed, the sample classification threshold moves.
1.1. Undersampling
Random undersampling
Random undersampling refers to the random sampling of a part of the data from the majority of samples for deletion. A big disadvantage of random undersampling is that the distribution of samples is not considered, and the sampling process is very random, so some important information in the majority of samples may be deleted by mistake.
Here are higher-order method readings
EasyEnsemble and BalanceCascade
EasyEnsemble generates multiple sub-data sets by randomly selecting a part of samples from the majority of class samples that have been put back for many times. It combines each subset with a few class data for training to generate multiple models, and then collects the results of the multiple models for judgment. This approach looks very similar to the principle of random forest.
BalanceCascade uses a random undersampling to generate a training set, train a classifier, and then undersample the rest of the class samples to generate a second training set, train the second classifier, and also do not put back the correctly classified samples, and so on. Until a certain stop condition is satisfied, the final model is also a combination of multiple classifiers.
Based on KNN undersampling
There are four KNN undersampling methods:
- Nearmis-1: Select the majority samples with the smallest average distance to the nearest three minority samples
- Nearmis-2: Select the samples with the smallest average distance to the three most minority samples
- Nearmis-3: Select a given number of recent majority samples for each minority sample, in order to ensure that each minority sample is surrounded by some majority samples
- Farthest distance: Select the majority samples with the largest average distance to the three nearest minority samples
1.2. Over-sampling
Random oversampling
Random undersampling refers to the sampling data that are put back randomly from a few samples for many times, and the sampling quantity is larger than the original minority sample quantity. Some of the data will be repeated, and the occurrence of repeated data will increase variance and lead to overfitting of the model.
Here are higher-order method readings
SMOTE algorithm
SMOTE is the Synthetic Minority Oversampling Technique, which is an improvement based on the random Oversampling algorithm, The basic idea of SMOTE is to analyse a few sample and add a new one to the set.
SMOTE algorithm uses the similarity between the existing minority samples in the characteristic space to establish the artificial data. It can also be considered that SMOTE algorithm assumes that the sample between the relatively close minority samples is still a minority, the specific process is as follows:
- A random minority sample is selected, its distance to all samples in the minority sample set is calculated, and its k-nearest neighbor is obtained.
- A sampling ratio is set according to the sample imbalance ratio to determine the sampling ratio n. For each minority class sample X, several samples are randomly selected from its K nearest neighbors
- For each randomly selected neighbor, select a random number between [0,1] times the difference between the random neighbor and the eigenvector of x, and then add an x,
Expressed by the formula:
SMOTE eliminates the random over-sampling and copying of samples, preventing the problem of easy over-fitting by using random over-sampling. These extra samples have no information, and SMOTE produces the same number of composite data samples for every original minority, which increases the possibility of repetition between classes.
Borderline – SMOTE algorithm
The Borderline-SMOTE algorithm gets better than SMOTE is only get new samples for those few types of SMOTE that have over half of the majority in K nearest neighbors, because these are prone to misclassify, whereas survival of artificial samples near these minority will help to classify the minority correctly. If a minority sample is surrounded by a majority sample, in this case, the sample is considered a noise sample.
The formula for selecting sample in Borderline-Smote is as follows:
Its selection process is as follows:
- For each XI ⊂Smin, a set of nearest neighbor samples is determined, and this data set is said to be Si:k− NN, and Si: K − NN ⊂S
- For each sample xi, judge the nearest neighbor sample concentration belongs to the majority of the number of samples, namely: | Si: k – nn ⋂ Smaj |
- Choose xi that satisfies the above inequality
Based on k-means oversampling
The K-means-based clustering oversampling method is generally divided into two steps:
- Firstly, k-means clustering is carried out for positive and negative cases respectively
- After clustering, the sample number of smaller clusters is expanded by the above oversampling method
- Then, the balance expansion of positive and negative samples is carried out
This algorithm can not only solve the imbalance between classes, but also solve the imbalance within classes.
1.3. Classification threshold movement
In A dichotomous problem, 0.5 is often used as the classification standard for prediction results. For example, the prediction probability greater than 0.5 is classified as class A, and the prediction probability less than 0.5 is classified as class B, where 0.5 is the classification threshold.
In dichotomies, if the probability that A sample is predicted to be A is P, then the probability that it is B is 1-P, and P /(1-p) represents the ratio of the two kinds of possibilities, namely odds, or odds ratio. If P /(1-p)>1, we think the sample is more likely to be class A than B. However, if the ratio of positive and negative samples in A data set is not equal, there will be an observation probability. Suppose there are M A samples and N B samples in the data set, then the observation probability is M/N (the observation probability is 1 in the case of sample equilibrium).
In the classification process of the algorithm, if the predicted probability P /(1-p) is greater than the actual observed probability M/N, then we classify the sample as A, instead of taking 0.5 as the classification threshold (0.5 as the threshold in the case of sample equilibrium) and using the formula: P /(1-p)>m/n the calculation result shows that P > M /(m+n) At this time, only when P is greater than M /(m+n), the prediction result is class A, where M /(m+n) replaces 0.5 as the new classification threshold.
By means of the above principle, face sample imbalance in the classification of learning, we can use the original imbalance of learning samples, and then by changing the decision rules for classification, such as in the sample balance when we 0.5 as classification threshold, and under the condition of uneven in the sample we can forecast probability rules need to be 0.8 can be predicted for the majority of the class.
2, the Reference
1, www.cnblogs.com/wkslearner/…