This article originally appeared on Walker AI
Using deep learning to do multiple categorization is a common task in both industry and research Settings. In a research environment, whether NLP, CV or TTS series tasks, data is abundant and clean. In the real industrial environment, data problems often become a big problem for practitioners; Common data problems include:
- Small data sample size
- Lack of data annotation
- The data is not clean and there is a lot of disturbance
- The number of samples between classes is not evenly distributed and so on.
In addition, there are other problems, this article will not list one by one. In view of the fourth problem mentioned above, Google published a paper “Long-tail Learning via Logit Adjustment” in July 2020, which makes relevant reasoning on the cross-entropy function by using Balanced Error Rate (BER). Based on the original cross entropy, the average classification accuracy is higher. This paper will briefly interpret the core inference of this paper, and implement it using the KerAS deep learning framework, and finally verify the results through a simple Mnist handwritten number classification experiment. This paper will be interpreted from the following four aspects:
- The basic concept
- Core reasoning
- Code implementation
- The experimental results
1. Basic concepts
In multi-classification problems based on deep learning, it is often necessary to adjust the data, structural parameters, loss function and training parameters to obtain better classification results. Especially in the face of categories of unbalanced data, make more adjustments. In the paper long-tail Learning via Logit Adjustment, in order to alleviate the problem of low sample classification accuracy caused by category imbalance, SOTA effect was achieved by adding the prior knowledge of labels into the loss function only. Therefore, in this paper, four basic concepts are briefly described :(1) long-tail distribution, (2) softmax, (3) cross entropy, and (4) BER
1.1 Long-tailed distribution
If the training data of all categories are sorted from high to low according to the sample size of each category, and the sorting results are shown in a graph, the training data with unbalanced categories will present a “head” and “tail” distribution form, as shown in the figure below:
The category with a large sample size forms the “head”, and the category with a low sample size forms the “tail”. The problem of category imbalance is very significant.
1.2 softmax
Softmax is often used as the activation function of the last layer of neural network in binary or multi-classification problems to express the predicted output of neural network due to its normalized function and easy derivation. This paper does not elaborate on Softmax, only gives its general formula:
In neural networks, zjz_{j}zj is the output of the upper layer; Q (cj)q\left(c_{j}\right)q(cj) is the distribution form of this layer output; Nezi ∑ I = 1 \ sum_ {I = 1} ^ ^ {n} e {z_ {I}} nezi ∑ I = 1 is a batch within ezie ^ {z_ {I}} ezi and.
1.3 cross entropy
This paper does not make too many inferences about the cross entropy function. For details, please refer to the relevant literature of information theory. In dichotomy or multi-classification problems, the cross entropy function and its variants are usually used as loss functions for optimization, and the basic formula is given:
In neural networks, P (ci)p\left(c_{I}\right) P (CI) is the expected sample distribution, usually the label after one-hot coding; Q (ci)q\left(c_{I}\right) Q (CI) is the output of the neural network, which can be regarded as the prediction result of the neural network to the sample.
1.4 BER
BER is the mean of error rate of positive sample and negative sample in dichotomies. In the multi-classification problem, is the weighted sum of the respective error rates of all kinds of samples, which can be expressed as the following form (refer to the paper) :
Where, FFF is the whole neural network; Fy ‘f_ (x) {y ^ {\ prime}} (x) fy’ (x) input is XXX, the output of y ‘y ^ {\ prime} y’ neural network; Y argmaxy ‘∈ Yfy’ (x)y \notin \ operatorName {argmax}_{y^{\prime} \in y} f_{y^{\prime}}(x)y∈/argmaxy ‘∈ Yfy’ (x) herein refers to the label incorrectly identified by the neural network Yyy; Px∣y\mathbb{P}_{x \mid y}Px∣y is the calculation of the error rate; 1L\frac{1}{L}L1 for all weights.
2. Core inference
According to the idea of the paper, firstly, a neural network model is determined:
F ∗ F ^{*}f∗ is a neural network model satisfying BER condition. Then the optimization of the neural network model argmax y ∈ [L] fy ∗ (x) \ operatorname {argmax} _ {y \ [L]} in f_ {} y ^ {*} (x) argmaxy ∈ [L] fy ∗ (x), This procedure is equivalent to Argmax y∈[L]Pbal(y∣x)\ operatorName {argmax}_{y \in[L]} \mathbb{P}^{mathrm{bal}}(y \mid x) Argmaxy ∈[L]Pbal(y∣x), It is the optimization process of obtaining the prediction label YYy given the training data XXX and equalizing the prediction label YYy (multiplied by their respective weights). Abbreviated to:
For Pbal (y) ∣ x \ mathbb {P} ^ {\ text {bal}} (y \ mid x) Pbal (∣ x y), Apparently Pbal (∣ x y) ∝ P (y ∣ x)/P (y) \ mathbb {P} ^ {\ text {bal}} (y \ mids) x \ propto \ mathbb {P} (y \ mid x) / \ mathbb {P} Pbal (y) (y∣x)∝P(y∣x)/P(y), where P(y)\mathbb{P}(y)P(y) is a label prior; P(y∣x)\mathbb{P}(y \mid x)P(y∣x) is the conditional probability of the prediction label given the training data XXX. Combining the essence of training in multi-classification neural networks:
According to the above procedure, suppose that the output logits of the network is denoted as S * : s∗:x→RLs^{*}: X \rightarrow \mathbb{R}^{L}s∗:x→RL Q (ci) = es ∗ ∑ I = 1 nes ∗ q \ left (c_ {I} \ right) = \ frac {e ^ ^ {s} {*}} {\ sum_ {I = 1} ^ {n} e ^ ^ {s} {*}} q (ci) = ∑ I = 1 nes ∗ es ∗; It is not hard to draw: P (y ∣ x) ∝ exp (sy ∗ (x)) \ mathbb {P} (y \ mid x), propto, exp, left (s_ {} y ^ {*} (x) \ right) P (y ∣ x) ∝ exp (sy ∗ (x)). Combining Pbal (∣ x y) ∝ P (y ∣ x)/P (y) \ mathbb {P} ^ {\ text {bal}} (y \ mids) x \ propto \ mathbb {P} (y \ mid x) / \ mathbb {P} Pbal (y) (y) ∣ x ∝ P (y ∣ x)/P (y), can be Pbal (y) ∣ x \ mathbb {P} ^ {\ text {bal}} (y \ mid x) Pbal (y ∣ x) for:
According to the above equation, two ways of optimizing Pbal (y∣x)\mathbb{P}^{\text {bal}}(y \mid x)Pbal (y∣x) are given in this paper:
(1) through argmax y ∈ [L] exp (sy ∗ (x))/P (y) \ operatorname {argmax} _ {y \ [L]} in/exp/left (s_ {} y ^ {*} (x)) \ right / \mathbb{P}(y) Argmaxy ∈[L]exp(sy∗(x))/P(y), divided by a prior P(y)\mathbb{P}(y)P(y) after the input XXX gets the prediction through all the neural network layers. This method has been used by predecessors and has achieved certain results.
(2) through argmax y ∈ [L] sy ∗ (x) – ln P (y) \ operatorname {argmax} _ {y \ [L]} in the s_ {} y ^ {*} (x) – \ ln \ mathbb {P} (y) argmaxy ∈ [L] sy ∗ (x) – lnP (y) , subtract lnP(y)\ln \mathbb{P}(y)lnP(y) after input XXX to get an encoded logits through the neural network layer. This idea is adopted in this paper.
According to the second idea, the paper directly provides a general formula, which is called Logit Adjustment loss:
Compared to the conventional SoftMax cross entropy:
Essentially, an offset related to a label prior is applied to each log output (that is, the result before softMax activation).
3. Code implementation
Implementation of the idea is: to the output of the neural network logits plus a migration log based on prior (PI y ‘PI y) tau, the log, left (\ frac {\ pi_ {y ^ {\ prime}}} {\ pi_ {y}} \ right) ^ {\ tau} log tau (PI PI y y’). In practice, τ tauτ=1, πy ‘\pi_{y^{\prime}}πy’ =1 are taken as the regulatory factor in order to be as efficient as possible. Logit Adjustment loss can be simplified as:
In the kerAS framework to achieve the following:
import keras.backend as K
def CE_with_prior(one_hot_label, logits, prior, tau=1.0) :
''' param: one_hot_label param: logits param: prior: real data distribution obtained by statistics param: tau: regulator, default is 1 return: loss '''
log_prior = K.constant(np.log(prior + 1e-8))
# align dim
for _ in range(K.ndim(logits) - 1):
log_prior = K.expand_dims(log_prior, 0)
logits = logits + tau * log_prior
loss = K.categorical_crossentropy(one_hot_label, logits, from_logits=True)
return loss
Copy the code
4. Experimental results
The paper long-tail Learning via Logit Adjustment itself compared various methods to improve the classification accuracy of long-tail distribution and tested with different data sets. The test performance was better than the existing methods. Detailed experimental results were found in the paper itself. In order to quickly verify the correctness of the implementation and the effectiveness of the method, a simple classification experiment was carried out using MNIST handwritten digits. The background of the experiment is as follows:
\ | details |
---|---|
The training sample | 0 ~ 4:500 sheets/class; 5 ~ 9 :500 pieces per class |
The test sample | 0 to 9:500 0/ class |
Runtime environment | Local CPU |
The network structure | Convolution + maximum pooling + full join |
Under the above background, the comparative experiment was conducted to compare the performance of the classification network when the standard multi-classification cross entropy and the prior cross entropy were respectively used as the loss function. Take the same epoch=60, and the experimental results are as follows:
\ | Standard multi-classification cross entropy | Cross entropy with a priori |
---|---|---|
accuracy | 0.9578 | 0.9720 |
Training flow chart |
PS: more dry technology, pay attention to the public, | xingzhe_ai 】, and walker to discuss together!