1 the introduction

Hello friends, welcome to the moon inn. In the previous article [1], the author introduced the measurement method of model loss in single-label classification problem, namely the cross entropy loss function. At the same time, common evaluation indexes and their realization methods in multi-classification tasks are also introduced [2]. In the following article, the author will introduce in detail two common loss assessment methods in multi-label classification tasks, as well as model evaluation indicators in multi-label classification scenarios.

2 methods

Replace the Softmax operation of the original output layer with the SiMOID operation, and then calculate the SigmoID cross entropy between the output layer and the label to measure the error. The specific calculation formula is as follows:


l o s s ( y . y ^ ) = 1 C i = 1 m [ y ( i ) log ( 1 1 + exp ( y ^ ( i ) ) ) + ( 1 y ( i ) ) log ( exp ( y ^ ( i ) ) 1 + exp ( y ^ ( i ) ) ) ]                ( 1 ) loss(y,\hat{y})=-\frac{1}{C} \sum_{i=1}^m\left[y^{(i)}\cdot\log\left(\frac{1}{1+\exp(-\hat{y}^{(i)})}\right)+\left(1-y^{(i)}\right)\cdot\log\left(\fr ac{\exp(-\hat{y}^{(i)})}{1+\exp(-\hat{y}^{(i)})}\right)\right]\; \; \; \; \; (1)

Where CCC represents the number of categories, y(I)y^{(I)}y(I) and Y ^(I)\hat{y}^{(I)}y^(I) are both vectors, which are used to represent real tags and network output values that are not processed by any activation function respectively.

It can be seen from Equation (1), (1) and (1) that this measure of error loss is actually a method used to measure the error between the predicted probability and the real label in logistic regression.

2.1 numpyImplementation:

According to the calculation formula of Formula (1), (1) and (1), the loss value can be calculated by the following Python code:

def sigmoid(z) :
    return 1 / (1 + np.exp(-z))

def compute_loss_v1(y_true, y_pred) :
    t_loss = y_true * np.log(sigmoid(y_pred)) + \
             (1 - y_true) * np.log(1 - sigmoid(y_pred))  # [batch_size,num_class]
    loss = t_loss.mean(axis=-1)  I'm going to get the loss for each sample
    return -loss.mean()  # Return the mean loss of the whole sample (or other)

if __name__ == '__main__':
    y_true = np.array([[1.1.0.0], [0.1.0.1]])
    y_pred = np.array([[0.2.0.5.0.0], [0.1.0.5.0.0.8]])
    print(compute_loss_v1(y_true, y_pred)) # 0.5926
Copy the code

Of course, both methods are implemented in TensorFlow 1.x and Pytorch, respectively.

2.2 TensorFlowimplementation

In Tensorflow 1.x, this can be called with the sigmoid_cross_entropy_with_logits method under the tf.nn module:

def sigmoid_cross_entropy_with_logits(labels, logits) :
    loss = tf.nn.sigmoid_cross_entropy_with_logits(labels=labels, logits=logits)
    loss = tf.reduce_mean(loss, axis=-1)
    return tf.reduce_mean(loss)

if __name__ == '__main__':
    y_true = tf.constant([[1.1.0.0], [0.1.0.1]],dtype=tf.float16)
    y_pred = tf.constant([[0.2.0.5.0.0], [0.1.0.5.0.0.8]],dtype=tf.float16)
    with tf.Session() as sess:
        loss = sess.run(sigmoid_cross_entropy_with_logits(y_true,y_pred))
        print(loss) # 0.5926
Copy the code

Of course, after the model training is completed, the predicted label results and corresponding probability values can be obtained by following codes:

def prediction(logits, K) :
      y_pred = np.argsort(-logits, axis=-1)[:,:K]
    print("Prediction label:",y_pred)
    p = np.vstack([logits[r,c] for r,c in enumerate(y_pred)])
    print("Predicted probability:",p)

prediction(y_pred,2)
# # # # #Prediction label: [[1 0]
 [3 1] Prediction probability: [[0.5 0.2]
 [0.8 0.5]]
Copy the code

2.3 Pytorchimplementation

In Pytorch, you can calculate the loss using the MultiLabelSoftMarginLoss class in the torch. Nn module:

if __name__ == '__main__':
    y_true = torch.tensor([[1.1.0.0], [0.1.0.1]],dtype=torch.int16)
    y_pred = torch.tensor([[0.2.0.5.0.0], [0.1.0.5.0.0.8]],dtype=torch.float32)
    loss = nn.MultiLabelSoftMarginLoss(reduction='mean')
    print(loss(y_pred, y_true)) # 0.5926
Copy the code

Similarly, after the model training is completed, inference prediction can also be completed through the above prediction function. It should be noted that sigmoid_cross_entropy_with_logits method in TensorFlow 1.x returns the mean of all sample losses; In Pytorch, MultiLabelSoftMarginLoss returns the mean of all sample losses by default, but the type returned can be specified by specifying reduction to mean or sum.

3 method 2

Among the methods to measure the loss of multi-label classification results, there is another commonly used loss function in addition to the method I introduced above. This loss function is actually an expanded version of the cross entropy loss function used in single-label classification, and single-label can be regarded as a special case. The specific calculation formula is as follows:


l o s s ( y . y ^ ) = 1 m i = 1 m j = 1 q y j ( i ) log y ^ j ( i )                               ( 2 ) loss(y,\hat{y})=-\frac{1}{m}\sum_{i=1}^m\sum_{j=1}^qy^{(i)}_j\log{\hat{y}^{(i)}_j}\; \; \; \; \; \; \; \; \; \; (2)

Where yj(I)y^{(I)}_jyj(I) represents the true value of JJJ category of the third sample, and Y ^j(I)\hat{y}^{(I)}_jy^j(I) represents the result of JJJ category of the third sample after softmax processing.

For example, for the following sample:

y_true = np.array([[1.1.0.0], [0.1.0.1.]])
y_pred = np.array([[0.2.0.5.0.1.0], [0.1.0.5.0.0.8]])
Copy the code

The output value processed by Softmax is as follows:

[[0.24549354 0.33138161 0.22213174 0.20099311]
 [0.18482871 0.27573204 0.16723993 0.37219932]]
Copy the code

Then, according to Formula (2), (2) and (2), the loss value of the above two samples is:

\ begin} {aligned loss & = – \ frac {1} {2} \ left (1 \ cdot \ log (0.245) + 1 \ cdot \ log (0.331) + 1 \ cdot \ \ \ log \ cdot (0.275) + 1 \ \ log (0.372) right)} {aligned \ \ approx2.395 \ end; \; \; \; \; \; \; \; \; \; (3)

3.1 numpyImplementation:

According to the calculation formula of Equations (3), (3) and (3), the loss value can be calculated by the following Python code:

def softmax(x) :
    s = np.exp(x)
    return s / np.sum(s, axis=-1, keepdims=True)

def compute_loss_v2(logits, y) :
    logits = softmax(logits)
    print(logits)
    c = -(y * np.log(logits)).sum(axis=-1)  # Calculate the sum of losses on each tag for each sample
    return np.mean(c)  # Calculate the average of all sample losses
y_true = np.array([[1.1.0.0], [0.1.0.1.]])
y_pred = np.array([[0.2.0.5.0.1.0], [0.1.0.5.0.0.8]])
print(compute_loss_v2(y_pred, y_true))# 2.392
Copy the code

3.2TensorFlowimplementation

In Tensorflow 1.x, this can be called using the softmax_cross_entropy_with_logits_v2 method under the tf.nn module:

def softmax_cross_entropy_with_logits(labels, logits) :
    loss = tf.nn.softmax_cross_entropy_with_logits_v2(labels=labels, logits=logits)
    return tf.reduce_mean(loss)
y_true = tf.constant([[1.1.0.0], [0.1.0.1.]], dtype=tf.float16)
y_pred = tf.constant([[0.2.0.5.0.1.0], [0.1.0.5.0.0.8]], dtype=tf.float16)
with tf.Session() as sess:
    loss = sess.run(softmax_cross_entropy_with_logits(y_true, y_pred))
    print(loss)# 2.395 
Copy the code

3.3 Pytorchimplementation

In Pytorch, I haven’t found a model that I can call, but I can code it myself:

def cross_entropy(logits, y) :
    s = torch.exp(logits)
    logits = s / torch.sum(s, dim=1, keepdim=True)
    c = -(y * torch.log(logits)).sum(dim=-1)
    return torch.mean(c)

y_true = torch.tensor([[1.1.0.0], [0.1.0.1.]])
y_pred = torch.tensor([[0.2.0.5.0.1.0], [0.1.0.5.0.0.8]])
loss = cross_entropy(y_pred,y_true)
print(loss)# 2.392
Copy the code

It is important to note that due to the different strategies used by different frameworks to preserve decimals, the final result will be slightly different after the decimal place.

4 Evaluation Indicators

4.1 Do not consider partially correct assessment methods

(1) Exact Match Ratio

Absolute matching means that for each sample, the prediction is correct only if the predicted value is exactly the same as the actual value, that is, if there is a difference in the prediction results of one category, the prediction is not correct. Therefore, its accuracy rate can be calculated by:


M R = 1 m i = 1 m I ( y ( i ) = = y ^ ( i ) )                               ( 4 ) MR=\frac{1}{m}\sum_{i=1}^mI(y^{(i)}==\hat{y}^{(i)})\; \; \; \; \; \; \; \; \; \; (4)

NNN represents the total number of samples. ⋅ (⋅)I(\cdot)I(⋅) is indicator function; 111 when yiy_iyi is completely equivalent to Y ^ I \hat{y}_iy^ I; 000 otherwise. It can be seen that the higher MR value, the higher accuracy of classification.

For example, there are actual and predicted values as follows:

 y_true = np.array([[0.1.0.1],
                       [0.1.1.0],
                       [1.0.1.1]])

y_pred = np.array([[0.1.1.0],
                       [0.1.1.0],
                       [0.1.0.1]])
Copy the code

Then the corresponding MR should be 0.3330.3330.333, because only the second sample is considered correct. In Sklearn, the accuracy_score method in sklearn.metrics module can be directly used to complete the calculation [3], as shown below:

from sklearn.metrics import accuracy_score
print(accuracy_score(y_true,y_pred)) # 0.33333333
Copy the code

(2) 0-1 loss

In addition to the absolute match rate, there is another evaluation criterion which is the opposite of the calculation process, namely zero-one Loss. Absolute accuracy calculates the proportion of samples that are completely correct in predicting the total number of samples, while 0-1 loss calculates the proportion of samples that are completely wrong in predicting the total number of samples. Therefore, for the above prediction and true outcome, the 0-1 loss should be 0.667. The calculation formula is as follows:


L 0 1 = 1 m i = 1 m I ( y ( i ) indicates y ^ ( i ) )                               ( 5 ) L_{0-1}=\frac{1}{m}\sum_{i=1}^mI(y^{(i)}\neq\hat{y}^{(i)})\; \; \; \; \; \; \; \; \; \; (5)

In SkLearn, calculation can be completed by zero_one_loss method in sklearn.metrics module [3], as shown below:

from sklearn.metrics import zero_one_loss
print(zero_one_loss(y_true,y_pred))# 0.66666
Copy the code

4.2 Consider partially correct assessment methods

As can be seen from the above two evaluation indicators, no matter the absolute matching rate or 0-1 loss, partial correctness is not taken into account in the calculation results of both of them, which is obviously inaccurate for the evaluation of the model. For example, suppose the correct label is [1,0,0,1] and the model predicts the label is [1,0,1,0]. As you can see, although the model did not predict all the labels correctly, it did predict some of them correctly. Therefore, it is advisable to take into account some of the predicted correct results [4]. In order to realize this idea, reference [5] proposed the calculation methods of Accuracy, Precision, Recall and F1F_1F1-measure in multi-label classification scenarios.

(1) Accuracy rate

For accuracy, the calculation formula is as follows:


Accuracy = 1 m i = 1 m y ( i ) studying y ^ ( i ) y ( i ) y ^ ( i )                               ( 6 ) \text{Accuracy} = \frac{1}{m} \sum_{i=1}^{m} \frac{\lvert y^{(i)} \cap \hat{y}^{(i)}\rvert}{\lvert y^{(i)} \cup \hat{y}^{(i)}\rvert}\; \; \; \; \; \; \; \; \; \; (6)

It can be seen from Formula (6), (6) and (6) that the accuracy rate actually calculates the average accuracy rate of all samples. For each sample, the accuracy rate is the percentage of the number of predicted correct tags in the total number of predicted correct or actual correct tags. For example, for a sample, the real label is [0, 1, 0, 1] and the prediction label is [0, 1, 1, 0]. Then the corresponding accuracy of the sample should be:


a c c = 1 1 + 1 + 1 = 1 3                               ( 7 ) acc = \frac{1}{1+1+1}=\frac{1}{3}\; \; \; \; \; \; \; \; \; \; (7)

Therefore, for the following real and predicted results:

 y_true = np.array([[0.1.0.1],
                       [0.1.1.0],
                       [1.0.1.1]])

y_pred = np.array([[0.1.1.0],
                       [0.1.1.0],
                       [0.1.0.1]])
Copy the code

Its accuracy rate is:


Accuracy = 1 3 x ( 1 3 + 2 2 + 1 4 ) material 0.5278                               ( 8 ) \ text = {Accuracy} \ frac {1} {3} \ times (\ frac {1} {3} + \ frac {2} {2} + \ frac {1} {4}) \ approx0.5278 \; \; \; \; \; \; \; \; \; \; (8)

The corresponding implementation code is [6] :

def Accuracy(y_true, y_pred) :
    count = 0
    for i in range(y_true.shape[0]):
        p = sum(np.logical_and(y_true[i], y_pred[i]))
        q = sum(np.logical_or(y_true[i], y_pred[i]))
        count += p / q
    return count / y_true.shape[0]
print(Accuracy(y_true, y_pred)) # 0.52777
Copy the code

(2) Accuracy

For accuracy, its calculation formula is as follows:


Precision = 1 m i = 1 m y ( i ) studying y ^ ( i ) y ^ ( i )                               ( 9 ) \text{Precision} = \frac{1}{m} \sum_{i=1}^{m} \frac{\lvert y^{(i)} \cap \hat{y}^{(i)}\rvert}{\lvert \hat{y}^{(i)}\rvert}\; \; \; \; \; \; \; \; \; \; (9)

It can be seen from Formula (9), (9) and (9) that the accuracy rate actually calculates the average accuracy rate of all samples. For each sample, accuracy is the percentage of predicted correct tags in the total number of predicted correct tags. For example, for a sample, the real label is [0, 1, 0, 1] and the prediction label is [0, 1, 1, 0]. Then the accuracy rate corresponding to this sample should be:


pre = 1 1 + 1 = 1 2                               ( 10 ) \text{pre} = \frac{1}{1+1}=\frac{1}{2}\; \; \; \; \; \; \; \; \; \; (10)

Therefore, for the real and predicted results above, the accuracy rate is:


Precision = 1 3 x ( 1 2 + 2 2 + 1 2 ) material 0.6666                               ( 11 ) \ text = {Precision} \ frac {1} {3} \ times (\ frac {1} {2} + \ frac {2} {2} + \ frac {1} {2}) \ approx0.6666 \; \; \; \; \; \; \; \; \; \; (11)

The corresponding implementation code is:

def Precision(y_true, y_pred) :
    count = 0
    for i in range(y_true.shape[0) :if sum(y_pred[i]) == 0:
            continue
        count += sum(np.logical_and(y_true[i], y_pred[i])) / sum(y_pred[i])
    return count / y_true.shape[0]
print(Precision(y_true, y_pred))# 0.6666
Copy the code

(3) Recall rate

For recall rate, its calculation formula is as follows:


Recall = 1 m i = 1 m y ( i ) studying y ^ ( i ) y ( i )                               ( 12 ) \text{Recall} = \frac{1}{m} \sum_{i=1}^{m} \frac{\lvert y^{(i)} \cap \hat{y}^{(i)}\rvert}{\lvert y^{(i)}\rvert} \; \; \; \; \; \; \; \; \; \; (12)

It can be seen from Formula (12), (12) and (12) that recall rate actually calculates the average accuracy rate of all samples. For each sample, the recall rate is the ratio of the correct number of predicted tags to the total correct number.

Therefore, for the following real and predicted results:

 y_true = np.array([[0.1.0.1],
                       [0.1.1.0],
                       [1.0.1.1]])

y_pred = np.array([[0.1.1.0],
                       [0.1.1.0],
                       [0.1.0.1]])
Copy the code

Its recall rate is:


Recall = 1 3 x ( 1 2 + 2 2 + 1 3 ) material 0.6111                               ( 13 ) \ text = {Recall} \ frac {1} {3} \ times (\ frac {1} {2} + \ frac {2} {2} + \ frac {1} {3}) \ approx0.6111 \; \; \; \; \; \; \; \; \; \; (13)

The corresponding implementation code is:

def Recall(y_true, y_pred) :
    count = 0
    for i in range(y_true.shape[0) :if sum(y_true[i]) == 0:
            continue
        count += sum(np.logical_and(y_true[i], y_pred[i])) / sum(y_true[i])
    return count / y_true.shape[0]
print(Recall(y_true, y_pred))# 0.6111
Copy the code

(4) F1F_1F1 value

For the value of F1F_1F1, its calculation formula is:


F 1 = 1 m i = 1 m 2 y ( i ) studying y ^ ( i ) y ( i ) + y ^ ( i )                               ( 14 ) F_{1} = \frac{1}{m} \sum_{i=1}^{m} \frac{2 \lvert y^{(i)} \cap \hat{y}^{(i)}\rvert}{\lvert y^{(i)}\rvert + \lvert \hat{y}^{(i)}\rvert} \; \; \; \; \; \; \; \; \; \; (14)

It can be seen from Formula (14), (14) and (14) that F1F_1F1 also calculates the average accuracy rate of all samples. Therefore, for the real and predicted results above, its F1F_1F1 value is:


F 1 = 2 3 ( 1 4 + 2 4 + 1 5 ) material 0.6333                               ( 15 ) F_1 = \ frac {2} {3} (\ frac {1} {4} + \ frac {2} {4} + \ frac {1} {5}) \ approx0.6333 \; \; \; \; \; \; \; \; \; \; (15)

The corresponding implementation code is:

def F1Measure(y_true, y_pred) :
    count = 0
    for i in range(y_true.shape[0) :if (sum(y_true[i]) == 0) and (sum(y_pred[i]) == 0) :continue
        p = sum(np.logical_and(y_true[i], y_pred[i]))
        q = sum(y_true[i]) + sum(y_pred[i])
        count += (2 * p) / q
    return count / y_true.shape[0]
print(F1Measure(y_true, y_pred))# 0.6333
Copy the code

In the above four indicators, the larger the value is, the better the classification effect of the corresponding model is. Meanwhile, according to Formula (6), (9), (12), (14), (6), (12), (12), (12), (14), it can be seen from formula (6), (9), (12), (14) that although the calculation steps of each indicator in multi-label scenario are different from those in single-label scenario, they have similar ideas in calculating each indicator.

Of course, sklearn can also be directly used to calculate the last three indicators. The code is as follows:

from sklearn.metrics import precision_score, recall_score, f1_score
print(precision_score(y_true=y_true, y_pred=y_pred, average='samples'))# 0.6666
print(recall_score(y_true=y_true, y_pred=y_pred, average='samples'))# 0.6111
print(f1_score(y_true,y_pred,average='samples'))# 0.6333
Copy the code

(5) Hamming Loss

In addition to the 6 evaluation methods introduced above, another more intuitive measurement method, Hamming Loss[3], is introduced below. Its calculation formula is:


Hamming Loss = 1 m q i = 1 m j = 1 q I ( y j ( i ) indicates y ^ j ( i ) )                               ( 16 ) \text{Hamming Loss} = \frac{1}{m q} \sum_{i=1}^{m}\sum_{j=1}^{q} I\left( y^{(i)}_{j} \neq \hat{y}^{(i)}_{j} \right) \; \; \; \; \; \; \; \; \; \; (16)

Where yj(I)y^{(I)}_jyj(I) represents the JJJ label of the third sample, and QQQ represents the number of categories of a type.

As can be seen from Formula (16), (16) and (16), Hamming Loss measures the proportion of the number of wrongly predicted labels in the total number of labels in all samples. Therefore, for Hamming Loss, the smaller the value, the better the performance of the model. Therefore, for the following real and predicted results:

 y_true = np.array([[0.1.0.1],
                       [0.1.1.0],
                       [1.0.1.1]])

y_pred = np.array([[0.1.1.0],
                       [0.1.1.0],
                       [0.1.0.1]])
Copy the code

Its Hamming Loss is:


Hamming Loss = 1 3 x 4 ( 2 + 0 + 3 ) material 0.4166                               ( 17 ) \ text {Hamming Loss} = \ frac {1} {3 \ times4} (2 + 3) 0 + \ approx0.4166 \; \; \; \; \; \; \; \; \; \; (17)

The corresponding implementation code is:

def Hamming_Loss(y_true, y_pred) :
    count = 0
    for i in range(y_true.shape[0]):
        p = np.size(y_true[i] == y_pred[i])
        q = np.count_nonzero(y_true[i] == y_pred[i])
        count += p - q
    return count / (y_true.shape[0] * y_true.shape[3])
print(Hamming_Loss(y_true, y_pred))# 0.4166
Copy the code

This can also be done using the hamming_loss method in sklearn.metrics:

from sklearn.metrics import hamming_loss
print(hamming_loss(y_true, y_pred))# 0.4166
Copy the code

Of course, although 7 different evaluation indicators are introduced here, there are still other different evaluation methods in multi-label classification, for details, please refer to the document [4]. For example, the multilabel_confusion_matrix method in the Sklearn. metric module can be used to calculate the accuracy and recall rate of each category of multiple labels. Finally, calculate the average value of each indicator in each category.

5 concludes

In this paper, the author first introduces the first common loss measurement method in multi-label classification task, which is essentially the objective function of logistic regression model. Then, the author introduces a variety of evaluation indexes used to evaluate the results of multi-label classification tasks, including absolute matching rate, accuracy rate, recall rate and so on. Finally, we introduce another common loss function in multi-label classification task.

This content is over, thank you for reading! If you feel that the above content is helpful to you, welcome to pay attention to and spread this public number! If you have any questions or suggestions, please add the author’s wechat ‘NULls8’ or leave a message for communication. Castle peak does not change, green water flow, we month to see the inn!

reference

[1] To understand multiple classifications, we must talk about logistic regression

[2] The recall rate and F value under multi-classification task

[3] Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

[4] Sorower, Mohammad S.. “A Literature Survey on Algorithms for Multi-label Learning.” (2010).

[5] Godbole, S., & Sarawagi, S. Discriminative Methods for Multi-labeled Classification. Lecture Notes in Computer Science,(2004), 22–30.

[6] mmuratarat. Making. IO / 2020-01-25 /…