Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”

Sklearn provides a variety of model evaluation methods in multi-label classification scenarios. This paper will describe the common multi-label classification model evaluation indicators in SkLearn. In multi-label classification, model evaluation indicators can be divided into two categories, namely, model evaluation methods that do not consider the correctness of sample part and model evaluation methods that consider the correctness of sample part.

First, we provide examples of real data and predicted results, and all subsequent examples are based on that data,

import numpy as np

y_true = np.array([[0.1.0.1],
                   [0.1.1.0],
                   [1.0.1.1]])

y_pred = np.array([[0.1.1.0],
                   [0.1.1.0],
                   [0.1.0.1]])
Copy the code

Do not consider partially correct evaluation methods

Exact Match Ratio

Absolute matching means that for each sample, the prediction is correct only if the predicted value is exactly the same as the actual value, that is, if there is a difference in the prediction results of one category, the prediction is not correct. Therefore, its accuracy rate can be calculated by:


accuracy ( y . y ^ ) = 1 n samples i = 0 n samples 1 I ( y ^ i = y i ) \texttt{accuracy}(y, \hat{y}) = \frac{1}{n_\text{samples}} \sum_{i=0}^{n_\text{samples}-1} I(\hat{y}_i = y_i)

Where, I(x)I(x)I(x) is the indicator function, the value is 1 when y^ I \hat{y}_iy^ I is exactly the same as yiy_iyi, otherwise, the value is 0.

The larger the value is, the higher the classification accuracy is.

from sklearn.metrics import accuracy_score

print(accuracy_score(y_true,y_pred)) # 0.33333333
print(accuracy_score(np.array([[0.1], [1.1]]), np.ones((2.2)))) # 0.5
Copy the code

0-1 loss

In addition to the absolute match rate, there is another evaluation criterion which is the opposite of the calculation process, namely zero-one Loss. Absolute accuracy calculates the proportion of samples that are completely correct in predicting the total number of samples, while 0-1 loss calculates the proportion of samples that are completely wrong in predicting the total number of samples.

The formula is:


L 0 1 ( y i . y ^ i ) = 1 m i = 0 m 1 I ( y ^ i indicates y i ) L_{0-1}(y_i, \hat{y}_i) = \frac{1}{m} \sum_{i=0}^{m-1} I(\hat{y}_i \not= y_i)

Where, I(x)I(x)I(x) is the indicator function.

from sklearn.metrics import zero_one_loss

print(zero_one_loss(y_true,y_pred)) # 0.66666
Copy the code

Consider partially correct evaluation methods

As can be seen from the above two evaluation indicators, no matter the absolute matching rate or 0-1 loss, partial correctness is not taken into account in the calculation results of both of them, which is obviously inaccurate for the evaluation of the model. For example, suppose the correct label is [1,0,0,1] and the model predicts the label is [1,0,1,0]. As you can see, although the model did not predict all the labels correctly, it did predict some of them correctly. It is therefore advisable to take into account some of the correct predictions. Sklearn provides calculation methods for Precision, Recall and F1 values in multi-label classification scenarios.

Accurate rate

The accuracy is actually the average accuracy of all the samples. For each sample, the accuracy rate is the proportion of the number of predicted correct tags in the total number of predicted correct tags in the classifier.

The formula is:


P ( y s . y ^ s ) = y s studying y ^ s y ^ s P(y_s, \hat{y}_s) = \frac{\left| y_s \cap {\hat{y}_s} \right|}{\left| {\hat{y}_s} \right|}

P r e c i s i o n = 1 S s S P ( y s . y ^ s ) Precision = \frac{1}{\left|S\right|} \sum_{s \in S} P(y_s, \hat{y}_s)

Where, ySY_sys is the true value of the correct label data, and Y ^s\hat{y} _SY ^s is the value predicted by the classifier to be correct.

For example, for a sample, the real label is [0, 1, 0, 1] and the prediction label is [0, 1, 1, 0]. Then the accuracy rate corresponding to this sample should be:


p r e c i s i o n = 1 1 + 1 = 1 2 precision = \frac{1}{1+1}=\frac{1}{2}

Therefore, for the real data and predicted results above, the accuracy rate is:


P r e c i s i o n = 1 3 ( 1 2 + 2 2 + 1 2 ) material 0.666 Precision = \ frac {1} {3} * (\ frac {1} {2} + \ frac {2} {2} + \ frac {1} {2}) \ approx 0.666

The corresponding code implementation is as follows:

def Precision(y_true, y_pred) :
    count = 0
    for i in range(y_true.shape[0) :if sum(y_pred[i]) == 0:
            continue
        count += sum(np.logical_and(y_true[i], y_pred[i])) / sum(y_pred[i])
    return count / y_true.shape[0]
print(Precision(y_true, y_pred)) # 0.6666
Copy the code

The implementation in Sklearn is as follows

from sklearn.metrics import precision_score
print(precision_score(y_true=y_true, y_pred=y_pred, average='samples')) # 0.6666
Copy the code

The recall rate

The recall rate actually measures the average accuracy of all samples. For each sample, the recall rate is the ratio of the correct number of predicted tags to the total correct number.

The formula is:


R ( y s . y ^ s ) = y s studying y ^ s y s R(y_s, \hat{y}_s) = \frac{\left| y_s \cap \hat{y}_s \right|}{\left| y_s \right|}

R e c a l l = 1 S s S R ( y s . y ^ s ) Recall = \frac{1}{\left|S\right|} \sum_{s \in S} R(y_s, \hat{y}_s)

Where, ySY_sys is the true value of the correct label data, and Y ^s\hat{y} _SY ^s is the value predicted by the classifier to be correct.

Therefore, for the real data and predicted results above, the recall rate is:


R e c a l l = 1 3 ( 1 2 + 2 2 + 1 3 ) material 0.611 Recall = \ frac {1} {3} * (\ frac {1} {2} + \ frac {2} {2} + \ frac {1} {3}) \ approx 0.611

The corresponding code implementation is as follows:

def Recall(y_true, y_pred) :
    count = 0
    for i in range(y_true.shape[0) :if sum(y_true[i]) == 0:
            continue
        count += sum(np.logical_and(y_true[i], y_pred[i])) / sum(y_true[i])
    return count / y_true.shape[0]
print(Recall(y_true, y_pred)) # 0.6111
Copy the code

The implementation in Sklearn is as follows:

from sklearn.metrics import recall_score
print(recall_score(y_true=y_true, y_pred=y_pred, average='samples'))# 0.6111
Copy the code

F1 value

F1F_1F1 also calculates the average F1F_1F1 value for all samples.

The formula is:


F Beta. ( y s . y ^ s ) = ( 1 + Beta. 2 ) P ( y s . y ^ s ) x R ( y s . y ^ s ) Beta. 2 P ( y s . y ^ s ) + R ( y s . y ^ s ) F_\beta(y_s, \hat{y}_s) = \left(1 + \beta^2\right) \frac{P(y_s, \hat{y}_s) \times R(y_s, \hat{y}_s)}{\beta^2 P(y_s, \hat{y}_s) + R(y_s, \hat{y}_s)}

F Beta. = 1 S s S F Beta. ( y s . y ^ s ) F_\beta = \frac{1}{\left|S\right|} \sum_{s \in S} F_\beta(y_s, \hat{y}_s)

When β=1\beta=1β=1, it is the value of F1F_1F1. The formula is:


F Beta. ( y s . y ^ s ) = 1 S s S 2 P ( y s . y ^ s ) x R ( y s . y ^ s ) P ( y s . y ^ s ) + R ( y s . y ^ s ) = 1 S s S 2 y s studying y ^ s y ^ s + y s F_\beta(y_s, \hat{y}_s) = \frac{1}{\left|S\right|} \sum_{s \in S} \frac{2 * P(y_s, \hat{y}_s) \times R(y_s, \hat{y}_s)}{ P(y_s, \hat{y}_s) + R(y_s, \hat{y}_s)}=\frac{1}{\left|S\right|} \sum_{s \in S} \frac{2* \left| y_s \cap \hat{y}_s \right|}{\left| \hat{y}_s \right| + \left| y_s \right|}

Therefore, for the real and predicted results above, its F1F_1F1 value is


F 1 = 2 3 ( 1 4 + 1 2 + 1 5 ) material 0.633 F_1 = \ frac {2} {3} * (\ frac {1} {4} + \ frac {1} {2} + \ frac {1} {5}) \ approx 0.633

The corresponding code implementation is as follows:

def F1Measure(y_true, y_pred) :
    count = 0
    for i in range(y_true.shape[0) :if (sum(y_true[i]) == 0) and (sum(y_pred[i]) == 0) :continue
        p = sum(np.logical_and(y_true[i], y_pred[i]))
        q = sum(y_true[i]) + sum(y_pred[i])
        count += (2 * p) / q
    return count / y_true.shape[0]
print(F1Measure(y_true, y_pred))# 0.6333
Copy the code

The implementation in Sklearn is as follows:

from sklearn.metrics import f1_score
print(f1_score(y_true,y_pred,average='samples')) # 0.6333
Copy the code

In the above four indicators, the larger the value is, the better the classification effect of the corresponding model is. At the same time, it can be seen from the above formula that although the calculation steps of each indicator in multi-label scenario are different from those in single-label scenario, they have similar ideas in calculating each indicator.

Hamming Score

Hamming Score is another method to calculate accuracy in multi-label classification scenarios. The Hamming Score actually measures average accuracy across all samples. For each sample, accuracy is the ratio of the number of correct tags to the total number of correct and true tags.

The formula is:


Accuracy = 1 m i = 1 m y i studying y ^ i y i y ^ i \text{Accuracy} = \frac{1}{m} \sum_{i=1}^{m} \frac{\left| y_i \cap \hat{y}_i \right|}{\left| y_i \cup \hat{y}_i \right|}

For example, for a sample, the real label is [0, 1, 0, 1] and the prediction label is [0, 1, 1, 0]. Then the corresponding accuracy of the sample should be:


accuracy = 1 1 + 1 + 1 = 1 3 \text{accuracy} = \frac{1}{1+1+1} = \frac{1}{3}

Therefore, for the real data and predicted results above, its Hamming Score is:


Accuracy = 1 3 ( 1 3 + 2 2 + 1 4 ) material 0.5278 \ text = {Accuracy} \ frac {1} {3} * (\ frac {1} {3} + \ frac {2} {2} + \ frac {1} {4}) \ approx 0.5278

The corresponding code implementation is as follows:

import numpy as np

def hamming_score(y_true, y_pred, normalize=True, sample_weight=None) :
    ''' Compute the Hamming score (a.k.a. label-based accuracy) for the multi-label case http://stackoverflow.com/q/32239577/395857 '''
    acc_list = []
    for i in range(y_true.shape[0]):
        set_true = set(np.where(y_true[i])[0] )
        set_pred = set(np.where(y_pred[i])[0] )
        tmp_a = None
        if len(set_true) == 0 and len(set_pred) == 0:
            tmp_a = 1
        else:
            tmp_a = len(set_true.intersection(set_pred))/float(len(set_true.union(set_pred)) )
        acc_list.append(tmp_a)
    return np.mean(acc_list)


y_true = np.array([[0.1.0.1],
                   [0.1.1.0],
                   [1.0.1.1]])

y_pred = np.array([[0.1.1.0],
                   [0.1.1.0],
                   [0.1.0.1]])


print('Hamming score: {0}'.format(hamming_score(y_true, y_pred))) # 0.5277
Copy the code

Hamming Loss

Hamming Loss measures the proportion of the number of wrongly predicted labels in the total number of labels in all samples. Therefore, for Hamming Loss, the smaller the value, the better the performance of the model. The value ranges from 0 to 1. A distance of 0 means that the prediction is exactly the same as the real result, and a distance of 1 means that the model is exactly the opposite of what we want.

The formula is:


L H a m m i n g ( y . y ^ ) = 1 m n labels i = 0 m 1 j = 0 n labels 1 I ( y ^ j ( i ) indicates y j ( i ) ) L_{Hamming}(y, \hat{y}) = \frac{1}{m*n_\text{labels}} \sum_{i=0}^{m – 1}\sum_{j=0}^{n_\text{labels} – 1} I(\hat{y}_j^{(i)} \not= y_j^{(i)})

M indicates the number of samples, and nlabLEN_ \text{lable} NLABLE indicates the number of labels.

Therefore, for the real and predicted results above, the Hamming Loss value is


Hamming Loss = 1 3 4 ( 2 + 0 + 3 ) material 0.4166 \text{Hamming Loss} = \frac{1}{3*4}*(2+0+3) \approx 0.4166

The corresponding code implementation is as follows:

def Hamming_Loss(y_true, y_pred) :
    count = 0
    for i in range(y_true.shape[0) :# Number of tags per sample
        p = np.size(y_true[i] == y_pred[i])
        # np.count_nonzero is used to count the number of non-zero elements in an array
        # Number of correctly predicted samples in a single sample
        q = np.count_nonzero(y_true[i] == y_pred[i])
        print(f"{p}-->{q}")
        count += p - q
    print(F "Sample number:{y_true.shape[0]}, number of labels:{y_true.shape[1]}") # Sample number: 3, tag number: 4
    return count / (y_true.shape[0] * y_true.shape[1])
print(Hamming_Loss(y_true, y_pred)) # 0.4166

Copy the code

The implementation in Sklearn is as follows:

from sklearn.metrics import hamming_loss

print(hamming_loss(y_true, y_pred))# 0.4166
print(hamming_loss(np.array([[0.1], [1.1]]), np.zeros((2.2)))) # 0.75
Copy the code

conclusion

In addition to the multi-label model evaluation method provided above, Sklean also provides other model evaluation methods, such as multilabel_confusion_matrix, Jaccrd_similarity_score, etc., which are not introduced here.

Reference documentation

  • sklearn model_evaluation
  • Loss function and evaluation index in multi-label classification
  • Getting the accuracy for multi-label prediction in scikit-learn
  • multi-label classification with sklearn
  • Metrics for Multilabel Classification