Model evaluation indicators in multi-label classification scenarios

Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”

preface

Metrics play an important role in machine learning or deep learning. We start by selecting metrics by question to understand the baseline score for a particular model. In this blog, we examine the best and most commonly used metrics for multi-label categories, and how they differ.

Next, let’s take a closer look at what multi-label categorization is, just in case you need it. If we have data on a dog’s characteristics, we can predict which breed and pet category it belongs to.

In the case of object detection, multi-label classification provides us with a list of all objects in the image, as shown in the figure below. We can see that the classifier detects three objects in the image. If the total number of training objects is 4, it can be represented as the following list [1 0 1 1] (corresponding objects are [dog, person, bicycle, truck]). This classification is called multi-label classification.

The difference between Multiclass classification and Multilabel classification:

Multi-category sorting: Sorting tasks for more than two categories. Multi-category classification assumes that each sample belongs to and only belongs to one label. For example, a fruit can be an apple or an orange, but not both.

Multi-label classification: Assign one or more labels to each sample. For example, a news item can be classified as both sports and entertainment.

Multi-label classification evaluation indicators

The most common indicators applicable to multi-label classification are as follows:

Precision at k
Avg precision at k
Mean avg precision at k
Sampled F1 Score
Log Loss

Let’s look at the details of these metrics.

Precision at k (P@K)

Given a list of actual and predicted categories, Precision@K is defined to consider only the number of correct predictions for the first k elements divided by the first k elements of each predicted category. Values range from 0 to 1.

A code example is as follows:

def patk(actual, pred, k) :
    #we return 0 if k is 0 because 
    # we can't divide the no of common values by 0
    if k == 0:
        return 0

    #taking only the top k predictions in a class 
    k_pred = pred[:k]

    # taking the set of the actual values 
    actual_set = set(actual)
    print(list(actual_set))
    # taking the set of the predicted values 
    pred_set = set(k_pred)
    print(list(pred_set))
    
    # Calculate the intersection of predicted value and real value
    common_values = actual_set.intersection(pred_set)
    print(common_values)

    return len(common_values)/len(pred[:k])

# defining the values of the actual and the predicted class
y_true = [1 ,2.0]
y_pred = [1.1.0]

if __name__ == "__main__":
    print(patk(y_true, y_pred,3))
Copy the code

The running results are as follows:

K: 3, true value: {0, 1, 2}, predicted value: {0, 1}, intersection: {0, 1}, P@k: 0.6666666666666666 0.666666666666Copy the code

Average Precision at K (AP@K)

It is defined as the average of all Precision@K from k = 1 to k. For clarity, let’s take a look at the code. Values range from 0 to 1.

import numpy as np

def apatk(acutal, pred, k) :
    #creating a list for storing the values of precision for each k 
    precision_ = []
    for i in range(1, k+1) :#calculating the precision at different values of k 
        # and appending them to the list
        precision_.append(patk(acutal, pred, i))

    #return 0 if there are no values in the list
    if len(precision_) == 0:
        return 0 

    #returning the average of all the precision values
    return np.mean(precision_)

#defining the values of the actual and the predicted class
y_true = [[1.2.0.1], [0.4], [3], [1.2]]
y_pred = [[1.1.0.1], [1.4], [2], [1.3]]

if __name__ == "__main__":
    for i in range(len(y_true)):
        for j in range(1.4) :print(
                f"""
                y_true = {y_true[i]}
                y_pred = {y_pred[i]}
                AP@{j} = {apatk(y_true[i], y_pred[i], k=j)}"" "
            )
        print("-- -- -- -- -- -- -- -- -- -- -")
            
Copy the code

The running results are as follows:

K: 1, True value: {0, 1, 2}, Predicted value: {1}, Intersection: {1}, P@k: 1.0y_true = [1, 2, 0, 1] y_pred = [1, 1, 0, 1] AP@1 = 1.0k: 1, True value: {0, 1, 2}, predicted:, intersection: {1}, {1} P @ k: 1.0 k: 2, real value: {0, 1, 2}, forecast:, intersection: {1}, {1} P @ k: 0.5 y_true = [1, 2, 0, 1] y_pred = [1, 1, 0, 1] AP@2 = 0.75k: 1, True value: {0, 1, 2}, predicted value: {1}, intersection value: {1}, P@k: 1.0k: 2, True value: {0, 1, 2}, predicted:, intersection: {1}, {1} P @ k: 0.5 k: 3, the real value: {0, 1, 2}, forecast: {0, 1}, intersection: {0, 1}, P @ k: 0.666666666666 y_true = [1, 2, 0, 1] y_pred = [1, 1, 0, 1] AP@3 = 0.7222222222222222 ----------- K: 1, true values: {0, 4}, predicted: {1}, : set (), P @ k: 0.0 y_true = [0, 4] y_pred = [1, 4] AP @ 0.0 k = 1:1, the real value: {0, 4}, forecast: {1}, : set (), P @ k: 0.0k: 2, true value: {0, 4}, predicted value: {1, 4}, Intersection: {4}, P@k: 0.5y_true = [0, 4] y_pred = [1, 4] AP@2 = 0.25k: 1, true value: {0, 4}, predicted value: {1}, intersection: Set (), P @ k: 0.0 k: 2, real value: {0, 4}, forecast: {1, 4}, intersection: {4}, P @ k: 0.5 k: 3, the real value: {0, 4}, forecast: {1, 4}, intersection: {4}, P @ k: Y_pred = 0.5 y_true = [0, 4] [1, 4] AP @ 3 = 0.3333333333333333 -- -- -- -- -- -- -- -- -- -- - K: 1, the real value: {3}, forecast: {2}, : set (), P @ K: 0.0 y_true = [3] y_pred = [2] AP@1 = 0.0 K: 1, True value: {3}, predicted value: {2}, intersection: set(), P@k: 0.0 K: 2, true value: {3}, predicted value: {2}, intersection: set(), P@k: 0.0 y_true = [3] y_pred = [2] AP@2 = 0.0 K: 1, True value: {3}, predicted value: {2}, intersection: set(), P@k: 0.0 K: 2, true value: {3}, predicted value: {2}, intersection: set(), P@k: 0.0 K: 3, real value: {3}, forecast: {2}, : set (), P @ K: 0.0 y_true = [3] y_pred = [2] AP @ 3 = 0.0 -- -- -- -- -- -- -- -- -- -- - K: 1, the real value: {1, 2}, predicted value: {1}, intersection: {1}, P@k: 1.0y_true = [1, 2] y_pred = [1, 3] AP@1 = 1.0k: 1, true value: {1, 2}, predicted value: {1}, Intersection: {1}, P@k: 1.0k: 2, True value: {1, 2}, forecast: {1, 3}, intersection: {1}, P @ k: 0.5 y_true = [1, 2] y_pred = [1, 3] AP @ 2 = 0.75 k: 1, the real value: {1, 2}, forecast:, intersection: {1}, {1} P @ k: 1.0K: 2, true value: {1, 2}, Predicted value: {1, 3}, Intersection: {1}, P@k: 0.5k: 3, true value: {1, 2}, predicted value: {1, 3}, intersection: {1}, P@k: Y_pred 0.5 y_true = [1, 2] = [1, 3] AP @ 3 = 0.6666666666666666 -- -- -- -- -- -- -- -- -- -- -Copy the code

Mean avg precision at k (MAP@K)

The average of all values of AP@k over the entire training data is called MAP@k. This helps us to accurately represent the accuracy of the overall forecast data. Values range from 0 to 1.

AP@k measures how well the learned model performs in each category, and MAP@k measures how well the learned model performs in all categories.

import numpy as np

def mapk(acutal, pred, k) :

    #creating a list for storing the Average Precision Values
    average_precision = []
    #interating through the whole data and calculating the apk for each 
    for i in range(len(acutal)):
        ap = apatk(acutal[i], pred[i], k)
        print(f"AP@k: {ap}")
        average_precision.append(ap)

    #returning the mean of all the data
    return np.mean(average_precision)

#defining the values of the actual and the predicted class
y_true = [[1.2.0.1], [0.4], [3], [1.2]]
y_pred = [[1.1.0.1], [1.4], [2], [1.3]]

if __name__ == "__main__":
    print(mapk(y_true, y_pred,3))
Copy the code

The running results are as follows:

K: 1, true value: {0, 1, 2}, Predicted value: {1}, Intersection: {1}, P@k: 1.0k: 2, true value: {0, 1, 2}, predicted value: {1}, Intersection: {1}, P@k: 0.5k: 3, true value: {0, 1, 2}, predicted value: {0, 1}, intersection: {0, 1}, P@k: 0.6666666666 AP@k: 0.7222222222222222 K: 1, true value: {0, 4}, predicted value: {1}, Intersection: set(), P@k: 0.0 K: 2, true value: {0, 4}, predicted value: {1, 4}, Intersection: {4}, P@k: 0.5k: 3, true value: {0, 4}, predicted value: {1, 4}, Intersection: {4}, P@k: 0.5 AP@k: 0.3333333333333333 K: 1, True value: {3}, predicted value: {2}, P@k: 0.0 K: 2, true value: {3}, predicted value: {2}, intersection: set(), P@k: 0.0 K: 3, true value: {3}, predicted value: {2}, intersection: set(), P@k: 0.0 AP@k: 0.0k: 1, true value: {1, 2}, Predicted value: {1}, Intersection: {1}, P@k: 1.0k: 2, true value: {1, 2}, predicted value: {1, 3}, intersection: {1}, P@k: 0.5k: 3, true value: {1, 2}, predicted value: {1, 3}, intersection: {1}, P@k: 0.5 AP@k: 0.6666666666 0.43055555555556Copy the code

This is a bad score because there are a lot of errors in the prediction set.

Sampled F1 Score

This metric first calculates the F1 score of each instance in the data, and then calculates the average of the F1 score.

We will use the implementation of Sklearn in our code. The documentation for calculating the F1 score is available here. Values range from 0 to 1.

We first convert the data to 0-1 format and then calculate the F1 score on it.

A code example is as follows:

from sklearn.metrics import f1_score
from sklearn.preprocessing import MultiLabelBinarizer

def f1_sampled(actual, pred) :
    # converting the multi-label classification to a binary output
    mlb = MultiLabelBinarizer()
    actual = mlb.fit_transform(actual)
    pred = mlb.fit_transform(pred)
    print(F "Multi-label binarization tag value:{mlb.classes_}")
    print(F "true value: \n{actual}\n Predicted value: \n{pred}")
    # fitting the data for calculating the f1 score 
    f1 = f1_score(actual, pred, average = "samples")
    return f1

# defining the values of the actual and the predicted class
# There are five categories
y_true = [[1.2.0.1], [0.4], [3], [1.2]]
y_pred = [[1.1.0.1], [1.4], [2], [1.3]]

if __name__ == "__main__":
    print(f1_sampled(y_true, y_pred))

Copy the code

The running results are as follows:

Multi-label binarization tag value: [0 1 2 3 4] True value: [[1 1 1 0 0 1] [1 0 0 0 1] [0 0 0 1 0 0]] Predicted value: [[1 1 0 0 0] [0 1 0 0 1] [0 0 1 0 0] [0 1 0 1 0]] 0.45Copy the code

We know that the F1 score is between 0 and 1, and here we get a score of 0.45. This is because the prediction set is bad. If we had a better prediction set, the value would be closer to 1.

Log Loss

Log Loss is also known as logic Loss or cross entropy Loss.

First, you can convert the target to 0-1 format and then use a logarithmic loss for each column. Finally, you can take the average of the logarithmic loss in each column. This is also known as the mean column logarithmic loss.

The formula is:

loss(x, y) = – \frac{1}{C} * \sum_i y[i] * \log((1 + \exp(-x[i]))^{-1}) + (1-y[i]) * \log\left(\frac{\exp(-x[i])}{(1 + \exp(-x[i]))}\right)

Where I ∈{0,… X. Element()−1},y[I]∈{0, 1}. i \in \left\{0, \; \cdots , \; \text{x.nElement}() – 1\right\}, y[i] \in \left\{0, \; 1 \ right \}. I ∈ {0,…, x.n Element () – 1}, y [I] ∈ {0, 1}.

In the formula x corresponds to the predicted value and y corresponds to the true value.

This measure of error loss is actually used to measure the error between the predicted probability and the real label in logistic regression.

A code example is as follows:

def sigmoid(z) :
    return 1 / (1 + np.exp(-z))

def compute_loss_v1(y_true, y_pred) :
    t_loss = y_true * np.log(sigmoid(y_pred)) + (1 - y_true) * np.log(1 - sigmoid(y_pred))
    loss = t_loss.mean(axis=-1)  # Get the loss value for each sample
    return -loss.mean()  # Return the mean loss of the whole sample

if __name__ == '__main__':
    y_true = np.array([[1.1.0.0], [0.1.0.1]])
    y_pred = np.array([[0.2.0.5.0.0], [0.1.0.5.0.0.8]])
    print(compute_loss_v1(y_true, y_pred)) 
Copy the code

The running results are as follows:

0.5926539631803737
Copy the code

conclusion

For multi-label classification scenarios, we typically use MAP@K, Sampled F1 Score, or Log Loss to set the metrics for your problem.

Reference documentation

Metrics for Multi-Label Classification
Loss function and evaluation index in multi-label classification
Pytorch MultiLabelSoftMarginLoss
sklearn multiclass-and-multilabel-classification

Model evaluation indicators in multi-label classification scenarios

preface

Multi-label classification evaluation indicators

Precision at k (P@K)

Average Precision at K (AP@K)

Mean avg precision at k (MAP@K)

Sampled F1 Score

Log Loss

conclusion

Reference documentation

Related Posts

“Everything can be added”

How to spend the 618 Online Shopping Festival gracefully? (Essential for e-commerce risk control)

Machine learning decision tree ID3 algorithm, hand teach you to use Python implementation