The author | Alvira Swalin compile | source of vitamin k | Medium

The second part of this series focuses on classification metrics

In the first article, we discussed some of the important metrics used in regression, their strengths and weaknesses, and use cases. This section will focus on the metrics commonly used in categorization and which should be chosen in a specific context.

define

Before discussing the pros and cons of each approach, let’s look at the basic terminology used in classification problems. If you are already familiar with the term, you can skip this section.

  • Recall rate or TPR (true case rate) : number of items correctly identified as positive cases out of all positive cases =TP/ (TP+FN)
  • Specificity or TNR (true counter example rate) : Number of items correctly identified as counter examples of all counter examples =TN/ (TN+FP)
  • Accuracy: Number of items correctly identified as positive examples =TP/ (TP+FP)
  • False positive example rate or type I error: the number of items incorrectly identified as positive examples in all negative examples =FP/ (FP+TN)
  • False negative example rate or type II error: number of items incorrectly identified as negative examples out of all positive examples =FN/ (FN+TP)

  • Confusion matrix

  • F1 measure: Harmonic mean of accuracy and recall. F1 = 2*Precision*Recall/(Precision + Recall)
  • Accuracy: Percentage of total items correctly classified (TP+TN)/(N+P)

ROC AUC scores

The probabilistic interpretation of the ROC-AUC score is that if a positive case and a negative case are randomly selected, the probability of positive cases being higher than negative cases is given by the AUC according to the classifier.

Mathematically, it is calculated from the area under the sensitivity curve (TPR).

FPR (1- specificity). Ideally, we want to have high sensitivity and specificity, but in practice there is always a trade-off between sensitivity and specificity.

Some important characteristics of ROC-AUC are

  • The value can range from 0 to 1. However, the auC score for random classifiers with balanced data was 0.5

  • Roc-auc score is independent of the classification threshold set. F1 scores are different. In the case of probability output, F1 scores need to be determined by a threshold

The Log loss

Logarithmic loss is a measure of accuracy that combines the concept of probabilistic confidence given by the following binary class expression:

It takes into account the uncertainty of your prediction, based on how it differs from the actual label. In the worst case, suppose you predict probabilities of 0.5. Thus, the logarithmic loss will become -log (0.5) =0.69.

So we can say that anything above 0.6 is a very bad model considering the actual probabilities.

Case 1

Comparison of Log losses with ROC and F1 measurements

In case 1, for example, model 1 does a better job of predicting absolute probabilities, whereas model 2 predicts ordered increasing values of probabilities. Let’s verify this with an actual score:

If log loss is taken into account, model 2 gives the highest log loss because the absolute probability is quite different from the actual label. But this is completely inconsistent with F1 and AUC scoring, according to which model 2 has a 100% accuracy rate.

Also, you can note that F1 scores vary for different thresholds, and F1 prefers model 1 to Model 2 when the default threshold is 0.5.

Inferences from the examples above:

  • If you care about absolute difference, you use logarithmic loss

  • If you only care about the prediction of a particular class and do not want to adjust the threshold, use AUC Score

  • The F1 score is threshold sensitive and you need to adjust it before comparing models

Case 2

How do they deal with category imbalances?

The only difference between the two models is their predictions for observations 13 and 14. Model 1 did a better job of classifying observations 13 (label 0) and model 2 did a better job of classifying observations 14 (label 1).

Our goal is to see which model better captures the differences in the unbalanced class classification (low data volume in label 1). In problems like fraud detection/spam detection, positive cases are always labeled very rarely, and we want our model to correctly predict positive cases, so we sometimes prefer models that can correctly classify these positive cases

Obviously, log loss is a failure in this case, because the performance of both models is the same according to log loss. This is because the log loss function is symmetric and does not discriminate between categories.

Both F1 metric and ROC-AUC score were superior to model 1 in selecting model 2. So we can use these two methods to deal with class imbalances. But we have to dig a little deeper to see how they differ in the way they treat category imbalances.

In the first example, we see very few positive labels. In the second example, there are almost no negative labels. Let’s see how the F1 metric and ROC-AUC distinguish between these two cases.

The ROC-AUC score deals with a few negative labels in the same way as it deals with a few positive labels. One interesting thing to note here is that the score for F1 is almost the same in model 3 and model 4, because of the large number of positive labels, it only cares about the misclassification of positive labels.

Inferences from the above examples:

  • If you care about a small number of classes and don’t need to be negative, choose the ROC-AUC score.

When would you choose F1 metrics over ROC-AUC?

When you have a small number of positive classes, then the score for F1 makes more sense. This is a common problem in fraud detection because positive tags are rare. We can understand this statement by using the following example.

For example, in a data set of size 10K, model (1) predicts 5 positive cases out of 100 real cases, while another model (2) predicts 90 positive cases out of 100 real cases. Clearly, in this case, model (2) does a better job than model (1). Let’s see if both the F1 score and the ROC-AUC score capture this difference

  • The F1 score of model (1) =2*(1)*(0.1)/1.1 = 0.095

  • F1 score of model (2) = 2*(1)*(0.9)/1.9 = 0.947

Yes, the difference in F1 results reflects the performance of the model.

  • Roc-auc of model (1) =0.5

  • Roc-auc of model (2) =0.93

Roc-auc also gave model 1 a good score, which is not a great performance indicator. Therefore, for unbalanced data sets, be careful when choosing roc-AUC.

Which metric should you use for multiple classifications?

We also have three types of non-dichotomies:

  • Multiclass: Classification tasks with more than two classes. Example: Divide a set of fruit images into any of the following categories: apples, bananas, and oranges.
  • Multiple tags: Classify samples into a set of target tags. Example: Tag a blog with one or more topics, such as technology, religion, politics, etc. Labels are independent and the relationship between them is not important.
  • Hierarchy: Each category can be combined with similar categories to create metaclasses, which can be combined again until we reach the root level (the collection that contains all the data). Examples include text classification and species classification.

In this blog, we will only discuss the first category.

As you can see in the table above, we have two types of metrics – micro average and macro average, and we will discuss the pros and cons of each. The most common measures for multiple classes are the F1 measure, average accuracy, and log loss. There is no mature ROC-AUC multiclass score.

Log loss of multiple classes is defined as:

  • In the microaverage method, the true, false positive and false negative examples of different sets in the system are summarized and then applied to obtain statistical data.

  • In the macromean method, the average of the accuracy and recall rate of the system on different sets is taken

If there is a category imbalance problem, the micro-average method is used.


Reference:

  1. Classeval.wordpress.com/simulation-…
  2. En.wikipedia.org/wiki/Precis…
  3. www.sciencedirect.com/science/art…
  4. Stats.stackexchange.com/questions/1…
  5. Datascience.stackexchange.com/questions/1…

Original link: medium.com/usf-msds/ch…

Welcome to panchuangai blog: panchuang.net/

Sklearn123.com/

Welcome to docs.panchuang.net/