In understanding cross-validation, we talked about using AUC to compare good and bad models. What is AUC? How does it measure how good a model is? Are there any other means of assessment besides AUC? In this paper, we will discuss these problems.

Confusion matrix

To understand the AUC, start with another concept, the Confusion Matrix, a two-dimensional square used to evaluate dichotomies (predicting, for example, whether you will or will not have a heart attack, or whether stocks will rise or fall, and so on). What if you ask multiple categorization questions? In fact, the multi-classification problem can still be transformed into a dichotomous problem. Here is a confounding matrix used to determine heart disease:

If you look at the confusion matrix vertically, it shows the number of people who actually have heart disease and the number of people who don’t have heart disease, in the graph above, the number of people who actually have heart disease is True Positive + False Negative, the number of people who don’t have heart disease is False Positive + True Negative; Similarly, looking horizontally at the confusion matrix, it shows that the model predicts True Positive + False Positive for the number of people with heart disease, and False Negative + True Negative for the number of people without heart disease.

So if you look in both directions, you predict disease and you actually get disease, we call that True Positive, you predict no disease and you actually don’t get disease, that’s called True Negative, those are the two areas that the model predicted correctly; There are also two kinds of model prediction errors. False Positive means that the model predicts the disease, but actually does not get the disease, and False Negative means that the model predicts the disease, but actually does get the disease.

It’s a lot of concepts, but it’s not hard to remember, and as you can see, all of these nouns are named after predictions — True/False Positive when you’re sick, True/False Negative when you’re not sick.

In the figure above, the part of the model that predicts correctly is filled in with green, and the proportion of it is also known as Accuracy:


Accuracy alone is not enough to evaluate the quality of the model. For example, in the following case, although the accuracy can reach 80%, the prediction success rate of this model is only 50% in the actual population with disease, which is obviously not a good model.

Heart disease No heart disease
Heart disease 10 10
No heart disease 10 70

The Sensitivity and Specificity

Therefore, we need to introduce more measures. Sensitivity (or Recall) represents the probability of predicting the success of disease among actual patients. Meanwhile, Sensitivity also means “allergy”, corresponding to disease, which is easier to remember:


Since there is an indicator for diagnosing a disease (positive sample), there must also be an indicator for diagnosing a non-disease (negative sample). Therefore, a wide spectrum of criteria is used to predict the probability of success in a non-disease population, i.e


The word for Specificity means “immunity” and can be associated with the absence of disease, so it is also easy to remember.

The presence of these two indicators will help you better compare the differences between models and make trade-offs. For example, when the Accuracy of the two models is similar, if you pay more attention to the effect of disease prediction, you should choose the one with higher Sensitivity value. Conversely, if you are more interested in predicting undiagnosed effects, you should choose a higher Specificity.

ROC curve, AUC and F1 Score

Furthermore, we can obtain more intuitive evaluation results by graphizing these indicators, among which the Receiver Operating Characteristic (ROC) curve is a commonly used one.

As we know, the result of the classification model (such as “logistic regression”) is a probability greater than 0 and less than 1. At this time, we also need a threshold value to define whether we are sick or not. Usually, we set the threshold value as 0.5, so that when the result is greater than 0.5, we can judge as sick, otherwise, we can judge as not sick.

The threshold value can be any value between 0 and 1. For each threshold value, there is a confusion matrix corresponding to it. With the confusion matrix, we can obtain a pair of Sensitivity and Specificity. Then we could draw a point on a coordinate system with 1-specificity as abscissa and Sensitivity as abscissa, and connect the points generated by all possible thresholds, which is the ROC curve.

Below we look at a concrete example, suppose that we do research on mice, hope that through the mouse weight to predict the probability of heart disease, we use logistic regression algorithm to model, below is forecast as a result, there are ten mouse sample points in the figure, the red dot represents the actual health of mice, blue dot represents the actual diseased mice, These points were fitted with a logistic regression curve, and a straight line P=0.5 in the figure was used to indicate the threshold of 0.5. It can be seen that 5 mice higher than P=0.5 were predicted to be sick, while the other 5 mice were predicted to be healthy, with a prediction success rate of 80% :

Next, we draw a ROC curve based on the above data. Firstly, the threshold value was set as 1, at which time all the mice were predicted to be free from disease. According to the actual disease situation of the samples, the following confusion matrix could be obtained

Based on the above confusion matrix, we could calculate a set of Sensitivity and Specificity values. Then we continuously adjusted the threshold to obtain all Sensitivity and Specificity pairs. Because there were only a few sample points, the threshold could be sampled according to the sample points, and the threshold was still represented by a horizontal line. Then the sampling of all thresholds was as follows:

Let’s list the confusion matrices corresponding to these thresholds:

Then, the corresponding Sensitivity and 1-specificity of these confusion matrices were calculated:

Threshold Sensitivity 1- Specificity
1 0 0
0.99 0.2 0
0.97 0.4 0
0.94 0.4 0.2
0.90 0.6 0.2
0.71 0.8 0.2
0.09 0.8 0.4
0.043 1.0 0.4
0.0061 1.0 0.6
0.0003 1.0 0.8
0 1.0 1.0

According to this table, 1-specificity was used as the horizontal axis and Sensitivity as the vertical axis. Generally, the coordinate axis of 1-specificity was denoted as FPR (False Positive Rate) in ROC curve drawing. The coordinate axis corresponding to Sensitivity is denoted as TPR (True Positive Rate), as follows:

The ROC curve has the following characteristics:

  1. Each point on the diagonal from (0, 0) to (1,1) means that the probability of success in predicting disease (TPR) among patients is equal to the probability of failure without disease (FPR) among patients without disease. For the model, the larger the TPR, the better, and the smaller the FPR, the better. So we need to move the ROC curve as far away from the diagonal as possible along the top-left direction.
  2. The ROC curve can also help us select the appropriate threshold value, that is, in the case of the same TPR, the farther to the left of the point on the ROC, the better the effect, because the farther to the left means the smaller the FPR.

According to the first characteristic of ROC curve: “The closer the Curve is to the upper left corner, the better the model will be”, which means that a better model has a larger Area Under the Curve. We call the Area Under the ROC Curve AUC (Area Under Curve). With this concept, only one value can be used to measure the quality of the model. The AUC for the example model above is as follows:

Usually we use AUC to evaluate models, and since it is “usually”, there must be exceptions: When the prevalence rate (or positive sample proportion) is very small, Ture Negative will be very large, and this value will affect FPR and make FPR smaller. To avoid this effect, we can replace FPR with another indicator: Precision


Precision means predicting the percentage of the sample that actually gets sick; In this way, the combination of Precision and Sensitivity allows us to focus more on the predictive effect of disease (positive sample), and another effect indicator in machine learning, F1 Score, is specifically responsible for this


In the above formula, Recall is equivalent to Sensitivity. Just like AUC, the two models are compared. The larger F1 Score is, the better the prediction effect is, and F1 Score can better measure the prediction effect of positive samples.

conclusion

The confusion matrix, ROC curve, AUC, and F1 Score are discussed using a medical example — heart disease or not. We also learn how ROC curves are drawn. Finally, we talk about AUC and F1 Score and the subtle differences between them.

It should be noted that the dichotomous assessment is not limited to the classification of the two cases of disease and non-disease. In consideration of generality, you can completely replace the positive sample with the heart disease in this paper and the negative sample with the non-heart disease.