AUC is a model evaluation index, which can only be used for the evaluation of the binary model. For the binary model, there are many other evaluation indexes, such as Logloss, accuracy and precision. If you follow a lot of data mining contests like Kaggle, AUC and LoGloss are pretty much the most common metrics for evaluating models. Why are AUC and Logloss more commonly used than accuracy? Because a lot of machine learning model is the result of the classification problem of predicting probability, if you want to calculation accuracy, need probability into the first category, it will need to manually set a threshold value, if the probability is higher than the prediction for the forecast of a sample, just put the sample in a category, below this threshold, put in another category. Therefore, this threshold greatly affects the calculation of accuracy. Use AUC or Logloss to avoid converting predicted probabilities into categories.
AUC is a model evaluation indicator in machine learning. According to Wikipedia, AUC(area under the curve) is the area under the ROC curve. So, before you understand AUC, understand what ROC is. ROC calculation also needs to use the obfuscation matrix, so we start with the obfuscation matrix.
Confusion matrix
Suppose we have a task: given a sample of patients, construct a model to predict whether a tumor is malignant. In this case, the tumor is either benign or malignant, so this is a classic dichotomy.
Let’s say we use y=1 to indicate that the tumor is benign and y=0 to indicate that the tumor is malignant. Then we can make the following table:
As shown above,
- TP represents the number of samples that are predicted to be benign but actually benign;
- FN represents the number of cases predicted to be malignant but actually benign;
- FP represents the number of cases that were predicted to be benign but actually malignant;
- TN represents the number of samples predicted to be malignant, but actually malignant;
So, these four numbers form a matrix, called the confusion matrix.
Then, how do we use the obfuscation matrix to calculate ROC? First we need to define the following two variables:
The FPR says that of all malignancies, the percentage are predicted to be benign. This is called a false positive rate. The false positive rate tells us how likely it is that a random sample of a malignant tumor will be predicted to be a benign tumor. Obviously we would want the FPR to be as small as possible.
TPR says the percentage of all benign tumors that are predicted to be benign. This is called the true positive rate. The true positive rate tells us how likely it is that a random sample of a benign tumor will be predicted to be benign. Obviously we would like TPR to be as big as possible.
If we take FPR as the abscissa and TPR as the ordinate, we can get the following coordinate system:
Now, if you look at this, you might wonder, what’s the use of FPR and TPR? Let’s look at a couple of particular points.
The point 0,1, FPR=0, TPR=1. FPR=0 means FP=0, that is, there are no false positive examples. TPR=1 means FN=0, which means there are no false counterexamples. Isn’t that the perfect situation? All predictions were correct. Benign tumors were predicted to be benign, and malignant tumors were predicted to be malignant, and the classification was 100 percent correct. This also reflects the significance of FPR and TPR. As mentioned earlier, we wanted FPR to be as small as possible and TPR to be as big as possible.
At the point 1,0, FPR=1, TPR=0. This point contrasts with that point up here, just the opposite. So this is the worst case scenario. All the predictions were wrong.
The point 0,0, FPR=0, TPR=0. FP is equal to 0, TP is equal to 0. So what this point means is that all of the samples are predicted to be malignant. In other words, no matter what sample is given to me, I will have no brain to predict a malignant tumor.
Point 1,1, FPR=1, TPR=1. Obviously, this point is the opposite of the point (0,0), and the significance of this point is that all the samples are predicted to be benign tumors.
After reviewing these four points, we can know that the closer a point is to the upper left corner, the better the prediction effect of the model is. If I get to the top left corner (point 0,1), that’s the perfect result.
The ROC curve
After introducing the confusion matrix, we can understand how the RECEIVER Operating characteristic curve (ROC) is defined.
We know that in the dichotomous (0,1) model, generally our final output is a probability value, representing the probability that the result is 1. So how do we end up deciding whether the input x belongs to 0 or 1? We need a threshold above which we classify as 1 and below which we classify as 0. Therefore, different thresholds will lead to different classification results, that is, the confusion matrix is different, FPR and TPR are different. Therefore, when the threshold slowly moves from 0 to 1, many pairs of values (FPR, TPR) will be formed, and they will be drawn on the coordinate system, which is the so-called ROC curve.
Let’s take an example. Let’s say we have 5 samples:
- The real category is y = c(1,1,0,0,1).
- The probability that a classifier predicts sample 1 is P = C (0.5, 0.6, 0.55, 0.4, 0.7).
As mentioned above, we need thresholds to convert probabilities into categories to get FPR and TPR. Different threshold values will result in different FPR and TPR. Assuming that the threshold we have now chosen is 0.1, then all five samples are classified as 1. If 0.3 is selected, the result is still the same. If 0.45 is chosen as the threshold, then only sample 4 will be graded into 0 and the rest into class 1. As we keep changing the threshold, we get different FPR and TPR. And then we connect the FPR, TPR, and we get the ROC curve.
Let’s calculate the ROC curve of the graph above
Case one
- Threshold is 0.1, because the predicted probability is (0.5, 0.6, 0.55, 0.4, 0.7), the category forecast,1,1,1,1 (1)
The real P | Real N | |
---|---|---|
To predict the P | TP:3 | FP:2 |
Prediction of N | FN:0 | TN:0 |
- TPR = tp/(tp + fn) = 3/3 = 1
- FPR = fp/(fp + tn) = 2/2 = 1
In the case that we set the threshold of 0.1, our point on the ROC curve (1,1)
The second case
- Threshold is 0.4, because the predicted probability is (0.5, 0.6, 0.55, 0.4, 0.7), the category forecast,1,1,0,1 (1)
The real P | Real N | |
---|---|---|
To predict the P | TP:3 | FP:1 |
Prediction of N | FN:0 | TN:1 |
- TPR = tp/(tp + fn) = 3/3 = 1
- FPR = fp/(fp + tn) = 1/2 = 0.5
In the case that we set the threshold of 0.4, the point on our ROC curve (0.5,1)
The third case
- Threshold is 0.5, because the predicted probability is (0.5, 0.6, 0.55, 0.4, 0.7), the category forecast (0,1,1,0,1)
The real P | Real N | |
---|---|---|
To predict the P | TP:2 | FP:1 |
Prediction of N | FN:1 | TN:1 |
- TPR = tp/(tp + fn) = 2/3
- FPR = fp/(fp + tn) = 1/2 = 0.5
In the case that we set the threshold value of 0.5, the points on our ROC curve (0.5,0.667)
The fourth case
- Threshold is 0.55, because the predicted probability is (0.5, 0.6, 0.55, 0.4, 0.7), the category forecast (0,1,0,0,1)
The real P | Real N | |
---|---|---|
To predict the P | TP:2 | FP:0 |
Prediction of N | FN:1 | TN:2 |
- TPR = tp/(tp + fn) = 2/3
- FPR = fp/(fp + tn) = 0/2 = 0
In the case that we set the threshold of 0.55, our point on the ROC curve (0,0.667)
Case five
- Threshold is 0.6, because the predicted probability is (0.5, 0.6, 0.55, 0.4, 0.7), the category forecast (0,0,0,0,1)
The real P | Real N | |
---|---|---|
To predict the P | TP:1 | FP:2 |
Prediction of N | FN:2 | TN:0 |
- TPR = TP /(TP + fn) = 1/3 = 0.333
- FPR = fp/(fp + tn) = 0/2 = 0
In the case that we set the threshold of 0.6, our point on the ROC curve (0,0.333)
Sixth case
- Threshold is 0.7, because the predicted probability is (0.5, 0.6, 0.55, 0.4, 0.7), the category forecast (0,0,0,0,0)
The real P | Real N | |
---|---|---|
To predict the P | TP:0 | FP:2 |
Prediction of N | FN:3 | TN:0 |
- TPR = tp/(tp + fn) = 0/3 = 0
- FPR = fp/(fp + tn) = 0/2 = 0
In the case that we set the threshold of 0.7, our point on the ROC curve (0,0)
We can find that the area under the ROC curve calculated according to different thresholds is the final AUC value, and the area must be less than 1, 0.5 <= AUC <= 1.
Now, maybe we’re a little bit confused here. Note here:
-
The threshold range is [0,1], and as the threshold slowly moves from 1 to 0, the FPR gets larger and larger. Because there’s going to be more and more FP.
-
If IN a given sample, I make a random prediction, that’s 0.5 probability of a benign tumor and 0.5 probability of a malignant tumor. So what is this curve going to look like? You could imagine, if the data were uniform, then this curve would be y equals x.
-
Notice that the curve must start at (0,0) and end up at (1,1). Understand the meaning of the above four points.
-
In fact, the ROC curve is not smooth, but stepped. Why is that? Because the number of samples is limited, and the change of FPR and TPR requires at least one sample change, there will be no change in the gap of no change. In other words, the step is 1 over the number of samples.
Once we have the ROC curve, we can calculate the area under the curve. The calculated area is the AUC value.
Meaning of AUC value
Knowing how to calculate the AUC value, we are of course going to ask the meaning of the AUC value. Why are we going through so much trouble to figure out this AUC?
Suppose we have a classifier whose output is the probability of positive sample input, and all samples will have a corresponding probability, then we can get the following graph:
Where, the horizontal axis represents the probability that the prediction is positive and the vertical axis represents the number of samples. So, the blue region is the probability distribution of all negative samples, and the red sample is the probability distribution of all positive samples. Obviously, if we want the best classification, the red area is as close to 1 as possible, and the blue area is as close to 0 as possible.
To verify the effectiveness of your classifier. You need to choose a threshold where predictions larger than this threshold are positive and predictions smaller than this threshold are negative. The diagram below:
In this diagram, the threshold is 0.5 so that the samples on the left are considered negative and the samples on the right are considered positive. It can be seen that the red area overlaps with the blue area, so when the threshold is 0.5, we can calculate the accuracy of 90%.
Ok, now let’s introduce the ROC curve.
The ROC curve is shown in the upper left corner, where the horizontal axis is the False Positive Rate (FPR) and the vertical axis is the True Positive Rate (TPR). And then when we choose a different threshold, it corresponds to a point in the coordinate system.
When the threshold is 0.8, it corresponds to the point indicated by the arrow in the figure above.
When the threshold is 0.5, it corresponds to the point indicated by the arrow in the figure above.
In this way, different thresholds correspond to different points. Finally all the points can be connected together to form a curve, the ROC curve.
Now let’s see, if the blue region changes from the red region, what happens to the ROC curve?
In the figure above, the blue region does not overlap much with the red region, so you can see that the ROC curve is close to the upper left corner.
However, when the blue region overlaps the red region, the ROC curve is close to the y=x line.
In summary, if we want to use ROC to evaluate the classification quality of the classifier, we can evaluate it by calculating AUC (area under the ROC curve), which is the purpose of AUC.
In fact, AUC is the probability that positive cases are ranked before negative cases.
For example, the AUC value of the first coordinate system shows that all positive cases precede negative cases. The second AUC value indicates that 80 percent of positive cases rank above negative cases.
We know that the threshold can be different, that is, the result of classification is affected by the threshold. If AUC is used, the assessment is better because the threshold variation is taken into account.
Another benefit is that the ROC curve has a nice property that it can stay the same when the distribution of positive and negative samples in the test set changes. In actual data sets, class imbalance often occurs, that is, negative samples are much more than positive samples (or the opposite), and the distribution of positive and negative samples in test data may also change over time.
In the figure above, (a) and (c) are ROC curves, and (b) and (d) are precision-recall curves. (a) and (b) show the results of classifying it in the original test set (balanced distribution of positive and negative samples), and (c) and (d) show the results of the classifier after increasing the number of negative samples in the test set by a factor of 10. It is obvious that the ROC curve basically remains unchanged, while the precision-recall curve changes greatly.
Good article summary: www.zhihu.com/question/39…