This article originated from personal public account: TechFlow, original is not easy, for attention


Today is the 18th article on machine learning, and we’re going to look at a couple of other metrics that are very important in the field of machine learning.

Confusion matrix

In the last article, we talked about TP, FP, FN and FP values before introducing the concepts of recall rate and accuracy. Just a quick review, we can’t just memorize these metrics because they’re easy to get wrong, and they’re easy to get confused. We need to understand it from English, where T stands for true, which can be understood as correct prediction, while F stands for false, which means wrong prediction. And P and N are positive and negative, which is Yin and Yang, or 0 and 1, which are two different categories.

Since they are two categories, it is obvious that our indicators are aimed at dichotomous scenarios, which are also the most common scenarios in machine learning.

The obfuscation matrix is essentially displaying these four values in a table so that we can easily observe the results and analyze them.

Let’s take an example:


Assuming that the confounding matrix of the prediction results of a certain model is like this, it can be easily analyzed from the data shown above that the errors in our prediction mainly occur in the lattice 49, which is also the false negative lattice. In other words, the model predicted a large number of positive samples to be negative, indicating that the threshold value of the model was set too high. We can try to reduce the threshold value to improve and expand recall.

On the contrary, if there are too many false positive samples, it indicates that the threshold value of the model is too low, and a large number of negative samples are predicted to be positive. If we want to improve the effect of the model, we can consider increasing the threshold value of model classification.

What if there’s a lot of false Yang and a lot of false Yin?

There are many cases, usually because the model does not converge completely, or the model is not strong enough. For example, there are too many features, and many hidden information in features cannot be learned. Consider using a more complex model such as a neural network or a more powerful model such as XGboost. If the model itself is complex enough, it may be that the number of samples in the training is not enough, which leads to the failure of the model to give full play to its capabilities. At this time, we can consider adding some samples.

Now that we understand the concept and use of obfuscation matrices, we can move on to ROC.

ROC

The English word ROC is Receiver Operating characteristic curve, which is translated into receiver operating characteristic curve, which is a concept transferred from the discipline of signal system. To be honest, I don’t know much about signaling, I don’t know much about what it actually means, but in machine learning, it’s a curve that reflects TPR and FPR.

Let me label the key points, TPR and FPR and the curve. TRP is the True Positive Rate, and FPR is the false Positive Rate.

The so-called true positive rate is the recall rate, which is the percentage of all positive samples that we expect to be positive.


FPR, of course, is the False Positive Rate, which is the percentage of all negative samples that can be predicted to be Positive. The denominator is obviously FP, and the numerator is FP plus TN.


I would advise you not to think of TPR as recall, which is exactly what it is but if you remember it as recall, it increases the cost of memory. The horizontal axis and the vertical axis are better remembered as FPR and TPR.

So the ROC curve is going to be FPR on the horizontal axis and TPR on the vertical axis, which looks something like this.


AUC

With ROC understood, AUC is easy. Because AUC is completely derived from ROC, its English name is Area under curve, which is the Area of the curve in ROC curve.

So, how does this ROC work out?

Let’s take an example. Suppose we now have a set of predictions:


Let’s list the confusion matrix for this model:


If we substitute FPR and TPR, we get that TPR is 3 / (3 + 2) = 0.6, and FPR is 1 / (1 + 4) = 0.2.

We substitute this point into the ROC curve, and it can be obtained:


It looks like that, but it’s a little weird, and it doesn’t look like a curve. This is because the predicted result of our model directly takes the value of 01. For some hard classifiers, such as SVM and Bayes, 0 is 0, 1 is 1, and we get such a line graph. However, if some soft classifiers are classified according to the threshold value, such as LR, GBDT, etc., what we get is a floating point value. When we adjust the threshold value, we will get different results, which will be more like curves.

Here’s another example:


This time the result is a floating-point value, which is different. Since the predicted result is a floating point value, we can get different confusion matrices by setting different thresholds.

For example, if we set the threshold to 0.5, the confusion matrix is as follows:


This gives a TPR of 0.8 and an FPR of 0.4. If we relax the threshold further, we can increase recall, which increases TPR, but at the same time we can increase FPR. For example, if we broaden the threshold to 0.2, we can identify all positive cases, but again, the FPR goes up:


TPR is 1.0 and FPR is 0.6 according to the above confounding matrix. In other words, we choose different thresholds and we get different TPR and FPR. If the sample size is small, the ROC drawn may be zigzag:


When the number of samples is large enough, the sawtooth will become more and more smooth, we can use some smoothing methods, we can get a relatively smooth curve, which becomes the picture above:


Depth of understanding

Now we understand the concept of AUC, which is the area of the ROC curve. Each point on the ROC curve is calculated by a different threshold.

Let us combine the AUC graph with the example above to understand this concept in depth. For the AUC curve, we find that it is monotonically increasing. In other words, the larger the FPR, the larger the TPR. This is intuitive, because the larger the FPR, the more samples we predict as positive examples, so obviously the larger the TPR. In other words, as we recall more positive samples, the proportion will also increase.

When FPR is equal to 1 and TPR is equal to 1, this point shows that we predicted all of the samples to be positive. Obviously in this case, all positive examples are predicted correctly, and TPR is one. Let’s look at another extreme point, where FPR is zero.

FPR equals 0 indicates that the false negative rate is 0, that is to say, no negative sample is wrongly predicted, which corresponds to a very small number of positive samples predicted by the model. Therefore, the higher the TPR corresponding to FPR, the better the effect of the model.

After we understand the concept of AUC, we have to ask the question, what does AUC value represent, what results can reflect?

Take a look at the following chart:


The area enclosed by the green line in the figure below is significantly larger than that of the pink line, namely AUC1 > AUC2. As can be seen from this figure, the larger THE AUC is, the larger the area enclosed by the curve is. If we choose a point from 0-1 as a perpendicular line, the same FPR can be obtained. Generally, the larger the AUC is, the larger the CORRESPONDING TPR will be (see the following figure for counterexamples).

A larger TPR indicates that the model can correctly predict more positive samples when the same number of negative samples are misclassified, which represents the ability of the model to distinguish positive and negative samples.

Why compare AUC instead of setting a threshold to compare TPR?

Because sometimes the model is complicated, like this one:


The purple model did significantly better before p, but the pink model did better after P. If we only rely on the situation of a single point, we can hardly reflect the ability of the model as a whole. Therefore, AUC can be used to measure the ability of the model to distinguish positive and negative samples as a whole.

Finally, let’s think about the question, what’s the worst case for AUC? Is it going to be 0?

Wrong, AUC worst case is 0.5. Because if we guess positive and negative cases at random, then the number of positive cases we guess correctly should always be 0.5 of the current number of guesses, in which case TPR and FPR are always equal, that is, we draw a straight line, as shown below:


What if the calculated AUC is less than 0.5? This means that the model may have learned the negative correlation between samples and features. We can try to switch the 0 and 1 categories, and the calculated AUC should be more than 0.5.

conclusion

As we said in the previous article, we tend to give more weight to positive examples in machine learning usage scenarios. For example, in the case of click rate prediction, search ranking, recommendation and so on, we pay more attention to the occurrence of user click behavior and the accuracy of prediction, rather than the accuracy of prediction without click. In these scenarios, it is not particularly important for us to measure accuracy or recall, especially for scenarios involving sorting and placement adjustment. We are more concerned about whether the model can give a high prediction score for high-quality content, so that it can be placed in the front and users can see it first. This is often the time when the AUC is more capable of illustrating the model’s capabilities.

Therefore, we may use AUC more than accuracy, accuracy and recall in real industrial applications. Of course, this is not to say that other concepts are not important, it is mainly application scenarios. Since the application scenario determines the wide range of AUC, it is highly likely that we will be asked about AUC when applying for a position, especially when examining the candidate’s basic ability. If asked, it’s not enough to understand what it means, we also need to know how it works, how it works, and even be able to think about things we haven’t thought about before.

Hope everyone can harvest, original is not easy, brazen beg a praise and forward, let us in order to become a better themselves and efforts.