This article originated from personal public account: TechFlow, original is not easy, for attention


Today, this is the 17th article in machine learning. We will talk about the evaluation of machine learning models.

In previous articles, we have introduced several models, including naive Bayes, KNN, KMeans, EM, linear regression and logistic regression. Today we’re going to talk to you about how to evaluate these models.

Mean square error

The concept is simple, and it is the same as the loss function in regression models. It can be understood as the deviation between our predicted value and the true value. We use Y to represent the true value of the sample, and y_ to represent the predicted value of the model. Then the mean square error can be written as:


The m here is the number of samples, it’s a constant, and we can ignore it, and if we ignore it it’s the sum of squares, it doesn’t seem to have a term for it. And if you don’t calculate the mean, maybe this sum is going to be huge, so we usually calculate the mean.

Here MSE stands for mean square Error, also known as mean variance. As we all know, variance represents the oscillation degree of sample distance from the mean in statistics. The larger the variance is, the more unstable and volatile an indicator is. In the regression model scenario, we measure not the degree of dispersion from the mean, but the degree of dispersion from the actual value. The two meanings are very similar and should not be difficult to understand.

The smaller the mean square deviation is, the closer the model is to the real value. Therefore, it can be used as a loss function during model training and a reference for us to review the model effect manually.

Regression models generally use mean square error, while classification models are much more complex and involve several indicators, so we will look at them one by one.

TP, TN, FP and FN

These four values may seem silly, but they are actually quite easy to understand once we understand them in English. T stands for true, which is true, so F naturally stands for false, which is false. P means positive, which can be interpreted as positive, N means negative, which is negative. There is no logical meaning of Yin and Yang here, it just represents the category, we can also understand it as 01 or a or B, because there is no logical factor, so it is the same.

So these four are the permutations and combinations of truth and falsehood and Yin and Yang, and everything with a T is true, which can be interpreted as correct prediction. TP, for example, is the true Yang, which means the test is positive, which is actually positive. In the same way, TN is the true Yin, the test is negative, is actually negative. If you have an F, it’s a false positive, it’s positive, but it’s negative, it’s wrong. In the same way, FN is false negative, the test is negative, but it is actually positive.

We can understand it by substituting these values with hospital tests. TP and TN indicate that the detection reagent is powerful and the detection is accurate. Whether the detection is Yin or Yang is Yang. FP and FN indicate that the detection reagent is not good, the detection is wrong, or the detection of the disease is not sick, or clearly sick is not detected. Apparently in a medical test setting, a false Yang is acceptable, but a false Yin is not. Because false positive can be tested again to determine whether it is Yin or Yang. If false positive is missed, it will have a bad impact on patients. Therefore, generally speaking, medical detection reagents will be more sensitive than the standard, so as not to let the fish slip through the net.

The recall rate

The English word “recall” is well translated. I think the Chinese meaning is basically the same as the English meaning. In some textbooks, the translation of “recall” is a little bit less meaningful, but the word “recall” is more trustworthy to Atari.

Let’s assume a scenario. Let’s say A is a platoon leader with 10 minions under his command. He has a task that requires him to gather all the members to perform. At a command, eight were summoned. So the recall rate is 80%. The same is true for machine learning scenarios. In binary scenarios, the model generally considers positive examples. Can understand into the sand panning for gold, negative example is sand, general value is not big, and gold is positive example, is what we need. So the recall rate is the proportion of positive cases that we predict to be accurate to all positive cases.

We substitute the above TP, TN, FP and FN into the calculation, and we can get:


TP is the positive example that we predicted correctly, the part that we recalled, so how many positive examples are there? The answer is positive cases of recall plus positive cases of no recall, positive cases of no recall means that the prediction was wrong, the prediction was negative. But they’re not negative cases, so they’re samples of false negatives, which of course are FN. There’s a little bit of a twist here, but the key point is that the recall is for positive cases, you don’t worry about negative cases. It’s like when you go panning for gold, you obviously don’t care about the sand, you just care about the gold, the same thing.

Accuracy and accuracy

You’ve probably seen these two values in a lot of machine learning books, and it’s possible that you remember the difference when you read it, but forget it after you read it, or even get confused. It’s not unusual. I’ve been there. I even got it wrong in an interview.

A big part of the reason for this is translation, the two values are translated too closely. In Chinese, it is difficult to distinguish the difference between accuracy and accuracy. In our opinion, the two words are equivalent, while in English, the two words are distinguished. So to make sense of the two, we need to look at the English explanation, not just memorize a concept.

Precision, precision, precision, precision, precision, precision precision is defined as the proportion of the true positives against all the positive results (both true positives and false positives). It translates to the correct percentage of all samples that are predicted to be positive.

Accuracy is accuracy. Accuracy is the proportion of true results (both true positives and true negatives to communicate) in the population. Which translates to the probability that the prediction is correct, and since the prediction is correct, it obviously includes both positive and negative examples.

From the Description in English, we can obviously see the difference between these two concepts. Both of them are the parts that predict correctly, but the accuracy rate only applies to positive cases, while the accuracy rate applies to all samples, including both positive and negative cases. I personally think the two translation into screening accuracy rate and judgment accuracy rate is easier to understand some, if only the accuracy rate and accuracy rate may be ok, coupled with the recall rate said above, may really be dizzy.

Let’s take an example to make all three indicators clear.

Suppose that during the civil war between the Kuomintang and the Communist Party, the COMMUNIST Party wanted to catch the spies of juntong and had already locked 100 people in a village and assigned them to Party A and Party B to find the spies. Among them, A chose 18 people, including 12 spies, while B chose 10 people, including 8 spies. Suppose we know that there are 20 agents in total. What is the recall, accuracy and accuracy rate of these two agents?


Let’s start with a. We’ll start with a simple recall. Since there are 20 spies and A finds 12 of them, the recall rate is 12/20 = 0.6. Accuracy is the accuracy rate of screening. We screened a total of 18 people and 12 of them were correct, so the accuracy rate is 12/18 = 2/3. The accuracy rate is the overall accuracy rate. It correctly judged 12 spies and 80 ordinary people, with an accuracy rate of (12 + 82-8) / 100, or 86%.

Let’s look at B, its recall rate is 8/20 = 0.4, accuracy rate is 8/10 = 0.8, accuracy rate is (8 + 90-12) / 100 = 86%.

From the above example, we can get the formula for accuracy and accuracy.

The accuracy rate is the probability of screening correctly, that is, the correct number of screening divided by the number of screened samples. The correct number of screening is TP, and the total number of screening is not only correct but also wrong, and the wrong one is FP, so:


Accuracy is the overall accuracy, which is all correct divided by the number of samples:


Recall versus Precision

Let’s continue with the previous example. From the results of the two people, we can see that both of them have a high accuracy rate of 86%. But you’ll notice that this value doesn’t mean anything, because IF I don’t catch a single spy, I can still get 80% accuracy. Because there are too many negative samples, the overall accuracy rate is increased, which cannot explain the problem. If the proportion of negative samples is larger, the accuracy will be further improved.

For example, in the medical industry, the detection accuracy of some diseases is useless, because the incidence rate itself is not high, and most of them are negative samples. If you can’t detect anything, you can get a high degree of accuracy. Once you understand this, you can also reduce a lot of deception, like a lot of fake medicine or fake equipment scammers trying to trick consumers with accuracy.

In scenarios where negative cases are not important, we generally do not refer to the accuracy rate, because it will be affected by negative cases. So which value should we refer to: recall or precision?

Let’s go back to the scene of the problem. There are two people, A and B. A has a higher recall and 12 of the 20 agents are found. B is more accurate, 8 out of 10 people are spies, the hit rate is very high. So who is the stronger of the two?

Plug in the question and you’ll see that there is no right answer, and the answer is entirely up to their bosses. If the boss is an egoist, more focused on performance and would rather kill a mistake, he will obviously think a is better because he has caught more spies. If the boss is compassionate, he will obviously prefer B, less mistakes will bring less damage to the common people. So it’s not a technical question, it’s a philosophical question.

Which one is better depends entirely on the perspective and the context of the problem, but if we change the context, it will be different. If it was a disease screening scenario, I would probably want to recall a higher one, so as to recall as many positive examples as possible. As for inaccurate test results, we can take more tests to increase confidence, but if we miss some samples, there is a risk that patients will miss diagnosis. If it is a risk control scenario, we will pay more attention to the accuracy rate because severe punishment is often taken after cheating behaviors are detected, because once a mistake is caught, it will bring great damage to users and they may uninstall the app and never come again. Therefore, it is better to let go of the error than not to kill it.

Is there a metric that combines recall and precision? There is, and this value is called f1-score.

It is defined as:


If students familiar with mathematics will find that F1-score is essentially the attuned average of recall and precision, we can use it to balance recall and precision, which is convenient for us to make choices. We can also calculate the F1-score of A and B just now. After calculation, THE F1-score of A is 0.631, and the F1-score of B is 0.533. Therefore, on the whole, A is better.

Recall and Precision’s trade-off

We continue to just example, if you have done many models, you will find in the field of machine learning, recall and precision are two values, a model of high precise rate, often recall is low, high recall and precision rate is low, it is difficult to be precise and recall ascend together, is this why?

Using a logistic regression model as an example, let’s look at the following graph:


L1, L2 and L3 in the figure can be regarded as three different models. Obviously, we can see that L1 has the highest accuracy but the lowest recall, while L3 has the lowest accuracy but the highest recall.

This is caused by two reasons. The first reason is that there are errors in our samples, especially in the data at the critical edge, which will permeate each other due to errors. The second reason is that the fitting ability of the model is limited. For example, in this case, we use a linear model, and we can only divide the sample by a linear plane. If the recall is expanded, it will inevitably lead to the mixing of more wrong samples. If the accuracy is improved, the recall will inevitably decrease.

It’s a tradeoff. We can’t have both. We have to choose between the two. Of course, if we change to a model with stronger fitting ability, such as GBDT or neural network, better results will be achieved, but it is not without cost. The more complex the model, the more samples will be required for training. If there are not enough samples, the more complex the model, the more difficult it is to converge.


This is a classic scenario for algorithm engineers, and we need to make the right choice for our scenario. Whether to expand recall, or to improve accuracy. Taking the logistic regression model as an example, 0.5 bit threshold is used to judge positive or negative cases. Theoretically, if we increase this threshold, the model accuracy can be improved, but at the same time, the recall rate will also decrease. Conversely, if we lower this threshold, we will get more positive cases, which means that more negative cases will be misjudged as positive cases, and the accuracy will be reduced.

conclusion

The concepts introduced today are not only the basics of machine learning, but also frequent in interviews. As a qualified algorithm engineer, this is the content that must be understood.

If you feel confused, think about the example of catching a spy, which is very vivid and effective, and will definitely help you remember it.

Today’s article is all of these, if you feel that there is a harvest, please feel free to point attention or forward, your help is very important to me.