Before the speech

In the field of artificial intelligence, the effect of machine learning needs to be evaluated by various indicators. This paper will elaborate the commonly used performance evaluation indexes in machine learning, but the evaluation indexes of vector convolution and neural grid are not included.

Training and identification

When a machine learning model is established, that is, the model training is completed, we can use this model for classification and recognition.

For example, if a model is given a photo of an electric car, it can recognize it as an electric car. Enter a photo of a motorcycle and the model recognizes it as a motorcycle. The premise is that in the process of model training, a large number of repeated recognition training of electric vehicle photos and motorcycle photos were carried out.

But even if the model has the ability to recognize electric vehicles and motorcycles, it does not mean that it will be 100 percent correct every time. Of course, we certainly want to have as high a recognition rate as possible. The higher the recognition accuracy, the better the model performance.

What specific indicators can be used to evaluate the excellent performance of the model? Let’s take a closer look at the following example.

For example, a test sample set S has a total of 100 photos, including 60 photos of electric vehicles and 40 photos of motorcycles. Input these 100 photos to the model (binary model) for classification recognition. Our goal is to find all electric cars in the 100 photos. The goal is Positives, and the non-goal is Negatives as well as Negatives.

The recognition results given by the hypothetical model are shown as follows:

As can be seen from the results in the table above, in the 100 photos, model recognition gave 50 electric vehicles, and the remaining 50 were motorcycles. This is different from the actual situation (the actual situation is: electric car 60, motorcycle 40), so some identification errors. Correct identification data is shown in TP and TN (T stands for True), and incorrect identification data is shown in FP and FN (F stands for False). \

Of the 50 evs identified, only 40 were correct (TP: real evs) and 10 were wrong (FP: fake evs, real motorcycles).

The above four recognition result values (TP, FP, TN, FN) are commonly used as basic parameters to evaluate the excellent performance of the model. Before further explaining the meanings of the symbols TP, FP, TN and FN, let’s first understand the concepts of positive example (positive sample) and negative example (negative sample).

Positive and negative cases

Positives: The object you focus on is a Positives.

Negative case: then any other than a positive case is a negative case.

For example, in the example above, we focus on the electric car, so the electric car is the positive example, and the remaining motorcycle is the negative example.

For another example, suppose that in a forest, there are three kinds of animals, antelope, reindeer and koala, and our goal is to identify the antelope, then the antelope is the positive example, and the reindeer and koala are the negative example.

Positive and negative case figure 1

Let’s say we have a bunch of number cards, and our goal is to find the card with the number 8 in it, so the card with the number 8 is positive, and the rest are negative.

Positive and negative case diagram 2\

Confusion matrix

With the concept of Positives and negative Negatives to understand TP, FN, TN, and FP (T is True, F is False, P is Positives, and N is Negatives as well as Negatives) :

Among the above four basic parameters, true and true negative cases are the correct identification results given by the model, such as the identification of electric vehicle as electric vehicle (real case), motorcycle as motorcycle (true negative case); Pseudo-positive and pseudo-negative examples are the wrong identification results given by the model, for example, motorcycle is identified as electric vehicle (pseudo-positive example), and electric vehicle is identified as motorcycle (pseudo-negative example). Among them, true case (TP) is a key parameter to evaluate model performance, because it is a useful result of the goal we are concerned with, and the higher the value, the better. \

It can be seen that in a data set, the relationship of judgment results given by the model is as follows:

Next, we will look at the various evaluation indicators of model performance. \

Model performance index

1

Accuracy

Accuracy: that is, the proportion of correct positive examples (TP) and negative examples (TN) in the total identification samples.

That is:

A=(TP+ TN)/S

In the above example of electric vehicle, it can be seen from the above table that TP+ TN =70, S= 100, then the accuracy is:

A = 70/100 = 0.7

Generally speaking, the higher the accuracy, the better the model performance.

2

Error rate

Error-rate: the proportion of misidentified positive cases (FP) and negative cases (FN) in the total identification samples.

That is:

E=( FP+FN)/S

In the above example of electric vehicle, we can see from the above table that FP+ FN =30, S= 100, then the error rate is:

E = 30/100 = 0.3

It can be seen that the accuracy rate and error rate are evaluated from both positive and negative aspects respectively, and the sum of the two values is just equal to 1. High accuracy rate, low error rate; The lower the accuracy, the higher the error rate.

3

Precision

Precision: the ratio of correct positive cases (TP) to positive cases identified. Where, positive examples identified are equal to positive examples identified correctly plus positive examples identified incorrectly.

That is:

P=TP/(TP+ FP)

In the electric car example above, TP=40, TP+ FP=50. In other words, in the recognition results of 100 photos, the model gives a total of 50 electric vehicle targets, but only 40 of these 50 targets are correctly identified, so the accuracy is:

P = 40/50 = 0.8

Therefore, accuracy is the correct proportion to identify the target. Precision is accuracy, like in the case of the electric car, the model finds 50 targets, but what percentage of those 50 targets are accurate.

4

Recall rate

Recall: The percentage of positive cases (TP) identified to the actual total positive cases. Where, the actual total positive cases are equal to the positive cases correctly identified plus the negative cases incorrectly identified (real cases + pseudo-negative cases).

That is: \

R=TP/(TP+ FN)

Similarly, in the electric car example above, TP=40, TP+FN =60. Then the recall rate is:

R = 40/60 = 0.67

In a certain sense, recall rate can also be said to be the “recovery rate”, that is, 40 out of the actual 60 targets are retrieved, and the recovery ratio is 40/60. At the same time, the recall rate is also called recall rate, that is, in the actual 60 targets, whether the search is complete, what is the ratio of the search.

It can be seen from the formula that both accuracy and recall rate are closely related to TP value. The larger TP value is, the higher the accuracy and recall rate will be. Ideally, we want as much precision and recall as possible. However, high precision or high recall rate alone is not enough to reflect the high performance of the model.

For example:

High precision model

As can be seen from the above table, 50 positive cases and 200 negative cases were identified by the model. All of the 50 positive examples given by the recognition are correct (all are real examples, no false positive examples), so the accuracy P is 100%, very high. However, all the 200 negative examples given by the identification are wrong (all are pseudo-negative examples), and the error rate is very high. The performance of such a model is actually very low.

High recall model

As can be seen from the above table, 110 positive cases and 0 negative cases were identified by the model. Of the 110 positive examples, 10 were true (correctly identified) and 100 were false positive (incorrectly identified). In this test data set, the calculated recall rate R is 100%, which is very good, that is, there are a total of 10 targets in this data set, all of which have been found (recalled). But at the same time, the calculated error rate E of model recognition results is also very high, up to 91%, so the performance of this model is also very low and basically unreliable. \

5

Accurate-recall curve (PR curve)

In practice, accuracy and recall rate are mutually affected. In general, when the accuracy is high, the recall rate is often low, and when the recall rate is high, the accuracy will be low. It also is easy to understand, as we said, accuracy or precision, the recall rate and the recall rate to check accurate a: (a), the model of the given target is correct, it raises the threshold threshold, a threshold value, conform to the requirements of the target would be reduced, it will inevitably lead to the net increase, the recall rate reduced.

On the contrary, if you want to have a high recall rate and no fish to escape from the net (all the targets are found), you need to lower the threshold threshold to capture all the targets, and at the same time, some pseudo targets will be attracted, resulting in lower accuracy.

For example, under different thresholds (0.6 and 0.5 respectively), the model gives the recognition results of 15 images as follows:

In the above table, 1 and 0 represent positive and negative cases respectively. By setting a threshold (T), when the confidence score is greater than the threshold, positive cases are identified, while when the confidence score is less than the threshold, negative cases are identified. In the recognition results of the above table, when the threshold T=0.6, there are 8 positive examples given by the model; when the threshold T=0.5, there are 10 positive examples given by the model. \

By checking with the real attribute values, we can get the parameters (TP, FP, FN) under these two thresholds and calculate the recall rate (R) and accuracy (P) as follows:

It can be seen that the recall rate (R) and accuracy (P) obtained are also different with different thresholds set. Therefore, a corresponding group (R, P) can be obtained for each threshold. For example, two groups (R, P) can be obtained for the above two thresholds, which are :(0.86, 0.75) and (1,0.7) respectively. If multiple different thresholds are taken, multiple groups (R, P) can be obtained. Draw these coordinate points (R, P) on the coordinate, and then connect them with curves to obtain the PR curve.

Therefore, the PR curve is drawn with recall rate R as the horizontal axis and accuracy P as the vertical axis, as shown in the following figure:

6

Average Precision (AP)

The area under the PR curve is called Average Precision (AP), which represents the Average Precision value of the recall rate from 0-1. How do you calculate AP? It is obvious that, based on the mathematics knowledge, we can calculate it by integral, as follows:

Obviously, this area can’t be greater than 1. The larger the area under PR curve is, the better the model performance is. The model with good performance should keep the accuracy (P) value at a high level while the recall rate (R) increases, while the model with low performance usually needs to sacrifice a lot of P value in order to improve the R value. As shown in the figure below, there are two PR curves. It can be seen that PR1 curve is the model expression with better performance, and the area under PR1 curve is significantly larger than that under PR2 curve. For PR1 curve, P value can remain at a high level with the increase of R value. For PR2 curve, P value decreases continuously with the increase of R value, so the increase of R value can be achieved only by sacrificing P value. \

In addition to calculating AP value by integral method, interpolation method is often used in practical application. A common interpolation method is to select 11 precision points, and then calculate the average value of these 11 points, which is the AP value. \

How do I pick 11 precision points? Usually set a set of threshold first, for example [0,0.1,… 0.2, 1], for each R is greater than the threshold value (R > 0, R > 0.1,… , R>1), a corresponding maximum precision value Pmax will be obtained, so 11 maximum precision values (Pmax1, Pmax2… , Pmax11).

Is:

AP = (Pmax1 + Pmax2 +… + Pmax11) / 11

7

MAP (Mean Average Precision) value

AP is to measure the average accuracy of the model in a single category, while mAP is to measure the average accuracy of the model in all categories. There should be one AP for each category. Assuming there are n categories, there are N APS, respectively: AP1, AP2… , APn, mAP is to take the average value of all categories of AP, namely:

The mAP = (AP1 + AP2 +… + APn)/n

8

Comprehensive evaluation index F-Measure

F-measure, also known as F-Score, is the weighted harmonic average of recall rate R and precision P. As the name implies, it is to reconcile the inverse contradiction between recall rate R and precision P. The coefficient α is introduced into the comprehensive evaluation index F to perform weighted reconciliation of R and P, and the expression is as follows:

The most commonly used F1 indicator is the case where the coefficient α in the above formula is 1, namely:

F1=2P.R/(P+R)

The maximum value of F1 is 1 and the minimum value is 0.

9

ROC curve and AUC

ROC(Receiver Operating Characteristic) curve and AUC(Area Under the Curver)\

ROC curve, also known as subject operating characteristics. The ROC curve is closely related to the True Rate (TPR) and False Positive Rate (FPR).

True rate (TPR): the proportion of correct positive cases (TP) to the actual total positive cases, and the actual calculated value is the same as the recall rate. That is:

TPR =TP/(TP+ FN)

False positive rate (FPR): The proportion of misidentified positive cases (FP) to the actual total negative cases. It can also be said that the proportion of negative cases of misjudgment (actually negative cases, no judgment) in the actual total negative cases. The calculation formula is as follows:

FPR =FP/(FP+ TN)

With FPR as the horizontal axis and TPR as the vertical axis, the curve drawn is ROC curve, and the drawing method is similar to PR curve. The ROC curve is drawn as follows:

In general, the closer the ROC curve is to the upper left, the better. \

The area under the ROC curve is AUC. The larger the area, the better the classification performance of the model. As shown in the figure above, the AUC of the green line classification model =0.83 is greater than that of the red line classification model =0.65. Therefore, the classification performance of the green line classification model is better. And the green line is smoother than the red line. Generally speaking, the smoother the ROC curve, the less overfitting it is. The overall performance of the green line classification model is better than the red line classification model.

10

Interp-over-union (IoU) indicator

IoU is short for union ratio. As the name suggests, the ratio of intersection to union in mathematics. Suppose there are two sets A and B, IoU is equal to the intersection of A and B divided by the union of A and B, expressed as follows:

Ious = A studying B/A ∪ B

In target detection, IoU is the intersection ratio of Prediction and Ground truth. As shown in the figure below, in the target detection of kittens, the purple line border is Prediction, and the red line border is Ground truth.

The prediction box and the real box are extracted as shown in the following figure. The intersection area of the two is the part filled by the slash line in the lower left figure, and the union area of the two is the blue filled area in the right figure. IoU means: \

The area filled by the slash on the left/the total area filled by the blue on the right.

Example of intersection and union of prediction box and real box

In target detection tasks, IoU≥0.5 is usually taken as recall. If the IoU threshold is set higher, the recall rate will decrease, but the locator box is more accurate.

The ideal situation, of course, is that the prediction box and the real box overlap as much as possible, if they overlap completely, then the area of intersection and union is the same, and IoU is equal to 1.

11

Top1 with TopK

Top1: For an image, among the recognition probability (i.e., confidence score) given by the model, the one with the highest score is the correct target, and it is considered correct. The goal here is what we call the positive example.

TopK: for an image, if the correct target (positive example) is included in the TopK bits of the recognition probability (confidence score) given by the model, it is considered correct.

The value of K can generally be within the order of 100, of course, the smaller the more practical. For example, if K is 5, it represents Top5, indicating that one of the Top5 confidence scores is the correct target. If K is 100, it represents Top100, which means that one of the Top100 confidence scores is the correct target (correct positive example). So as K goes up, the difficulty goes down.

For example, in one data set, we sorted the confidence scores of the top 5 with the following result:

In the above table, the threshold T=0.45 is taken, and the confidence scores of the top 5 are all greater than the threshold, so they are all identified as positive examples. For Top1, that is, the image with ID number 4, the actual attribute is negative example, so the target recognition is wrong. As for the Top5, the Top5 confidence scores have correctly identified targets, namely pictures with ID number 2 and 20, so they are considered correct. \

In the common face recognition algorithm model, accuracy is the first indicator of application publicity. In fact, for the same model, each performance indicator is not a static number, but will change with the application scenario, the number of face libraries, etc. Therefore, there must be a gap between the accuracy of actual application scenarios and that of the laboratory environment. To some extent, the accuracy of actual application scenarios is more meaningful for evaluation.

Machine learning beginner

The public account created by Dr. Huang Haiguang has more than 21,000 followers on Zhihu, and Github ranks among the top 120 in the world (more than 30,000). This public number is committed to the direction of artificial intelligence science articles, for beginners to provide learning routes and basic information. Original works include: Personal Notes on Machine learning, notes on deep learning, etc.

Past wonderful review \

  • All those years of academic philanthropy. – You’re not alone

  • Conscience recommendation: Introduction to machine learning information summary and learning recommendations \

  • Machine Learning Course Notes and Resources (Github star 12000+, baidu cloud image provided)

  • Ng deep learning notes, videos and other resources (Github standard star 8500+, providing Baidu cloud image)

  • Statistical Learning Methods of Python code implementation (Github 7200+) \

  • Carefully organized and translated mathematical materials related to machine learning

  • Introduction to Deep Learning – Python Deep Learning, annotated version of the original code in Chinese and ebook

  • Word2vec (original translation)\

Note: If you join our wechat group or QQ group, please reply”Add group

To join Knowledge Planet (4100+ user ID: 92416895), please reply”Knowledge of the planet