Machine learning model evaluation indicators: accuracy, accuracy, F1-score, AUC, etc

Model evaluation is a fundamental and core part of the entire modeling cycle. Selecting appropriate evaluation index can make model training and testing get twice the result with half the effort. This paper introduces several common indicators of classification model evaluation. Understand and be able to skillfully use these indicators to evaluate the model, and basically be able to deal with the preliminary and intermediate data modeling work.

Error rate and accuracy

Error rate and accuracy are the two most commonly used evaluation indexes, which are intuitive and easy to understand

The error rate is the proportion of the sample with wrong prediction to the total sample; Accuracy is the ratio of correct samples to total samples; The sum of the two is 1

For example, if there are 80 positive examples and 20 negative examples in 100 real test samples, and the prediction results of the model are all positive examples, what are the error rate and accuracy rate of the model respectively?

Obviously, at this point, the error rate is 20/100=20%, and the accuracy rate is 1-20%=80%

An 80% accuracy rate seems good. The model predicted all samples as positive examples and still got an 80% accuracy rate. The reason is that the sample imbalance is serious. Once the sample distribution is unbalanced, the error rate and accuracy are difficult to effectively measure the model performance.

At this point, we need to move to the next level, namely, whether there is a more general measurement index to effectively evaluate the model for the unbalanced category samples

2. Confusion matrix, Precision, Recall

2.1 Confusion Matrix

Before introducing F1 score, AUC, ROC, we need to take out a famous confusion matrix box graph

True Positive (TP): The model predicts the correct number of Positive cases, that is, the sample is Positive and the model predicts Positive cases

True Negative (TN): The model will predict the correct number of Negative examples, that is, the sample is a Negative example, the model forecast is a Negative example

False Positive (FP): The model predicts the number of errors by negative examples, that is, the sample is a negative example and the model predicts a Positive example

False Negative (FN): The model predicts the number of errors by positive examples, that is, the sample is positive and the model predicts Negative examples

For example, the number of potential novel coronavirus carriers to be tested is 10 positive (carrying) and 90 negative (not carrying), with the instrument predicting 8 positive and 2 negative for 10 positive samples; The 90 negative samples were predicted to be 80 negative and 10 positive

In this example, we can calculate (note here that TP, TN, FP, FN all refer to numbers, not percentages) :

TP = 8； TN = 80； FP = 10； FN = 2

The obfuscation matrix is proposed to calculate precision and recall, two concepts that are easily confused or difficult to remember. But remember that precision is an exact measure of predicted results and recall is a recall measure of real samples

2.2 Accuracy and recall rate

Precision, in terms of prediction results, represents the proportion of samples with positive predictions that are both true and positive. That is, the sample prediction that was originally positive is also positive, TP; A sample that would have been negative is predicted to be positive, FP. So precision = TP /(TP + FP) 8/(8+10)=8/18

Recall, for real samples, represents what percentage of the original positive sample was correctly predicted. That is, the sample prediction that was originally positive is also positive, TP; Recall = TP/(TP + FN)

In the example above, the result is 8/(8+2)=8/10

The recall of the visible model was good, but the precision was low. The model tended to predict positive potential virus carriers. Is it a good model? Yes, but in this case, we should actually improve its recall further, because as a novel coronavirus test, it is better to test again if it is positive, than if it is not.

F1-score/ROC /AUC and code implementation

3.1 F1 – score

F1-score is a measurement score calculated by combining precision and recall results. Mathematically, f1-score is the Harmonic mean of precision and recall. F1-score = 2/(1/P+1/R) = 2PR/P+R

3.2 ROC curve

The obfuscation matrix is used to draw Receiver Operating Characteristic (Receiver Operating Characteristic) curve. The abscissa of ROC curve is false positive rate(FPR), The ordinate is true positive rate(TPR), where FPR = TP/(TP + FN); TPR = FP/(FP + TN) it can be seen that the calculation of FPR and recall is the same

Taking TPR as the horizontal axis, FPR as the horizontal axis and TPR as the vertical axis, ROC curve can be drawn. Then, how do TPR and EACH corresponding coordinate point of FPR come from? As we know, before the model gives classification results, it first outputs category probability, and then classifies with silent 0.5 as the threshold. We only need to take each prediction probability value as the threshold value to obtain multiple groups of different prediction results. FPR and TPR can be calculated according to each group of prediction results to obtain multiple groups of coordinate values, thus drawing the ROC curve.

3.3 AUC

AUC is the area under the ROC curve, which is a value that can be calculated by integrating along the ROC horizontal axis. Its physical significance is that a positive sample and a negative sample are randomly given, and the probability that the positive sample is predicted to be positive is P1, and the probability that the negative sample is predicted to be positive is P2. The probability that P1 > P2 is AUC. It can be seen that the larger AUC is, the better the model classification effect is. ACU can be a good measure of the classification performance of the classifier and is widely used at present.

3.4 Example of using Sklearn to calculate evaluation indicators

A. Load sample data and build logistic regression classification model

from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
X, y_true = load_breast_cancer(return_X_y=True)
clf = LogisticRegression(solver="liblinear", random_state=0).fit(X, y)
y_pred = clf.decision_function(X)
Copy the code

B. Calculate the indicators

Accuracy

from sklearn.metrics import accuracy_score
accuracy_score(y_true, y_pred) 
Copy the code

Precision and Recall

from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
precision_score(y_true, y_pred)recall_score(y_true, y_pred)
Copy the code

f1-score, AUC

from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
f1_score(y_true, y_pred)
roc_auc_score(y_true, y_pred)
Copy the code