Model evaluation is a fundamental and core part of the entire modeling cycle. Selecting appropriate evaluation index can make model training and testing get twice the result with half the effort. This paper introduces several common indicators of classification model evaluation. Understand and be able to skillfully use these indicators to evaluate the model, and basically be able to deal with the preliminary and intermediate data modeling work.
Error rate and accuracy
Error rate and accuracy are the two most commonly used evaluation indexes, which are intuitive and easy to understand
The error rate is the proportion of the sample with wrong prediction to the total sample; Accuracy is the ratio of correct samples to total samples; The sum of the two is 1
For example, if there are 80 positive examples and 20 negative examples in 100 real test samples, and the prediction results of the model are all positive examples, what are the error rate and accuracy rate of the model respectively?
Obviously, at this point, the error rate is 20/100=20%, and the accuracy rate is 1-20%=80%
An 80% accuracy rate seems good. The model predicted all samples as positive examples and still got an 80% accuracy rate. The reason is that the sample imbalance is serious. Once the sample distribution is unbalanced, the error rate and accuracy are difficult to effectively measure the model performance.
At this point, we need to move to the next level, namely, whether there is a more general measurement index to effectively evaluate the model for the unbalanced category samples
2. Confusion matrix, Precision, Recall
2.1 Confusion Matrix
Before introducing F1 score, AUC, ROC, we need to take out a famous confusion matrix box graph
True Positive (TP): The model predicts the correct number of Positive cases, that is, the sample is Positive and the model predicts Positive cases
True Negative (TN): The model will predict the correct number of Negative examples, that is, the sample is a Negative example, the model forecast is a Negative example
False Positive (FP): The model predicts the number of errors by negative examples, that is, the sample is a negative example and the model predicts a Positive example
False Negative (FN): The model predicts the number of errors by positive examples, that is, the sample is positive and the model predicts Negative examples
For example, the number of potential novel coronavirus carriers to be tested is 10 positive (carrying) and 90 negative (not carrying), with the instrument predicting 8 positive and 2 negative for 10 positive samples; The 90 negative samples were predicted to be 80 negative and 10 positive
In this example, we can calculate (note here that TP, TN, FP, FN all refer to numbers, not percentages) :
TP = 8; TN = 80; FP = 10; FN = 2
The obfuscation matrix is proposed to calculate precision and recall, two concepts that are easily confused or difficult to remember. But remember that precision is an exact measure of predicted results and recall is a recall measure of real samples
2.2 Accuracy and recall rate
Precision, in terms of prediction results, represents the proportion of samples with positive predictions that are both true and positive. That is, the sample prediction that was originally positive is also positive, TP; A sample that would have been negative is predicted to be positive, FP. So precision = TP /(TP + FP) 8/(8+10)=8/18
Recall, for real samples, represents what percentage of the original positive sample was correctly predicted. That is, the sample prediction that was originally positive is also positive, TP; Recall = TP/(TP + FN)
In the example above, the result is 8/(8+2)=8/10
The recall of the visible model was good, but the precision was low. The model tended to predict positive potential virus carriers. Is it a good model? Yes, but in this case, we should actually improve its recall further, because as a novel coronavirus test, it is better to test again if it is positive, than if it is not.
F1-score/ROC /AUC and code implementation
3.1 F1 – score
F1-score is a measurement score calculated by combining precision and recall results. Mathematically, f1-score is the Harmonic mean of precision and recall. F1-score = 2/(1/P+1/R) = 2PR/P+R
3.2 ROC curve
The obfuscation matrix is used to draw Receiver Operating Characteristic (Receiver Operating Characteristic) curve. The abscissa of ROC curve is false positive rate(FPR), The ordinate is true positive rate(TPR), where FPR = TP/(TP + FN); TPR = FP/(FP + TN) it can be seen that the calculation of FPR and recall is the same
Taking TPR as the horizontal axis, FPR as the horizontal axis and TPR as the vertical axis, ROC curve can be drawn. Then, how do TPR and EACH corresponding coordinate point of FPR come from? As we know, before the model gives classification results, it first outputs category probability, and then classifies with silent 0.5 as the threshold. We only need to take each prediction probability value as the threshold value to obtain multiple groups of different prediction results. FPR and TPR can be calculated according to each group of prediction results to obtain multiple groups of coordinate values, thus drawing the ROC curve.
3.3 AUC
AUC is the area under the ROC curve, which is a value that can be calculated by integrating along the ROC horizontal axis. Its physical significance is that a positive sample and a negative sample are randomly given, and the probability that the positive sample is predicted to be positive is P1, and the probability that the negative sample is predicted to be positive is P2. The probability that P1 > P2 is AUC. It can be seen that the larger AUC is, the better the model classification effect is. ACU can be a good measure of the classification performance of the classifier and is widely used at present.
3.4 Example of using Sklearn to calculate evaluation indicators
A. Load sample data and build logistic regression classification model
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
X, y_true = load_breast_cancer(return_X_y=True)
clf = LogisticRegression(solver="liblinear", random_state=0).fit(X, y)
y_pred = clf.decision_function(X)
Copy the code
B. Calculate the indicators
Accuracy
from sklearn.metrics import accuracy_score
accuracy_score(y_true, y_pred)
Copy the code
Precision and Recall
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
precision_score(y_true, y_pred)recall_score(y_true, y_pred)
Copy the code
f1-score, AUC
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
f1_score(y_true, y_pred)
roc_auc_score(y_true, y_pred)
Copy the code