For regression problems, there are usually MSE, MAE, RMSE and R^2 methods to evaluate the effect of the model. For classification problems, the simplest approach is to use accuracy to evaluate the effectiveness of the model. For example, the default score for classification problems in SkLearn is based on the accuracy rate.
It’s easy to use accuracy to assess understanding, but predictions for extremely skewed data can be problematic. For example, for cancer prediction, the ratio of healthy to sick might be 10,000 to one. For such extremely skewed data, we can make the simplest model and directly predict that all samples belong to the healthy category, so that the accuracy of the model can reach 99.99%.
For this type of data, the score of the classification algorithm model can be evaluated by the obfuscation matrix.
Confusion matrix
In order to facilitate the explanation of confusion matrix and terms such as accuracy and recall rate, the dichotomy problem is first analyzed as an example.
True/forecast | 0 | 1 |
---|---|---|
0 | TN | FP |
1 | FN | TP |
- In the table above, the rows represent the actual values and the columns represent the predicted values.
- 0 means negative, 1 means postive.
- TN (True Negative) Indicates that the actual value is Negative and the predicted value is Negative, indicating that the predicted value is correct.
- FP (False Positive) Indicates that the actual value is negative, the predicted value is Positive, and the predicted value is wrong.
- FN (False Negative) indicates that the actual value is positive and the predicted value is Negative, indicating that the predicted value is incorrect.
- TP (True Positive) Indicates that the actual value is Positive and the predicted value is Positive. The predicted value is wrong.
This is a little abstract, but let’s do a concrete example.
True/forecast | 0 | 1 |
---|---|---|
0 | 9980 | 10 |
1 | 3 | 7 |
- 9,980 people did not have cancer themselves, and the algorithm also predicted that they did not have cancer.
- Ten people didn’t have cancer, but the algorithm predicted they did.
- Three people had cancer, but the algorithm predicted they didn’t have cancer.
- Seven people had cancer, and the algorithm predicted that they had cancer.
Accurate rate
Accuracy is defined as the probability that the prediction is correct 7 times and wrong 10 times (17 times in total) in the outcome of the event of interest.
Accuracy = TP/(TP + FP) = 7 / (10+7), which means that out of the 17 predictions made, 7 were correct on average.
The recall rate
Recall was defined as the probability of predicting 7 of the type of concern (i.e., 10 patients).
Recall rate = TP/(TP + FN) = 7 / (7 + 3) = 70%, that is to say, when there are 100 patients, the algorithm can successfully find 70 patients on average and miss 30 patients.
F1-Score
For some scenarios, the accuracy rate is more appropriate, such as the stock prediction scenario, in order to predict whether the stock will rise or fall, the business needs are more accurate to find the stock that can rise. For the disease prediction scenario, to predict whether the patient is ill or not, the business requirement at this time is to find out all the sick patients and not to miss any patients. It can be said that it may not matter much to diagnose healthy patients as patients, as long as the patient is not diagnosed as healthy.
But what about situations where you need to combine accuracy and recall? This can be solved by using F1-score, where F1 is the harmonic average of accuracy and recall:
The instance
In order to demonstrate the three concepts mentioned above, we first build an extremely skewed data. We choose SkLearn to provide a handwriting recognition data set. In this data set, the ten digits from 0 to 9 are evenly distributed. The other category does not equal 9 to create a skew in the data.
import numpy as np
from sklearn import datasets
digits = datasets.load_digits()
X = digits.data
y = digits.target.copy()
y[digits.target==9] = 1
y[digits.target!=9] = 0
Copy the code
Use logistic regression to make predictions:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=Awesome!)
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
log_reg.score(X_test, y_test)
Copy the code
Because of the extreme skew of the data, even if all sample types were 0, the accuracy would be about 90 percent. Accuracy can only show the accuracy of prediction of each sample by the model, but it can’t really accurately find the sample of type 1, that is to say, accuracy can’t reflect whether the model can accurately find the sample of type 1. The SkLearn Metrics package provides direct support for obfuscation matrices, accuracy, and recall.
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_log_predict)
from sklearn.metrics import precision_score
precision_score(y_test, y_log_predict)
from sklearn.metrics import recall_score
recall_score(y_test, y_log_predict)
from sklearn.metrics import f1_score
f1_score(y_test, y_log_predict)
Copy the code
PR curve
For dichotomous problems, we can adjust the classification boundary value to adjust the proportion of accuracy and recall rate. When score > threshold, the classification is 1; when score < threshold, the classification is 0. The threshold increases, the accuracy increases, and the recall rate decreases. The threshold decreases, the accuracy decreases, and the recall rate increases. Accuracy rate and recall rate are two contradictory variables which cannot be increased at the same time.
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
digits = datasets.load_digits()
X = digits.data
y = digits.target.copy()
y[digits.target==9] = 1
y[digits.target!=9] = 0
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=Awesome!)
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
decision_scores = log_reg.decision_function(X_test)
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
precisions = []
recalls = []
thresholds = np.arange(np.min(decision_scores), np.max(decision_scores), 0.1)
for threshold in thresholds:
y_predict = np.array(decision_scores >= threshold, dtype='int')
precisions.append(precision_score(y_test, y_predict))
recalls.append(recall_score(y_test, y_predict))
Copy the code
plt.plot(precisions, recalls)
plt.show()
Copy the code
The ROC curve
Receiver Operation Characteristic Curve (ROC) is used to describe the relationship between TPR and FPR, where:
- TPR(True Positive Rate) indicates the True Rate. The number of Positive sample results predicted to be Positive/the actual number of Positive samples: TPR = TP /(TP + FN)
- TNR(True Negative Rate) indicates True Negative Rate. Number of negative sample results predicted to be negative/actual number of negative samples: TNR = TN /(TN + FP)
- False Positive Rate (FPR) : False Positive Rate. Number of negative sample outcomes predicted to be positive/actual number of negative samples: FPR = FP /(TN + FP)
- FNR(False Negative Rate) indicates False Negative Rate. Number of positive sample results predicted to be negative/actual number of positive samples: FNR = FN /(TP + FN)
The instance
import numpy as np
from sklearn import datasets
digits = datasets.load_digits()
X = digits.data
y = digits.target.copy()
y[digits.target==9] = 1
y[digits.target!=9] = 0
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=Awesome!)
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
decision_scores = log_reg.decision_function(X_test)
from sklearn.metrics import roc_curve
fprs, tprs, thresholds = roc_curve(y_test, decision_scores)
import matplotlib.pyplot as plt
plt.plot(fprs, tprs)
plt.show()
Copy the code
The area enclosed by ROC curve and graph boundary is a standard to measure the merits of the model. The larger the area is, the better the model will be.