This is the 17th day of my participation in the August Text Challenge.More challenges in August
0. Overview Models are relative. What is a good model depends as much on the task requirements as on the algorithm and data.
- Performance measures for regression tasks
Mean square error mean square error is the most commonly used performance measure for regression tasks. It represents the error between the predicted value and the actual value. The picture
- Performance measures for classified tasks
True case Prediction Result Positive example Negative example Positive example TP(true example) FN(false negative example) Negative example FP(false positive example) TN(true negative example) Accuracy: proportion of correctly classified samples in total samples TP+FN/ Total number of cases Accuracy: proportion of correctly predicted samples TP/(TP+FP) Recall rate: Many machine learning methods classify by generating a real value (or probability prediction) for the test sample and comparing it to a set threshold, with those greater than the threshold classified as positive and those less than the threshold classified as negative. We can sort the sample by the size of the real value, by the likelihood that it’s a positive example. The sorting process is to find some “cutoff point” in this order and to judge the first part as positive example and the second part as negative example. The ROC curve takes each real value as a threshold and measures the performance of the model. It takes true case rate TPR=TP/(TP+FN) as abscissa and false positive case rate FPR=FP/(TN+FP) as abscissa. Test by trying one real value at a time in sequence as the threshold. The denominator of true and inverse case rates is the actual number of positive and negative cases, which is a fixed number, so they are proportional to TP and FP. The image above shows an example of ROC. The point (0,1) corresponds to the ideal model in which all positive cases are placed in front of negative cases. The orange line corresponds to a ROC curve in actual situations. At the beginning, the threshold value is high, and only the first few numbers are regarded as positive examples. The number of false positive cases is 0, and the rate of false positive cases is 0. With the decrease of threshold value, more samples were divided into positive cases, the number of true and false positive cases increased, and the rate of true and false positive cases also increased. When the threshold value increases to a certain extent, all the samples ranked lower are actual counterexamples. So the true case rate no longer increases, and the inverse case rate gradually increases to 1. The roc_curve([actual result], [score]) in Sklear. metrics can be used for TPR. AUC is the area below the ROC curve, which measures the ranking quality of prediction and is a comprehensive indicator. When AUC=1, ROC curve is a broken line connected by (0, 0), (0, 1) and (1, 1), which represents the perfect ordering of all positive cases before negative cases. When AUC= is close to 1/2, ROC is close to the straight line connecting (0, 0) and (1, 1), indicating that the sorting method is close to completely random. The method to calculate AUC is to apply roc_AUC ([FPR],[TPR]) in Sklearning. metrics. For example, if a patient is diagnosed as a healthy person (missed treatment) and a healthy person is diagnosed as a patient (examination cost), the cost-sensitive error rate is calculated by adding FP and FN in a certain proportion according to the different types of errors and dividing by the number of samples.
- Code sample
Draw the ROC curve and calculate the AUC value
import matplotlib.pyplot as plt from sklearn import svm, datasets from sklearn.metrics import roc_curve, auc from sklearn.model_selection import train_test_split from sklearn.preprocessing import label_binarize from sklearn.multiclass import OneVsRestClassifier
Import iris data set
iris = datasets.load_iris() X = iris.data y = iris.target
Binary output
y = label_binarize(y, classes=[0, 1, 2]) n_classes = y.shape[1]
Split the training set and the test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, random_state=0)
SVM was used for scoring
classifier = OneVsRestClassifier(svm.SVC(kernel=’linear’, probability=True,)) y_score = classifier.fit(X_train, y_train).decision_function(X_test)
Calculate the ROC curve and AUC for each category
fpr = dict() tpr = dict() roc_auc = dict() for i in range(n_classes): # Here we use roc_curve to get the FPR and TPR values of each point. Parameters are the i-column FPR [I], TPR [I], _ = ROC_curve (y_test[:, I], y_score[:, Figure () lw = 2. Figure () lw = 2
Draw the ROC curve and mark the AUC value
Plt.plot (FPR [1], TPR [1], color=’darkorange’, lw=lw, label=’ROC curve (area = %0.4f)’ % roc_auc[1]) plt.xlim([0.0, [0.0, 1.0]) PLT. Ylim ( 1.05]) plt.xlabel(‘False Positive Rate’) plt.ylabel(‘True Positive Rate’) plt.title(‘Receiver Operating characteristic ‘ example’) plt.legend(loc=”lower right”) plt.show()
Reference: Machine Learning, Zhihua Zhou
Official document of SK-learn
Assignment: Draw roc curves for breast cancer data sets and calculate AUC values
import matplotlib.pyplot as plt from sklearn import svm, datasets from sklearn.metrics import roc_curve, auc from sklearn.model_selection import train_test_split from sklearn.multiclass import OneVsRestClassifier
Import the breast cancer data set
data = datasets.load_breast_cancer() X = data.data y = data.target
Split the training set and the test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, random_state=0)