This article has participated in the “new creative Ceremony” activity
classification
Get mnIST data set
from sklearn.datasets import fetch_openml
import numpy as np
mnist = fetch_openml('mnist_784', version=1)
mnist.keys()
Copy the code
Running results:Where: DESCR: description data set data: contains an array, one row per instance, one column per feature Target: contains a marked array
Get training data and tags
X, y = mnist['data'], mnist['target']
Copy the code
import matplotlib.pyplot as plt
import matplotlib as mpl
some_digit = np.array(X)[0]
some_digit_image = some_digit.reshape(28.28)
plt.imshow(some_digit_image, cmap="binary")
plt.axis("off")
plt.show()
Copy the code
Display the 0th image
Data standardization and data set division
Because the label is character, now convert the character to an unsigned 8-bit integer
y = y.astype(np.uint8)
Copy the code
The MNIST data set has been separated into the training set (the first 60,000) and the test set (the last 10,000)
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
Copy the code
Train binary classifiers
Partition data set
Here, the original 0-9 data set is divided into 5 or non-5
y_train_5 = (y_train == 5) # is 5 is 1, not 5 is 0
y_test_5 = (y_test == 5)
Copy the code
Random gradient descent classification
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(random_state=42) # random_state=42 sets the random value to 42, which can also be changed to other values
sgd_clf.fit(X_train, y_train_5) # training
sgd_clf.predict([some_digit]) # some_digit # some_digit # some_digit # some_digit # some_digit
Copy the code
Running results:
The performance test
Cross validation of measurement accuracy was used
K fold stratified sampling:
from sklearn.model_selection import StratifiedKFold # K fold stratified sampling
from sklearn.base import clone
skfolds = StratifiedKFold(n_splits=3) # 3 off
for train_index, test_index in skfolds.split(X_train, y_train_5): # This is the same 5 and non-5 classifier
clone_clf = clone(sgd_clf) # Clone trained SGD_CLF (Stochastic Gradient Descent classifier)
# Divide the training set
X_train_flods = np.array(X_train)[train_index]
y_train_flods = y_train_5[train_index]
# Divide the validation set
X_test_flods = np.array(X_train)[test_index]
y_test_flods = y_train_5[test_index]
clone_clf.fit(X_train_flods, y_train_flods) # Training compromise training data
y_pred = clone_clf.predict(X_test_flods) # Predict a compromise validation data
n_correct = sum(y_pred == y_test_flods)
print(n_correct / len(y_pred))
Copy the code
Running results:Cross validation:
from sklearn.model_selection import cross_val_score # cross validation
cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")
Copy the code
Running results:
Dumb classifier
from sklearn.base import BaseEstimator
class Never5Classifier(BaseEstimator) : # Dumb version of classifier
def fit(self, X, y=None) :
return self # This training is no training
def predict(self, X) :
return np.zeros((len(X), 1), dtype=bool) # This prediction is zero no matter what is input
never_5_clf = Never5Classifier()
cross_val_score(never_5_clf, X_train, y_train_5, cv=3, scoring="accuracy")
Copy the code
Running results:Since the data of 5 accounts for 1/10 of the total data, the random result is also good, but this good performance is a false performance.
Confusion matrix
Confusion matrix corresponding to stochastic gradient descent classifier
The calculation of the obfuscation matrix requires a predicted value to be compared with the actual target. The test set is not used here for the time being, so cross_val_PREDICT is used instead
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)
confusion_matrix(y_train_5, y_train_pred)
Copy the code
Running results:
Confusion matrix at its best
y_train_perfect_predictions = y_train_5
confusion_matrix(y_train_5, y_train_perfect_predictions) # An obfuscation matrix for a perfect classifier
Copy the code
Running results:
Accuracy and recall
Accuracy = TP(true class [true positive class discriminated as positive class])/(TP + FP(false positive class [discriminated as positive class is not positive class]))) the ratio of real positive class to real positive class
Recall rate = TP/(TP(true classes) + FN(false negative classes)) the ratio of true positive classes to all positive classes
from sklearn.metrics import precision_score, recall_score
precision_score(y_train_5, y_train_pred)
Copy the code
Running results:
recall_score(y_train_5, y_train_pred)
Copy the code
Running results:As can be seen from the above, when a data is 5, the probability of having precision_score is accurate, and only 5 of recall_score is detected
The accuracy and recall rate are combined into a single index F1 score, which is the harmonic average value of accuracy and recall rate. The harmonic average value will give a higher weight to the low value. Only when the recall rate and accuracy are high, can the classifier get a higher F1 score
F1 = 2 / (1/ precision + 1/ recall rate) = 2 * precision * recall rate/(precision + recall rate) = TP/ (TP+ (FN+FP) /2)
from sklearn.metrics import f1_score
f1_score(y_train_5, y_train_pred)
Copy the code
Running results:F1 is advantageous for classifiers with similar accuracy and recall
Accuracy/recall trade-offs
Increasing the threshold improves the accuracy, while decreasing the threshold increases the recall rate and reduces the accuracy
y_scores= sgd_clf.decision_function([some_digit])
y_scores
Copy the code
Running results:
When the threshold is 0
threshold = 0
y_some_digit_pred = (y_scores > threshold)
y_some_digit_pred
Copy the code
Running results:
The threshold is 8000
threshold = 8000
y_some_digit_pred = (y_scores > threshold)
y_some_digit_pred
Copy the code
Running results:
The effect of threshold on accuracy and recall rate changes the image
y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method="decision_function") # return decision function
from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds) :
plt.plot(thresholds, precisions[:-1]."b--", label="Precison")
plt.plot(thresholds, recalls[:-1]."g-", label="Recall")
plt.xlim(-45000.45000)
plt.ylim(0.1)
plt.legend()
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.show()
Copy the code
To determine the threshold
Suppose you now want to set the accuracy to 90%, first check the threshold
np.argmax(precisions > 0.9) Find the index corresponding to the threshold whose accuracy is greater than 90%
Copy the code
Running results:
threshold_90_precision = thresholds[np.argmax(precisions >= 0.90)]
threshold_90_precision # find the threshold
Copy the code
y_train_pred_90 = (y_scores >= threshold_90_precision)
precision_score(y_train_5, y_train_pred_90)
Copy the code
recall_score(y_train_5, y_train_pred_90)
Copy the code
plt.plot(recalls, precisions)
plt.show()
Copy the code
The ROC curve
The Receiver operating Characteristic curve (ROC) plots true class (recall) and false positive class (FPR), which is the ratio of instances of negative class incorrectly classified as positive and is equal to 1- true negative class (TNR).
from sklearn.metrics import roc_curve # Calculate TPR and FPR for multiple thresholds
fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)
def plot_roc_curve(fpr, tpr, label=None) :
plt.plot(fpr, tpr, linewidth=2, label=label)
plt.plot([0.1], [0.1].'k--')
plot_roc_curve(fpr, tpr)
plt.show()
Copy the code
There is a tradeoff. The higher the recall rate (TPR), the more false positive classes (FPR) generated by the classifier. The dotted line represents the ROC curve of a purely random classifier, and a good classifier should be as far away from this line as possible
from sklearn.metrics import roc_auc_score
roc_auc_score(y_train_5, y_scores) # One way to compare classifiers is to measure the area under the curve (AUC). The ROC AUC for perfect classifiers is equal to 1, and the ROC_AUC for purely random classifiers is equal to 0.5
Copy the code
Since the ROC curve is very similar to the accuracy/recall ratio (PR) curve, the PR curve should be chosen when positive classes are rare or false positive classes are more important than false negative classes, and vice versa
The number of positive (number 5) classes is really small compared to negative (non-5) classes, and the PR curve clearly shows that there is room for improvement in the classifier
Now we will train a random forest classifier and compare its ROC curve and ROC AUC score with that of the stochastic gradient descent classifier
from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier(random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3, method="predict_proba") Because random forest doesn't have decision_function, it has predict_proba
Copy the code
Roc_curve requires labels and scores, and the probability of the positive class is directly used as the score value
y_probas_forest
Copy the code
y_score_forest = y_probas_forest[:, 1]
fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_5, y_score_forest)
plt.plot(fpr, tpr,"b:", label="SGD")
plot_roc_curve(fpr_forest, tpr_forest, "Random Forest")
plt.legend(loc="lower right")
plt.show()
Copy the code
Compared with ROC curve, random forest is better than random gradient descent
roc_auc_score(y_train_5, y_score_forest)
Copy the code
Test accuracy and recall rate
precision_score(y_train_5, y_score_forest > 0.5) # Create a tag based on probability
Copy the code
recall_score(y_train_5, y_score_forest > 0.5)
Copy the code
Multiclass classifier
OvR and OvO
OvR strategy: a pair of residual; The OvO strategy: One-on-one
Scikit-learn detects attempts to classify multiple classes using binary algorithms, automatically running OvR,OvO, and sklearn.svc classes to try SVM classifiers.
from sklearn.svm import SVC
svm_clf = SVC()
svm_clf.fit(X_train, y_train) # This is not a simple dichotomy, but a 10 dichotomy (0-9).
svm_clf.predict([some_digit])
Copy the code
We actually trained 45 binary classifiers internally, and to test that, calling decision_function() returns 10 scores
some_digit_scores = svm_clf.decision_function([some_digit])
some_digit_scores
Copy the code
from sklearn.multiclass import OneVsRestClassifier
ovr_clf = OneVsRestClassifier(SVC()) # Enforce OvR
ovr_clf.fit(X_train, y_train)
ovr_clf.predict([some_digit])
Copy the code
Random gradient descent and random forest
sgd_clf.fit(X_train, y_train)
sgd_clf.predict([some_digit])
Copy the code
sgd_clf.decision_function([some_digit])
Copy the code
cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring="accuracy")
Copy the code
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler() # zoom
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64)) # Fit_transform is a combination of transform and FIT, including data scaling and model training
cross_val_score(sgd_clf, X_train_scaled, y_train, cv=3, scoring="accuracy")
Copy the code
The error analysis
Suppose you have a model that has potential, and you want to find ways to improve it. One way to do that is to analyze the types of errors (types of misjudgments, i.e. why they are misjudged).
y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
conf_mx = confusion_matrix(y_train, y_train_pred)
conf_mx
Copy the code
Because the numbers are large and not intuitive, here we use matshow(this is used to graph the matrix, be careful to distinguish it from the thermogram) to look at the graphical representation of the confusion matrix
plt.matshow(conf_mx, cmap=plt.cm.gray)
plt.show()
Copy the code
Most of the images are on the diagonal, indicating that they are generally correctly classified (a good classifier’s diagonal is brighter).
row_sums = conf_mx.sum(axis=1, keepdims=True) # column summation (I think column summation is ok)
norm_conf_mx = conf_mx / row_sums So let's figure out the proportion
Copy the code
Fill the diagonals with zeros, keep only the errors, and redraw the results (essentially reducing the brightness to highlight the misjudgment)
np.fill_diagonal(norm_conf_mx, 0)
plt.matshow(norm_conf_mx, cmap=plt.cm.gray)
plt.show()
Copy the code
It can be seen that there are many wrong classifications in the category of 8, and the subsequent optimization can be carried out for 8 (what is written in the book is to collect more data like 8, or use algorithms to calculate the closed loop).
3 and 5 miscarriage of justice
def plot_digits(instances, images_per_row=10, **options) : # here is reference to a function online: https://github.com/ageron/handson-ml/issues/257
size = 28
images_per_row = min(len(instances), images_per_row)
images = [np.array(instances.iloc[i]).reshape(size, size) for i in range(instances.shape[0]]#change done here
if images_per_row == 0:
images_per_row = 0.1
n_rows = (len(instances) - 1) // images_per_row + 1
row_images = []
n_empty = n_rows * images_per_row - len(instances)
images.append(np.zeros((size, size * n_empty)))
for row in range(n_rows):
rimages = images[row * images_per_row : (row + 1) * images_per_row]
row_images.append(np.concatenate(rimages, axis=1))
image = np.concatenate(row_images, axis=0)
plt.imshow(image, cmap = plt.cm.binary, **options)
plt.axis("off")
cl_a, cl_b = 3.5
X_aa = X_train[(y_train == cl_a) & (y_train_pred == cl_a)] # Correctly divided into 3 cases
X_ab = X_train[(y_train == cl_a) & (y_train_pred == cl_b)] # Divide 3 into 5
X_bb = X_train[(y_train == cl_b) & (y_train_pred == cl_b)] # Correctly divided into 5 cases
X_ba = X_train[(y_train == cl_b) & (y_train_pred == cl_a)] # Divide 5 into 3
plt.figure(figsize=(8.8))
plt.subplot(221); plot_digits(X_aa[:25], images_per_row=5)
plt.subplot(222); plot_digits(X_ab[:25], images_per_row=5)
plt.subplot(223); plot_digits(X_bb[:25], images_per_row=5)
plt.subplot(224); plot_digits(X_ba[:25], images_per_row=5)
plt.show()
Copy the code
Multilabel classification
Output classifiers with multiple labels (the previous classifiers all result in one label)
from sklearn.neighbors import KNeighborsClassifier # K proximity algorithm
y_train_large = (y_train >= 7) # Call numbers >=7 large
y_train_odd = (y_train % 2= =1) # odd
y_multilabel = np.c_[y_train_large, y_train_odd] # Combine two tags into multiple tags
print(y_multilabel.shape)
print(y_multilabel)
Copy the code
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_multilabel)
Copy the code
The prediction result of the trained model includes two labels, one to judge whether it is large number and the other to judge whether it is odd number (this method is not recommended, because it increases the calculation amount of the model, which will affect the accuracy. Generally, it is necessary to predict the number first, and then judge whether it is large number and odd number).
knn_clf.predict([some_digit])
Copy the code
y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_multilabel, cv=3)
f1_score(y_multilabel, y_train_knn_pred, average="macro")
# Use the average F1 score to calculate all tags. Here, it is assumed that different tags have the same weight. If Average = "weighted", each tag is given a weight equal to its own support
Copy the code
Multiple output classification
Basically similar to multi-label classification, it is a generalization of multi-label classification. The following uses image denoising as an example to illustrate multi-output classification
noise = np.random.randint(0.100, (len(X_train), 784)) # Make noise (training set)
X_train_mod = X_train + noise
noise = np.random.randint(0.100, (len(X_test), 784)) # Generate noise (Test set)
X_test_mod = X_test + noise
y_train_mod = X_train
y_test_mod = X_test
some_index = 1 # set a random index
plt.imshow(np.array(X_train_mod[some_index-1:some_index]).reshape((28.28)), cmap="binary") # Noise image
plt.axis("off")
plt.show()
plt.imshow(np.array(y_train_mod[some_index-1:some_index]).reshape((28.28)), cmap="binary") # Noise reduction image
plt.axis("off")
plt.show()
Copy the code
knn_clf.fit(X_train_mod, y_train_mod)
clean_digit = knn_clf.predict(X_test_mod[some_index-1:some_index])
plt.imshow(np.array(X_test_mod[some_index-1:some_index]).reshape(28.28), cmap="binary") # Noise reduction image
plt.axis("off")
plt.show()
Copy the code
plt.imshow(np.array(clean_digit).reshape(28.28), cmap="binary") # Noise reduction image
plt.axis("off")
plt.show()
Copy the code