- PROGRAMMING Machine Learning for Diabetes with Python
- By Susan Li
- The Nuggets translation Project
- Permanent link to this article: github.com/xitu/gold-m…
- Translator: EmilyQiRabbit
- Proofreader: luochen1992, zhmhhu
According to the Centers for Disease Control and Prevention, about one in seven Adults in the United States today has diabetes. By 2050, that proportion will have soared to more than a third. So with that in mind, what we’re going to do today is learn how to use machine learning to help us predict diabetes. Start now!
data
The diabetes dataset comes from UCI Machine Learning Repository, which can be downloaded here.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
diabetes = pd.read_csv('diabetes.csv')
print(diabetes.columns)
Copy the code
Index([' Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'], dtype = "object")Copy the code
diabetes.head()
Copy the code
The diabetes dataset contains 768 data points, and each data point contains 9 features:
print("dimension of diabetes data: {}".format(diabetes.shape))
Copy the code
dimension of diabetes data: (768, 9)
Copy the code
The “outputs” are the characteristics we are going to predict, with 0 for non-diabetic and 1 for diabetic. Of these 768 data points, 500 are labeled as 0,268 are labeled as 1:
print(diabetes.groupby('Outcome').size())
Copy the code
import seaborn as sns
sns.countplot(diabetes['Outcome'],label="Count")
Copy the code
The following figure is obtained:
diabetes.info()
Copy the code
K neighbor
K-nearest neighbor algorithm is arguably the simplest machine learning algorithm. It builds models that contain only training data sets. To make a prediction about a new data point, the algorithm finds the nearest data point – its “nearest neighbor point” – in the training data set.
First, we need to examine whether the relationship between model complexity and precision can be confirmed:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(diabetes.loc[:, diabetes.columns != 'Outcome'], diabetes['Outcome'], stratify=diabetes['Outcome'], random_state=66)
from sklearn.neighbors import KNeighborsClassifier
training_accuracy = []
test_accuracy = []
Experiment with n_NEIGHBORS from 1 to 10
neighbors_settings = range(1, 11)
for n_neighbors in neighbors_settings:
# Build a model
knn = KNeighborsClassifier(n_neighbors=n_neighbors)
knn.fit(X_train, y_train)
Record the training set accuracy
training_accuracy.append(knn.score(X_train, y_train))
Record test set accuracy
test_accuracy.append(knn.score(X_test, y_test))
plt.plot(neighbors_settings, training_accuracy, label="training accuracy")
plt.plot(neighbors_settings, test_accuracy, label="test accuracy")
plt.ylabel("Accuracy")
plt.xlabel("n_neighbors")
plt.legend()
plt.savefig('knn_compare_model')
Copy the code
The following figure is obtained:
As shown in the figure above, the precision of the training and test set represented by the Y-axis is inversely proportional to the n nearest neighbors represented by the X-axis. Imagine if we choose only one neighbor, the prediction in the training set is perfect. However, when more neighbors are added, the training accuracy will decrease, which means that the model obtained by selecting only one neighbor is too complex. Best practice is to choose nine or so close neighbors.
We should choose N_NEIGHBORS =9. So here it is:
knn = KNeighborsClassifier(n_neighbors=9)
knn.fit(X_train, y_train)
print('Accuracy of K-NN classifier on training set: {:.2f}'.format(knn.score(X_train, y_train)))
print('Accuracy of K-NN classifier on test set: {:.2f}'.format(knn.score(X_test, y_test)))
Copy the code
Accuracy of K-NN classifier on training set: 0.79
Accuracy of K-NN classifier on test set: 0.78
Copy the code
Logistic regression
Logistic regression is one of the most commonly used classification algorithms.
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression().fit(X_train, y_train)
print("Training set score: {:.3f}".format(logreg.score(X_train, y_train)))
print("Test set score: {:.3f}".format(logreg.score(X_test, y_test)))
Copy the code
Training setAccuracy: 0.781 TestsetAccuracy: 0.771Copy the code
The default value C=1 is 78% accurate in the training set and 77% accurate in the test set.
Logreg001 = LogisticRegression (C = 0.01). The fit (X_train y_train)print("Training set accuracy: {:.3f}".format(logreg001.score(X_train, y_train)))
print("Test set accuracy: {:.3f}".format(logreg001.score(X_test, y_test)))
Copy the code
Training setAccuracy: 0.700 TestsetAccuracy: 0.703Copy the code
Using C=0.01 results in decreased accuracy in both the training set and the test set.
logreg100 = LogisticRegression(C=100).fit(X_train, y_train)
print("Training set accuracy: {:.3f}".format(logreg100.score(X_train, y_train)))
print("Test set accuracy: {:.3f}".format(logreg100.score(X_test, y_test)))
Copy the code
Training setAccuracy: 0.785 TestsetAccuracy: 0.766Copy the code
Using C=100 results in a slight increase in accuracy on the training set but a decrease in accuracy on the test set, and we can determine that the less regular and more complex model may not perform better than the default setting.
So we should use the default C=1.
Let’s visualize the parameters obtained by learning the model for three data sets of different regularized parameter C.
The set with strong regularization (C=0.001) gets closer and closer to zero. Looking more closely at the diagram, we can also see that for C=100, C=1, and C=0.001, the feature “DiabetesPedigreeFunction” coefficients are all positive. This means that no matter which model we look at, a high “diabetics Pedigree function” trait is associated with a diabetes sample.
diabetes_features = [x for i,x in enumerate(diabetes.columns) ifi! = 8] PLT. Figure (figsize = 8, 6) (PLT). The plot (logreg. Coef_. T,'o', label="C=1")
plt.plot(logreg100.coef_.T, A '^', label="C=100")
plt.plot(logreg001.coef_.T, 'v', label="C = 0.001")
plt.xticks(range(diabetes.shape[1]), diabetes_features, rotation=90)
plt.hlines(0, 0, diabetes.shape[1])
plt.ylim(-5, 5)
plt.xlabel("Feature")
plt.ylabel("Coefficient magnitude")
plt.legend()
plt.savefig('log_coef')
Copy the code
The following figure is obtained:
The decision tree
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(random_state=0)
tree.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test)))
Copy the code
Accuracy on training set: 1.000
Accuracy on test set: 0.714
Copy the code
The accuracy in the training set is 100%, but the accuracy in the test set is much less. This means that the tree is overfitted, so it has a weak ability to generalize to new data. Therefore, we need to prune the tree.
We set max_depth=3 to limit the depth of the tree to reduce overfitting. This will result in decreased accuracy on the training set, but improved results on the test set.
tree = DecisionTreeClassifier(max_depth=3, random_state=0)
tree.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test)))
Copy the code
Accuracy on training set: 0.773
Accuracy on test set: 0.740
Copy the code
Feature weights of decision trees
The feature weight determines the importance of each feature to the final decision of a tree. For each feature it is a number between zero and one, with zero being “completely useless” and one being “perfect prediction”. The sum of the feature weights must be 1.
print("Feature importances:\n{}".format(tree.feature_importances_))
Copy the code
Feature importances: [0.04554275 0.6830362 0\.0 \.0 \.0.27142106 0\.0 \.]Copy the code
Then we visualized the feature weights:
def plot_feature_importances_diabetes(model): Plt. figure(figsize=(8,6)) n_features = 8 plt.barh(range(n_features), model.feature_importances_, align='center')
plt.yticks(np.arange(n_features), diabetes_features)
plt.xlabel("Feature importance")
plt.ylabel("Feature")
plt.ylim(-1, n_features)
plot_feature_importances_diabetes(tree)
plt.savefig('feature_importance')
Copy the code
The following figure is obtained:
The feature “Glucose” is by far the most position-weighted feature.
Random forests
Let’s apply a random forest of 100 trees to the diabetes data set:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, random_state=0)
rf.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(rf.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(rf.score(X_test, y_test)))
Copy the code
Accuracy on training set: 1.000
Accuracy on test set: 0.786
Copy the code
The random forest with no tuning gave an accuracy of 78.6%, better than logistic regression or decision trees alone. However, we can still tweak the max_Features Settings to see if we get better results.
rf1 = RandomForestClassifier(max_depth=3, n_estimators=100, random_state=0)
rf1.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(rf1.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(rf1.score(X_test, y_test)))
Copy the code
Accuracy on training set: 0.800
Accuracy on test set: 0.755
Copy the code
No, which means that the default parameters of random forest already work fine.
Feature weights in random forests
plot_feature_importances_diabetes(rf)
Copy the code
The following figure is obtained:
! []=(https://datascienceplus.com/wp-content/uploads/2018/03/diabetes_8.png)
Similar to the single decision tree, the “Glucose” feature of the random forest also had a higher weight, but “BMI” was also selected as the second highest weight among all the features. The randomness of generating the random forest requires that the algorithm must consider many possible solutions, and the result is that the random forest can capture the characteristics of the data more completely than a single decision tree.
The gradient promotion
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier(random_state=0)
gb.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(gb.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(gb.score(X_test, y_test)))
Copy the code
Accuracy on training set: 0.917
Accuracy on test set: 0.792
Copy the code
Models can overfit. To reduce overfitting, we can apply pruning with greater intensity to limit the maximum depth or reduce the learning rate:
gb1 = GradientBoostingClassifier(random_state=0, max_depth=1)
gb1.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(gb1.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(gb1.score(X_test, y_test)))
Copy the code
Accuracy on training set: 0.804
Accuracy on test set: 0.781
Copy the code
Gb2 = GradientBoostingClassifier gb2 (random_state = 0, learning_rate = 0.01). The fit (X_train y_train)print("Accuracy on training set: {:.3f}".format(gb2.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(gb2.score(X_test, y_test)))
Copy the code
Accuracy on training set: 0.802
Accuracy on test set: 0.776
Copy the code
Both methods that reduce the complexity of the model also reduce the accuracy of the training set as expected. But in this case, none of these methods improved generalization on the test set.
We can visualize feature weights to further study our model, although we are not very satisfied with it:
plot_feature_importances_diabetes(gb1)
Copy the code
The following figure is obtained:
We can see that the feature weights of the gradient promoted tree are somewhat similar to those of the random forest to some extent. In this case, all features are given weights.
Support vector machine
from sklearn.svm import SVC
svc = SVC()
svc.fit(X_train, y_train)
print("Accuracy on training set: {:.2f}".format(svc.score(X_train, y_train)))
print("Accuracy on test set: {:.2f}".format(svc.score(X_test, y_test)))
Copy the code
Accuracy on training set: 1.00
Accuracy on test set: 0.65
Copy the code
The model is clearly overfitted, with perfect results on the training set but only 65% accuracy on the test set.
SVM (support vector machine) requires all features to be normalized. We need to rescale the data so that all features are roughly in the same dimension:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)
svc = SVC()
svc.fit(X_train_scaled, y_train)
print("Accuracy on training set: {:.2f}".format(svc.score(X_train_scaled, y_train)))
print("Accuracy on test set: {:.2f}".format(svc.score(X_test_scaled, y_test)))
Copy the code
Accuracy on training set: 0.77
Accuracy on test set: 0.77
Copy the code
Data normalization makes a huge difference! Now it’s not really fitting, the training set and the test set perform similarly but it’s a little bit far from 100% accuracy. At this point, we can try raising C or gamma to produce a more complex model.
svc = SVC(C=1000)
svc.fit(X_train_scaled, y_train)
print("Accuracy on training set: {:.3f}".format(
svc.score(X_train_scaled, y_train)))
print("Accuracy on test set: {:.3f}".format(svc.score(X_test_scaled, y_test)))
Copy the code
Accuracy on training set: 0.790
Accuracy on test set: 0.797
Copy the code
Here, improvement C optimizes the model so that the accuracy on the test set becomes 79.7%.
Deep learning
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(random_state=42)
mlp.fit(X_train, y_train)
print("Accuracy on training set: {:.2f}".format(mlp.score(X_train, y_train)))
print("Accuracy on test set: {:.2f}".format(mlp.score(X_test, y_test)))
Copy the code
Accuracy on training set: 0.71
Accuracy on test set: 0.67
Copy the code
The accuracy of Multilayer perceptrons is not nearly as good as that of other models, probably because of the dimension of the data. The deep learning algorithm also expects all input features to be normalized, with the best mean of 0 and variance of 1. We have to readjust the data to meet these requirements.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)
mlp = MLPClassifier(random_state=0)
mlp.fit(X_train_scaled, y_train)
print("Accuracy on training set: {:.3f}".format(
mlp.score(X_train_scaled, y_train)))
print("Accuracy on test set: {:.3f}".format(mlp.score(X_test_scaled, y_test)))
Copy the code
Accuracy on training set: 0.823
Accuracy on test set: 0.802
Copy the code
Let’s increase the number of iterations:
mlp = MLPClassifier(max_iter=1000, random_state=0)
mlp.fit(X_train_scaled, y_train)
print("Accuracy on training set: {:.3f}".format(
mlp.score(X_train_scaled, y_train)))
print("Accuracy on test set: {:.3f}".format(mlp.score(X_test_scaled, y_test)))
Copy the code
Accuracy on training set: 0.877
Accuracy on test set: 0.755
Copy the code
Increasing the number of iterations only optimizes the performance of the model in the training set, but does not change the performance of the test set.
Let’s increase the alpha parameter and increase the weight regularity:
mlp = MLPClassifier(max_iter=1000, alpha=1, random_state=0)
mlp.fit(X_train_scaled, y_train)
print("Accuracy on training set: {:.3f}".format(
mlp.score(X_train_scaled, y_train)))
print("Accuracy on test set: {:.3f}".format(mlp.score(X_test_scaled, y_test)))
Copy the code
Accuracy on training set: 0.795
Accuracy on test set: 0.792
Copy the code
The results were good, but we were not able to improve the test set further.
Therefore, by far the best model is the default deep learning model after normalization.
Finally, we draw a heat map of the first layer weights of the neural network learning diabetes data set.
plt.figure(figsize=(20, 5))
plt.imshow(mlp.coefs_[0], interpolation='none', cmap='viridis')
plt.yticks(range(8), diabetes_features)
plt.xlabel("Columns in weight matrix")
plt.ylabel("Input feature")
plt.colorbar()
Copy the code
The following figure is obtained:
It’s hard to quickly see from a heat map which features are underweighted compared to other features.
conclusion
For classification and regression, we experimented with various machine learning models, what their strengths and weaknesses were, and how to control the complexity of each model. We find that for many algorithms, setting the right parameters is critical for the model to perform well.
We should know how to apply, tune, and analyze the model we tested above. Now it’s your turn! Try applying one of these algorithms to the built-in sciKit-Learn dataset or any other dataset of your choice. Have fun with machine learning!
The source code for this blog post can be found here. I would love to receive your feedback and questions regarding the above.
Introduction to Machine Learning with Python
The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.