♚ \

Author: Tang Tang, PyTorch for Practical Computer Vision in Deep Learning

Blog: www.zhihu.com/people/Jaim…

We know that there are many common and procedural skills in the practice of machine learning. Here, we demonstrate common practical skills in the use and processing of data through a simple data set, which mainly includes:

  • Data preview
  • Data visualization
  • Find the optimal model
  • The use of the Pipeline
  • Model to adjust

The data set we used can be imported via Scikit-learn

from sklearn import datasets
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

data = datasets.load_iris()
col_name = data.feature_names
X = data.data
y = data.target
Copy the code

Data preview

When we get the data, we first need to have a general understanding of our data, such as data distribution, data type, data dimension, etc. Only after we have a general impression of our own data structure, can we further in-depth data analysis.

The first are the two most common ways to preview data:

X = pd.DataFrame(X)
X.head(n=10)
X.sample(n=10)
Copy the code

The difference between them is that the former is the top ten data in the output of all data, while the latter is the output of ten randomly selected data, and the effect is as follows:

\

\

The first ten data extracted sequentially

\

\

\

Ten randomly selected data points

View data dimensions:

X.shape
Copy the code

View the types of data features:

X.dtypes
Copy the code

View statistical summary of data:

X.describe()
Copy the code

This information is often used, and it is very helpful to have an overall picture of the data being analyzed.

Data visualization

Data preview allows us to have a deeper understanding of the analyzed data, and we can discover the relationship between outliers or data features and data distribution rules through data visualization.

First, we can draw a box diagram:

X.plot(kind='box')
plt.show()
Copy the code

\

Box figure

It’s easy to spot outliers hidden in the data through the box diagram.

Draw a bar chart:

X.h ist (figsize = (12, 5), xlabelsize = 1, ylabelsize = 1) PLT. The show ()Copy the code

\

The bar chart

Through the bar chart, we can clearly see the distribution structure of data, whether there is a normal distribution and so on.

Plot density:

X.p lot (kind = "density", subplots = True, layout = (1, 4), figsize = (12, 5)) PLT. The show ()Copy the code

\

Density figure

The density diagram gives us a density distribution of the data values.

Draw feature correlation diagram:

Pd. Scatter_matrix (X, figsize = (10, 10)) PLT. The show ()Copy the code

\

Feature correlation diagram

Through feature correlation graph, we can know which features are obviously correlated.

Draw the thermal map of feature correlation:

FIG = plt.figure(figsize=(10,10)) ax = add_subplot(111) cax = Matshow (x.corr (),vmin=-1,vmax=1,interpolation="none") FIG. Colorbar (cax) ticks = np.arange(0,4,1) ax.set_xticks(ticks)  ax.set_yticks(ticks) ax.set_xticklabels(col_name) ax.set_yticklabels(col_name) plt.show()Copy the code

\

Feature correlation diagram

In fact, this thermal map is very similar to the previous feature correlation map, but the difference is that the thermal map is more obvious and the results are quantified.

Find the optimal model

When we don’t know what model to use for prediction at first, the simplest idea is to start with multiple models with default parameters. For example, the integration model based on decision tree, SVM, logistic regression and so on will be trained and predicted for a round, and then the best model will be selected for tuning.

The code is as follows:

from sklearn.ensemble import AdaBoostClassifier from sklearn.ensemble import GradientBoostingClassifier from sklearn.ensemble import ExtraTreesClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.svm import  SVC from sklearn.neighbors import KNeighborsClassifier from sklearn.linear_model import LogisticRegression from sklearn.naive_bayes import GaussianNB from sklearn.discriminant_analysis import LinearDiscriminantAnalysis models = [] models.append(("AB",AdaBoostClassifier())) models.append(("GBM",GradientBoostingClassifier())) models.append(("RF",RandomForestClassifier())) models.append(("ET",ExtraTreesClassifier())) models.append(("SVC",SVC()))  models.append(("KNN",KNeighborsClassifier())) models.append(("LR",LogisticRegression())) models.append(("GNB",GaussianNB())) models.append(("LDA",LinearDiscriminantAnalysis())) names = [] results = [] for name,model in models: kfold = KFold(n_splits=5,random_state=42) result = cross_val_score(model,X,y,scoring="accuracy",cv=kfold) names.append(name) results.append(result) print("{} Mean:{:.4f}(Std{:.4f})".format(name,result.mean(),result.std()))Copy the code

The final result is as follows:

AB Mean:0.9133(Std0.0980) GBM Mean:0.9133(Std0.0980) RF Mean:0.9067(Std0.0998) ET Mean:0.8933(Std0.1083) SVC Mean:0.9333(Std0.0699) KNN Mean:0.9133(Std0.0833) LR Mean:0.7533(Std0.2621) GNB Mean:0.9467(Std0.0340) LDA Mean: 0.9600 (Std0.0490)Copy the code

It can be seen from here that the linear discriminant analysis model has the highest classification accuracy. If we are faced with real problems, it is highly likely that we will choose this model for further tuning.

The use of the Pipeline

It should be added that the value range of the data used here is relatively standard. If the data in our problem needs to be scaled to achieve normalization, the Pipeline method in Scikit-Learn can be used for operation.

The code is as follows:

pipeline = []
pipeline.append(("ScalerET", Pipeline([("Scaler",StandardScaler()),
 ("ET",ExtraTreesClassifier())])))
pipeline.append(("ScalerGBM", Pipeline([("Scaler",StandardScaler()),
   ("GBM",GradientBoostingClassifier())])))
pipeline.append(("ScalerRF", Pipeline([("Scaler",StandardScaler()),
 ("RF",RandomForestClassifier())])))

names = []
results = []
for name,model in pipeline:
kfold = KFold(n_splits=5,random_state=42)
result = cross_val_score(model, X, y, cv=kfold, scoring="accuracy")
results.append(result)
names.append(name)
print("{}:  Error Mean:{:.4f} (Error Std:{:.4f})".format(
name,result.mean(),result.std()))
Copy the code

Output result:

ScalerET: Error Mean:0.9133 (Error Std:0.0884) ScalerGBM: Error Mean:0.9133 (Error Std:0.0980) ScalerRF: The Error Mean: 0.9133 (0.0884) Error Std:Copy the code

Of course, the pipelining function of Pipeline is not limited to this, it can not only add data scaling part, other data processing process can also be introduced.

Model to adjust

Went on to say that our model to adjust, also can saying is the model tuning, if we get the best model to further enhance the generalization ability of the model, it is the need for the adjustment of the parameters, a general method is the grid search, the grid search can be batch of model parameter adjustment, so as to achieve the purpose of further enhance the generalization performance of the model. Let’s take the SVM we used earlier as an example.

The code is as follows:

Param_grid = {" C ": [0.1, 0.3, 0.5, 0.7, 0.9, 1.0, 1.3, 1.5, 1.7, 2.0]," the kernel ": [' linear ', 'poly real', 'RBF, 'sigmoid'] } model = SVC() kfold = KFold(n_splits=5, random_state=42) grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring="accuracy", cv=kfold) grid_result = grid.fit(X, y) print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param))Copy the code

The final result:

Using {'C': 1.0, 'kernel': 'linear'} 0.886667 (0.125786) with: {'C': 0.1, 'kernel': 'linear'} (0.090431) with 0.926667: {' C ', 0.1 'kernel' : 'poly real'} (0.337573) with 0.660000: {' C ', 0.1 'kernel' : 'RBF} (0.000000) with 0.000000: {' C', 0.1 'kernel' : 'sigmoid} (0.067987) with 0.926667: {' C', 0.3 'kernel' : 'linear'} (0.081650) with 0.933333: {' C ', 0.3 'kernel' : 'poly real'} (0.114310) with 0.893333: {' C ', 0.3 'kernel' : 'RBF} (0.000000) with 0.000000: {' C', 0.3 'kernel' : 'sigmoid} (0.069921) with 0.933333: {' C', 0.5 'kernel' : 'linear'} (0.090431) with 0.926667: {' C ', 0.5 'kernel' : 'poly real'} (0.099778) with 0.906667: {' C ', 0.5 'kernel' : 'RBF} (0.000000) with 0.000000: {' C', 0.5 'kernel' : 'sigmoid} (0.069921) with 0.933333: {' C', 0.7 'kernel' : 'linear'} (0.100222) with 0.920000: {' C ', 0.7 'kernel' : 'poly real'} (0.069921) with 0.933333: {' C ', 0.7 'kernel' : 'RBF} (0.000000) with 0.000000: {' C', 0.7 'kernel' : 'sigmoid} (0.061101) with 0.940000: {' C', 0.9 'kernel' : 'linear'} (0.100222) with 0.920000: {' C ', 0.9 'kernel' : 'poly real'} (0.069921) with 0.933333: {' C ', 0.9 'kernel' : 'RBF} (0.000000) with 0.000000: {' C', 0.9 'kernel' : 'sigmoid} (0.065320) with 0.946667: {' C', 1.0 'kernel' : 'linear'} (0.100222) with 0.920000: {' C ', 1.0 'kernel' : 'poly real'} (0.069921) with 0.933333: {' C ', 1.0 'kernel' : 'RBF} (0.000000) with 0.000000: {' C', 1.0 'kernel' : 'sigmoid} (0.065320) with 0.946667: {' C', 1.3 'kernel' : 'linear'} (0.100222) with 0.920000: {' C ', 1.3 'kernel' : 'poly real'} (0.069921) with 0.933333: {' C ', 1.3 'kernel' : 'RBF} (0.000000) with 0.000000: {' C', 1.3 'kernel' : 'sigmoid} (0.074237) with 0.940000: {' C', 1.5 'kernel' : 'linear'} (0.100222) with 0.920000: {' C ', 1.5 'kernel' : 'poly real'} (0.069921) with 0.933333: {' C ', 1.5 'kernel' : 'RBF} (0.000000) with 0.000000: {' C', 1.5 'kernel' : 'sigmoid} (0.069921) with 0.933333: {' C', 1.7 'kernel' : 'linear'} (0.100222) with 0.920000: {' C ', 1.7 'kernel' : 'poly real'} (0.069921) with 0.933333: {' C ', 1.7 'kernel' : 'RBF} (0.000000) with 0.000000: {' C', 1.7 'kernel' : 'sigmoid} (0.074237) with 0.940000: {' C', 2.0 'kernel' : 'linear'} (0.100222) with 0.920000: {' C ', 2.0 'kernel' : 'poly real'} (0.069921) with 0.933333: {' C ', 2.0 'kernel' : 'RBF} (0.000000) with 0.000000: {' C', 2.0 'kernel' : 'sigmoid'}Copy the code

Compared with the previous model using SVM default parameters, the model obtained after grid search adjustment has further improved its accuracy. If time and computer power are sufficient, more combination methods can be tried.

conclusion

Finally the practice in the machine learning, there are two very important steps in the characteristics of the project and model integration, they limit on the final model generalization ability promotion is most closely linked, but the two parts are not able to speak clearly in a few words, also need to own a large amount of exploration and the accumulation of experience to seize the “unfinished” finally.

\

\

Python Chinese community as a decentralized global technology community, to become the world’s 200000 Python tribe as the vision, the spirit of Chinese developers currently covered each big mainstream media and collaboration platform, and ali, tencent, baidu, Microsoft, amazon and open China, CSDN industry well-known companies and established wide-ranging connection of the technical community, Have come from more than 10 countries and regions tens of thousands of registered members, members from the Ministry of Public Security, ministry of industry, tsinghua university, Beijing university, Beijing university of posts and telecommunications, the People’s Bank of China, the Chinese Academy of Sciences, cicc, huawei, BAT, represented by Google, Microsoft and other government departments, scientific research institutions, financial institutions, and well-known companies at home and abroad, nearly 200000 developers to focus on the platform.

\

\

Click below to read the original article and become a free community member at ****