Introduction to skLearn machine learning framework

Today I will give you a brief introduction to sKlearn, divided into the following sections:

Introduction to the
Basic generalization
Example of actual data import and data pretreatment
Examples of actual combat training model
Model selection
Search for overparameters
Feature selection
Data segmentation
Pineline
summary

Introduction to the

Scikit-learn today is sciKit-Learn. Sklearn is a very powerful machine learning library provided by Python third parties. It covers everything from data preprocessing to training models. Using sciKit-Learn in the field can greatly reduce the time and amount of code we write, freeing up more energy to analyze data distribution, adjust models, and modify overparameters. (Sklearn is the package name)

Basic generalization

Sklearn has methods that can be used for both supervised and unsupervised learning, and supervised learning is generally used more. Most of the functions in SkLearn fall into two categories: Estimator and Transformer.

An Estimator is a model that is used to predict or regression data. Basically, estimators have the following methods:

Fit (x, Y) : The incoming data and labels can be trained model, training time and parameter Settings, data set size and data itself characteristics
Score (x,y) is used to score the accuracy of the model (range 0-1). But due to the different problems, the evaluation model of quality standards are not limited to simple and accuracy, possibly recall rate or other indicators such as precision, especially for sample categories imbalances, the merits of the model of evaluating the accuracy is not very good, therefore, when evaluating model, don’t easily to be score points.
Predict (x) is used to predict data. It takes input and outputs prediction labels in the format of a NUMpy array. We usually use this method to return the results of our tests, which are then used to evaluate the model.

Transformer is used for data processing, such as standardization, dimensionality reduction and feature selection. The use of the estimator is similar to:

Fit (x,y) : This method takes inputs and labels and calculates how the data is transformed.
Transform (x) : Returns the result of the transformation of input data x according to the calculated transformation mode (without changing X)
Fit_transform (x,y) : This method converts the input X in place after calculating the data transformation method.

This is just a brief overview of some of the features of sklearn’s functions. The basic usage of most sklearn functions is something like this. However, different estimators have different attributes. For example, random forest has Feature_importance to measure the importance of features, while logistic regression has coef_ for regression coefficients, intercept_ for intercepts, and so on. And the quality of the model for machine learning depends not only on which model you choose, but also largely on the Settings of your superparameters. So be sure to check the official documentation when using Sklearn to make adjustments to overparameters.

Example of actual data import and data pretreatment

Sklearn datasets provide training data that can be used for classification, regression, etc., to familiarize yourself with skLearn.

As shown in the code below, we read the classification data set of Iris. Load_iris () returns a dictionary-like object that retrieves data by keyword.

Datasets import load_iris dataSet = load_iris() data = dataSet['data'] # Data label = dataSet['target'] # Feature = dataSet['feature_names'] # feature = dataSet['target_names'] # print(target)Copy the code

Now that we have read the data, the first thing we should do is look at the characteristics of the data. We can see that there are three labels in total, so the data is to solve a triclassification problem. The next thing we need to do is look at the data characteristics. Using pandas’ DataFrame is a good choice.

import pandas as pdimport numpy as np df = pd.DataFrame(np.column_stack((data,label)),columns = Np.append (feature,'label')) df.head()#Copy the code

We can see that one piece of data corresponds to four consecutive features, and then we should look at the proportion of missing values in some data sets. This step is very important, because if there are missing values in the data set, there will be problems when training the model.

Df.isnull ().sum(axis=0).sort_values(Ascending =False)/float(len(df))# Check the proportion of missing valuesCopy the code

In preprocessing of Sklearn, there is a function to provide Imputer() to process the missing value, which provides median, mean and mode strategies to fill the missing value. However, different processing cases do not necessarily use padding to handle missing values. So be careful when you encounter missing values. Fortunately, there are no missing values in our dataset, which saves us time dealing with missing values. The next step is to determine whether the sample categories of the data set are balanced.

Df ['label'].value_counts() # check the ratio of data categoriesCopy the code

Fortunately, the proportion of our sample category is exactly 1:1:1. If the proportion of our sample is seriously unbalanced, we may need to make some adjustments to the unbalanced proportion, such as resampling and undersampling. Now the data set doesn’t seem to be a big problem, but we should standardize the data before training.

Before model training, we need to preprocess the data. The Preprocessing module in SkLearn provides many classes for data normalization.

Standardized data can not only improve the training speed of the model, but also bring different benefits according to different standards.

from sklearn.preprocessing import StandardScaler
StandardScaler().fit_transform(data)
Copy the code

For example, z-Score standardization converts the characteristic values of samples to the same dimension, making different features comparable. We have used z-Score standardization above, and there are other standardization methods in SkLearn Preprocessing. If you are interested, you can check the official documentation.

Examples of actual combat training model

After processing the data, we can train the model, using multivariate logistic regression as an example

from sklearn.linear_model import LogisticRegression from sklearn.model_selection import ShuffleSplit from sklearn.metrics import classification_report from sklearn.metrics import roc_auc_score ss = ShuffleSplit(n_splits = For tr,te in ss. Split (data,label): xr = data[tr] xe = data[te] yr = label[tr] ye = label[te] clf = LogisticRegression(solver = 'lbfgs',multi_class = 'multinomial') clf.fit(xr,yr) predict = clf.predict(xe) print(classification_report(ye, predict))Copy the code

Here, our logistic regression uses OVR multiple classification method,

OvR refers to multiple logistic regression as binary logistic regression. The specific approach is to choose one category as positive cases and the other categories as negative cases each time, and then do binary logistic regression to get the classification model of the first category. Finally, several binary regression models are obtained. The classification results are obtained according to the scores of each category.

Model selection

For a classification task, we can choose a suitable solution or model according to the above figure, but the choice of model is not absolute. In fact, in many cases, you will have to test many models to compare the model suitable for the problem.

Data partitioning

We can divide the data set multiple times using cross-validation or other methods of dividing the data set to produce average performance of the model rather than random results. Sklearn has a number of methods for dividing data sets, all of which are in model_selection

K-fold cross validation:

KFold Common K fold cross validation
StratifiedKFold (ensure equal proportions of each category)

One method:

LeaveOneOut
LeavePOut (leave P verification, change to leave one method when P = 1)

Random partition method:

ShuffleSplit (randomly shuffled data set partitioning)
StratifiedShuffleSplit

All of the above methods have the same parameters except the leave one method:

N_splits: Number of splits splits
Random_state: Sets the random seed

The above partitioning methods have their own advantages. The retention method and k-fold cross validation make full use of the data, but the cost is higher than the random partitioning method, which can better control the proportion of training set and test set. (By setting train_size). See ShuffleSplit in the example above for more information on how to use partitioned data sets. Other functions are used in the same way. See the official documentation for details.

Search for overparameters

We have preliminarily obtained the model above, and the effect seems to be good, so now we should try to further optimize this model.

We need to divide the data set into three parts, namely, training set, verification set and test set. The training set is used to train the model, then we adjust the model with strong generalization ability according to the results of the verification set, and finally obtain the generalization ability of the model through the test set. If you just divide the data into training and test sets, you may be able to call up the parameters that are most appropriate for the test set, but the model may not be as generalizing as it is on the test set.

Since the data set of Iris is not large, if the data is divided into three parts, the training data is too little, which may not improve the performance of the model. This is just a brief introduction to the callback method in Sklearn.

Model_seletion also provides automatic callback functions, such as GridSearchCV.

from sklearn.model_selection import GridSearchCV
clf = LogisticRegression()
gs = GridSearchCV(clf, parameters)
gs.fit(data, label)
gs.best_params_ 
Copy the code

By passing in the dictionary and comparing the effects of estimators using different parameters, the optimal parameters are obtained. Here the accuracy of logistic regression is adjusted. In addition, we can use different metrics to select parameters. The different metrics are in sklearn.metrics

Feature selection

When there are too many features and redundancy, the selection of features can not only accelerate the training speed, but also eliminate the interference of some negative features. Feature_seletion of Sklearn provides many feature selection functions, including univariate selection method and recursive feature elimination algorithm. They are all converters, so no examples of how to use them are given here.

In addition to selecting features by feature_seletion, we can also select those models with feature selection to select features. For example, random forest will score features according to the importance of features.

Pineline

Using Pineline, you can build the entire process from data processing to training models in sequence. Pineline intermediate steps must be converted (to process data). The advantage of using Pineline is that it encapsulates a learning process, making it easier to re-invoke the process. The intermediate process is represented by a list of tuples.

from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
clf = LogisticRegression()
new_clf = Pipeline([('pca',pca),('clf',clf)]) 
Copy the code

The above encapsulated estimator will first use PCA to reduce the data to two dimensions, and then use logistic regression to fit.

summary

In this issue, SKlearn is only a brief introduction, and some aspects are not covered, such as feature extraction, dimensionality reduction, etc., which are described in details in the official documentation.