Why should directory guide language give this tutorial? 1. Basic use Cases: Train and test classifier exercises 2. More advanced use cases: Preprocess data before training and testing classifiers 2.1 Standardize your data 2.2 Bad preprocessing patterns 2.3 Keep It Simple, Stupid: Pipe Connector exercise with SciKit-learn 3. When more is better than less: Cross-validation rather than separate split exercises 4. Hyperparameter Optimization: Fine tuning pipe internals exercises 5. Summary: My Scikit-learn pipeline has less than 10 lines of code (skip import statements) 6. Heterogeneous data: Practice \ when you work with data other than numbers
Introduction:
This article refers to the following github, translation: Light city \
Github.com/glemaitre/p…
Click to read the original to get the translated source code and explanation!
Enable inline mode
In this tutorial, we will draw several figures, so we activate Matplotlib to display the inline figure in the Notebook.
# enable matlibplot inline mode %matplotlib inlineimport matplotlib.pyplot as plt
Copy the code
Why this tutorial?
Scikit-learn provides state-of-the-art machine learning algorithms. However, these algorithms cannot be applied directly to raw data. Raw data need to be preprocessed. So, in addition to machine learning algorithms, SciKit-Learn also provides a set of preprocessing methods. In addition, SciKit-Learn provides connectors for pipelinizing these estimators (that is, transformers, regressors, classifiers, clustering, etc.). In this tutorial, you will be introduced to the SciKit-Learn feature set, which allows pipeline estimators, evaluation of these pipelines, tuning of these pipelines with hyperparametric optimizations, and creation of complex preprocessing steps.
1. Basic use case: Train and test the classifier
For the first example, we will train and test a classifier on the dataset. We will use this example to recall the SCIKit-Learn API.
We’ll use the digits data set, which is a data set of handwritten numbers.
Sklearn. DatasetsimportLoad_digits # return_X_y defaults to False, in which case a Bunch object is set to True. (data, target) X, y= load_digits(return_X_y=True)Copy the code
Each row in X contains an intensity of 64 image pixels. For each sample of X, we get y for the number we’re writing.
Plt. imshow(X[0].reshape(8.8), cmap='gray'); # turn off the axis plt.axis('off') # Format printprint('The digit in the image is {}'.format(y[0]))
Copy the code
Output:
The digit in the image is 0
In machine learning, we should evaluate our models by training and testing them on different data sets. Train_test_split is a utility function for splitting data into two separate data sets. The Stratify parameter forces the class distribution of the training and test data set to be the same as that of the entire data set.
Stratify parameter is added so that the class distribution of the training and test data sets is the same as that of the entire data set. from sklearn.model_selectionimport train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)
Copy the code
Once we have a separate set of training and tests, we can use the FIT approach to learn machine learning models. We will test this method using the Score method, depending on the default accuracy metric.
Linear Logistic regression accuracy score from sklear.linear_modelimport LogisticRegression
clf = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=5000, random_state=42)
clf.fit(X_train, y_train)
accuracy = clf.score(X_test, y_test)
print('Accuracy score of the {} is {:.2f}'.format(clf.__class__.__name__, accuracy))
Copy the code
Accuracy score of the LogisticRegression is 0.95
Copy the code
The SCIKit-Learn API is consistent across classifiers. Therefore, we can easily replace the LogisticRegression classifier with a RandomForestClassifier. These changes are minor and relate only to the creation of the classifier instance.
# RandomForestClassifier easily replace LogisticRegression classifier from sklearn. Ensembleimport RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100, n_jobs=- 1, random_state=42)
clf.fit(X_train, y_train)
accuracy = clf.score(X_test, y_test)
print('Accuracy score of the {} is {:.2f}'.format(clf.__class__.__name__, accuracy))
Copy the code
Output:
Accuracy score of the RandomForestClassifier is 0.96
Copy the code
practice
Complete the following exercise:
- Load the breast cancer data set. from
sklearn.datasets
The import functionload_breast_cancer
# %load solutions/01_1_solutions.py
Copy the code
- use
sklearn.model_selection.train_test_split
Split the dataset and reserve 30% of the dataset for testing. Ensure that the data is layered (i.estratify
Parameter) and willrandom_state
Set it to 0.
# %load solutions/01_2_solutions.py
Copy the code
- Train supervisory classifiers using training data.
# %load solutions/01_3_solutions.py
Copy the code
- A fitting classifier is used to predict the classification labels of the test set.
# %load solutions/01_4_solutions.py
Copy the code
- Calculate the balanced precision of the test set. Do you need from
sklearn.metrics
The importbalanced_accuracy_score
# %load solutions/01_5_solutions.py
Copy the code
2. More advanced use case: preprocessing the data before training and testing the classifier
2.1 Standardize your data
Preprocessing may be required before learning the model. For example, a user might be interested in creating handmade features or algorithms, so he might make some prior assumptions about the data. In our example, the solver used by the LogisticRegression expects the data to be normalized. Therefore, we need to standardize the data before training the model. To observe this requirement, we will examine the number of iterations required to train the model.
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=5000, random_state=42)
clf.fit(X_train, y_train)
print('{} required {} iterations to be fitted'.format(clf.__class__.__name__, clf.n_iter_[0]))
Copy the code
Output:
LogisticRegression required 1841 iterations to be fitted
Copy the code
The MinMaxScaler converter is used to normalize data. This scalar should be applied by learning statistics on the training set (that is, the FIT method) and standardizing the training set and test set (that is, the Transform method). Finally, we will train and test the model and get a normalized data set.
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
clf = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=1000, random_state=42)
clf.fit(X_train_scaled, y_train)
accuracy = clf.score(X_test_scaled, y_test)
print('Accuracy score of the {} is {:.2f}'.format(clf.__class__.__name__, accuracy))
print('{} required {} iterations to be fitted'.format(clf.__class__.__name__, clf.n_iter_[0]))
Copy the code
Output:
Accuracy score of the LogisticRegression is 0.96
LogisticRegression required 190 iterations to be fitted
Copy the code
With normalized data, the model converges much faster than with unnormalized data. (Fewer iterations)
2.2 Incorrect preprocessing mode
We emphasize how to preprocess and fully train machine learning models. It was also interesting to discover the wrong way to preprocess data. There are two potential mistakes, easy to make but easy to spot.
The first pattern is to standardize the data before the entire data set is divided into training and test sets.
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
X_train_prescaled, X_test_prescaled, y_train_prescaled, y_test_prescaled = train_test_split(
X_scaled, y, stratify=y, random_state=42)
clf = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=1000, random_state=42)
clf.fit(X_train_prescaled, y_train_prescaled)
accuracy = clf.score(X_test_prescaled, y_test_prescaled)
print('Accuracy score of the {} is {:.2f}'.format(clf.__class__.__name__, accuracy))
Copy the code
Output:
Accuracy score of the LogisticRegression is 0.96
Copy the code
The second pattern is to standardize training and test sets independently. It comes back to calling fit methods on the training and test sets. Therefore, the standardization of training and test sets differs.
Scaler = MinMaxScaler() X_train_prescaled = scaler.fit_transform(X_train) X_test_prescaled = scaler.fit_transform(X_test) clf = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=1000, random_state=42)
clf.fit(X_train_prescaled, y_train)
accuracy = clf.score(X_test_prescaled, y_test)
print('Accuracy score of the {} is {:.2f}'.format(clf.__class__.__name__, accuracy))
Copy the code
Output:
Accuracy score of the LogisticRegression is 0.96
Copy the code
2.3 Keep it simple, Stupid: Use scikit-learn’s pipe connector
The two patterns mentioned earlier are problems of data leakage. However, it is difficult to prevent this error when preprocessing must be done manually. Therefore, SciKit-learn introduces Pipeline objects. It connects multiple transformers and classifiers (or regressors) in turn. We can create a pipe like this:
from sklearn.pipeline import Pipeline
pipe = Pipeline(steps=[('scaler', MinMaxScaler()),
('clf', LogisticRegression(solver='lbfgs', multi_class='auto', random_state=42)))Copy the code
We see that the pipe contains parameters for the zoomer (normalized) and classifier. Sometimes naming every estimator in a pipe can be tedious. The make_pipeline will automatically name each estimator, which is lowercase for the class name.
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(MinMaxScaler(),
LogisticRegression(solver='lbfgs', multi_class='auto', random_state=42, max_iter=1000))
Copy the code
The pipes will have the same API. We use FIT to train classifiers and SOCre to check accuracy. However, calling FIT calls the FIT_transform method of all the converters in the pipe. Calling Score (or Predict and predict_proba) invokes the internal transforms of all converters in the pipeline. It corresponds to the normalization process in article 2.1.
pipe.fit(X_train, y_train)
accuracy = pipe.score(X_test, y_test)
print('Accuracy score of the {} is {:.2f}'.format(pipe.__class__.__name__, accuracy))
Copy the code
Accuracy score of the Pipeline is 0.96
Copy the code
We can use get_params() to check all the parameters of the pipe.
pipe.get_params()
Copy the code
Output:
{'logisticregression': LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=1000, multi_class='auto',
n_jobs=None, penalty='l2', random_state=42, solver='lbfgs',
tol=0.0001, verbose=0, warm_start=False),
'logisticregression__C': 1.0. . . }Copy the code
practice
Reuse the breast cancer data set from the first exercise to train by importing the SGDClassifier from the linear_Model. Use this classifier and the StandardScaler converter imported from sklearn.preprocessing to create the pipe. Then train and test the pipeline.
# %load solutions/02_solutions.py
Copy the code
3. When more is better than less: Cross-validate rather than split separately
Segmentation of data is necessary to evaluate the performance of statistical models. However, it reduces the number of samples available for learning models. Therefore, cross validation should be used whenever possible. Having more than one split also provides information about the stability of the model.
Scikit-learn provides three functions: cross_val_score, cross_val_predict, and cross_validate. The latter provides more information on fitting times, training, and test scores. I can also return multiple scores at once.
from sklearn.model_selection import cross_validate
pipe = make_pipeline(MinMaxScaler(),
LogisticRegression(solver='lbfgs', multi_class='auto',
max_iter=1000, random_state=42))
scores = cross_validate(pipe, X, y, cv=3, return_train_score=True)
Copy the code
Using the cross-validation function, we can quickly check training and test scores, and use PANDAS to quickly draw.
import pandas as pd
df_scores = pd.DataFrame(scores)
df_scores
Copy the code
Output:
Df_scores [['train_score'.'test_score']].boxplot()
Copy the code
Output:
practice
Use the pipeline from the previous exercise and cross-validate rather than separate evaluation.
# %load solutions/03_solutions.py
Copy the code
4. Hyperparameter optimization: fine tune the inside of the pipe
Sometimes you want to find the parameters of a pipe component to get the best accuracy. We have seen that we can use get_params() to check the parameters of a pipe.
pipe.get_params()
Copy the code
Output:
{'logisticregression': LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=1000, multi_class='auto',
n_jobs=None, penalty='l2', random_state=42, solver='lbfgs',
tol=0.0001, verbose=0, warm_start=False),
'logisticregression__C': 1.0. . . }Copy the code
You can optimize hyperparameters by exhaustive search. GridSearchCV provides such utilities and cross-validated grid searches through parametric grids.
In the following example, we want to optimize the C and penalty parameters of the LogisticRegression classifier.
from sklearn.model_selection import GridSearchCV
pipe = make_pipeline(MinMaxScaler(),
LogisticRegression(solver='saga', multi_class='auto',
random_state=42, max_iter=5000))
param_grid = {'logisticregression__C': [0.1.1.0.10].'logisticregression__penalty': ['l2'.'l1']}
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=- 1, return_train_score=True)
grid.fit(X_train, y_train)
Copy the code
Output:
GridSearchCV(cv=3, error_score='raise-deprecating'. . . scoring=None, verbose=0)
Copy the code
When fitting the grid search object, it finds the best combination of parameters on the training set (using cross validation). We can obtain the results of a grid search by accessing the cv_results_ attribute. This property allows us to examine the impact of parameters on model performance.
df_grid = pd.DataFrame(grid.cv_results_)
df_grid
Copy the code
Output:
By default, grid search objects also behave as estimators. Once it is fit, a call to score fixes the hyperparameter to the best parameter found.
grid.best_params_
Copy the code
Output:
{'logisticregression__C': 10.'logisticregression__penalty': 'l2'}
Copy the code
In addition, the grid search can be called any other classifier for prediction.
accuracy = grid.score(X_test, y_test)
print('Accuracy score of the {} is {:.2f}'.format(grid.__class__.__name__, accuracy))
Copy the code
Accuracy score of the GridSearchCV is 0.96
Copy the code
Most importantly, we only perform grid searches for individual splits. However, as mentioned earlier, we may be interested in conducting external cross-validation to estimate the performance of the model and different data samples, and to examine potential variations in performance. Since grid search is an estimator, we can use it directly in the cross_validate function.
scores = cross_validate(grid, X, y, cv=3, n_jobs=- 1, return_train_score=True)
df_scores = pd.DataFrame(scores)
df_scores
Copy the code
Output:
practice
Previous channels of the breast cancer data set were reused and grid searches were performed to assess differences between hinge andlog losses. In addition, fine tuning penalty.
# %load solutions/04_solutions.py
Copy the code
5. Summary: My Scikit-learn pipeline has less than 10 lines of code (skip import statements)
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_validate
pipe = make_pipeline(MinMaxScaler(),
LogisticRegression(solver='saga', multi_class='auto', random_state=42, max_iter=5000))
param_grid = {'logisticregression__C': [0.1.1.0.10].'logisticregression__penalty': ['l2'.'l1']}
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=- 1)
scores = pd.DataFrame(cross_validate(grid, X, y, cv=3, n_jobs=- 1, return_train_score=True))
scores[['train_score'.'test_score']].boxplot()
Copy the code
Output:
6. Heterogeneous data: When you work with data other than numbers
So far, we have used SciKit-Learn to train models using numerical data.
X
Copy the code
Output:
array([[ 0..0..5.. .0..0..0.],
[ 0..0..0.. .10..0..0.],
[ 0..0..0.. .16..9..0.],... [0..0..1.. .6..0..0.],
[ 0..0..2.. .12..0..0.],
[ 0..0..10.. .12..1..0.]])
Copy the code
X is a NumPy array containing only floating point values. However, datasets can contain mixed types.
import os
data = pd.read_csv(os.path.join('data'.'titanic_openml.csv'), na_values='? ')
data.head()
Copy the code
Output:
The Titanic dataset contains categorical, textual and digital features. We will use this data set to predict whether passengers survived the Titanic.
Let’s split the data into training and test sets and use the surviving columns as targets.
y = data['survived']
X = data.drop(columns='survived')
Copy the code
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
Copy the code
First, try using the LogisticRegression classifier to see how well it performs.
clf = LogisticRegression()
clf.fit(X_train, y_train)
Copy the code
Alas, most classifiers are designed to work with numeric data. Therefore, we need to convert classified data into digital features. The simplest approach is to use OneHotEncoder to read heat code each classification feature. Embarked let’s take the columns sex and embarked. Note that we also encounter some missing data. We will use SimpleImputer to replace missing values with constant values.
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
ohe = make_pipeline(SimpleImputer(strategy='constant'), OneHotEncoder())
X_encoded = ohe.fit_transform(X_train[['sex'.'embarked']])
X_encoded.toarray()
Copy the code
Output:
array([[0..1..0..0..1..0.],
[0..1..1..0..0..0.],
[0..1..0..0..1..0.],... [0..1..0..0..1..0.],
[1..0..0..0..1..0.],
[1..0..0..0..1..0.]])
Copy the code
In this way, classification features can be coded. However, we also want to standardize digital features. Therefore, we need to divide the original data into two subgroups and apply different preprocessing :(I) independent hot editing of classified data; (ii) Standard scaling (normalization) of numerical data. We also need to deal with missing values in two cases: for the classified column, we replace the string ‘missing_values’ with missing values, which will interpret itself as a category. For numerical data, we replace the missing data with the average of the features of interest.
- Independent heat editing of classified data
col_cat = ['sex'.'embarked']
col_num = ['age'.'sibsp'.'parch'.'fare']
X_train_cat = X_train[col_cat]
X_train_num = X_train[col_num]
X_test_cat = X_test[col_cat]
X_test_num = X_test[col_num]
Copy the code
- Standard scaling of numerical data (normalization)
from sklearn.preprocessing import StandardScaler
scaler_cat = make_pipeline(SimpleImputer(strategy='constant'), OneHotEncoder())
X_train_cat_enc = scaler_cat.fit_transform(X_train_cat)
X_test_cat_enc = scaler_cat.transform(X_test_cat)
scaler_num = make_pipeline(SimpleImputer(strategy='mean'), StandardScaler())
X_train_num_scaled = scaler_num.fit_transform(X_train_num)
X_test_num_scaled = scaler_num.transform(X_test_num)
Copy the code
We should apply these transformations on the training and test sets as we did in article 2.1.
import numpy as np
from scipy import sparse
X_train_scaled = sparse.hstack((X_train_cat_enc,
sparse.csr_matrix(X_train_num_scaled)))
X_test_scaled = sparse.hstack((X_test_cat_enc,
sparse.csr_matrix(X_test_num_scaled)))
Copy the code
With the transformation complete, we can now combine all numeric information. Finally, we use the LogisticRegression classifier as the model.
clf = LogisticRegression(solver='lbfgs')
clf.fit(X_train_scaled, y_train)
accuracy = clf.score(X_test_scaled, y_test)
print('Accuracy score of the {} is {:.2f}'.format(clf.__class__.__name__, accuracy))
Copy the code
Output:
Accuracy score of the LogisticRegression is 0.79
Copy the code
The above pattern of first converting data and then fitting/scoring classifiers happens to be one of the patterns in section 2.1. Therefore, we want to use pipes for this purpose. However, we also want to treat the different columns of the matrix differently. The ColumnTransformer converter or the make_column_transformer function should be used. It is used to automatically apply different pipes on different columns.
from sklearn.compose import make_column_transformer
pipe_cat = make_pipeline(SimpleImputer(strategy='constant'), OneHotEncoder(handle_unknown='ignore'))
pipe_num = make_pipeline(SimpleImputer(), StandardScaler())
preprocessor = make_column_transformer((col_cat, pipe_cat), (col_num, pipe_num))
pipe = make_pipeline(preprocessor, LogisticRegression(solver='lbfgs'))
pipe.fit(X_train, y_train)
accuracy = pipe.score(X_test, y_test)
print('Accuracy score of the {} is {:.2f}'.format(pipe.__class__.__name__, accuracy))
Copy the code
Output:
Accuracy score of the Pipeline is 0.79
Copy the code
In addition, it can also be used in another pipe. Therefore, we will be able to use all of the Scikit-learn utilities as cross_VALIDATE or GridSearchCV.
pipe.get_params()
Copy the code
Output:
{'columntransformer': ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
transformer_weights=None,
transformers=[('pipeline-1', Pipeline(memory=None, ... ] }Copy the code
Merging and visualization:
pipe_cat = make_pipeline(SimpleImputer(strategy='constant'), OneHotEncoder(handle_unknown='ignore'))
pipe_num = make_pipeline(StandardScaler(), SimpleImputer())
preprocessor = make_column_transformer((col_cat, pipe_cat), (col_num, pipe_num))
pipe = make_pipeline(preprocessor, LogisticRegression(solver='lbfgs'))
param_grid = {'columntransformer__pipeline-2__simpleimputer__strategy': ['mean'.'median'].'logisticregression__C': [0.1.1.0.10]}
grid = GridSearchCV(pipe, param_grid=param_grid, cv=5, n_jobs=- 1)
scores = pd.DataFrame(cross_validate(grid, X, y, scoring='balanced_accuracy', cv=5, n_jobs=- 1, return_train_score=True))
scores[['train_score'.'test_score']].boxplot()
Copy the code
Output:
practice
Complete the following exercise:
Load the adult dataset in./data/adult_openml.csv. Make your own ColumnTransformer preprocessor and pipe it with classifiers. Fine-tune it and check for prediction accuracy in cross validation.
- use
pd.read_csv
Read in./data/adult_openml.csv
Adult data set in.
# %load solutions/05_1_solutions.py
Copy the code
- Split the dataset into data and targets. The target corresponds to the class column. For data, delete columns
fnlwgt
.capitalgain
andcapitalloss
.
# %load solutions/05_2_solutions.py
Copy the code
- The target is not encoded. use
sklearn.preprocessing.LabelEncoder
Code the class.
# %load solutions/05_3_solutions.py
Copy the code
- Create a list containing the names of the classified columns. The same is true for log data.
# %load solutions/05_4_solutions.py
Copy the code
- Create a pipe to read heat code the classified data. use
KBinsDiscretizer
As numerical data. fromsklearn.preprocessing
Import it.
# %load solutions/05_5_solutions.py
Copy the code
- use
make_column_transformer
Create the preprocessor. You should apply good pipes to good columns.
# %load solutions/05_6_solutions.py
Copy the code
Pipe the preprocessor using the LogisticRegression classifier. A grid search is then defined to find the best parameter C. Train and test this workflow in a cross-validation scheme using cross_validate.
# %load solutions/05_7_solutions.py
Copy the code
Author’s official account of Kwangseong:
Please follow and share ↓↓↓\
Machine learning beginners \
QQ group: 554839127
(Note: there are 6 QQ groups on this site, those who have joined any of them do not need to add more)
Past wonderful review \
-
Conscience Recommendation: Introduction to machine learning and learning recommendations (2018 edition) \
-
Github Image download by Dr. Hoi Kwong (Machine learning and Deep Learning resources)
-
Printable version of Machine learning and Deep learning course notes \
-
Machine Learning Cheat Sheet – understand Machine Learning like reciting TOEFL Vocabulary
-
Introduction to Deep Learning – Python Deep Learning, annotated version of the original code in Chinese and ebook
-
Zotero paper Management tool
-
The mathematical foundations of machine learning
-
Machine learning essential treasure book – “statistical learning methods” python code implementation, ebook and courseware
-
Blood vomiting recommended collection of dissertation typesetting tutorial (complete version)
-
The encyclopaedia of Machine learning introduction – A collection of articles from the “Beginner machine Learning” public account in 2018