Public account: You and the cabin by: Peter Editor: Peter

Hello, I’m Peter

Scikit-learn is a well-known Python machine learning library that is widely used in data science areas such as statistical analysis and machine learning modeling.

  • Modeling is awesome: SciKit-Learn enables users to implement a variety of models for supervised and unsupervised learning
  • Multiple functions: Sklearn can also be used for data preprocessing, feature engineering, data set segmentation, model evaluation and other work
  • Data rich: Built-in rich data sets, such as: Titanic, irises, etc., data is no longer a worry

This article provides a quick and concise introduction to sciKit-learn. For more information, see sciKit-Learn.

  1. Built-in data set usage
  2. Data set segmentation
  3. Data normalization and standardization
  4. Type encoding
  5. Modeling 6 pieces

Scikit-learn uses god diagrams

The following figure is provided on the official website, which summarizes the use of SciKit-Learn from four aspects: regression, classification, clustering, and data dimension reduction.

Scikit-learn.org/stable/tuto…

The installation

As for installing SciKit-learn, you are advised to use Anaconda to install it without worrying about configuration and environment issues. You can also install PIP directly:

pip install scikit-learn
Copy the code

Data set generation

Sklearn has some excellent data sets built in, such as: Iris data, housing price data, Titanic data, etc.

import pandas as pd
import numpy as np

import sklearn 
from sklearn import datasets  # import data set
Copy the code

Classification data – IRIS data

# iris data
iris = datasets.load_iris()
type(iris)

sklearn.utils.Bunch
Copy the code

What exactly does iris data look like? There is a lot of information in each built-in data

We can generate the desired DataFrame from the above data, and we can also add dependent variables:

Regression data – Boston home prices

The attributes we focused on:

  • data
  • Target, target_names
  • feature_names
  • filename

DataFrame can also be generated:

There are three ways to generate data

Method 1

# call module
from sklearn.datasets import load_iris
data = load_iris()

Import data and tags
data_X = data.data
data_y = data.target 
Copy the code

Way 2

from sklearn import datasets
loaded_data = datasets.load_iris()  Import the properties of the dataset

# Import sample data
data_X = loaded_data.data
# import tag
data_y = loaded_data.target
Copy the code

Methods 3

# direct return
data_X, data_y = load_iris(return_X_y=True)
Copy the code

Datasets use summaries

from sklearn import datasets  # import libraries

boston = datasets.load_boston()  # Import Boston housing price data
print(boston.keys())  # to check the key (property) [' data ', 'target', 'feature_names',' DESCR ', 'filename']
print(boston.data.shape,boston.target.shape)  Look at the shape of the data
print(boston.feature_names)  # See what the features are
print(boston.DESCR)  # described data set description information
print(boston.filename)  # file path
Copy the code

Data segmentation

# import module
from sklearn.model_selection import train_test_split
# Data is divided into training set and test set
X_train, X_test, y_train, y_test = train_test_split(
  data_X, 
  data_y, 
  test_size=0.2,
  random_state=111
)

# 150 * 0.8 = 120
len(X_train)
Copy the code

Data standardization and normalization

from sklearn.preprocessing import StandardScaler  # standardization
from sklearn.preprocessing import MinMaxScaler  # normalization

# standardization
ss = StandardScaler()
X_scaled = ss.fit_transform(X_train)  Pass in data to be standardized

# normalization
mm = MinMaxScaler()
X_scaled = mm.fit_transform(X_train)
Copy the code

Type encoding

Case from official website: scikit-learn.org/stable/modu…

Pair digit coding

Encoding string

Modeling case

The import module

from sklearn.neighbors import KNeighborsClassifier, NeighborhoodComponentsAnalysis  # model
from sklearn.datasets import load_iris  # import data
from sklearn.model_selection import train_test_split  # Shard data
from sklearn.model_selection import GridSearchCV  # grid search
from sklearn.pipeline import Pipeline  # Pipeline operation

from sklearn.metrics import accuracy_score  # Score verification
Copy the code

Model instantiation

# model instantiation
knn = KNeighborsClassifier(n_neighbors=5)
Copy the code

Training model

knn.fit(X_train, y_train)
Copy the code
KNeighborsClassifier()
Copy the code

Test set prediction

y_pred = knn.predict(X_test)
y_pred  # Predicated values based on the model
Copy the code
array([0.0.2.2.1.0.0.2.2.1.2.0.1.2.2.0.2.1.0.2.1.2.1.1.2.0.0.2.0.2])
Copy the code

Score validation

There are two ways to verify model scores:

knn.score(X_test,y_test)
Copy the code
0.9333333333333333
Copy the code
accuracy_score(y_pred,y_test)
Copy the code
0.9333333333333333
Copy the code

The grid search

How to search for parameters

from sklearn.model_selection import GridSearchCV

# search parameters
knn_paras = {"n_neighbors": [1.3.5.7]}
# Default model
knn_grid = KNeighborsClassifier()

Instantiation object for grid search
grid_search = GridSearchCV(
	knn_grid, 
	knn_paras, 
	cv=10  # 10 fold cross validation
)
grid_search.fit(X_train, y_train)
Copy the code
GridSearchCV(cv=10, estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': [1.3.5.7]})
Copy the code
The best parameter value found by the search
grid_search.best_estimator_ 
Copy the code
KNeighborsClassifier(n_neighbors=7)
Copy the code
grid_search.best_params_
Copy the code

Out[42]:

{'n_neighbors': 7}
Copy the code
grid_search.best_score_
Copy the code
0.975
Copy the code

Modeling based on search results

knn1 = KNeighborsClassifier(n_neighbors=7)

knn1.fit(X_train, y_train)
Copy the code
KNeighborsClassifier(n_neighbors=7)
Copy the code

It can be seen from the following results that the modeling effect after grid search is better than the model without grid search

y_pred_1 = knn1.predict(X_test)

knn1.score(X_test,y_test)
Copy the code
1.0
Copy the code
accuracy_score(y_pred_1,y_test)
Copy the code
1.0
Copy the code