Public account: You and the cabin by: Peter Editor: Peter
Hello, I’m Peter
Scikit-learn is a well-known Python machine learning library that is widely used in data science areas such as statistical analysis and machine learning modeling.
- Modeling is awesome: SciKit-Learn enables users to implement a variety of models for supervised and unsupervised learning
- Multiple functions: Sklearn can also be used for data preprocessing, feature engineering, data set segmentation, model evaluation and other work
- Data rich: Built-in rich data sets, such as: Titanic, irises, etc., data is no longer a worry
This article provides a quick and concise introduction to sciKit-learn. For more information, see sciKit-Learn.
- Built-in data set usage
- Data set segmentation
- Data normalization and standardization
- Type encoding
- Modeling 6 pieces
Scikit-learn uses god diagrams
The following figure is provided on the official website, which summarizes the use of SciKit-Learn from four aspects: regression, classification, clustering, and data dimension reduction.
Scikit-learn.org/stable/tuto…
The installation
As for installing SciKit-learn, you are advised to use Anaconda to install it without worrying about configuration and environment issues. You can also install PIP directly:
pip install scikit-learn
Copy the code
Data set generation
Sklearn has some excellent data sets built in, such as: Iris data, housing price data, Titanic data, etc.
import pandas as pd
import numpy as np
import sklearn
from sklearn import datasets # import data set
Copy the code
Classification data – IRIS data
# iris data
iris = datasets.load_iris()
type(iris)
sklearn.utils.Bunch
Copy the code
What exactly does iris data look like? There is a lot of information in each built-in data
We can generate the desired DataFrame from the above data, and we can also add dependent variables:
Regression data – Boston home prices
The attributes we focused on:
- data
- Target, target_names
- feature_names
- filename
DataFrame can also be generated:
There are three ways to generate data
Method 1
# call module
from sklearn.datasets import load_iris
data = load_iris()
Import data and tags
data_X = data.data
data_y = data.target
Copy the code
Way 2
from sklearn import datasets
loaded_data = datasets.load_iris() Import the properties of the dataset
# Import sample data
data_X = loaded_data.data
# import tag
data_y = loaded_data.target
Copy the code
Methods 3
# direct return
data_X, data_y = load_iris(return_X_y=True)
Copy the code
Datasets use summaries
from sklearn import datasets # import libraries
boston = datasets.load_boston() # Import Boston housing price data
print(boston.keys()) # to check the key (property) [' data ', 'target', 'feature_names',' DESCR ', 'filename']
print(boston.data.shape,boston.target.shape) Look at the shape of the data
print(boston.feature_names) # See what the features are
print(boston.DESCR) # described data set description information
print(boston.filename) # file path
Copy the code
Data segmentation
# import module
from sklearn.model_selection import train_test_split
# Data is divided into training set and test set
X_train, X_test, y_train, y_test = train_test_split(
data_X,
data_y,
test_size=0.2,
random_state=111
)
# 150 * 0.8 = 120
len(X_train)
Copy the code
Data standardization and normalization
from sklearn.preprocessing import StandardScaler # standardization
from sklearn.preprocessing import MinMaxScaler # normalization
# standardization
ss = StandardScaler()
X_scaled = ss.fit_transform(X_train) Pass in data to be standardized
# normalization
mm = MinMaxScaler()
X_scaled = mm.fit_transform(X_train)
Copy the code
Type encoding
Case from official website: scikit-learn.org/stable/modu…
Pair digit coding
Encoding string
Modeling case
The import module
from sklearn.neighbors import KNeighborsClassifier, NeighborhoodComponentsAnalysis # model
from sklearn.datasets import load_iris # import data
from sklearn.model_selection import train_test_split # Shard data
from sklearn.model_selection import GridSearchCV # grid search
from sklearn.pipeline import Pipeline # Pipeline operation
from sklearn.metrics import accuracy_score # Score verification
Copy the code
Model instantiation
# model instantiation
knn = KNeighborsClassifier(n_neighbors=5)
Copy the code
Training model
knn.fit(X_train, y_train)
Copy the code
KNeighborsClassifier()
Copy the code
Test set prediction
y_pred = knn.predict(X_test)
y_pred # Predicated values based on the model
Copy the code
array([0.0.2.2.1.0.0.2.2.1.2.0.1.2.2.0.2.1.0.2.1.2.1.1.2.0.0.2.0.2])
Copy the code
Score validation
There are two ways to verify model scores:
knn.score(X_test,y_test)
Copy the code
0.9333333333333333
Copy the code
accuracy_score(y_pred,y_test)
Copy the code
0.9333333333333333
Copy the code
The grid search
How to search for parameters
from sklearn.model_selection import GridSearchCV
# search parameters
knn_paras = {"n_neighbors": [1.3.5.7]}
# Default model
knn_grid = KNeighborsClassifier()
Instantiation object for grid search
grid_search = GridSearchCV(
knn_grid,
knn_paras,
cv=10 # 10 fold cross validation
)
grid_search.fit(X_train, y_train)
Copy the code
GridSearchCV(cv=10, estimator=KNeighborsClassifier(),
param_grid={'n_neighbors': [1.3.5.7]})
Copy the code
The best parameter value found by the search
grid_search.best_estimator_
Copy the code
KNeighborsClassifier(n_neighbors=7)
Copy the code
grid_search.best_params_
Copy the code
Out[42]:
{'n_neighbors': 7}
Copy the code
grid_search.best_score_
Copy the code
0.975
Copy the code
Modeling based on search results
knn1 = KNeighborsClassifier(n_neighbors=7)
knn1.fit(X_train, y_train)
Copy the code
KNeighborsClassifier(n_neighbors=7)
Copy the code
It can be seen from the following results that the modeling effect after grid search is better than the model without grid search
y_pred_1 = knn1.predict(X_test)
knn1.score(X_test,y_test)
Copy the code
1.0
Copy the code
accuracy_score(y_pred_1,y_test)
Copy the code
1.0
Copy the code