Scikit-learn main modules and basic use

The introduction

For kids who are starting out with machine learning algorithms and are afraid to do it, it can be a struggle to get started quickly. The most common tools used by people working in data science are R and Python. Each tool has its pros and cons, but Python comes out on top in every respect because the SciKit-Learn library implements many machine learning algorithms.

Data Loading

Let’s assume that we input an eigenmatrix or CSV file. First, the data should be loaded into memory. The Implementation of SciKit-Learn uses NumPy arrays, so we use NumPy to load CSV files. The following data is downloaded from the UCI Machine Learning data Warehouse.

import numpy as np
import urllib
# url with dataset
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
# download the file
raw_data = urllib.urlopen(url)
# load the CSV file as a numpy matrix
dataset = np.loadtxt(raw_data, delimiter=",")
# separate the data from the target attributes
X = dataset[:,0:7]
y = dataset[:,8]Copy the code

We will use this data set as an example, with the eigenmatrix as X and the target variable as y.

Data Normalization

The gradient method in most machine learning algorithms is sensitive to the scale and scale of the data. Before running the algorithm, we should perform a normalization or normalization process, which makes the feature data scale to 0-1. Scikit-learn provides a normalized approach:

from sklearn import preprocessing
# normalize the data attributes
normalized_X = preprocessing.normalize(X)
# standardize the data attributes
standardized_X = preprocessing.scale(X)Copy the code

Feature Selection

In the process of solving a real problem, the ability to select the right features or build features is particularly important. This is called feature selection or feature engineering. Feature selection is a creative process that relies more on intuition and expertise, and there are many ready-made algorithms for feature selection. The following Tree algorithms calculate the amount of information of features:

from sklearn import metrics
from sklearn.ensemble import ExtraTreesClassifier
model = ExtraTreesClassifier()
model.fit(X, y)
# display the relative importance of each attribute
print(model.feature_importances_)Copy the code

Use of algorithms

Let’s take a quick look at sciKit-Learn, which implements most of the basic algorithms for machine learning.

Logistic regression

Most problems can be reduced to binary classification problems. The advantage of this algorithm is that it can give the probability of the data being in the category.

from sklearn import metrics
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))Copy the code

Results:

LogisticRegression(C=1.0, class_weight=None, dual=False, FIT_Intercept =True, intercept_scaling=1, penalty=l2, Random_state =None, toL =0.0001) Precision recall F1-score support
   0.0       0.79      0.89      0.84       500
   1.0       0.74      0.55      0.63       268Copy the code
Avg/Total 0.77 0.77 0.77 768

[[447 53] [120 148]]

Naive Bayes

This is also a famous machine learning algorithm. The task of this method is to restore the distribution density of training sample data, which has a good effect in multi-category classification.

from sklearn import metrics
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))Copy the code

Results:

GaussianNB() precision recall f1-score support

   0.0       0.80      0.86      0.83       500
    1.0       0.69      0.60      0.64       268Copy the code

Avg/Total 0.76 0.77 0.76 768

[[429 71] [108 160]]

K neighbor

K nearest neighbor algorithm is often used as a part of the classification algorithm, for example, it can be used to evaluate features, we can use it in feature selection.

from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
# fit a k-nearest neighbor model to the data
model = KNeighborsClassifier()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))Copy the code

Results:

KNeighborsClassifier(algorithm=auto, leaf_size=30, metric=minkowski,

n_neighbors=5, p=2, weights=uniform)

precision recall f1-score support
   0.0       0.82      0.90      0.86       500
    1.0       0.77      0.63      0.69       268Copy the code
Avg/Total 0.80 0.80 0.80 768

[[448 52] [98 170]]

The decision tree

Classification and Regression Trees (CART) algorithm is often used for Classification or Regression problems with features containing category information. This method is very suitable for multi-classification situations.

from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
# fit a CART model to the data
model = DecisionTreeClassifier()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))Copy the code

Results:

DecisionTreeClassifier(compute_importances=None, criterion=gini,

max_depth=None, max_features=None, min_density=None,

min_samples_leaf=1, min_samples_split=2, random_state=None,

splitter=best)

precision recall f1-score support
   0.0       1.00      1.00      1.00       500
    1.0       1.00      1.00      1.00       268Copy the code
Avg/Total 1.00 1.00 1.00 768

[[500 0] [0 268]]

Support vector machine

SVM is a very popular machine learning algorithm, mainly used for classification problems, like logistic regression problems, it can use a one-to-many method for multi-category classification.

from sklearn import metrics
from sklearn.svm import SVC
# fit a SVM model to the data
model = SVC()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))Copy the code

Results:

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0, kernel= RBF, max_iter=-1, Probability =False, random_state=None, shrinking=True, TOL =0.001, verbose=False) Precision recall F1-score support
   0.0       1.00      1.00      1.00       500
    1.0       1.00      1.00      1.00       268Copy the code
Avg/Total 1.00 1.00 1.00 768

[[500 0] [0 268]]

In addition to classification and regression algorithms, SciKit-Learn provides more complex algorithms, such as clustering, and implements techniques for combining algorithms, such as Bagging and Boosting.

How to optimize algorithm parameters

A more difficult task is to build an efficient method for selecting the right parameters, which we need to determine using a search method. Scikit-learn provides functions to achieve this goal. The following example is a regular parameter selection program:

import numpy as np
from sklearn.linear_model import Ridge
from sklearn.grid_search import GridSearchCV
# prepare a range of alpha values to test
alphas = np.array([1.0.1.0.01.0.001.0.0001.0])
# create and fit a ridge regression model, testing each alpha
model = Ridge()
grid = GridSearchCV(estimator=model, param_grid=dict(alpha=alphas))
grid.fit(X, y)
print(grid)
# summarize the results of the grid search
print(grid.best_score_)
print(grid.best_estimator_.alpha)Copy the code

Results:

GridSearchCV(CV =None, estimator=Ridge(alpha=1.0, copy_X=True, FIT_Intercept =True, max_iter=None, Normalize =False, Solver =auto, toL =0.001), estimator__alpha=1.0, estimator__copy_X=True, estimator__fit_intercept=True, Estimator__max_iter =None, ESTIMator__normalize =False, Estimator__solver =auto, ESTIMator__tol =0.001, FIT_params ={}, iid=True, loss_func=None, n_jobs=1, param_grid={‘alpha’: Array ([1.00000e+00, 1.00000e-01, 1.00000e-02, 1.00000e-03, 1.00000e-04, 0.00000e])}, pre_dispatch=2*n_jobs, Refit =True, score_func=None, scoring=None, verbose=0) 0.282118955686 1.0

Sometimes it is effective to randomly select parameters from a given interval, and then evaluate the performance of the algorithm based on those parameters to select the best one.

import numpy as np
from scipy.stats import uniform as sp_rand
from sklearn.linear_model import Ridge
from sklearn.grid_search import RandomizedSearchCV
# prepare a uniform distribution to sample for the alpha parameter
param_grid = {'alpha': sp_rand()}
# create and fit a ridge regression model, testing random alpha values
model = Ridge()
rsearch = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=100)
rsearch.fit(X, y)
print(rsearch)
# summarize the results of the random parameter search
print(rsearch.best_score_)
print(rsearch.best_estimator_.alpha)Copy the code

Results:

RandomizedSearchCV(CV =None, estimator=Ridge(alpha=1.0, copy_X=True, FIT_Intercept =True, max_iter=None, Normalize =False, Solver =auto, toL =0.001), estimator__alpha=1.0, estimator__copy_X=True, estimator__fit_intercept=True, Estimator__max_iter =None, ESTIMator__normalize =False, Estimator__solver =auto, ESTIMator__tol =0.001, FIT_params ={}, iid=True, n_iter=100, n_jobs=1, param_distributions={‘alpha’:

}, pre_dispatch=2*n_jobs, random_state=None, refit=True, Scoring = None, verbose = 0) 0.282118643885 0.988443794636

summary

We’ve given you an overview of how to use the SciKit-Learn library, and we hope that this summary will help beginners settle down and learn how to solve specific machine learning problems.

Reprint please indicate the author Jason Ding and provenance GitCafe blog home page (http://jasonding1354.gitcafe.io/) making the blog home page (http://jasonding1354.github.io/) CSDN blog (http://blog.csdn.net/jasonding1354) Jane books home page (http://www.jianshu.com/users/2bd9b48f6ea8/latest_articles) Baidu search jasonding1354 to enter my blog home page

Want to receive ai-related high-quality technical content first time every day?

Scan the QR code to follow the public account: AIMaster

Scikit-learn main modules and basic use

The introduction

Data Loading

Data Normalization

Feature Selection

Use of algorithms

Logistic regression

Naive Bayes

K neighbor

The decision tree

Support vector machine

How to optimize algorithm parameters

summary

Related Posts

“Distributed ID Series” Baidu open source distributed high-performance unique ID generator UidGenerator

[Design Mode] – Observer mode

Spring the Boot log