@[toc]

Basic Algorithm use (Sklearn)

So far this step involves the use of specific algorithms, that is, some basic integration algorithms that call SkLearn, so there can be a call specification for algorithm calls of SkLearn (these are unified, since the operation is similar to all forms the specification).

(Ps: At this time for some basic algorithms will not be mathematical derivation, not really, many things, some I am very familiar with do not want to write, this blog for learning summary notes combined with B station learning condensed condensed notes)

Operator API call steps

Estimator estimates

Estimator = some operator API ()

Estimator. fit(x_train,y_train) performs the operation and trains the corresponding model

Estimator. predict(x_test)

Estimator. score(x_test, Y_test) is used to estimate the model

Data analysis step by step

1. Load data

2. Data standardization and normalization

3. Data feature extraction, dimensionality reduction etc

4. Data cutting and division, training set and test set

5. Use the operator API

6. Use estimator for training

7. Perform predictive evaluation on the data

8. Optimize according to test results

9. Save the optimized algorithm model

So then there are some operators for processing data.

Classification algorithm

KNN algorithm

Description:

A sample also belongs to a category if the large logarithms of the k most similar (k values) in the feature space belong to that category.

Used for classification prediction! For example, categorize map locations and predict where users like to gather most (and then run ads, see The Facebook example: Predict Facebook check-ins).

Principle:

For a sample, calculate the distance between a similar value and other points (such as Euclidean distance) so as to find approximate points for division.

Advantages:

  • Easy to implement, no training required

Disadvantages:

  • Lazy algorithm, the test sample classification calculation of large, large memory overhead
  • The value of K must be specified, and the classification accuracy cannot be guaranteed if the value of K is not chosen properly (certain self-correction is required, in this case, the grid correction of Sklearn).

The corresponding API:

from sklearn.neighbors import KNeighborsClassifier

Code examples:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

# Use KNN to classify irises
def knn_iris() :
    "" classification of Iris by KNN algorithm """
    #1. Get the data
    iris=load_iris()
    #2. Divide the dataset
    x_train,x_test,y_train,y_test=train_test_split(iris.data,iris.target,random_state=6)
    #3. Feature Engineering: Standardization
    transfer=StandardScaler()
    x_train=transfer.fit_transform(x_train)
    x_test=transfer.transform(x_test)
    #4.KNN algorithm predictor
    estimator=KNeighborsClassifier(n_neighbors=3)
    estimator.fit(x_train,y_train)
    #5. Model evaluation
        #5.1 Method 1: Directly compare the real and predicted values
    y_predict=estimator.predict(x_test)
    print("y_predict:\n",y_predict)
    print("Direct comparison of real value and predicted value: \n",y_test==y_predict)
        #5.2 Method 2: Calculation accuracy
    score=estimator.score(x_test,y_test)
    print("Accuracy: \n",score)

Copy the code

out:

y_predict: [0 2 0 0 2 1 1 0 2 1 2 2 1 1 2 1 1 2 1 1 1 0 0 2 0 1 1 1 2 0 1 0 1 0 0 1 2 1 2 2] Directly compare the actual value and the predicted value: [ True True True True True True False True True True True True True True True False True True True True True True True True True True True True True True True True True False True True] The accuracy rate is 0.9210526315789

The grid optimization

from sklearn.model_selection import GridSearchCV

Estimator =KNeighborsClassifier(n_neighbor =3) estimator=KNeighborsClassifier

Sklearn provides a way to experiment with the n_NEIGHBORS parameter

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

def knn_iris_gsv() :
    """ Irises are classified by KNN algorithm, adding grid search and cross validation. ""
    #1. Get the data
    iris=load_iris()
    #2. Divide the dataset
    x_train,x_test,y_train,y_test=train_test_split(iris.data,iris.target,random_state=6)
    #3. Feature Engineering: Standardization
    transfer=StandardScaler()
    x_train=transfer.fit_transform(x_train)
    x_test=transfer.transform(x_test)
    #4.KNN algorithm predictor
    estimator=KNeighborsClassifier()

    Join grid search and cross validation.
    # Parameter preparation
    param_dict={"n_neighbors": [1.3.5.7.9.11],}# estimator parameters
    estimator=GridSearchCV(estimator,param_grid=param_dict,cv=10)#estimator: estimator object, CV: several fold cross validation, FIT (): input training data, score(): accuracy

    estimator.fit(x_train,y_train)
    #5. Model evaluation
        #5.1 Method 1: Directly compare the real and predicted values
    y_predict=estimator.predict(x_test)
    print("y_predict:\n",y_predict)
    print("Direct comparison of real value and predicted value: \n",y_test==y_predict)
        #5.2 Method 2: Calculation accuracy
    score=estimator.score(x_test,y_test)
    print("Accuracy: \n",score)

    View grid search and cross validation results.
    # Best parameter: best_params_
    print("Best parameter: \n", estimator.best_params_)
    Best result: best_score_
    print("Best result: \n", estimator.best_score_)
    # Best estimator: best_estimator_
    print("Best estimator :\n", estimator.best_estimator_)
    # cross validation result: cv_results_
    print("Cross-validation result :\n", estimator.cv_results_)

Copy the code

Here’s the point:

param_dict={"n_neighbors": [1.3.5.7.9.11],}# estimator parameters
    estimator=GridSearchCV(estimator,param_grid=param_dict,cv=10)#estimator: estimator object, CV: several fold cross validation, FIT (): input training data, score(): accuracy

Copy the code

KNN case (Predicting Facebook check-in location)

This case is the positioning case of Facebook.

In this case, data cleaning is required for pandas. I’m going to go into a little bit of detail here because I think the hard part is the data processing part.

Data set:

Link: pan.baidu.com/s/1nojpx6ov… Extraction code: 6666

Data cleaning

First, the data looks like this:

Row_id x Y Accuracy time place_id 0 0 0.7941 9.0809 54 470702 8523065625 1 1 5.9567 4.7968 13 186555 1757726713 2 2 8.3078 7.0407 74 322648 1137537235 3 3 7.3665 2.5165 65 704587 6567393236 44 4.0961 1.1307 31 472130 7440663949 (29118021, 6)Copy the code

Data = pd.read_csv(“.. /input/train.csv”)

Then we need to conduct basic data processing, that is to say, format transformation of time first, and then we need to extract training set X and prediction set Y

That’s what it looks like when you extract it

Y looks like this

place_id
1014605271    28
1015645743     4
1017236154    31
1024951487     5
1028119817     4
Name: row_id, dtype: int64
Copy the code

So the code looks like this:

	# 1) Narrow the data range
data=data.query(" x<2.5 & x>2 & y<1.5 & y>1.0")
	# 2) Processing time characteristics
time_value=pd.to_datetime(data["time"],unit="s")
data["day"]=date.day
data["weekdaty"]=date.weekday
data["hour"]=date.hour
	# 3) Filter out locations with low frequency
place_count=data.groupby("place_id").count()["row_id"]
data_final = data[data["place_id"].isin(place_count[place_count>3].index.values)]
	# 4) Filter eigenvalues and target values
x=data_final[["x"."y"."accuracy"."day"."weekdaty"."hour"]]
y=data_final["place_id"]

Copy the code
KNN handles predictions

This step is the trick:

	# 5) Data set partitioning
x_train,x_test,y_train,y_test=train_test_split(x,y)

#3. Feature Engineering: Standardization
transfer=StandardScaler()
x_train=transfer.fit_transform(x_train)
x_test=transfer.transform(x_test)

#4.KNN algorithm predictor
estimator=KNeighborsClassifier()
Join grid search and cross validation.
# Parameter preparation
param_dict={"n_neighbors": [3.5.7.9]}# estimator parameters
estimator=GridSearchCV(estimator,param_grid=param_dict,cv=3)#estimator: estimator object, CV: several fold cross validation, FIT (): input training data, score(): accuracy
estimator.fit(x_train,y_train)

#5. Model evaluation
    #5.1 Method 1: Directly compare the real and predicted values
y_predict=estimator.predict(x_test)
print("y_predict:\n",y_predict)
print("Direct comparison of real value and predicted value: \n",y_test==y_predict)
    #5.2 Method 2: Calculation accuracy
score=estimator.score(x_test,y_test)
print("Accuracy: \n",score)
View grid search and cross validation results.
# Best parameter: best_params_
print("Best parameter: \n", estimator.best_params_)
Best result: best_score_
print("Best result: \n", estimator.best_score_)
# Best estimator: best_estimator_
print("Best estimator :\n", estimator.best_estimator_)
# cross validation result: cv_results_
print("Cross-validation result :\n", estimator.cv_results_)

Copy the code
The complete code
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
# 1. Get the data
data=pd.read_csv("./FBlocation/train.csv")

# 2. Basic data processing
	# 1) Narrow the data range
data=data.query(" x<2.5 & x>2 & y<1.5 & y>1.0")
	# 2) Processing time characteristics
time_value=pd.to_datetime(data["time"],unit="s")
data["day"]=date.day
data["weekdaty"]=date.weekday
data["hour"]=date.hour
	# 3) Filter out locations with low frequency
place_count=data.groupby("place_id").count()["row_id"]
data_final = data[data["place_id"].isin(place_count[place_count>3].index.values)]
	# 4) Filter eigenvalues and target values
x=data_final[["x"."y"."accuracy"."day"."weekdaty"."hour"]]
y=data_final["place_id"]
	# 5) Data set partitioning
x_train,x_test,y_train,y_test=train_test_split(x,y)

#3. Feature Engineering: Standardization
transfer=StandardScaler()
x_train=transfer.fit_transform(x_train)
x_test=transfer.transform(x_test)

#4.KNN algorithm predictor
estimator=KNeighborsClassifier()
Join grid search and cross validation.
# Parameter preparation
param_dict={"n_neighbors": [3.5.7.9]}# estimator parameters
estimator=GridSearchCV(estimator,param_grid=param_dict,cv=3)#estimator: estimator object, CV: several fold cross validation, FIT (): input training data, score(): accuracy
estimator.fit(x_train,y_train)

#5. Model evaluation
    #5.1 Method 1: Directly compare the real and predicted values
y_predict=estimator.predict(x_test)
print("y_predict:\n",y_predict)
print("Direct comparison of real value and predicted value: \n",y_test==y_predict)
    #5.2 Method 2: Calculation accuracy
score=estimator.score(x_test,y_test)
print("Accuracy: \n",score)
View grid search and cross validation results.
# Best parameter: best_params_
print("Best parameter: \n", estimator.best_params_)
Best result: best_score_
print("Best result: \n", estimator.best_score_)
# Best estimator: best_estimator_
print("Best estimator :\n", estimator.best_estimator_)
# cross validation result: cv_results_
print("Cross-validation result :\n", estimator.cv_results_)

Copy the code

Naive Bayes algorithm

So this is sort of a probabilistic prediction sort of thing, where you ask the probability that something falls into a certain category or range

Simplicity: This refers to the assumption that ignores the effect of the dataset (the dataset may be incomplete), that is, the assumption that each element in the dataset is independent!

  1. Joint probability is the probability that multiple conditions are true at the same time
  2. Conditional probability is the probability that time A has already occurred at another time B
  3. Independent of each other If P(A,B)=P(A)P(B), event A is said to be independent of event B
  4. The bayesian formula (C | W) = P (P (W) | C PC)/P (W)
  5. Naive features and features are independent from each other
  6. Naive Bayes algorithm naive + Bayes
  7. Applied classification Text classification (words as features)
  8. Laplacian smoothing coefficient (F1 | C) = P + alpha (molecular)/(the denominator + alpha m) alpha: specify coefficient m: training document appears the characteristics of the total number

Call API:

from sklearn.naive_bayes import MultinomialNB

Let’s go straight to the example:

from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

def nb_news() :
    """ Classification of news with naive Bayesian algorithm :return: """
    #1. Get the data
    news = fetch_20newsgroups(subset="all")Subset ="all" obtain all data, train obtain training data
    #2. Divide the dataset
    x_train, x_test, y_train, y_test = train_test_split(news.data, news.target)
    #3. Feature Engineering: Text feature extraction - TFIDF
    transfer = TfidfVectorizer()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)
    #4. Naive Bayes algorithm predictor flow
    estimator = MultinomialNB()
    estimator.fit(x_train, y_train)
    #5. Model evaluation
        # Method 1: Directly compare the real and predicted values
    y_predict = estimator.predict(x_test)
    print("y_predict:\n", y_predict)
    print("Direct comparison of real value and predicted value :\n", y_test == y_predict)
        # Method 2: Calculation accuracy
    score = estimator.score(x_test, y_test)
    print("Accuracy: \n", score)

Copy the code
Y_predict: [15 2 16... 11 10 18] Directly compare the predicted value with the predicted value: [False True False... True True True] The accuracy rate is 0.8503820033955858Copy the code

(PS: These things are grid-optimized!)

The decision tree

You know all about this stuff!

Reference up:www.bilibili.com/video/BV1Xp…

[Fundamentals of information Theory]

  1. Information Shannon: Eliminate random uncertainty
  2. Measure of information – information – information entropy 2.1 units: bit information gain 2.2 g (D | A) = H (D) – H (D | A) 2.3 one of the division of decision tree based on — — — — — — the information gain (bigger is better)

Advantages:

  • Visualization – Strong explanatory ability

Disadvantages:

  • Decision trees don’t work well with overly complex trees (overfitting)

Improvement:

  • Pruning CART algorithm
  • Random forests

Call API

from sklearn.tree import DecisionTreeClassifier

case

The decision tree classifies iris data

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier,export_graphviz

def decision_iris() :
    "" classification of irises by decision tree versus decision tree ""
    #1. Get the data set
    iris = load_iris()
    #2. Divide the dataset
    x_train,x_test,y_train,y_test = train_test_split(iris.data,iris.target,random_state=22)
    #3. Decision tree estimator
    estimator = DecisionTreeClassifier(criterion="entropy")
    estimator.fit(x_train,y_train)
    #4. Model evaluation
        # Method 1: Directly compare the real and predicted values
    y_predict = estimator.predict(x_test)
    print("y_predict:\n", y_predict)
    print("Direct comparison of real value and predicted value :\n", y_test == y_predict)
        # Method 2: Calculation accuracy
    score = estimator.score(x_test, y_test)
    print("Accuracy: \n", score)
    # Visualize decision trees
    export_graphviz(estimator,out_file="iris_tree.dot",feature_names=iris.feature_names)

Copy the code
Visual decision tree

export_graphviz(estimator,out_file=”iris_tree.dot”,feature_names=iris.feature_names)

This will generate a file that you can put on the website and parse.

Random forests

Reference up:www.bilibili.com/video/BV1H5…

In a word: Let multiple forests run and unify.

Ensemble learning solves a single prediction problem by combining several models. It works by generating multiple classifiers/models that independently learn and make predictions. These predictions are eventually combined into composite predictions, and are therefore better than any single class of predictions. Random forest: Training set random, feature random forest: a classifier containing multiple decision trees

API

from sklearn.ensemble import RandomForestClassifier

Data set:

Link: pan.baidu.com/s/1cocizyxt… Extraction code: 6666

Random Forest predicts the survival of Titanic passengers

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
# Fetch data
path="./titanic.csv"
titanic=pd.read_csv(path)
Filter eigenvalues and target values
x=titanic[["pclass"."age"."sex"]]
y=titanic["survived"]
#2. Data processing
	#2.1 Missing value processing
x["age"].fillna(x["age"].mean(),inplace=True)
	#2.2 Convert to a dictionary
x=x.to_dict(orient="records")
#3. Data set division
x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=22)
#4. Dictionary feature extraction
transfer=DictVectorizer()
x_train=transfer.fit_transform(x_train)
x_test=transfer.transform(x_test)
estimator=RandomForestClassifier()
Add grid search and cross validation
# Parameter preparation
param_dict={"n_estimators": [120.200.300.500.800.1200]."max_depth": [5.8.15.25.30]}# estimator parameters
estimator=GridSearchCV(estimator,param_grid=param_dict,cv=3)#estimator: estimator object, CV: several fold cross validation, FIT (): input training data, score(): accuracy
estimator.fit(x_train,y_train)
#5. Model evaluation
    #5.1 Method 1: Directly compare the real and predicted values
y_predict=estimator.predict(x_test)
print("y_predict:\n",y_predict)
print("Direct comparison of real value and predicted value: \n",y_test==y_predict)
    #5.2 Method 2: Calculation accuracy
score=estimator.score(x_test,y_test)
print("Accuracy: \n",score)
View grid search and cross validation results.
# Best parameter: best_params_
print("Best parameter: \n", estimator.best_params_)
Best result: best_score_
print("Best result: \n", estimator.best_score_)
# Best estimator: best_estimator_
print("Best estimator :\n", estimator.best_estimator_)
# cross validation result: cv_results_
print("Cross-validation result :\n", estimator.cv_results_)

Copy the code

Regression and clustering algorithm

Linear regression

Refer to the up: www.bilibili.com/video/BV17T…

The teacher gave the definition:

[Linear regression]

  1. Regression problem target value – continuous data
  2. define

Function relation (linear model) : establish relation between eigenvalue and target value y = w1x1 + w2x2 + w3x3 +… + WNXN + b = wTx + b Y = w1x1 + w2x2 + w3x3 +… + WNXN + b = wTx + b y = w1x1 + w2x1^2 + w3x1^3 + w4x2^3 +… Objective: To find the model parameter loss function /cost/ cost function/objective function — least — least square optimization loss (optimization method) a. GD gradient descent: Calculate the values of all the samples to obtain the gradient SGD stochastic gradient descent: Consider only one sample at a time SAG stochastic mean gradient method 5. Regression performance evaluation means square error (MSE) evaluation mechanism

Normal equations

API

from sklearn.linear_model import LinearRegression

Features: fast operation, suitable for small sample data

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression,SGDRegressor
from sklearn.metrics import mean_squared_error

def linear1() :
    Prediction of Housing Price in Boston by Optimizing Method of Normal Equation
    #1. Get the data
    boston=load_boston()
    print("Characteristic Quantity: \n",boston.data.shape)
    #2. Divide the dataset
    x_train,x_test,y_train,y_test=train_test_split(boston.data,boston.target,random_state=22)
    # 3. Standardization
    transfer=StandardScaler()
    x_train=transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)
    # 4. The forecast
    estimator=LinearRegression()
    estimator.fit(x_train,y_train)
    #5. Get the model
    print("Normal equation - weighting coefficient is: \n",estimator.coef_)
    print("Normal equation - bias is: \n",estimator.intercept_)
    #6. Model evaluation
    y_predict=estimator.predict(x_test)
    print("Normal Equation - Forecast House prices: \n",y_predict)
    error=mean_squared_error(y_test,y_predict)
    print("Normal equation - mean square error is: \n",error)

Copy the code

Gradient descent

This code up here

from sklearn.linear_model import SGDRegressor

def linear2() :
    Prediction of Housing Price in Boston based on gradient Descent Optimization Method
    #1. Get the data
    boston=load_boston()
    #2. Divide the dataset
    x_train,x_test,y_train,y_test=train_test_split(boston.data,boston.target,random_state=22)
    # 3. Standardization
    transfer=StandardScaler()
    x_train=transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)
    # 4. The forecast
    estimator=SGDRegressor(learning_rate="constant",eta0=0.01,max_iter=10000)
    estimator.fit(x_train,y_train)
    #5. Get the model
    print("Gradient descent - weighting coefficient is: \n",estimator.coef_)
    print("Gradient descent - offset to: \n",estimator.intercept_)
    # 6. Model evaluation
    y_predict = estimator.predict(x_test)
    print("Gradient Descent - Forecast House Prices: \n", y_predict)
    error = mean_squared_error(y_test, y_predict)
    print("Gradient descent - mean square error: \n", error)

Copy the code

Ridge regression

Here it is mainly the first two methods that have problems, as follows:

[Under-fitting and over-fitting]

  1. 2. Poor performance of the training set, poor test set – underfit
  2. 3.2 Overfitting cause: There are too many original features, some noisy features, and the model is too complex, because the model tries to take into account all test data points. Solution: Regularization L1 – LASSO: loss function + lambda penalty term (| | w) L2 – Ridge Ridge regression: more common (loss function + lambda punishment squared) (w)

So this was introduced to solve the problem

from sklearn.linear_model import Ridge

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

def linear3() :
    """ Ridge returns to forecast Home prices in Boston. ""
    #1. Get the data
    boston=load_boston()
    #2. Divide the dataset
    x_train,x_test,y_train,y_test=train_test_split(boston.data,boston.target,random_state=22)
    # 3. Standardization
    transfer=StandardScaler()
    x_train=transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)
    # 4. The forecast
    estimator=Ridge(alpha=0.5,max_iter=10000)
    estimator.fit(x_train,y_train)
    #5. Get the model
    print("Ridge regression - weight coefficient is: \n",estimator.coef_)
    print("Ridge regression - bias is: \n",estimator.intercept_)
    # 6. Model evaluation
    y_predict = estimator.predict(x_test)
    print("Ridge Regression - Forecast Housing Price: \ N", y_predict)
    error = mean_squared_error(y_test, y_predict)
    print("Ridge regression - mean square error is: \n", error)


Copy the code

Logistic regression and dichotomy

[The principle of logistic regression]

  1. Input: The output of linear regression is the input of logistic regression
  2. Sigmoid: 1/(1+e^(-x))
  3. Loss function logarithmic likelihood loss
  4. Optimization loss: gradient descent

[Dichotomy]

This is actually one of those forest things that tells you whether the end result is going to be true or not, and it has two values: true or false, and typically disease prediction and things like that to see if you’re sick or not

from sklearn.linear_model import LogisticRegression

This is not to be confused with the normal equation

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

#1. Read data
path="https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"
column_name = ['Sample code number'.'Clump Thickness'.'Uniformity of Cell Size'.'Uniformity of Cell Shape'.'Marginal Adhesion'.'Single Epithelial Cell Size'.'Bare Nuclei'.'Bland Chromatin'.'Normal Nucleoli'.'Mitoses'.'Class']
data=pd.read_csv(path,names=column_name)
#2. Missing value handling
	# 2.1 replacement - > np. Nan
data=data.replace(to_replace="?",value=np.nan)
	#2.2 Delete missing samples
data.dropna(inplace=True)
Filter eigenvalues and target values
x=data.iloc[:,1: -1]
y=data["Class"]
#3. Divide the data set
x_train,x_test,y_train,y_test=train_test_split(x,y)
#4. Feature engineering
transfer=StandardScaler()
x_train=transfer.fit_transform(x_train)
x_test=transfer.transform(x_test)
#5. The estimator process
estimator=LogisticRegression()
estimator.fit(x_train,y_train)
Model parameters of logistic regression: regression coefficient and bias
print((2) Characteristic parameters:,estimator.coef_)
#6. Model evaluation
    # Method 1: Directly compare the real and predicted values
y_predict = estimator.predict(x_test)
print("y_predict:\n", y_predict)
print("Direct comparison of real value and predicted value :\n", y_test == y_predict)
    # Method 2: Calculation accuracy
score = estimator.score(x_test, y_test)
print("Accuracy: \n", score)

Copy the code

Classification evaluation

Because what we find here is that this is a particular word here, that the categories here are not just categories, but categories with true and false values. There is also a problem with the data source, because the proportion of the diseased elements in the data set may not be obvious in the source, which may lead to inaccurate results in predicting the classification of the diseased population, so we need to evaluate this stuff.

Classification assessment method]

  1. Confusion matrix TP True example = True Possitive FP False positive example = False Possitive FN False Negative example = False Negative TN True Negative example = True Negative
  2. Precision and Recall

Accuracy rate: proportion of positive samples whose predicted results are true. Recall rate: proportion of positive samples whose predicted results are true (ability to distinguish positive samples) F1-Score: robustness of the model 3. ROC curve and AUC indicators (which can measure the evaluation under unbalanced samples) TP/(TP+FN) Proportion of predicted category 1 (recall rate) among all samples of true category 1 3.2 FPR: FP/(FP+TN) Of all samples with true category 0, the proportion of predicted category 1 3.3 AUC can only be used to evaluate dichotomies (e.g., above) AUC is very suitable for evaluating classifier performance in sample imbalance

from sklearn.metrics import classification_report

from sklearn.metrics import roc_auc_score

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score

#1. Read data
path="https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"
column_name = ['Sample code number'.'Clump Thickness'.'Uniformity of Cell Size'.'Uniformity of Cell Shape'.'Marginal Adhesion'.'Single Epithelial Cell Size'.'Bare Nuclei'.'Bland Chromatin'.'Normal Nucleoli'.'Mitoses'.'Class']
data=pd.read_csv(path,names=column_name)
#2. Missing value handling
	# 2.1 replacement - > np. Nan
data=data.replace(to_replace="?",value=np.nan)
	#2.2 Delete missing samples
data.dropna(inplace=True)
Filter eigenvalues and target values
x=data.iloc[:,1: -1]
y=data["Class"]
#3. Divide the data set
x_train,x_test,y_train,y_test=train_test_split(x,y)
#4. Feature engineering
transfer=StandardScaler()
x_train=transfer.fit_transform(x_train)
x_test=transfer.transform(x_test)
#5. The estimator process
estimator=LogisticRegression()
estimator.fit(x_train,y_train)
Model parameters of logistic regression: regression coefficient and bias
print((2) Characteristic parameters:,estimator.coef_)
#6. Model evaluation
    # Method 1: Directly compare the real and predicted values
y_predict = estimator.predict(x_test)
print("y_predict:\n", y_predict)
print("Direct comparison of real value and predicted value :\n", y_test == y_predict)
    # Method 2: Calculation accuracy
score = estimator.score(x_test, y_test)
print("Accuracy: \n", score)

# Check accuracy, recall, F1-score
report=classification_report(y_test,y_predict,labels=[2.4],target_names=["Benign"."Vicious"])
# AUC indicators
#y_true: The true category of each sample, must be marked 0 (negative example), 1 (positive example)
#y_test converts to 0, 1
y_true=np.where(y_test>3.1.0)
auc=roc_auc_score(y_true,y_predict)
print("auc:\n",auc)

Copy the code

K-means unsupervised clustering algorithm

[Unsupervised learning]

  1. Definition No target value – unsupervised learning 2. Included algorithm clustering: K-means(K-means clustering) Dimensionality reduction: PCA

[K-means algorithm steps]

  1. Randomly set K points in the feature space as the initial cluster centers. 2. Calculate the distance from each other point to K centers, and select the nearest cluster center for the unknown point as the marker category

    1. Then the new center point (mean value) of each cluster is recalculated against the marked cluster center.

    2. If the calculated new center point is the same as the original center point, then the end, otherwise the second step of the process

    3. The advantages and disadvantages

      Advantages:

      Using iterative algorithm, intuitive and very practical

      Disadvantages:

      Easy convergence to local optimal solution (solution: multiple clustering)

Note: Clustering is usually done before classification!

API:

from sklearn.cluster import KMeans

Evaluation:

from sklearn.metrics import silhouette_score

Here is an example:

To explore the user’s preference for item category subdivision, in this case, it is to get the classification of the user’s favorite item through cluster analysis

This has already been done

Data set:

Link: pan.baidu.com/s/1P9xwvyYA… Extraction code: 6666

import pandas as pd
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
#1. Get the data
order_products=pd.read_csv("./data/order_products__prior.csv")
products=pd.read_csv("./data/products.csv")
orders=pd.read_csv("./data/orders.csv")
aisles=pd.read_csv("./data/aisles.csv")
# 2. Merge tables
# merge aisles with Products aisle and product_id
tab1=pd.merge(aisles,products,on=["aisle_id"."aisle_id"[]) :100]
tab2=pd.merge(tab1,order_products,on=["product_id"."product_id"[]) :100]
tab3=pd.merge(tab2,orders,on=["order_id"."order_id"[]) :100]
#3. Find the relationship between user_id and aisle
table=pd.crosstab(tab3["user_id"],tab3["aisle"])
data=table[:10000]# Take a portion to save time
# 4. PCA dimension reduction
Instantiate the converter class
transfer=PCA(n_components=0.95)
Call fit_transform # 4.2
data_new=transfer.fit_transform(data)

#5. The estimator process
estimator=KMeans(n_clusters=3)
estimator.fit(data_new)
y_predict=estimator.predict(data_new)
# Model evaluation - profile coefficient
silhouette_num=silhouette_score(data_new,y_predict)
print("Contour coefficient: \n",silhouette_num)

Copy the code

Model loading and saving


import joblib
1.Save the joblib. Dump (estimator,'test.pkl')
2.Load estimator = joblib.load('test.pkl')
Copy the code

Summary map (rough)