@[toc]
Basic Algorithm use (Sklearn)
So far this step involves the use of specific algorithms, that is, some basic integration algorithms that call SkLearn, so there can be a call specification for algorithm calls of SkLearn (these are unified, since the operation is similar to all forms the specification).
(Ps: At this time for some basic algorithms will not be mathematical derivation, not really, many things, some I am very familiar with do not want to write, this blog for learning summary notes combined with B station learning condensed condensed notes)
Operator API call steps
Estimator estimates
Estimator = some operator API ()
Estimator. fit(x_train,y_train) performs the operation and trains the corresponding model
Estimator. predict(x_test)
Estimator. score(x_test, Y_test) is used to estimate the model
Data analysis step by step
1. Load data
2. Data standardization and normalization
3. Data feature extraction, dimensionality reduction etc
4. Data cutting and division, training set and test set
5. Use the operator API
6. Use estimator for training
7. Perform predictive evaluation on the data
8. Optimize according to test results
9. Save the optimized algorithm model
So then there are some operators for processing data.
Classification algorithm
KNN algorithm
Description:
A sample also belongs to a category if the large logarithms of the k most similar (k values) in the feature space belong to that category.
Used for classification prediction! For example, categorize map locations and predict where users like to gather most (and then run ads, see The Facebook example: Predict Facebook check-ins).
Principle:
For a sample, calculate the distance between a similar value and other points (such as Euclidean distance) so as to find approximate points for division.
Advantages:
- Easy to implement, no training required
Disadvantages:
- Lazy algorithm, the test sample classification calculation of large, large memory overhead
- The value of K must be specified, and the classification accuracy cannot be guaranteed if the value of K is not chosen properly (certain self-correction is required, in this case, the grid correction of Sklearn).
The corresponding API:
from sklearn.neighbors import KNeighborsClassifier
Code examples:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
# Use KNN to classify irises
def knn_iris() :
"" classification of Iris by KNN algorithm """
#1. Get the data
iris=load_iris()
#2. Divide the dataset
x_train,x_test,y_train,y_test=train_test_split(iris.data,iris.target,random_state=6)
#3. Feature Engineering: Standardization
transfer=StandardScaler()
x_train=transfer.fit_transform(x_train)
x_test=transfer.transform(x_test)
#4.KNN algorithm predictor
estimator=KNeighborsClassifier(n_neighbors=3)
estimator.fit(x_train,y_train)
#5. Model evaluation
#5.1 Method 1: Directly compare the real and predicted values
y_predict=estimator.predict(x_test)
print("y_predict:\n",y_predict)
print("Direct comparison of real value and predicted value: \n",y_test==y_predict)
#5.2 Method 2: Calculation accuracy
score=estimator.score(x_test,y_test)
print("Accuracy: \n",score)
Copy the code
out:
y_predict: [0 2 0 0 2 1 1 0 2 1 2 2 1 1 2 1 1 2 1 1 1 0 0 2 0 1 1 1 2 0 1 0 1 0 0 1 2 1 2 2] Directly compare the actual value and the predicted value: [ True True True True True True False True True True True True True True True False True True True True True True True True True True True True True True True True True False True True] The accuracy rate is 0.9210526315789
The grid optimization
from sklearn.model_selection import GridSearchCV
Estimator =KNeighborsClassifier(n_neighbor =3) estimator=KNeighborsClassifier
Sklearn provides a way to experiment with the n_NEIGHBORS parameter
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
def knn_iris_gsv() :
""" Irises are classified by KNN algorithm, adding grid search and cross validation. ""
#1. Get the data
iris=load_iris()
#2. Divide the dataset
x_train,x_test,y_train,y_test=train_test_split(iris.data,iris.target,random_state=6)
#3. Feature Engineering: Standardization
transfer=StandardScaler()
x_train=transfer.fit_transform(x_train)
x_test=transfer.transform(x_test)
#4.KNN algorithm predictor
estimator=KNeighborsClassifier()
Join grid search and cross validation.
# Parameter preparation
param_dict={"n_neighbors": [1.3.5.7.9.11],}# estimator parameters
estimator=GridSearchCV(estimator,param_grid=param_dict,cv=10)#estimator: estimator object, CV: several fold cross validation, FIT (): input training data, score(): accuracy
estimator.fit(x_train,y_train)
#5. Model evaluation
#5.1 Method 1: Directly compare the real and predicted values
y_predict=estimator.predict(x_test)
print("y_predict:\n",y_predict)
print("Direct comparison of real value and predicted value: \n",y_test==y_predict)
#5.2 Method 2: Calculation accuracy
score=estimator.score(x_test,y_test)
print("Accuracy: \n",score)
View grid search and cross validation results.
# Best parameter: best_params_
print("Best parameter: \n", estimator.best_params_)
Best result: best_score_
print("Best result: \n", estimator.best_score_)
# Best estimator: best_estimator_
print("Best estimator :\n", estimator.best_estimator_)
# cross validation result: cv_results_
print("Cross-validation result :\n", estimator.cv_results_)
Copy the code
Here’s the point:
param_dict={"n_neighbors": [1.3.5.7.9.11],}# estimator parameters
estimator=GridSearchCV(estimator,param_grid=param_dict,cv=10)#estimator: estimator object, CV: several fold cross validation, FIT (): input training data, score(): accuracy
Copy the code
KNN case (Predicting Facebook check-in location)
This case is the positioning case of Facebook.
In this case, data cleaning is required for pandas. I’m going to go into a little bit of detail here because I think the hard part is the data processing part.
Data set:
Link: pan.baidu.com/s/1nojpx6ov… Extraction code: 6666
Data cleaning
First, the data looks like this:
Row_id x Y Accuracy time place_id 0 0 0.7941 9.0809 54 470702 8523065625 1 1 5.9567 4.7968 13 186555 1757726713 2 2 8.3078 7.0407 74 322648 1137537235 3 3 7.3665 2.5165 65 704587 6567393236 44 4.0961 1.1307 31 472130 7440663949 (29118021, 6)Copy the code
Data = pd.read_csv(“.. /input/train.csv”)
Then we need to conduct basic data processing, that is to say, format transformation of time first, and then we need to extract training set X and prediction set Y
That’s what it looks like when you extract it
Y looks like this
place_id
1014605271 28
1015645743 4
1017236154 31
1024951487 5
1028119817 4
Name: row_id, dtype: int64
Copy the code
So the code looks like this:
# 1) Narrow the data range
data=data.query(" x<2.5 & x>2 & y<1.5 & y>1.0")
# 2) Processing time characteristics
time_value=pd.to_datetime(data["time"],unit="s")
data["day"]=date.day
data["weekdaty"]=date.weekday
data["hour"]=date.hour
# 3) Filter out locations with low frequency
place_count=data.groupby("place_id").count()["row_id"]
data_final = data[data["place_id"].isin(place_count[place_count>3].index.values)]
# 4) Filter eigenvalues and target values
x=data_final[["x"."y"."accuracy"."day"."weekdaty"."hour"]]
y=data_final["place_id"]
Copy the code
KNN handles predictions
This step is the trick:
# 5) Data set partitioning
x_train,x_test,y_train,y_test=train_test_split(x,y)
#3. Feature Engineering: Standardization
transfer=StandardScaler()
x_train=transfer.fit_transform(x_train)
x_test=transfer.transform(x_test)
#4.KNN algorithm predictor
estimator=KNeighborsClassifier()
Join grid search and cross validation.
# Parameter preparation
param_dict={"n_neighbors": [3.5.7.9]}# estimator parameters
estimator=GridSearchCV(estimator,param_grid=param_dict,cv=3)#estimator: estimator object, CV: several fold cross validation, FIT (): input training data, score(): accuracy
estimator.fit(x_train,y_train)
#5. Model evaluation
#5.1 Method 1: Directly compare the real and predicted values
y_predict=estimator.predict(x_test)
print("y_predict:\n",y_predict)
print("Direct comparison of real value and predicted value: \n",y_test==y_predict)
#5.2 Method 2: Calculation accuracy
score=estimator.score(x_test,y_test)
print("Accuracy: \n",score)
View grid search and cross validation results.
# Best parameter: best_params_
print("Best parameter: \n", estimator.best_params_)
Best result: best_score_
print("Best result: \n", estimator.best_score_)
# Best estimator: best_estimator_
print("Best estimator :\n", estimator.best_estimator_)
# cross validation result: cv_results_
print("Cross-validation result :\n", estimator.cv_results_)
Copy the code
The complete code
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
# 1. Get the data
data=pd.read_csv("./FBlocation/train.csv")
# 2. Basic data processing
# 1) Narrow the data range
data=data.query(" x<2.5 & x>2 & y<1.5 & y>1.0")
# 2) Processing time characteristics
time_value=pd.to_datetime(data["time"],unit="s")
data["day"]=date.day
data["weekdaty"]=date.weekday
data["hour"]=date.hour
# 3) Filter out locations with low frequency
place_count=data.groupby("place_id").count()["row_id"]
data_final = data[data["place_id"].isin(place_count[place_count>3].index.values)]
# 4) Filter eigenvalues and target values
x=data_final[["x"."y"."accuracy"."day"."weekdaty"."hour"]]
y=data_final["place_id"]
# 5) Data set partitioning
x_train,x_test,y_train,y_test=train_test_split(x,y)
#3. Feature Engineering: Standardization
transfer=StandardScaler()
x_train=transfer.fit_transform(x_train)
x_test=transfer.transform(x_test)
#4.KNN algorithm predictor
estimator=KNeighborsClassifier()
Join grid search and cross validation.
# Parameter preparation
param_dict={"n_neighbors": [3.5.7.9]}# estimator parameters
estimator=GridSearchCV(estimator,param_grid=param_dict,cv=3)#estimator: estimator object, CV: several fold cross validation, FIT (): input training data, score(): accuracy
estimator.fit(x_train,y_train)
#5. Model evaluation
#5.1 Method 1: Directly compare the real and predicted values
y_predict=estimator.predict(x_test)
print("y_predict:\n",y_predict)
print("Direct comparison of real value and predicted value: \n",y_test==y_predict)
#5.2 Method 2: Calculation accuracy
score=estimator.score(x_test,y_test)
print("Accuracy: \n",score)
View grid search and cross validation results.
# Best parameter: best_params_
print("Best parameter: \n", estimator.best_params_)
Best result: best_score_
print("Best result: \n", estimator.best_score_)
# Best estimator: best_estimator_
print("Best estimator :\n", estimator.best_estimator_)
# cross validation result: cv_results_
print("Cross-validation result :\n", estimator.cv_results_)
Copy the code
Naive Bayes algorithm
So this is sort of a probabilistic prediction sort of thing, where you ask the probability that something falls into a certain category or range
Simplicity: This refers to the assumption that ignores the effect of the dataset (the dataset may be incomplete), that is, the assumption that each element in the dataset is independent!
- Joint probability is the probability that multiple conditions are true at the same time
- Conditional probability is the probability that time A has already occurred at another time B
- Independent of each other If P(A,B)=P(A)P(B), event A is said to be independent of event B
- The bayesian formula (C | W) = P (P (W) | C PC)/P (W)
- Naive features and features are independent from each other
- Naive Bayes algorithm naive + Bayes
- Applied classification Text classification (words as features)
- Laplacian smoothing coefficient (F1 | C) = P + alpha (molecular)/(the denominator + alpha m) alpha: specify coefficient m: training document appears the characteristics of the total number
Call API:
from sklearn.naive_bayes import MultinomialNB
Let’s go straight to the example:
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
def nb_news() :
""" Classification of news with naive Bayesian algorithm :return: """
#1. Get the data
news = fetch_20newsgroups(subset="all")Subset ="all" obtain all data, train obtain training data
#2. Divide the dataset
x_train, x_test, y_train, y_test = train_test_split(news.data, news.target)
#3. Feature Engineering: Text feature extraction - TFIDF
transfer = TfidfVectorizer()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)
#4. Naive Bayes algorithm predictor flow
estimator = MultinomialNB()
estimator.fit(x_train, y_train)
#5. Model evaluation
# Method 1: Directly compare the real and predicted values
y_predict = estimator.predict(x_test)
print("y_predict:\n", y_predict)
print("Direct comparison of real value and predicted value :\n", y_test == y_predict)
# Method 2: Calculation accuracy
score = estimator.score(x_test, y_test)
print("Accuracy: \n", score)
Copy the code
Y_predict: [15 2 16... 11 10 18] Directly compare the predicted value with the predicted value: [False True False... True True True] The accuracy rate is 0.8503820033955858Copy the code
(PS: These things are grid-optimized!)
The decision tree
You know all about this stuff!
Reference up:www.bilibili.com/video/BV1Xp…
[Fundamentals of information Theory]
- Information Shannon: Eliminate random uncertainty
- Measure of information – information – information entropy 2.1 units: bit information gain 2.2 g (D | A) = H (D) – H (D | A) 2.3 one of the division of decision tree based on — — — — — — the information gain (bigger is better)
Advantages:
- Visualization – Strong explanatory ability
Disadvantages:
- Decision trees don’t work well with overly complex trees (overfitting)
Improvement:
- Pruning CART algorithm
- Random forests
Call API
from sklearn.tree import DecisionTreeClassifier
case
The decision tree classifies iris data
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier,export_graphviz
def decision_iris() :
"" classification of irises by decision tree versus decision tree ""
#1. Get the data set
iris = load_iris()
#2. Divide the dataset
x_train,x_test,y_train,y_test = train_test_split(iris.data,iris.target,random_state=22)
#3. Decision tree estimator
estimator = DecisionTreeClassifier(criterion="entropy")
estimator.fit(x_train,y_train)
#4. Model evaluation
# Method 1: Directly compare the real and predicted values
y_predict = estimator.predict(x_test)
print("y_predict:\n", y_predict)
print("Direct comparison of real value and predicted value :\n", y_test == y_predict)
# Method 2: Calculation accuracy
score = estimator.score(x_test, y_test)
print("Accuracy: \n", score)
# Visualize decision trees
export_graphviz(estimator,out_file="iris_tree.dot",feature_names=iris.feature_names)
Copy the code
Visual decision tree
export_graphviz(estimator,out_file=”iris_tree.dot”,feature_names=iris.feature_names)
This will generate a file that you can put on the website and parse.
Random forests
Reference up:www.bilibili.com/video/BV1H5…
In a word: Let multiple forests run and unify.
Ensemble learning solves a single prediction problem by combining several models. It works by generating multiple classifiers/models that independently learn and make predictions. These predictions are eventually combined into composite predictions, and are therefore better than any single class of predictions. Random forest: Training set random, feature random forest: a classifier containing multiple decision trees
API
from sklearn.ensemble import RandomForestClassifier
Data set:
Link: pan.baidu.com/s/1cocizyxt… Extraction code: 6666
Random Forest predicts the survival of Titanic passengers
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
# Fetch data
path="./titanic.csv"
titanic=pd.read_csv(path)
Filter eigenvalues and target values
x=titanic[["pclass"."age"."sex"]]
y=titanic["survived"]
#2. Data processing
#2.1 Missing value processing
x["age"].fillna(x["age"].mean(),inplace=True)
#2.2 Convert to a dictionary
x=x.to_dict(orient="records")
#3. Data set division
x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=22)
#4. Dictionary feature extraction
transfer=DictVectorizer()
x_train=transfer.fit_transform(x_train)
x_test=transfer.transform(x_test)
estimator=RandomForestClassifier()
Add grid search and cross validation
# Parameter preparation
param_dict={"n_estimators": [120.200.300.500.800.1200]."max_depth": [5.8.15.25.30]}# estimator parameters
estimator=GridSearchCV(estimator,param_grid=param_dict,cv=3)#estimator: estimator object, CV: several fold cross validation, FIT (): input training data, score(): accuracy
estimator.fit(x_train,y_train)
#5. Model evaluation
#5.1 Method 1: Directly compare the real and predicted values
y_predict=estimator.predict(x_test)
print("y_predict:\n",y_predict)
print("Direct comparison of real value and predicted value: \n",y_test==y_predict)
#5.2 Method 2: Calculation accuracy
score=estimator.score(x_test,y_test)
print("Accuracy: \n",score)
View grid search and cross validation results.
# Best parameter: best_params_
print("Best parameter: \n", estimator.best_params_)
Best result: best_score_
print("Best result: \n", estimator.best_score_)
# Best estimator: best_estimator_
print("Best estimator :\n", estimator.best_estimator_)
# cross validation result: cv_results_
print("Cross-validation result :\n", estimator.cv_results_)
Copy the code
Regression and clustering algorithm
Linear regression
Refer to the up: www.bilibili.com/video/BV17T…
The teacher gave the definition:
[Linear regression]
- Regression problem target value – continuous data
- define
Function relation (linear model) : establish relation between eigenvalue and target value y = w1x1 + w2x2 + w3x3 +… + WNXN + b = wTx + b Y = w1x1 + w2x2 + w3x3 +… + WNXN + b = wTx + b y = w1x1 + w2x1^2 + w3x1^3 + w4x2^3 +… Objective: To find the model parameter loss function /cost/ cost function/objective function — least — least square optimization loss (optimization method) a. GD gradient descent: Calculate the values of all the samples to obtain the gradient SGD stochastic gradient descent: Consider only one sample at a time SAG stochastic mean gradient method 5. Regression performance evaluation means square error (MSE) evaluation mechanism
Normal equations
API
from sklearn.linear_model import LinearRegression
Features: fast operation, suitable for small sample data
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression,SGDRegressor
from sklearn.metrics import mean_squared_error
def linear1() :
Prediction of Housing Price in Boston by Optimizing Method of Normal Equation
#1. Get the data
boston=load_boston()
print("Characteristic Quantity: \n",boston.data.shape)
#2. Divide the dataset
x_train,x_test,y_train,y_test=train_test_split(boston.data,boston.target,random_state=22)
# 3. Standardization
transfer=StandardScaler()
x_train=transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)
# 4. The forecast
estimator=LinearRegression()
estimator.fit(x_train,y_train)
#5. Get the model
print("Normal equation - weighting coefficient is: \n",estimator.coef_)
print("Normal equation - bias is: \n",estimator.intercept_)
#6. Model evaluation
y_predict=estimator.predict(x_test)
print("Normal Equation - Forecast House prices: \n",y_predict)
error=mean_squared_error(y_test,y_predict)
print("Normal equation - mean square error is: \n",error)
Copy the code
Gradient descent
This code up here
from sklearn.linear_model import SGDRegressor
def linear2() :
Prediction of Housing Price in Boston based on gradient Descent Optimization Method
#1. Get the data
boston=load_boston()
#2. Divide the dataset
x_train,x_test,y_train,y_test=train_test_split(boston.data,boston.target,random_state=22)
# 3. Standardization
transfer=StandardScaler()
x_train=transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)
# 4. The forecast
estimator=SGDRegressor(learning_rate="constant",eta0=0.01,max_iter=10000)
estimator.fit(x_train,y_train)
#5. Get the model
print("Gradient descent - weighting coefficient is: \n",estimator.coef_)
print("Gradient descent - offset to: \n",estimator.intercept_)
# 6. Model evaluation
y_predict = estimator.predict(x_test)
print("Gradient Descent - Forecast House Prices: \n", y_predict)
error = mean_squared_error(y_test, y_predict)
print("Gradient descent - mean square error: \n", error)
Copy the code
Ridge regression
Here it is mainly the first two methods that have problems, as follows:
[Under-fitting and over-fitting]
- 2. Poor performance of the training set, poor test set – underfit
- 3.2 Overfitting cause: There are too many original features, some noisy features, and the model is too complex, because the model tries to take into account all test data points. Solution: Regularization L1 – LASSO: loss function + lambda penalty term (| | w) L2 – Ridge Ridge regression: more common (loss function + lambda punishment squared) (w)
So this was introduced to solve the problem
from sklearn.linear_model import Ridge
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
def linear3() :
""" Ridge returns to forecast Home prices in Boston. ""
#1. Get the data
boston=load_boston()
#2. Divide the dataset
x_train,x_test,y_train,y_test=train_test_split(boston.data,boston.target,random_state=22)
# 3. Standardization
transfer=StandardScaler()
x_train=transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)
# 4. The forecast
estimator=Ridge(alpha=0.5,max_iter=10000)
estimator.fit(x_train,y_train)
#5. Get the model
print("Ridge regression - weight coefficient is: \n",estimator.coef_)
print("Ridge regression - bias is: \n",estimator.intercept_)
# 6. Model evaluation
y_predict = estimator.predict(x_test)
print("Ridge Regression - Forecast Housing Price: \ N", y_predict)
error = mean_squared_error(y_test, y_predict)
print("Ridge regression - mean square error is: \n", error)
Copy the code
Logistic regression and dichotomy
[The principle of logistic regression]
- Input: The output of linear regression is the input of logistic regression
- Sigmoid: 1/(1+e^(-x))
- Loss function logarithmic likelihood loss
- Optimization loss: gradient descent
[Dichotomy]
This is actually one of those forest things that tells you whether the end result is going to be true or not, and it has two values: true or false, and typically disease prediction and things like that to see if you’re sick or not
from sklearn.linear_model import LogisticRegression
This is not to be confused with the normal equation
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
#1. Read data
path="https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"
column_name = ['Sample code number'.'Clump Thickness'.'Uniformity of Cell Size'.'Uniformity of Cell Shape'.'Marginal Adhesion'.'Single Epithelial Cell Size'.'Bare Nuclei'.'Bland Chromatin'.'Normal Nucleoli'.'Mitoses'.'Class']
data=pd.read_csv(path,names=column_name)
#2. Missing value handling
# 2.1 replacement - > np. Nan
data=data.replace(to_replace="?",value=np.nan)
#2.2 Delete missing samples
data.dropna(inplace=True)
Filter eigenvalues and target values
x=data.iloc[:,1: -1]
y=data["Class"]
#3. Divide the data set
x_train,x_test,y_train,y_test=train_test_split(x,y)
#4. Feature engineering
transfer=StandardScaler()
x_train=transfer.fit_transform(x_train)
x_test=transfer.transform(x_test)
#5. The estimator process
estimator=LogisticRegression()
estimator.fit(x_train,y_train)
Model parameters of logistic regression: regression coefficient and bias
print((2) Characteristic parameters:,estimator.coef_)
#6. Model evaluation
# Method 1: Directly compare the real and predicted values
y_predict = estimator.predict(x_test)
print("y_predict:\n", y_predict)
print("Direct comparison of real value and predicted value :\n", y_test == y_predict)
# Method 2: Calculation accuracy
score = estimator.score(x_test, y_test)
print("Accuracy: \n", score)
Copy the code
Classification evaluation
Because what we find here is that this is a particular word here, that the categories here are not just categories, but categories with true and false values. There is also a problem with the data source, because the proportion of the diseased elements in the data set may not be obvious in the source, which may lead to inaccurate results in predicting the classification of the diseased population, so we need to evaluate this stuff.
Classification assessment method]
- Confusion matrix TP True example = True Possitive FP False positive example = False Possitive FN False Negative example = False Negative TN True Negative example = True Negative
- Precision and Recall
Accuracy rate: proportion of positive samples whose predicted results are true. Recall rate: proportion of positive samples whose predicted results are true (ability to distinguish positive samples) F1-Score: robustness of the model 3. ROC curve and AUC indicators (which can measure the evaluation under unbalanced samples) TP/(TP+FN) Proportion of predicted category 1 (recall rate) among all samples of true category 1 3.2 FPR: FP/(FP+TN) Of all samples with true category 0, the proportion of predicted category 1 3.3 AUC can only be used to evaluate dichotomies (e.g., above) AUC is very suitable for evaluating classifier performance in sample imbalance
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
#1. Read data
path="https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"
column_name = ['Sample code number'.'Clump Thickness'.'Uniformity of Cell Size'.'Uniformity of Cell Shape'.'Marginal Adhesion'.'Single Epithelial Cell Size'.'Bare Nuclei'.'Bland Chromatin'.'Normal Nucleoli'.'Mitoses'.'Class']
data=pd.read_csv(path,names=column_name)
#2. Missing value handling
# 2.1 replacement - > np. Nan
data=data.replace(to_replace="?",value=np.nan)
#2.2 Delete missing samples
data.dropna(inplace=True)
Filter eigenvalues and target values
x=data.iloc[:,1: -1]
y=data["Class"]
#3. Divide the data set
x_train,x_test,y_train,y_test=train_test_split(x,y)
#4. Feature engineering
transfer=StandardScaler()
x_train=transfer.fit_transform(x_train)
x_test=transfer.transform(x_test)
#5. The estimator process
estimator=LogisticRegression()
estimator.fit(x_train,y_train)
Model parameters of logistic regression: regression coefficient and bias
print((2) Characteristic parameters:,estimator.coef_)
#6. Model evaluation
# Method 1: Directly compare the real and predicted values
y_predict = estimator.predict(x_test)
print("y_predict:\n", y_predict)
print("Direct comparison of real value and predicted value :\n", y_test == y_predict)
# Method 2: Calculation accuracy
score = estimator.score(x_test, y_test)
print("Accuracy: \n", score)
# Check accuracy, recall, F1-score
report=classification_report(y_test,y_predict,labels=[2.4],target_names=["Benign"."Vicious"])
# AUC indicators
#y_true: The true category of each sample, must be marked 0 (negative example), 1 (positive example)
#y_test converts to 0, 1
y_true=np.where(y_test>3.1.0)
auc=roc_auc_score(y_true,y_predict)
print("auc:\n",auc)
Copy the code
K-means unsupervised clustering algorithm
[Unsupervised learning]
- Definition No target value – unsupervised learning 2. Included algorithm clustering: K-means(K-means clustering) Dimensionality reduction: PCA
[K-means algorithm steps]
Randomly set K points in the feature space as the initial cluster centers. 2. Calculate the distance from each other point to K centers, and select the nearest cluster center for the unknown point as the marker category
Then the new center point (mean value) of each cluster is recalculated against the marked cluster center.
If the calculated new center point is the same as the original center point, then the end, otherwise the second step of the process
The advantages and disadvantages
Advantages:
Using iterative algorithm, intuitive and very practical
Disadvantages:
Easy convergence to local optimal solution (solution: multiple clustering)
Note: Clustering is usually done before classification!
API:
from sklearn.cluster import KMeans
Evaluation:
from sklearn.metrics import silhouette_score
Here is an example:
To explore the user’s preference for item category subdivision, in this case, it is to get the classification of the user’s favorite item through cluster analysis
This has already been done
Data set:
Link: pan.baidu.com/s/1P9xwvyYA… Extraction code: 6666
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
#1. Get the data
order_products=pd.read_csv("./data/order_products__prior.csv")
products=pd.read_csv("./data/products.csv")
orders=pd.read_csv("./data/orders.csv")
aisles=pd.read_csv("./data/aisles.csv")
# 2. Merge tables
# merge aisles with Products aisle and product_id
tab1=pd.merge(aisles,products,on=["aisle_id"."aisle_id"[]) :100]
tab2=pd.merge(tab1,order_products,on=["product_id"."product_id"[]) :100]
tab3=pd.merge(tab2,orders,on=["order_id"."order_id"[]) :100]
#3. Find the relationship between user_id and aisle
table=pd.crosstab(tab3["user_id"],tab3["aisle"])
data=table[:10000]# Take a portion to save time
# 4. PCA dimension reduction
Instantiate the converter class
transfer=PCA(n_components=0.95)
Call fit_transform # 4.2
data_new=transfer.fit_transform(data)
#5. The estimator process
estimator=KMeans(n_clusters=3)
estimator.fit(data_new)
y_predict=estimator.predict(data_new)
# Model evaluation - profile coefficient
silhouette_num=silhouette_score(data_new,y_predict)
print("Contour coefficient: \n",silhouette_num)
Copy the code
Model loading and saving
import joblib
1.Save the joblib. Dump (estimator,'test.pkl')
2.Load estimator = joblib.load('test.pkl')
Copy the code