Python Machine learning Notes
3 Days quick Start python machine learning with dark Horse Programmer
www.bilibili.com/video/BV1nt…
Overview of machine learning
1.1 Overview of artificial intelligence
1.1.1 The relationship between machine learning, artificial intelligence and deep learning
-
Machine learning and artificial intelligence, deep learning
- Machine learning is an approach to artificial intelligence
- Deep learning is an evolution of machine learning
-
Dartmouth Conferences – Artificial intelligence began
In 1956, a group of computer scientists met at the Dartmouth Conferences and coined the concept of “artificial intelligence.”
1.1.2 What can machine learning and deep learning do
There are many application scenarios of machine learning, which can be said to permeate all walks of life. Medical care, aviation, education, logistics, e-commerce and so on.
- Used in the field of excavation and prediction:
- Application scenarios: store sales forecast, quantitative investment, advertising recommendation, enterprise user classification…
- Used in the field of images:
- Application scenarios: street traffic sign detection, face recognition and so on
- In the field of self-heating language processing:
- Application scenarios: text classification, sentiment analysis, automatic chat, text detection and so on.
What is important now is to master some machine learning algorithms and other skills to solve problems from an industry perspective.
1.2 What is Machine learning
1.2.1 definition
Machine learning: Models derived from automatic analysis of data and used to make predictions about unknown data.
1.2.2 Data set composition
- Structure: eigenvalue + target value
Note:
- For each row of data we call it a sample.
- Some data can have no target value:
1.3 Classification of machine learning algorithms
1.3.1 Supervised learning
-
Classification problem: have target value, judge category
-
Regression problems: target values, continuous data
1.3.2 Unsupervised learning
- Unsupervised learning: No target value
1.4 Machine learning development process
Machine learning development process:
- To get the data
- The data processing
- Characteristics of the engineering
- Machine learning algorithm training — model
- Model to evaluate
- application
1.5 Learning framework and material introduction
Make a few points clear:
- Algorithms are the core, data and calculation are the foundation
- Get the position
Most algorithms are done by proprietary algorithm engineers, and we just need to:
- Analyze a lot of data
- Analyze the specific business
- Apply common algorithms
- Feature engineering, parameter tuning, optimization
Machine learning libraries and frameworks:
Two feature engineering
2.1 the data set
- Goal:
- Know that data sets are divided into training sets and test sets
- Sklearn data sets are used
- Application:
- There is no
2.1.1 Available data sets
Kaggle at www.kaggle.com/
The UCI dataset is available at archive.ics.uci.edu/ml
Scikit-learn website: scikit-learn.org/stable/data…
1. Introduction to sciKit-learn
- Machine learning tools for Python
- Scikit-learn includes the implementation of many well-known machine learning algorithms
- Scikit-learn is well documented, easy to use, and rich in apis
- Latest stable release 0.24
2, installation,
conda install -c conda-forge scikit-learn
Copy the code
3. Contents of sciKit-learn
2.1.2 Sklearn dataset
1. Introduction to sciKit-learn data set API
- sklearn.datasets
- Load to get popular data sets
- datasets.load_*()
- Get a small set of data contained in datasets
- datasets.fetch_*(data_home=None)
- The first argument to this function is data_HOME, which indicates the directory from which the dataset was downloaded.
2. Return the sklearn dataset
- Load and fetch return data type datasets.base.Bunch(dictionary format)
def datasets_demo() :
""" Sklearn dataset uses :return: """
# Fetch data
iris = load_iris()
print("Iris Data Set :\n", iris)
print("View data description :\n", iris["DESCR"])
print("View the name of the eigenvalue :\n", iris.feature_names)
print("View eigenvalues :\n", iris.data, iris.data.shape)
return None
Copy the code
2.1.3 Data set division
The general data set of machine learning is divided into two parts:
- Training data: used for training and modeling
- Test data: Used during model validation to evaluate the validity of the model
Division ratio:
- Training set: 70% 80% 90%
- Test set: 30% 20% 10%
Data set partitioning API
- sklearn.model_selection.train_test_split(arrys,*options)
Def datasets_demo(): """ Iris = load_iris() print(" Iris dataset :\n", iris["DESCR"]) print(" \n", iris["DESCR"]) print(" \n", Print (" view feature values :\n", iris.data, iris.data.shape) Y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=22) print(" \n", x_train, x_train.shape) return NoneCopy the code
2.2 Introduction to feature engineering
Learning Objectives:
- Understand the importance of feature engineering in machine learning
- Know the classification of feature engineering
2.2.1 Why is Feature Engineering needed?
It is widely held in the industry that data and features determine the threshold of machine learning, and that models and algorithms only approximate this threshold.
2.2.2 What is feature engineering
Feature engineering is the process of using professional background knowledge and skills to process data so that features can play a better role in machine learning algorithms.
Significance: Directly affect the effect of machine learning
2.2.3 Comparison of location and data processing of feature engineering
- Pandas: A tool for easy reading and basic manipulation of data formats
- Sklearn: provides a powerful interface for feature processing
2.3 Feature Extraction
Goal:
- DictVectorizer was applied to realize the numerical and discretization of category features
- CountVectorizer is used to implement the numeralization of text features
- TfidfVectorizer is used to numeralize text features
- Distinguish the difference between two text feature extraction
2.3.1 Feature extraction
1. Convert arbitrary data, such as text or images, into digital features that can be used for machine learning
Note: Eigenvaluation is used for better understanding of data by computers
- Dictionary feature extraction (Feature discretization)
- Text feature extraction
- Image feature extraction
2. Feature extraction API
sklearn.feature_extraction
Copy the code
2.3.2 Dictionary feature extraction
Function: Eigenvalue dictionary data
sklearn.feature_extraction.DictVectorizer(…)
Example:
def dict_demo() :
""" dictionary feature extraction :return: ""
data = [{'city': "Beijing".'temperature': 100}, {'city': "Shanghai".'temperature': 60}, {'city': "Shenzhen".'temperature': 30}]
Instantiate a converter
# Sparse means sparse matrix: only values are returned, not 0
transfer = DictVectorizer(sparse=False)
# 2. Call fit_transform()
data_new = transfer.fit_transform(data)
print("New data :\n", data_new)
print("Characteristic name: \n", transfer.get_feature_names())
return None
Copy the code
Execution Result:
2.3.3 Text feature extraction
1, function: to do eigenvalue of text data
- Sklearn. Feature_extraction. Text. CountVectorizer (stop_words = []) returns the word frequency matrix
- Countvectorizer.fit_transform (X) X: Text or an iterable object containing text strings, return value sparse matrix
- Countvectorizer.inverse_transform (X) X:array Array or SPARSE matrix Returned value: Data format before conversion
- Countvectorizer.get_feature_names () returns a list of words
def count_demo() :
Text feature Extraction :return: """
data = ["lift is short,i like like python"."lift is long,i dislike python"]
Instantiate a converter
transfer = CountVectorizer()
# 2. Call fit_transform()
data_new = transfer.fit_transform(data)
print("data_new:\n", data_new.toarray())
print("List of eigenvalues: \n", transfer.get_feature_names())
return None
Copy the code
2. Running results
3. Chinese extraction example, note: because there is no space between Chinese words, the program can not recognize a single word, need to use space segmentation, after the use of word segmentation, manual segmentation verification.
def count_chinese_demo() :
""" Chinese text feature extraction :return: """
data = ["I love Tiananmen Square in Beijing"."The sun rises over Tiananmen square."]
Instantiate a converter
transfer = CountVectorizer()
# 2. Call fit_transform()
data_new = transfer.fit_transform(data)
print("data_new:\n", data_new.toarray())
print("List of eigenvalues: \n", transfer.get_feature_names())
return None
Copy the code
4, Chinese extraction: use jieba to automatically divide words
def count_chinese_demo2() :
Chinese text feature extraction, automatic word segmentation: """
data = ["The so-called training instructor is actually a self-employed marketing professional infected with COVID-19. He infected 102 people, the oldest of whom was born in 1933."."At the press conference, the relevant person in charge of the market supervision department of Jilin Province said that the relevant departments of the two places have carried out a joint investigation, and any illegal behavior will be severely punished."."The epidemic situation at present, health care workers are still tightly wrapped in thick protective clothing, in the front line with virus into combat, community volunteer still take cold wind gusts in residents" stand guard "downstairs, unexpectedly someone at this time, however, held" health training activities to promote products, puts the elderly will be coronavirus infection "discourse", Although under the banner of caring for the health of the elderly, it is tantamount to seeking wealth fatal!"]
# participle
data_new = []
for item in data:
data_new.append("".join(list(jieba.cut(item))))
Instantiate a converter
transfer = CountVectorizer(stop_words=["102"."1933"."A"])
# 2. Call fit_transform()
data_final = transfer.fit_transform(data_new)
print("data_new:\n", data_final.toarray())
print("List of eigenvalues: \n", transfer.get_feature_names())
return None
Copy the code
Running results:
The appeal method is used to extract the characteristic value of the article. A certain word appears many times, but it only appears few times in other articles
Solution: Use TfidfVectorizer
5. Tf-idf text feature extraction
- The main idea of TF-IDF: If a certain word or phrase has a high probability of occurrence in one article and rarely appears in other articles, it is considered that the word or phrase has good classification ability and is suitable for classification.
- Tf-idf: Used to evaluate the importance of a word to one document in a document set or a corpus.
- TF: Term Frequency
- IDF: Reverse document frequency
- API: sklearn feature_extraction. Text. TfidfVectorizer ()
Case study:
def tfidf_demo() :
Automatic word segmentation for Chinese text feature extraction using TF-IDF: """
data = ["The so-called training instructor is actually a self-employed marketing professional infected with COVID-19. He infected 102 people, the oldest of whom was born in 1933."."At the press conference, the relevant person in charge of the market supervision department of Jilin Province said that the relevant departments of the two places have carried out a joint investigation, and any illegal behavior will be severely punished."."The epidemic situation at present, health care workers are still tightly wrapped in thick protective clothing, in the front line with virus into combat, community volunteer still take cold wind gusts in residents" stand guard "downstairs, unexpectedly someone at this time, however, held" health training activities to promote products, puts the elderly will be coronavirus infection "discourse", Although under the banner of caring for the health of the elderly, it is tantamount to seeking wealth fatal!"]
# participle
data_new = []
for item in data:
data_new.append("".join(list(jieba.cut(item))))
Instantiate a converter
transfer = TfidfVectorizer(stop_words=["102"."1933"."A"])
# 2. Call fit_transform()
data_final = transfer.fit_transform(data_new)
print("data_new:\n", data_final.toarray())
print("List of eigenvalues: \n", transfer.get_feature_names())
return None
Copy the code
Running results:
2.4 Feature preprocessing
Goal:
- Understand the characteristics of numerical data and categorical data
- MinMaxScaler is used to normalize the feature data
- StandardScaler is used to standardize the characteristic data
Against 2.4.1 profile
What is feature preprocessing:
The process of transforming feature data into feature data more suitable for the algorithm model through some transformation functions
1. Include content
- The normalized
- standardized
2. Feature preprocessing API
sklearn.preprocessing
Copy the code
Why do we need normalization?
- Feature units or large size difference, or the variance of a feature is several orders of magnitude larger than that of other features, easy to affect (dominate) the target result, so that some algorithms can not learn other features.
The normalized 2.4.2
1. Definition: map 3 data between (default [0,1]) by converting the original data
2, the formula
def minmax_demo() :
""" normalization :return: """
# 1. Get the data
data = pd.read_csv('.. /data/dating.csv', encoding="gbk")
data = data.iloc[:, 0:8]
Instantiate a converter
transfer = MinMaxScaler()
# 3. Call fit_transform
data_new = transfer.fit_transform(data)
print("data_new:\n", data_new)
# 4.
return None
Copy the code
Results:
Conclusion: When outliers exist, that is, when the maximum or minimum value is an outlier, the normalized value is not accurate and only suitable for traditional precise small data scenarios.
2.4.3 standardization
1. Definition: Transform the original data to a range of 0 mean and 1 standard deviation
2, the formula
- For normalization: if there are outliers that affect the maximum and minimum, then the result obviously changes
- For standardization: if there are outliers, due to a certain amount of data, a small number of outliers have little influence on the average, so the variance change is small.
3, API
- sklearn.preprocessing.StandardScaler()
4. Data calculation
Do the same with the above data
- Analysis of the
- Instantiation StandardScaler
- Transform by fit_transform
def stand_demo() :
""" standardized :return: """
# 1. Get the data
data = pd.read_csv('.. /data/dating.csv', encoding="gbk")
data = data.iloc[:, 0:8]
Instantiate a converter
transfer = StandardScaler()
# 3. Call fit_transform
data_new = transfer.fit_transform(data)
print("data_new:\n", data_new)
return None
Copy the code
Execution Result:
Conclusion:
Standard deviation: concentration
In the case of enough samples, it is stable and suitable for modern noisy big data scenes.
2.5 Feature dimension reduction
Goal:
- application
VarianceThreshold
Implement deletion of low variance features- Understand the characteristics and calculation of correlation coefficient
- The eigenvalue selection is realized by correlation coefficient
2.5.1 dimension reduction
Dimensionality reduction: The process of reducing the number of random variables (features) to obtain a set of “unrelated” principal variables under certain constraints
- Let’s reduce the number of random variables
- Relevant features
- The correlation between relative humidity and rainfall
- .
Because when we are training, we are learning using eigenvalues. If the features themselves have problems or the correlation between features is strong, it will have a great influence on algorithm learning and prediction.
2.5.2 Two ways of dimensionality reduction
- Feature selection
- Principal component analysis (can understand a feature extraction method)
2.5.3 What is feature selection
1. Definition: Data contains redundant or related variables (or features, attributes, indicators, etc.) to find out the main features from the original features.
2, methods,
- Filter: Mainly explore the characteristics of features themselves, and the correlation between features and target values
- Variance selection method: low variance filtering
- The correlation coefficient
- Embedded: Algorithms automatically select features (associations between features and target values)
- Decision tree: information entropy, information gain
- Regularization: L1, L2
- Deep learning: convolution, etc
3, modules,
sklearn.feature_selection
Copy the code
4, filter type
4.1 Low-variance feature filtering
Remove some features of poor place:
- Small feature variance: most samples of a feature have similar values
- Large feature variance: the value of a feature varies from one sample to another
- Sklearn. Feature_selection. VarianceThreshold (threshold = 0.0)
- Delete all low-variance features
- Variance.fit_transform(X)
- X: Numpy array format data [n_samples,n_features]
- Return value: The training value is lower than
threshold
The feature will be deleted. The default is to retain all non-zero variance features, that is, to delete features with the same value in all samples.
def variance_demo() :
""" filter low variance feature :return: """
# 1. Get the data
data = pd.read_csv('.. /data/dating.csv', encoding="gbk")
data = data.iloc[:, 1:8]
Instantiate a converter
transfer = VarianceThreshold(threshold=10)
# 3. Call fit_transform
data_new = transfer.fit_transform(data)
print("data_new:\n", data_new, data_new.shape)
# Calculate the correlation coefficient between two variables
r = pearsonr(data["gender"], data["position"])
print("Related Systems: \n", r)
return None
Copy the code
Execution Result:
4.2 Correlation coefficient
- Pearson correlation coefficient
- A statistical indicator reflecting the degree of correlation between variables
The value of the correlation coefficient is between -1 and +1, i.e. [-1,+1]
- When r is greater than 0, the two variables are positively correlated; when r is less than zero, the two variables are negatively correlated
- When the absolute value of r is equal to 1, it means that the two variables are completely correlated; when r=0, it means that the two variables are not correlated
- When the 0 < | r | < 1, says the two variables had a certain degree of correlation. And the closer the | r | 1, a linear relationship between two variables more closely; | r | the closer to zero, the two variables linear correlation.
- According to three-tiered commonly: | r | < 0.4 for low-grade related; 0.4 < = | r | < 0.7 is significant correlation; 0.7 < = | r | < 1 for highly linear correlation.
from scipy.stats import pearsonr
Copy the code
2.6 Principal component analysis
Goal:
- PCA is used to reduce dimension of feature
Application:
- Principal component analysis between user and item category
2.6.1 profile
Principal component analysis (PCA) : the process of transforming high-dimensional data into low-dimensional data, in which the original data may be discarded and new variables created.
Function: data dimension compression, as far as possible to reduce the original data dimension (complexity), loss of a small amount of information.
Application: Regression analysis or cluster analysis.
API:
- sklearn.decomposition.PCA(n_components=None)
def pca_demo() :
""" PCA dimension reduction :return: ""
# Prepare data
data = [[2.8.4.5], [6.3.0.8], [5.4.9.1]]
# instantiate the converter
transfer = PCA(n_components=0.95)
# call fit_transform ()
data_new = transfer.fit_transform(data)
print("data_new:\n", data_new)
return None
Copy the code
Execution Result:
2.6.2 case
Explore the segmentation of user preferences for item categories
#Run jupyter notebook
jupyter notebook
Copy the code
Three classification algorithm
3.1 SkLearn converters and estimators
The target
- Know the converter and estimator flow of SkLearn
3.1.1 converter
Steps of feature engineering:
- Instantiate (instantiate a Transformer class)
- Call fit_transform(not at the same time for creating a class word frequency matrix for documents)
Characteristic interfaces are called converters, which take several forms:
- fit_transform
- fit
- transform
For example, standardization:
(x – mean)/ std
fit_transform
Fit calculates the mean and standard deviation of each column
Transform (X-mean)/ STD for final conversion
3.1.2 estimator
Estimators play an important role in SkLearn and are a class of APIS for implementing algorithms
- Estimator for classification
- Sklearn. Neighbors algorithm
- Sklearn native_bayes bayesian
- Sklearn.linear_model.LogisticRegression
- Sklearn. tree Decision tree and random forest
- Estimators for regression:
- LinearRegression sklearn.linear_model.LinearRegression
- Sklearn. Linear_model. Ridge Ridge regression
3.2 K-nearest neighbor algorithm
Learning goals
- Learn the distance formula of KNN algorithm
- Study the hyperparameter K value of KNN algorithm and its value problem
- Learn the pros and cons of KNN
- The KNeighborsClassifier is used to implement classification
- Understand the accuracy of classification algorithm evaluation criteria
1. Principle of K-Nearest Neighbor algorithm (KNN)
K Nearest Neighbor algorithm is also called KNN algorithm
Definition: A sample belongs to a category if most of the k closest samples in the feature space (that is, the nearest neighbors in the feature space) belong to that category.
The distance between two samples is also called Euclidean distance
KNN algorithm was used to classify iris
def knn_iris() :
""" Classifying iris with KNN algorithm :return: """
# 1. Get data
iris = load_iris()
# 2. Divide the data set
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=6)
# 3. Feature Engineering: Standardization
tranfer = StandardScaler()
x_train = tranfer.fit_transform(x_train)
x_test = tranfer.fit_transform(x_test)
# 4. KNN algorithm estimator
estimator = KNeighborsClassifier(n_neighbors=3)
estimator.fit(x_train, y_train)
# 5. Model evaluation
# Method 1: Directly compare the actual and predicted values
y_predict = estimator.predict(x_test)
print("y_predict:\n", y_predict)
print("Direct comparison of actual value and predicted value: \n", y_predict == y_test)
# Method 2: Calculation accuracy
score = estimator.score(x_test, y_test)
print("Accuracy: \n", score)
return None
Copy the code
Running results:
3.3 Model selection and tuning
Learning Objectives:
- Cross validation procedure
- Hyperparameter search procedure
- GridSearchCV is used to optimize the algorithm parameters
3.3.1 Cross validation
Cross validation: Divide the training data into training and validation sets. The following figure is an example: The data is divided into four parts, one of which is used as the verification set. It then goes through four tests, each with a different validation set. That is, four groups of model results are obtained, and the average value is taken as the final result. Also known as 4 fold cross validation.
The data is divided into training set and test set, but in order to make the model results obtained from the training set more accurate, the following processing is done:
KNN algorithm was used to classify iris, and network search and cross validation were added
def knn_iris_gscv() :
""" Classifying iris with KNN algorithm, adding web search and cross validation :return: """
# 1. Get data
iris = load_iris()
# 2. Divide the data set
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=6)
# 3. Feature Engineering: Standardization
tranfer = StandardScaler()
x_train = tranfer.fit_transform(x_train)
x_test = tranfer.fit_transform(x_test)
# 4. KNN algorithm estimator
estimator = KNeighborsClassifier()
# Add web search and cross-validation
# Parameter preparation
param_dict = {"n_neighbors": [1.3.5.7.9.11]}
estimator = GridSearchCV(estimator, param_grid=param_dict, cv=10)
estimator.fit(x_train, y_train)
# 5. Model evaluation
# Method 1: Directly compare the actual and predicted values
y_predict = estimator.predict(x_test)
print("y_predict:\n", y_predict)
print("Direct comparison of actual value and predicted value: \n", y_predict == y_test)
# Method 2: Calculation accuracy
score = estimator.score(x_test, y_test)
print("Accuracy: \n", score)
# Optimal parameters
print("Best parameter :\n", estimator.best_params_)
# Best result
print("Best result :\n", estimator.best_score_)
# Best estimator
print("Best estimator :\n", estimator.best_estimator_)
# Cross-verify the results
print("Cross-validation result :\n", estimator.cv_results_)
return None
Copy the code
Running results:
Facebook case
3.4 Naive Bayes algorithm
Learning Objectives:
- Conditional probability and joint probability
- Bayes’ formula, and feature independence
- Laplace smoothing coefficient
- Bayesian formula is used to calculate the probability
3.4.1 What is naive Bayes Classification
According to the naive Bayes algorithm classification can appear the following probability value, according to the size of the probability value of mail classification
3.4.2 Basis of probability
1. Definition of Probability
- Probability is defined as the likelihood of an event happening
- P(X) : the value ranges from 0 to 1.
3.4.3 Joint probability, conditional probability and mutual independence
- Joint probability: contains multiple conditions, and all conditions are true probability
- Remember: P (A, B)
- Conditional probability: The probability of event A occurring if event B occurs
- Remember: P (A | B)
- Interdependent: If P(A,B) = P(A)P(B), then event A is said to be independent of event B
- <=> Event A and event B are independent of each other
3.4.4 Bayes’ formula
3.4.5 API
- Sklearn. Naive_bayes. MultinomiaINB (alpha = 1.0)
- Naive Bayes classification
- Alpha: Laplace smoothing coefficient
3.4.6 case
def nb_news() :
""" Classification of news with naive Bayesian algorithm :return: """
# Fetch data
news = fetch_20newsgroups(subset="all")
# Divide data set: feature training set, feature test set, target training set, target test set
x_train,x_test,y_train,y_test = train_test_split(news.data,news.target)
Feature engineering: Text feature extraction TFIDF
transfer = TfidfVectorizer()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)
# Naive Bayes algorithm predictor flow
estimator = MultinomialNB()
estimator.fit(x_train,y_train)
# Model evaluation
# Method 1: Directly compare the actual and predicted values
y_predict = estimator.predict(x_test)
print("y_predict:\n", y_predict)
print("Direct comparison of actual value and predicted value: \n", y_predict == y_test)
# Method 2: Calculation accuracy
score = estimator.score(x_test, y_test)
print("Accuracy: \n", score)
return None
Copy the code
Running results:
3.5 the decision tree
Learning Objectives:
- The formula and function of information entropy
- Formulas and functions of information
- The implementation of information gain is used to calculate the degree of uncertainty reduction of features
- Understand three kinds of decision tree algorithm implementation
3.5.1 Understanding the decision tree
The idea of decision tree is very simple. The conditional branch structure in program design is if-else structure. The earliest decision tree is a classification learning method using this structure to divide data
3.5.2 Principle of decision tree
1, the principle of
- Information entropy, information gain
2. Definition of information entropy
- The technical term for H is information entropy, which is measured in bits
3. Decision tree division is based on —– information entropy
Definition: characteristics of A set of training data D the information gain of g (D, A), defined as the set D information entropy H (D) and the characteristics of A given under the condition of D information entropy H (D | A), namely:
3.5.4 case
1. Classify iris by decision tree
def decision_iris() :
""" Classifying irises with decision trees :return: """
# 1. Get the data set
iris = load_iris()
# 2. Divide the data set
x_train,x_test,y_train,y_test = train_test_split(iris.data,iris.target,random_state=22)
# 3 decision tree estimator
estimator = DecisionTreeClassifier(criterion="entropy")
estimator.fit(x_train,y_train)
# 4. Model evaluation
# Method 1: Directly compare the actual and predicted values
y_predict = estimator.predict(x_test)
print("y_predict:\n", y_predict)
print("Direct comparison of actual value and predicted value: \n", y_predict == y_test)
# Method 2: Calculation accuracy
score = estimator.score(x_test, y_test)
print("Accuracy: \n", score)
# Visualize decision trees
export_graphviz(estimator,out_file="iris_tree.dot")
return None
Copy the code
Running results:
2. Survival prediction of Titanic passengers
3.5.6 Decision tree summary
- Advantages:
- Simple understanding and interpretation, tree visualization
- Disadvantages:
- Decision tree creators can create, but do not generalize well, trees with overly complex data, which is called overfitting
- Improvement:
- Pruning CART algorithm (already implemented in the decision tree API)
Random forests
3.6 Ensemble learning method of random forest
Learning Objectives:
- The establishment process of every decision tree in random forest
- Why do you need random Bootstrap sampling
- Hyperparameters of random forest
3.6.1 What integrated learning method
Ensemble learning is to solve a single prediction problem by combining several models.
It works by generating multiple classifiers/models that independently learn and make predictions. These predictions are finally combined into a composite forecast, so that any single category makes a prediction.
3.6.2 What is random forest
Random forest is a classifier containing multiple decision trees, and its output categories are determined by the mode of the categories output by individual trees.
3.6.3 Random forest principle process
The learning algorithm creates each lesson tree according to the following algorithm:
- Use N to represent the number of training cases and M to represent the number of features.
- One sample is randomly selected at a time and repeated N times (duplicate samples may occur)
- M features were randomly selected, m<< m, and the decision tree was established
- Bootstrap sampling was taken
3.6.4 radar echoes captured API
3.6.5 case
Titanic passenger survival prediction
Four regression and clustering algorithm
4.1 Linear regression
Learning Objectives:
- The principle of linear regression
- Regression prediction is implemented using LinearRegression or SGDRegressor
- Evaluation criteria and formula of regression algorithm
4.1.1 Principle of linear regression
1. Application scenarios of linear regression
2. What is linear regression
Definition and Formula
4.1.2 Loss and optimization principle of linear regression
1. Loss function: also known as cost, cost function and objective function
2. Optimization method
- Normal equations: less used
- Gradient Descent
Case study:
def linear1() :
""" Normal equation optimization method for Boston housing price prediction: """
# 1. Get data
boston = load_boston()
# 2. Divide the data set
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=22)
# 3. Standardization
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)
# 4. Predictor
estimator = LinearRegression()
estimator.fit(x_train,y_train)
# 5. Get the model
print("Normal equation - weight coefficient: \n",estimator.coef_)
print("Normal equation - bias is :\n",estimator.intercept_)
# 6. Model evaluation
return None
Copy the code
def linear2() :
""" Gradient descent optimization method to predict Boston housing prices: """
# 1. Get data
boston = load_boston()
# 2. Divide the data set
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=22)
# 3. Standardization
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)
# 4. Predictor
estimator = SGDRegressor()
estimator.fit(x_train, y_train)
# 5. Get the model
print("Gradient descent - weighting coefficient: \n", estimator.coef_)
print("Gradient descent - offset to :\n", estimator.intercept_)
# 6. Model evaluation
return None
Copy the code
Running results:
4.1.3 Regression performance evaluation
from sklearn.metrics import mean_squared_error
Copy the code
def linear1() :
""" Normal equation optimization method for Boston housing price prediction: """
# 1. Get data
boston = load_boston()
# 2. Divide the data set
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=22)
# 3. Standardization
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)
# 4. Predictor
estimator = LinearRegression()
estimator.fit(x_train, y_train)
# 5. Get the model
print("Normal equation - weight coefficient: \n", estimator.coef_)
print("Normal equation - bias is :\n", estimator.intercept_)
# 6. Model evaluation
y_predict = estimator.predict(x_test)
print("Forecast House Price :\n", y_predict)
error = mean_squared_error(y_test, y_predict)
print("Normal equation - mean square error is: \n", error)
return None
Copy the code
def linear2() :
""" Gradient descent optimization method to predict Boston housing prices: """
# 1. Get data
boston = load_boston()
# 2. Divide the data set
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=22)
# 3. Standardization
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)
# 4. Predictor
estimator = SGDRegressor()
estimator.fit(x_train, y_train)
# 5. Get the model
print("Gradient descent - weighting coefficient: \n", estimator.coef_)
print("Gradient descent - offset to :\n", estimator.intercept_)
# 6. Model evaluation
y_predict = estimator.predict(x_test)
print("Forecast House Price :\n", y_predict)
error = mean_squared_error(y_test, y_predict)
print("Gradient descent - mean square error: \n", error)
return None
Copy the code
4.1.4 Comparison between normal equation and gradient descent
Gradient descent | Normal equations |
---|---|
You need to choose the learning rate | Don’t need |
You need to iterate | In one operation |
It can be used when the number of features is large | You need to compute the equation, time complexity O(n3). |
- choose
- Small-scale data:
- LinearRegression (does not solve the fit problem)
- Ridge regression
- Large-scale data: SGT Dregressor
- Small-scale data:
4.1.5 extension
Optimization methods: GD, SGD, SAG
1, GD
The original Gradient Descent requires calculation of all the values to obtain the Gradient, which is a large amount of calculation, hence the improved algorithm.
2, SGD
Stochastic gradient Descent: She considers only one training sample at a time.
- Advantages of SGD:
- efficient
- Easy to implement
- Disadvantages of SGD:
- SGD requires many hyperparameters: regular term parameters, number of iterations.
- SGD is sensitive to feature standardization
3, SAG
Stochastic Average gradient. Due to the slow speed of convergence, some algorithms based on gradient descent such as SAG have been proposed.
SAG optimization is found in ridge regression and logistic regression
4.2 Underfitting and overfitting
Learning Objectives:
- Disadvantages of linear regression (without regularization)
- Causes and solutions of over-fitting and under-fitting
2 brief introduction
1. Underfitting
2. Overfitting
Analysis:
- The first case: because machine learning has too few swan features, the discrimination criteria are too rough to accurately identify swans
- Case two: The machine can basically tell swans apart. Unfortunately, all the pictures of swans were white swans, and then the machine learned that the swan’s feathers were white, and then it saw a swan with black feathers and thought it wasn’t a swan.
- Overfitting: a hypothesis that can obtain a better fit than other hypotheses on the training set but cannot fit the data well on the test data set is considered to be overfitting. (Model is too complex)
- Underfitting: a hypothesis that does not fit the training set data well and does not fit the test data well is considered to be underfitting. (Model is too simple)
4.2.2 Causes and solutions
- The reason of underfitting and the solution
- Reason: Learning too few features of data
- Solution: Increase the number of features in the data
- Causes and solutions of overfitting
- Reason: There are too many original features, some noisy features, and the model is too complicated because the model tries to take into account the data of each test point
- Solutions:
- regularization
Regularization: L2 regularization (common), L1 regularization
4.3 Improvement of linear regression — ridge regression
Learning Objectives:
- Learning the difference between the principle of ridge regression and linear regression
- Effects of regularization on weight parameters
- The difference between L1 and L2 regularization
4.3.1 Linear regression with L2 regularization — Ridge regression
Ridge regression is also a linear regression. In order to solve the problem of overfitting, the regularization restriction is added to establish the regression equation.
1, API
2. Observe the change of regularization degree and its influence on the results
Case: Ridge regression to Boston housing price forecast
def linear3() :
""" Ridge returns to the Boston housing price forecast: """
# 1. Get data
boston = load_boston()
# 2. Divide the data set
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=22)
# 3. Standardization
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)
# 4. Predictor
estimator = Ridge(alpha=0.5, max_iter=10000)
estimator.fit(x_train, y_train)
# 5. Get the model
print("Ridge regression - weight coefficient: \ N", estimator.coef_)
print("Ridge regression - bias is :\n", estimator.intercept_)
# 6. Model evaluation
y_predict = estimator.predict(x_test)
print("Forecast House Price :\n", y_predict)
error = mean_squared_error(y_test, y_predict)
print("Ridge regression - mean square error is: \n", error)
return None
Copy the code
4.4 Classification algorithms – Logistic regression and dichotomy
Learning Objectives:
- Loss function of logistic regression
- Optimization method of logistic regression
- The sigmoid function
- Application scenarios of logistic regression
- Accuracy rate, recall rate index difference
- The actual significance of F1-score index and recall rate
- How to solve the sample imbalance in the case of assessment
- ROC curve significance, AUC index size
- Classification_report is used to calculate the accuracy and recall rate
- Roc_auc_score is used to calculate the index
4.4.1 Logistic regression
Definition: Logistic Regression is a classification model of machine learning. Logistic Regression is a classification algorithm. Although it has Regression in its name, it is related to Regression. Because of its simplicity and efficiency, the algorithm is widely used in practice.
Application Scenarios:
- Click through rate
- Whether it is spam
- Whether sick
- Financial fraud
- False account
Note: looking at the example above, we can see that it is characteristic of the judgment that falls between the two categories. Logistic regression is a powerful tool to solve the problem of binary classification.
4.4.2 Principle of logistic regression
1. Input: The output of linear regression is the input of logistic regression
2. Activate the function
The sigmoid function:
g = 1/(1 + e^(-x))
The following formula is a matrix representation
Note: the sigmoid function input is the formula x, namely: x = h(w)=w1x1+ W2x2 +w3x3… +b
3. Loss and optimization
loss
The loss of logistic regression is called logarithmic likelihood loss
Synthesize the complete loss function
We know that minus log of P, the larger the P, the smaller the result, so we can analyze this loss.
Optimize the loss
The gradient descent optimization algorithm is also used to reduce the value of the loss function. In this way, the weight parameters of the corresponding algorithm before logistic regression are updated to increase the probability that originally belongs to class 1 and reduce the probability that originally belongs to class 0
4.4.3 API of logistic regression
4.4.4 case
Cancer classification prediction – benign/malignant breast cancer tumor prediction
Analysis process:
To get the data
Add names when reading
The data processing
Handling missing values
Data set partitioning
Characteristics of the engineering
Dimensionless processing – standardization
Logistic regression estimator
Model to evaluate
Data address: archive.ics.uci.edu/ml/machine-…
4.4.5 Assessment method of classification
1. Accuracy and recall rate
- Confusion matrix
2. Precision and Recall
- Accuracy: The proportion of positive samples in the predicted results that are actually positive (understood)
- Recall rate: The percentage of samples with positive results predicted by positive results (completeness, ability to distinguish positive samples)
- F1-score: reflects the robustness of the model
3. Classified evaluation report API
4. Check the accuracy rate, recall rate and F1-Score of cancer Classification prediction — Benign/Malignant Breast Cancer Tumor Prediction
4.4.6 ROC curve and AUC index
1. TPR and FPR
- (Recall rate) TPR = TP/(TP + FN)
- The percentage of all samples of true category 1 that are predicted to be category 1
- FPR = FP / (FP + TN)
- The percentage of all samples with a true category of 0 that are predicted to be category 1
2. ROC curve
- The horizontal axis of ROC is FPRate, and the vertical axis is TPRate. When the two are equal, it means that the probability of the predicted value of 1 by classifier is equal for the samples regardless of the real category is 1 or 0, and the AUC indicator is 0.5
3. AUC indicators
- The probabilistic significance of AUC is the probability that a pair of positive and negative samples are randomly selected and the score of positive samples is greater than that of negative samples
- The minimum value of AUC is 0.5, and the maximum value is 1. The higher the value, the better
- AUC=1, perfect classifier, when using this prediction model, no matter what threshold is set, perfect prediction can be obtained. For the most part, there is no perfect classifier.
- 0.5
Note: The final AUC ranges between [0.5,1] and is closer to 1, the better
4.4.7 AUC computing API
4.4.8 summary
- AUC can only be used to evaluate dichotomies
- AUC is very suitable for evaluating the performance of classifier with unbalanced samples
4.5 Model saving and loading
Learning Objectives:
- Joblib is used to save and load the model
4.5.1 SkLearn model saving and loading API
4.5.2 of case
1. Model preservation
from sklearn.externals import joblib
# 4. Predictor
estimator = Ridge(alpha=0.5, max_iter=10000)
estimator.fit(x_train, y_train)
# Save the model
joblib.dump(estimator,"./my_ridge.pkl")
Copy the code
2. Load the model
# Load model
estimator = joblib.load("./my_ridge.pkl")
Copy the code
4.6 Unsupervised learning – K-means algorithm
Learning Objectives:
- Principle of k-means algorithm
- K-means performance evaluation of quasi-contour coefficients
- Advantages and disadvantages of K-means
4.6.1 What is Unsupervised learning
Unsupervised learning: No target value
4.6.2 Unsupervised learning includes algorithms
- clustering
- K-means (K-means clustering)
- Dimension reduction
- PCA
4.6.3 K – means principle
4.6.4 K – means of API
4.6.5 case
K-means is used to cluster Instacart Market users
4.6.6 K-means performance evaluation indicators
1. Contour coefficient
Conclusion:
- If b_i >> a_i: tends to 1, the better
- If b_i << a_i: tends to -1, that’s not good
- The value of contour coefficient is between [-1,1], and the closer it is to 1, the better the cohesion and separation degree are
4.6.7 Contour Coefficient API
4.6.8 case
Evaluation of Instacart Market user clustering – contour coefficient by k-means
4.6.9 K – means to summarize
- Characteristic analysis: using iterative algorithm, intuitive and easy to understand, very practical
- Disadvantages: Easy convergence to local optimal solution (multiple clustering)
Note: Clustering is usually done before classification