Python Machine learning Notes

3 Days quick Start python machine learning with dark Horse Programmer

www.bilibili.com/video/BV1nt…

Overview of machine learning

1.1 Overview of artificial intelligence

1.1.1 The relationship between machine learning, artificial intelligence and deep learning

Machine learning and artificial intelligence, deep learning
- Machine learning is an approach to artificial intelligence
- Deep learning is an evolution of machine learning
Dartmouth Conferences – Artificial intelligence began

In 1956, a group of computer scientists met at the Dartmouth Conferences and coined the concept of “artificial intelligence.”

1.1.2 What can machine learning and deep learning do

There are many application scenarios of machine learning, which can be said to permeate all walks of life. Medical care, aviation, education, logistics, e-commerce and so on.

Used in the field of excavation and prediction:
- Application scenarios: store sales forecast, quantitative investment, advertising recommendation, enterprise user classification…
Used in the field of images:
- Application scenarios: street traffic sign detection, face recognition and so on
In the field of self-heating language processing:
- Application scenarios: text classification, sentiment analysis, automatic chat, text detection and so on.

What is important now is to master some machine learning algorithms and other skills to solve problems from an industry perspective.

1.2 What is Machine learning

1.2.1 definition

Machine learning: Models derived from automatic analysis of data and used to make predictions about unknown data.

1.2.2 Data set composition

Structure: eigenvalue + target value

Note:

For each row of data we call it a sample.

Some data can have no target value:

1.3 Classification of machine learning algorithms

1.3.1 Supervised learning

Classification problem: have target value, judge category
Regression problems: target values, continuous data

1.3.2 Unsupervised learning

Unsupervised learning: No target value

1.4 Machine learning development process

Machine learning development process:

To get the data

The data processing

Characteristics of the engineering

Machine learning algorithm training — model

Model to evaluate

application

1.5 Learning framework and material introduction

Make a few points clear:

Algorithms are the core, data and calculation are the foundation

Get the position

Most algorithms are done by proprietary algorithm engineers, and we just need to:

Analyze a lot of data
Analyze the specific business
Apply common algorithms
Feature engineering, parameter tuning, optimization

Machine learning libraries and frameworks:

Two feature engineering

2.1 the data set

Goal:

Know that data sets are divided into training sets and test sets

Sklearn data sets are used

Application:

There is no

2.1.1 Available data sets

Kaggle at www.kaggle.com/

The UCI dataset is available at archive.ics.uci.edu/ml

Scikit-learn website: scikit-learn.org/stable/data…

1. Introduction to sciKit-learn

Machine learning tools for Python
Scikit-learn includes the implementation of many well-known machine learning algorithms
Scikit-learn is well documented, easy to use, and rich in apis
Latest stable release 0.24

2, installation,

conda install -c conda-forge scikit-learn
Copy the code

3. Contents of sciKit-learn

2.1.2 Sklearn dataset

1. Introduction to sciKit-learn data set API

sklearn.datasets
- Load to get popular data sets
- datasets.load_*()
  - Get a small set of data contained in datasets
- datasets.fetch_*(data_home=None)
  - The first argument to this function is data_HOME, which indicates the directory from which the dataset was downloaded.

2. Return the sklearn dataset

Load and fetch return data type datasets.base.Bunch(dictionary format)

def datasets_demo() :
    """ Sklearn dataset uses :return: """
    # Fetch data
    iris = load_iris()
    print("Iris Data Set :\n", iris)
    print("View data description :\n", iris["DESCR"])
    print("View the name of the eigenvalue :\n", iris.feature_names)
    print("View eigenvalues :\n", iris.data, iris.data.shape)
    return None
Copy the code

2.1.3 Data set division

The general data set of machine learning is divided into two parts:

Training data: used for training and modeling
Test data: Used during model validation to evaluate the validity of the model

Division ratio:

Training set: 70% 80% 90%
Test set: 30% 20% 10%

Data set partitioning API

sklearn.model_selection.train_test_split(arrys,*options)

Def datasets_demo(): """ Iris = load_iris() print(" Iris dataset :\n", iris["DESCR"]) print(" \n", iris["DESCR"]) print(" \n", Print (" view feature values :\n", iris.data, iris.data.shape) Y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=22) print(" \n", x_train, x_train.shape) return NoneCopy the code

2.2 Introduction to feature engineering

Learning Objectives:

Understand the importance of feature engineering in machine learning

Know the classification of feature engineering

2.2.1 Why is Feature Engineering needed?

It is widely held in the industry that data and features determine the threshold of machine learning, and that models and algorithms only approximate this threshold.

2.2.2 What is feature engineering

Feature engineering is the process of using professional background knowledge and skills to process data so that features can play a better role in machine learning algorithms.

Significance: Directly affect the effect of machine learning

2.2.3 Comparison of location and data processing of feature engineering

Pandas: A tool for easy reading and basic manipulation of data formats
Sklearn: provides a powerful interface for feature processing

2.3 Feature Extraction

Goal:

DictVectorizer was applied to realize the numerical and discretization of category features

CountVectorizer is used to implement the numeralization of text features

TfidfVectorizer is used to numeralize text features

Distinguish the difference between two text feature extraction

2.3.1 Feature extraction

1. Convert arbitrary data, such as text or images, into digital features that can be used for machine learning

Note: Eigenvaluation is used for better understanding of data by computers

Dictionary feature extraction (Feature discretization)
Text feature extraction
Image feature extraction

2. Feature extraction API

sklearn.feature_extraction
Copy the code

2.3.2 Dictionary feature extraction

Function: Eigenvalue dictionary data

sklearn.feature_extraction.DictVectorizer(…)

Example:

def dict_demo() :
    """ dictionary feature extraction :return: ""
    data = [{'city': "Beijing".'temperature': 100}, {'city': "Shanghai".'temperature': 60}, {'city': "Shenzhen".'temperature': 30}]
    Instantiate a converter
    # Sparse means sparse matrix: only values are returned, not 0
    transfer = DictVectorizer(sparse=False)
    # 2. Call fit_transform()
    data_new = transfer.fit_transform(data)
    print("New data :\n", data_new)
    print("Characteristic name: \n", transfer.get_feature_names())
    return None
Copy the code

Execution Result:

2.3.3 Text feature extraction

1, function: to do eigenvalue of text data

Sklearn. Feature_extraction. Text. CountVectorizer (stop_words = []) returns the word frequency matrix
Countvectorizer.fit_transform (X) X: Text or an iterable object containing text strings, return value sparse matrix
Countvectorizer.inverse_transform (X) X:array Array or SPARSE matrix Returned value: Data format before conversion
Countvectorizer.get_feature_names () returns a list of words

def count_demo() :
    Text feature Extraction :return: """
    data = ["lift is short,i like like python"."lift is long,i dislike python"]
    Instantiate a converter
    transfer = CountVectorizer()
    # 2. Call fit_transform()
    data_new = transfer.fit_transform(data)
    print("data_new:\n", data_new.toarray())
    print("List of eigenvalues: \n", transfer.get_feature_names())
    return None
Copy the code

2. Running results

3. Chinese extraction example, note: because there is no space between Chinese words, the program can not recognize a single word, need to use space segmentation, after the use of word segmentation, manual segmentation verification.

def count_chinese_demo() :
    """ Chinese text feature extraction :return: """
    data = ["I love Tiananmen Square in Beijing"."The sun rises over Tiananmen square."]
    Instantiate a converter
    transfer = CountVectorizer()
    # 2. Call fit_transform()
    data_new = transfer.fit_transform(data)
    print("data_new:\n", data_new.toarray())
    print("List of eigenvalues: \n", transfer.get_feature_names())
    return None
Copy the code

4, Chinese extraction: use jieba to automatically divide words

def count_chinese_demo2() :
    Chinese text feature extraction, automatic word segmentation: """
    data = ["The so-called training instructor is actually a self-employed marketing professional infected with COVID-19. He infected 102 people, the oldest of whom was born in 1933."."At the press conference, the relevant person in charge of the market supervision department of Jilin Province said that the relevant departments of the two places have carried out a joint investigation, and any illegal behavior will be severely punished."."The epidemic situation at present, health care workers are still tightly wrapped in thick protective clothing, in the front line with virus into combat, community volunteer still take cold wind gusts in residents" stand guard "downstairs, unexpectedly someone at this time, however, held" health training activities to promote products, puts the elderly will be coronavirus infection "discourse", Although under the banner of caring for the health of the elderly, it is tantamount to seeking wealth fatal!"]
    # participle
    data_new = []
    for item in data:
        data_new.append("".join(list(jieba.cut(item))))

    Instantiate a converter
    transfer = CountVectorizer(stop_words=["102"."1933"."A"])
    # 2. Call fit_transform()
    data_final = transfer.fit_transform(data_new)
    print("data_new:\n", data_final.toarray())
    print("List of eigenvalues: \n", transfer.get_feature_names())
    return None
Copy the code

Running results:

The appeal method is used to extract the characteristic value of the article. A certain word appears many times, but it only appears few times in other articles

Solution: Use TfidfVectorizer

5. Tf-idf text feature extraction

The main idea of TF-IDF: If a certain word or phrase has a high probability of occurrence in one article and rarely appears in other articles, it is considered that the word or phrase has good classification ability and is suitable for classification.
Tf-idf: Used to evaluate the importance of a word to one document in a document set or a corpus.
TF: Term Frequency
IDF: Reverse document frequency
API: sklearn feature_extraction. Text. TfidfVectorizer ()

Case study:

def tfidf_demo() :
    Automatic word segmentation for Chinese text feature extraction using TF-IDF: """
    data = ["The so-called training instructor is actually a self-employed marketing professional infected with COVID-19. He infected 102 people, the oldest of whom was born in 1933."."At the press conference, the relevant person in charge of the market supervision department of Jilin Province said that the relevant departments of the two places have carried out a joint investigation, and any illegal behavior will be severely punished."."The epidemic situation at present, health care workers are still tightly wrapped in thick protective clothing, in the front line with virus into combat, community volunteer still take cold wind gusts in residents" stand guard "downstairs, unexpectedly someone at this time, however, held" health training activities to promote products, puts the elderly will be coronavirus infection "discourse", Although under the banner of caring for the health of the elderly, it is tantamount to seeking wealth fatal!"]
    # participle
    data_new = []
    for item in data:
        data_new.append("".join(list(jieba.cut(item))))

    Instantiate a converter
    transfer = TfidfVectorizer(stop_words=["102"."1933"."A"])
    # 2. Call fit_transform()
    data_final = transfer.fit_transform(data_new)
    print("data_new:\n", data_final.toarray())
    print("List of eigenvalues: \n", transfer.get_feature_names())
    return None
Copy the code

Running results:

2.4 Feature preprocessing

Goal:

Understand the characteristics of numerical data and categorical data

MinMaxScaler is used to normalize the feature data

StandardScaler is used to standardize the characteristic data

Against 2.4.1 profile

What is feature preprocessing:

The process of transforming feature data into feature data more suitable for the algorithm model through some transformation functions

1. Include content

The normalized

standardized

2. Feature preprocessing API

sklearn.preprocessing
Copy the code

Why do we need normalization?

Feature units or large size difference, or the variance of a feature is several orders of magnitude larger than that of other features, easy to affect (dominate) the target result, so that some algorithms can not learn other features.

The normalized 2.4.2

1. Definition: map 3 data between (default [0,1]) by converting the original data

2, the formula

def minmax_demo() :
    """ normalization :return: """
    # 1. Get the data
    data = pd.read_csv('.. /data/dating.csv', encoding="gbk")
    data = data.iloc[:, 0:8]
    Instantiate a converter
    transfer = MinMaxScaler()
    # 3. Call fit_transform
    data_new = transfer.fit_transform(data)
    print("data_new:\n", data_new)
    # 4.

    return None
Copy the code

Results:

Conclusion: When outliers exist, that is, when the maximum or minimum value is an outlier, the normalized value is not accurate and only suitable for traditional precise small data scenarios.

2.4.3 standardization

1. Definition: Transform the original data to a range of 0 mean and 1 standard deviation

2, the formula

For normalization: if there are outliers that affect the maximum and minimum, then the result obviously changes
For standardization: if there are outliers, due to a certain amount of data, a small number of outliers have little influence on the average, so the variance change is small.

3, API

sklearn.preprocessing.StandardScaler()

4. Data calculation

Do the same with the above data

Analysis of the

Instantiation StandardScaler

Transform by fit_transform

def stand_demo() :
    """ standardized :return: """
    # 1. Get the data
    data = pd.read_csv('.. /data/dating.csv', encoding="gbk")
    data = data.iloc[:, 0:8]
    Instantiate a converter
    transfer = StandardScaler()
    # 3. Call fit_transform
    data_new = transfer.fit_transform(data)
    print("data_new:\n", data_new)
    return None
Copy the code

Execution Result:

Conclusion:

Standard deviation: concentration

In the case of enough samples, it is stable and suitable for modern noisy big data scenes.

2.5 Feature dimension reduction

Goal:

applicationVarianceThresholdImplement deletion of low variance features

Understand the characteristics and calculation of correlation coefficient

The eigenvalue selection is realized by correlation coefficient

2.5.1 dimension reduction

Dimensionality reduction: The process of reducing the number of random variables (features) to obtain a set of “unrelated” principal variables under certain constraints

Let’s reduce the number of random variables

Relevant features
- The correlation between relative humidity and rainfall
- .

Because when we are training, we are learning using eigenvalues. If the features themselves have problems or the correlation between features is strong, it will have a great influence on algorithm learning and prediction.

2.5.2 Two ways of dimensionality reduction

Feature selection
Principal component analysis (can understand a feature extraction method)

2.5.3 What is feature selection

1. Definition: Data contains redundant or related variables (or features, attributes, indicators, etc.) to find out the main features from the original features.

2, methods,

Filter: Mainly explore the characteristics of features themselves, and the correlation between features and target values
- Variance selection method: low variance filtering
- The correlation coefficient
Embedded: Algorithms automatically select features (associations between features and target values)
- Decision tree: information entropy, information gain
- Regularization: L1, L2
- Deep learning: convolution, etc

3, modules,

sklearn.feature_selection
Copy the code

4, filter type

4.1 Low-variance feature filtering

Remove some features of poor place:

Small feature variance: most samples of a feature have similar values

Large feature variance: the value of a feature varies from one sample to another

Sklearn. Feature_selection. VarianceThreshold (threshold = 0.0)

Delete all low-variance features

Variance.fit_transform(X)

X: Numpy array format data [n_samples,n_features]

Return value: The training value is lower thanthresholdThe feature will be deleted. The default is to retain all non-zero variance features, that is, to delete features with the same value in all samples.

def variance_demo() :
    """ filter low variance feature :return: """
    # 1. Get the data
    data = pd.read_csv('.. /data/dating.csv', encoding="gbk")
    data = data.iloc[:, 1:8]
    Instantiate a converter
    transfer = VarianceThreshold(threshold=10)
    # 3. Call fit_transform
    data_new = transfer.fit_transform(data)
    print("data_new:\n", data_new, data_new.shape)

    # Calculate the correlation coefficient between two variables
    r = pearsonr(data["gender"], data["position"])
    print("Related Systems: \n", r)
    return None
Copy the code

Execution Result:

4.2 Correlation coefficient

Pearson correlation coefficient
- A statistical indicator reflecting the degree of correlation between variables

The value of the correlation coefficient is between -1 and +1, i.e. [-1,+1]

When r is greater than 0, the two variables are positively correlated; when r is less than zero, the two variables are negatively correlated

When the absolute value of r is equal to 1, it means that the two variables are completely correlated; when r=0, it means that the two variables are not correlated

When the 0 < | r | < 1, says the two variables had a certain degree of correlation. And the closer the | r | 1, a linear relationship between two variables more closely; | r | the closer to zero, the two variables linear correlation.

According to three-tiered commonly: | r | < 0.4 for low-grade related; 0.4 < = | r | < 0.7 is significant correlation; 0.7 < = | r | < 1 for highly linear correlation.

from scipy.stats import pearsonr
Copy the code

2.6 Principal component analysis

Goal:

PCA is used to reduce dimension of feature

Application:

Principal component analysis between user and item category

2.6.1 profile

Principal component analysis (PCA) : the process of transforming high-dimensional data into low-dimensional data, in which the original data may be discarded and new variables created.

Function: data dimension compression, as far as possible to reduce the original data dimension (complexity), loss of a small amount of information.

Application: Regression analysis or cluster analysis.

API:

sklearn.decomposition.PCA(n_components=None)

def pca_demo() :
    """ PCA dimension reduction :return: ""
    # Prepare data
    data = [[2.8.4.5], [6.3.0.8], [5.4.9.1]]
    # instantiate the converter
    transfer = PCA(n_components=0.95)
    # call fit_transform ()
    data_new = transfer.fit_transform(data)
    print("data_new:\n", data_new)
    return None
Copy the code

Execution Result:

2.6.2 case

Explore the segmentation of user preferences for item categories

#Run jupyter notebook
jupyter notebook
Copy the code

Three classification algorithm

3.1 SkLearn converters and estimators

The target

Know the converter and estimator flow of SkLearn

3.1.1 converter

Steps of feature engineering:

Instantiate (instantiate a Transformer class)

Call fit_transform(not at the same time for creating a class word frequency matrix for documents)

Characteristic interfaces are called converters, which take several forms:

fit_transform

fit

transform

For example, standardization:

(x – mean)/ std

fit_transform

Fit calculates the mean and standard deviation of each column

Transform (X-mean)/ STD for final conversion

3.1.2 estimator

Estimators play an important role in SkLearn and are a class of APIS for implementing algorithms

Estimator for classification

Sklearn. Neighbors algorithm

Sklearn native_bayes bayesian

Sklearn.linear_model.LogisticRegression

Sklearn. tree Decision tree and random forest

Estimators for regression:

LinearRegression sklearn.linear_model.LinearRegression

Sklearn. Linear_model. Ridge Ridge regression

3.2 K-nearest neighbor algorithm

Learning goals

Learn the distance formula of KNN algorithm

Study the hyperparameter K value of KNN algorithm and its value problem

Learn the pros and cons of KNN

The KNeighborsClassifier is used to implement classification

Understand the accuracy of classification algorithm evaluation criteria

1. Principle of K-Nearest Neighbor algorithm (KNN)

K Nearest Neighbor algorithm is also called KNN algorithm

Definition: A sample belongs to a category if most of the k closest samples in the feature space (that is, the nearest neighbors in the feature space) belong to that category.

The distance between two samples is also called Euclidean distance

KNN algorithm was used to classify iris

def knn_iris() :
    """ Classifying iris with KNN algorithm :return: """
    # 1. Get data
    iris = load_iris()
    # 2. Divide the data set
    x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=6)
    # 3. Feature Engineering: Standardization
    tranfer = StandardScaler()
    x_train = tranfer.fit_transform(x_train)
    x_test = tranfer.fit_transform(x_test)
    # 4. KNN algorithm estimator
    estimator = KNeighborsClassifier(n_neighbors=3)
    estimator.fit(x_train, y_train)
    # 5. Model evaluation
    # Method 1: Directly compare the actual and predicted values
    y_predict = estimator.predict(x_test)
    print("y_predict:\n", y_predict)
    print("Direct comparison of actual value and predicted value: \n", y_predict == y_test)
    # Method 2: Calculation accuracy
    score = estimator.score(x_test, y_test)
    print("Accuracy: \n", score)
    return None
Copy the code

Running results:

3.3 Model selection and tuning

Learning Objectives:

Cross validation procedure

Hyperparameter search procedure

GridSearchCV is used to optimize the algorithm parameters

3.3.1 Cross validation

Cross validation: Divide the training data into training and validation sets. The following figure is an example: The data is divided into four parts, one of which is used as the verification set. It then goes through four tests, each with a different validation set. That is, four groups of model results are obtained, and the average value is taken as the final result. Also known as 4 fold cross validation.

The data is divided into training set and test set, but in order to make the model results obtained from the training set more accurate, the following processing is done:

KNN algorithm was used to classify iris, and network search and cross validation were added

def knn_iris_gscv() :
    """ Classifying iris with KNN algorithm, adding web search and cross validation :return: """
    # 1. Get data
    iris = load_iris()
    # 2. Divide the data set
    x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=6)
    # 3. Feature Engineering: Standardization
    tranfer = StandardScaler()
    x_train = tranfer.fit_transform(x_train)
    x_test = tranfer.fit_transform(x_test)
    # 4. KNN algorithm estimator
    estimator = KNeighborsClassifier()

    # Add web search and cross-validation
    # Parameter preparation
    param_dict = {"n_neighbors": [1.3.5.7.9.11]}
    estimator = GridSearchCV(estimator, param_grid=param_dict, cv=10)

    estimator.fit(x_train, y_train)
    # 5. Model evaluation
    # Method 1: Directly compare the actual and predicted values
    y_predict = estimator.predict(x_test)
    print("y_predict:\n", y_predict)
    print("Direct comparison of actual value and predicted value: \n", y_predict == y_test)
    # Method 2: Calculation accuracy
    score = estimator.score(x_test, y_test)
    print("Accuracy: \n", score)

    # Optimal parameters
    print("Best parameter :\n", estimator.best_params_)
    # Best result
    print("Best result :\n", estimator.best_score_)
    # Best estimator
    print("Best estimator :\n", estimator.best_estimator_)
    # Cross-verify the results
    print("Cross-validation result :\n", estimator.cv_results_)

    return None
Copy the code

Running results:

Facebook case

3.4 Naive Bayes algorithm

Learning Objectives:

Conditional probability and joint probability

Bayes’ formula, and feature independence

Laplace smoothing coefficient

Bayesian formula is used to calculate the probability

3.4.1 What is naive Bayes Classification

According to the naive Bayes algorithm classification can appear the following probability value, according to the size of the probability value of mail classification

3.4.2 Basis of probability

1. Definition of Probability

Probability is defined as the likelihood of an event happening

P(X) : the value ranges from 0 to 1.

3.4.3 Joint probability, conditional probability and mutual independence

Joint probability: contains multiple conditions, and all conditions are true probability

Remember: P (A, B)

Conditional probability: The probability of event A occurring if event B occurs

Remember: P (A | B)

Interdependent: If P(A,B) = P(A)P(B), then event A is said to be independent of event B

<=> Event A and event B are independent of each other

3.4.4 Bayes’ formula

3.4.5 API

Sklearn. Naive_bayes. MultinomiaINB (alpha = 1.0)

Naive Bayes classification

Alpha: Laplace smoothing coefficient

3.4.6 case

def nb_news() :
    """ Classification of news with naive Bayesian algorithm :return: """
    # Fetch data
    news = fetch_20newsgroups(subset="all")
    # Divide data set: feature training set, feature test set, target training set, target test set
    x_train,x_test,y_train,y_test = train_test_split(news.data,news.target)
    Feature engineering: Text feature extraction TFIDF
    transfer = TfidfVectorizer()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)

    # Naive Bayes algorithm predictor flow
    estimator = MultinomialNB()
    estimator.fit(x_train,y_train)
    # Model evaluation
    # Method 1: Directly compare the actual and predicted values
    y_predict = estimator.predict(x_test)
    print("y_predict:\n", y_predict)
    print("Direct comparison of actual value and predicted value: \n", y_predict == y_test)
    # Method 2: Calculation accuracy
    score = estimator.score(x_test, y_test)
    print("Accuracy: \n", score)
    return None
Copy the code

Running results:

3.5 the decision tree

Learning Objectives:

The formula and function of information entropy

Formulas and functions of information

The implementation of information gain is used to calculate the degree of uncertainty reduction of features

Understand three kinds of decision tree algorithm implementation

3.5.1 Understanding the decision tree

The idea of decision tree is very simple. The conditional branch structure in program design is if-else structure. The earliest decision tree is a classification learning method using this structure to divide data

3.5.2 Principle of decision tree

1, the principle of

Information entropy, information gain

2. Definition of information entropy

The technical term for H is information entropy, which is measured in bits

3. Decision tree division is based on —– information entropy

Definition: characteristics of A set of training data D the information gain of g (D, A), defined as the set D information entropy H (D) and the characteristics of A given under the condition of D information entropy H (D | A), namely:

3.5.4 case

1. Classify iris by decision tree

def decision_iris() :
    """ Classifying irises with decision trees :return: """
    # 1. Get the data set
    iris = load_iris()
    # 2. Divide the data set
    x_train,x_test,y_train,y_test = train_test_split(iris.data,iris.target,random_state=22)
    # 3 decision tree estimator
    estimator = DecisionTreeClassifier(criterion="entropy")
    estimator.fit(x_train,y_train)
    # 4. Model evaluation
    # Method 1: Directly compare the actual and predicted values
    y_predict = estimator.predict(x_test)
    print("y_predict:\n", y_predict)
    print("Direct comparison of actual value and predicted value: \n", y_predict == y_test)
    # Method 2: Calculation accuracy
    score = estimator.score(x_test, y_test)
    print("Accuracy: \n", score)

    # Visualize decision trees
    export_graphviz(estimator,out_file="iris_tree.dot")
    return None
Copy the code

Running results:

2. Survival prediction of Titanic passengers

3.5.6 Decision tree summary

Advantages:

Simple understanding and interpretation, tree visualization

Disadvantages:

Decision tree creators can create, but do not generalize well, trees with overly complex data, which is called overfitting

Improvement:

Pruning CART algorithm (already implemented in the decision tree API)

Random forests

3.6 Ensemble learning method of random forest

Learning Objectives:

The establishment process of every decision tree in random forest

Why do you need random Bootstrap sampling

Hyperparameters of random forest

3.6.1 What integrated learning method

Ensemble learning is to solve a single prediction problem by combining several models.

It works by generating multiple classifiers/models that independently learn and make predictions. These predictions are finally combined into a composite forecast, so that any single category makes a prediction.

3.6.2 What is random forest

Random forest is a classifier containing multiple decision trees, and its output categories are determined by the mode of the categories output by individual trees.

3.6.3 Random forest principle process

The learning algorithm creates each lesson tree according to the following algorithm:

Use N to represent the number of training cases and M to represent the number of features.

One sample is randomly selected at a time and repeated N times (duplicate samples may occur)

M features were randomly selected, m<< m, and the decision tree was established

Bootstrap sampling was taken

3.6.4 radar echoes captured API

3.6.5 case

Titanic passenger survival prediction

Four regression and clustering algorithm

4.1 Linear regression

Learning Objectives:

The principle of linear regression

Regression prediction is implemented using LinearRegression or SGDRegressor

Evaluation criteria and formula of regression algorithm

4.1.1 Principle of linear regression

1. Application scenarios of linear regression

2. What is linear regression

Definition and Formula

4.1.2 Loss and optimization principle of linear regression

1. Loss function: also known as cost, cost function and objective function

2. Optimization method

Normal equations: less used

Gradient Descent

Case study:

def linear1() :
    """ Normal equation optimization method for Boston housing price prediction: """
    # 1. Get data
    boston = load_boston()
    # 2. Divide the data set
    x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=22)
    # 3. Standardization
    transfer = StandardScaler()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)

    # 4. Predictor
    estimator = LinearRegression()
    estimator.fit(x_train,y_train)

    # 5. Get the model
    print("Normal equation - weight coefficient: \n",estimator.coef_)
    print("Normal equation - bias is :\n",estimator.intercept_)

    # 6. Model evaluation
    return None
Copy the code

def linear2() :
    """ Gradient descent optimization method to predict Boston housing prices: """
    # 1. Get data
    boston = load_boston()
    # 2. Divide the data set
    x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=22)
    # 3. Standardization
    transfer = StandardScaler()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)

    # 4. Predictor
    estimator = SGDRegressor()
    estimator.fit(x_train, y_train)

    # 5. Get the model
    print("Gradient descent - weighting coefficient: \n", estimator.coef_)
    print("Gradient descent - offset to :\n", estimator.intercept_)

    # 6. Model evaluation
    return None
Copy the code

Running results:

4.1.3 Regression performance evaluation

from sklearn.metrics import mean_squared_error
Copy the code

def linear1() :
    """ Normal equation optimization method for Boston housing price prediction: """
    # 1. Get data
    boston = load_boston()
    # 2. Divide the data set
    x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=22)
    # 3. Standardization
    transfer = StandardScaler()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)

    # 4. Predictor
    estimator = LinearRegression()
    estimator.fit(x_train, y_train)

    # 5. Get the model
    print("Normal equation - weight coefficient: \n", estimator.coef_)
    print("Normal equation - bias is :\n", estimator.intercept_)

    # 6. Model evaluation
    y_predict = estimator.predict(x_test)
    print("Forecast House Price :\n", y_predict)
    error = mean_squared_error(y_test, y_predict)
    print("Normal equation - mean square error is: \n", error)

    return None
Copy the code

def linear2() :
    """ Gradient descent optimization method to predict Boston housing prices: """
    # 1. Get data
    boston = load_boston()
    # 2. Divide the data set
    x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=22)
    # 3. Standardization
    transfer = StandardScaler()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)

    # 4. Predictor
    estimator = SGDRegressor()
    estimator.fit(x_train, y_train)

    # 5. Get the model
    print("Gradient descent - weighting coefficient: \n", estimator.coef_)
    print("Gradient descent - offset to :\n", estimator.intercept_)

    # 6. Model evaluation
    y_predict = estimator.predict(x_test)
    print("Forecast House Price :\n", y_predict)
    error = mean_squared_error(y_test, y_predict)
    print("Gradient descent - mean square error: \n", error)
    return None
Copy the code

4.1.4 Comparison between normal equation and gradient descent

Gradient descent	Normal equations
You need to choose the learning rate	Don’t need
You need to iterate	In one operation
It can be used when the number of features is large	You need to compute the equation, time complexity O(n3).

choose
- Small-scale data:
  - LinearRegression (does not solve the fit problem)
  - Ridge regression
- Large-scale data: SGT Dregressor

4.1.5 extension

Optimization methods: GD, SGD, SAG

1, GD

The original Gradient Descent requires calculation of all the values to obtain the Gradient, which is a large amount of calculation, hence the improved algorithm.

2, SGD

Stochastic gradient Descent: She considers only one training sample at a time.

Advantages of SGD:

efficient

Easy to implement

Disadvantages of SGD:

SGD requires many hyperparameters: regular term parameters, number of iterations.

SGD is sensitive to feature standardization

3, SAG

Stochastic Average gradient. Due to the slow speed of convergence, some algorithms based on gradient descent such as SAG have been proposed.

SAG optimization is found in ridge regression and logistic regression

4.2 Underfitting and overfitting

Learning Objectives:

Disadvantages of linear regression (without regularization)

Causes and solutions of over-fitting and under-fitting

2 brief introduction

1. Underfitting

2. Overfitting

Analysis:

The first case: because machine learning has too few swan features, the discrimination criteria are too rough to accurately identify swans

Case two: The machine can basically tell swans apart. Unfortunately, all the pictures of swans were white swans, and then the machine learned that the swan’s feathers were white, and then it saw a swan with black feathers and thought it wasn’t a swan.

Overfitting: a hypothesis that can obtain a better fit than other hypotheses on the training set but cannot fit the data well on the test data set is considered to be overfitting. (Model is too complex)

Underfitting: a hypothesis that does not fit the training set data well and does not fit the test data well is considered to be underfitting. (Model is too simple)

4.2.2 Causes and solutions

The reason of underfitting and the solution

Reason: Learning too few features of data

Solution: Increase the number of features in the data

Causes and solutions of overfitting

Reason: There are too many original features, some noisy features, and the model is too complicated because the model tries to take into account the data of each test point

Solutions:

regularization

Regularization: L2 regularization (common), L1 regularization

4.3 Improvement of linear regression — ridge regression

Learning Objectives:

Learning the difference between the principle of ridge regression and linear regression

Effects of regularization on weight parameters

The difference between L1 and L2 regularization

4.3.1 Linear regression with L2 regularization — Ridge regression

Ridge regression is also a linear regression. In order to solve the problem of overfitting, the regularization restriction is added to establish the regression equation.

1, API

2. Observe the change of regularization degree and its influence on the results

Case: Ridge regression to Boston housing price forecast

def linear3() :
    """ Ridge returns to the Boston housing price forecast: """
    # 1. Get data
    boston = load_boston()
    # 2. Divide the data set
    x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=22)
    # 3. Standardization
    transfer = StandardScaler()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)

    # 4. Predictor
    estimator = Ridge(alpha=0.5, max_iter=10000)
    estimator.fit(x_train, y_train)

    # 5. Get the model
    print("Ridge regression - weight coefficient: \ N", estimator.coef_)
    print("Ridge regression - bias is :\n", estimator.intercept_)

    # 6. Model evaluation
    y_predict = estimator.predict(x_test)
    print("Forecast House Price :\n", y_predict)
    error = mean_squared_error(y_test, y_predict)
    print("Ridge regression - mean square error is: \n", error)
    return None
Copy the code

4.4 Classification algorithms – Logistic regression and dichotomy

Learning Objectives:

Loss function of logistic regression

Optimization method of logistic regression

The sigmoid function

Application scenarios of logistic regression

Accuracy rate, recall rate index difference

The actual significance of F1-score index and recall rate

How to solve the sample imbalance in the case of assessment

ROC curve significance, AUC index size

Classification_report is used to calculate the accuracy and recall rate

Roc_auc_score is used to calculate the index

4.4.1 Logistic regression

Definition: Logistic Regression is a classification model of machine learning. Logistic Regression is a classification algorithm. Although it has Regression in its name, it is related to Regression. Because of its simplicity and efficiency, the algorithm is widely used in practice.

Application Scenarios:

Click through rate

Whether it is spam

Whether sick

Financial fraud

False account

Note: looking at the example above, we can see that it is characteristic of the judgment that falls between the two categories. Logistic regression is a powerful tool to solve the problem of binary classification.

4.4.2 Principle of logistic regression

1. Input: The output of linear regression is the input of logistic regression

2. Activate the function

The sigmoid function:

g = 1/(1 + e^(-x))

The following formula is a matrix representation

Note: the sigmoid function input is the formula x, namely: x = h(w)=w1x1+ W2x2 +w3x3… +b

3. Loss and optimization

loss

The loss of logistic regression is called logarithmic likelihood loss

Synthesize the complete loss function

We know that minus log of P, the larger the P, the smaller the result, so we can analyze this loss.

Optimize the loss

The gradient descent optimization algorithm is also used to reduce the value of the loss function. In this way, the weight parameters of the corresponding algorithm before logistic regression are updated to increase the probability that originally belongs to class 1 and reduce the probability that originally belongs to class 0

4.4.3 API of logistic regression

4.4.4 case

Cancer classification prediction – benign/malignant breast cancer tumor prediction

Analysis process:

To get the data

Add names when reading

The data processing

Handling missing values

Data set partitioning

Characteristics of the engineering

Dimensionless processing – standardization

Logistic regression estimator

Model to evaluate

Data address: archive.ics.uci.edu/ml/machine-…

4.4.5 Assessment method of classification

1. Accuracy and recall rate

Confusion matrix

2. Precision and Recall

Accuracy: The proportion of positive samples in the predicted results that are actually positive (understood)

Recall rate: The percentage of samples with positive results predicted by positive results (completeness, ability to distinguish positive samples)

F1-score: reflects the robustness of the model

3. Classified evaluation report API

4. Check the accuracy rate, recall rate and F1-Score of cancer Classification prediction — Benign/Malignant Breast Cancer Tumor Prediction

4.4.6 ROC curve and AUC index

1. TPR and FPR

(Recall rate) TPR = TP/(TP + FN)

The percentage of all samples of true category 1 that are predicted to be category 1

FPR = FP / (FP + TN)

The percentage of all samples with a true category of 0 that are predicted to be category 1

2. ROC curve

The horizontal axis of ROC is FPRate, and the vertical axis is TPRate. When the two are equal, it means that the probability of the predicted value of 1 by classifier is equal for the samples regardless of the real category is 1 or 0, and the AUC indicator is 0.5

3. AUC indicators

The probabilistic significance of AUC is the probability that a pair of positive and negative samples are randomly selected and the score of positive samples is greater than that of negative samples

The minimum value of AUC is 0.5, and the maximum value is 1. The higher the value, the better

AUC=1, perfect classifier, when using this prediction model, no matter what threshold is set, perfect prediction can be obtained. For the most part, there is no perfect classifier.

0.5

Note: The final AUC ranges between [0.5,1] and is closer to 1, the better

4.4.7 AUC computing API

4.4.8 summary

AUC can only be used to evaluate dichotomies

AUC is very suitable for evaluating the performance of classifier with unbalanced samples

4.5 Model saving and loading

Learning Objectives:

Joblib is used to save and load the model

4.5.1 SkLearn model saving and loading API

4.5.2 of case

1. Model preservation

from sklearn.externals import joblib

# 4. Predictor
estimator = Ridge(alpha=0.5, max_iter=10000)
estimator.fit(x_train, y_train)

# Save the model
joblib.dump(estimator,"./my_ridge.pkl")
Copy the code

2. Load the model

# Load model
estimator = joblib.load("./my_ridge.pkl")
Copy the code

4.6 Unsupervised learning – K-means algorithm

Learning Objectives:

Principle of k-means algorithm

K-means performance evaluation of quasi-contour coefficients

Advantages and disadvantages of K-means

4.6.1 What is Unsupervised learning

Unsupervised learning: No target value

4.6.2 Unsupervised learning includes algorithms

clustering
- K-means (K-means clustering)
Dimension reduction
- PCA

4.6.3 K – means principle

4.6.4 K – means of API

4.6.5 case

K-means is used to cluster Instacart Market users

4.6.6 K-means performance evaluation indicators

1. Contour coefficient

Conclusion:

If b_i >> a_i: tends to 1, the better

If b_i << a_i: tends to -1, that’s not good

The value of contour coefficient is between [-1,1], and the closer it is to 1, the better the cohesion and separation degree are

4.6.7 Contour Coefficient API

4.6.8 case

Evaluation of Instacart Market user clustering – contour coefficient by k-means

4.6.9 K – means to summarize

Characteristic analysis: using iterative algorithm, intuitive and easy to understand, very practical

Disadvantages: Easy convergence to local optimal solution (multiple clustering)

Note: Clustering is usually done before classification