Python Machine learning Notes

3 Days quick Start python machine learning with dark Horse Programmer

www.bilibili.com/video/BV1nt…

Overview of machine learning

1.1 Overview of artificial intelligence

1.1.1 The relationship between machine learning, artificial intelligence and deep learning

  • Machine learning and artificial intelligence, deep learning

    • Machine learning is an approach to artificial intelligence
    • Deep learning is an evolution of machine learning
  • Dartmouth Conferences – Artificial intelligence began

    In 1956, a group of computer scientists met at the Dartmouth Conferences and coined the concept of “artificial intelligence.”

1.1.2 What can machine learning and deep learning do

There are many application scenarios of machine learning, which can be said to permeate all walks of life. Medical care, aviation, education, logistics, e-commerce and so on.

  • Used in the field of excavation and prediction:
    • Application scenarios: store sales forecast, quantitative investment, advertising recommendation, enterprise user classification…
  • Used in the field of images:
    • Application scenarios: street traffic sign detection, face recognition and so on
  • In the field of self-heating language processing:
    • Application scenarios: text classification, sentiment analysis, automatic chat, text detection and so on.

What is important now is to master some machine learning algorithms and other skills to solve problems from an industry perspective.

1.2 What is Machine learning

1.2.1 definition

Machine learning: Models derived from automatic analysis of data and used to make predictions about unknown data.

1.2.2 Data set composition

  • Structure: eigenvalue + target value

Note:

  • For each row of data we call it a sample.
  • Some data can have no target value:

1.3 Classification of machine learning algorithms

1.3.1 Supervised learning

  • Classification problem: have target value, judge category

  • Regression problems: target values, continuous data

1.3.2 Unsupervised learning

  • Unsupervised learning: No target value

1.4 Machine learning development process

Machine learning development process:

  1. To get the data
  2. The data processing
  3. Characteristics of the engineering
  4. Machine learning algorithm training — model
  5. Model to evaluate
  6. application

1.5 Learning framework and material introduction

Make a few points clear:

  • Algorithms are the core, data and calculation are the foundation
  • Get the position

Most algorithms are done by proprietary algorithm engineers, and we just need to:

  • Analyze a lot of data
  • Analyze the specific business
  • Apply common algorithms
  • Feature engineering, parameter tuning, optimization

Machine learning libraries and frameworks:

Two feature engineering

2.1 the data set

  • Goal:
    • Know that data sets are divided into training sets and test sets
    • Sklearn data sets are used
  • Application:
    • There is no

2.1.1 Available data sets

Kaggle at www.kaggle.com/

The UCI dataset is available at archive.ics.uci.edu/ml

Scikit-learn website: scikit-learn.org/stable/data…

1. Introduction to sciKit-learn

  • Machine learning tools for Python
  • Scikit-learn includes the implementation of many well-known machine learning algorithms
  • Scikit-learn is well documented, easy to use, and rich in apis
  • Latest stable release 0.24

2, installation,

conda install -c conda-forge scikit-learn
Copy the code

3. Contents of sciKit-learn

2.1.2 Sklearn dataset

1. Introduction to sciKit-learn data set API

  • sklearn.datasets
    • Load to get popular data sets
    • datasets.load_*()
      • Get a small set of data contained in datasets
    • datasets.fetch_*(data_home=None)
      • The first argument to this function is data_HOME, which indicates the directory from which the dataset was downloaded.

2. Return the sklearn dataset

  • Load and fetch return data type datasets.base.Bunch(dictionary format)
def datasets_demo() :
    """ Sklearn dataset uses :return: """
    # Fetch data
    iris = load_iris()
    print("Iris Data Set :\n", iris)
    print("View data description :\n", iris["DESCR"])
    print("View the name of the eigenvalue :\n", iris.feature_names)
    print("View eigenvalues :\n", iris.data, iris.data.shape)
    return None
Copy the code

2.1.3 Data set division

The general data set of machine learning is divided into two parts:

  • Training data: used for training and modeling
  • Test data: Used during model validation to evaluate the validity of the model

Division ratio:

  • Training set: 70% 80% 90%
  • Test set: 30% 20% 10%

Data set partitioning API

  • sklearn.model_selection.train_test_split(arrys,*options)
Def datasets_demo(): """ Iris = load_iris() print(" Iris dataset :\n", iris["DESCR"]) print(" \n", iris["DESCR"]) print(" \n", Print (" view feature values :\n", iris.data, iris.data.shape) Y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=22) print(" \n", x_train, x_train.shape) return NoneCopy the code

2.2 Introduction to feature engineering

Learning Objectives:

  • Understand the importance of feature engineering in machine learning
  • Know the classification of feature engineering

2.2.1 Why is Feature Engineering needed?

It is widely held in the industry that data and features determine the threshold of machine learning, and that models and algorithms only approximate this threshold.

2.2.2 What is feature engineering

Feature engineering is the process of using professional background knowledge and skills to process data so that features can play a better role in machine learning algorithms.

Significance: Directly affect the effect of machine learning

2.2.3 Comparison of location and data processing of feature engineering

  • Pandas: A tool for easy reading and basic manipulation of data formats
  • Sklearn: provides a powerful interface for feature processing

2.3 Feature Extraction

Goal:

  • DictVectorizer was applied to realize the numerical and discretization of category features
  • CountVectorizer is used to implement the numeralization of text features
  • TfidfVectorizer is used to numeralize text features
  • Distinguish the difference between two text feature extraction

2.3.1 Feature extraction

1. Convert arbitrary data, such as text or images, into digital features that can be used for machine learning

Note: Eigenvaluation is used for better understanding of data by computers

  • Dictionary feature extraction (Feature discretization)
  • Text feature extraction
  • Image feature extraction

2. Feature extraction API

sklearn.feature_extraction
Copy the code

2.3.2 Dictionary feature extraction

Function: Eigenvalue dictionary data

sklearn.feature_extraction.DictVectorizer(…)

Example:

def dict_demo() :
    """ dictionary feature extraction :return: ""
    data = [{'city': "Beijing".'temperature': 100}, {'city': "Shanghai".'temperature': 60}, {'city': "Shenzhen".'temperature': 30}]
    Instantiate a converter
    # Sparse means sparse matrix: only values are returned, not 0
    transfer = DictVectorizer(sparse=False)
    # 2. Call fit_transform()
    data_new = transfer.fit_transform(data)
    print("New data :\n", data_new)
    print("Characteristic name: \n", transfer.get_feature_names())
    return None
Copy the code

Execution Result:

2.3.3 Text feature extraction

1, function: to do eigenvalue of text data

  • Sklearn. Feature_extraction. Text. CountVectorizer (stop_words = []) returns the word frequency matrix
  • Countvectorizer.fit_transform (X) X: Text or an iterable object containing text strings, return value sparse matrix
  • Countvectorizer.inverse_transform (X) X:array Array or SPARSE matrix Returned value: Data format before conversion
  • Countvectorizer.get_feature_names () returns a list of words
def count_demo() :
    Text feature Extraction :return: """
    data = ["lift is short,i like like python"."lift is long,i dislike python"]
    Instantiate a converter
    transfer = CountVectorizer()
    # 2. Call fit_transform()
    data_new = transfer.fit_transform(data)
    print("data_new:\n", data_new.toarray())
    print("List of eigenvalues: \n", transfer.get_feature_names())
    return None
Copy the code

2. Running results

3. Chinese extraction example, note: because there is no space between Chinese words, the program can not recognize a single word, need to use space segmentation, after the use of word segmentation, manual segmentation verification.

def count_chinese_demo() :
    """ Chinese text feature extraction :return: """
    data = ["I love Tiananmen Square in Beijing"."The sun rises over Tiananmen square."]
    Instantiate a converter
    transfer = CountVectorizer()
    # 2. Call fit_transform()
    data_new = transfer.fit_transform(data)
    print("data_new:\n", data_new.toarray())
    print("List of eigenvalues: \n", transfer.get_feature_names())
    return None
Copy the code

4, Chinese extraction: use jieba to automatically divide words

def count_chinese_demo2() :
    Chinese text feature extraction, automatic word segmentation: """
    data = ["The so-called training instructor is actually a self-employed marketing professional infected with COVID-19. He infected 102 people, the oldest of whom was born in 1933."."At the press conference, the relevant person in charge of the market supervision department of Jilin Province said that the relevant departments of the two places have carried out a joint investigation, and any illegal behavior will be severely punished."."The epidemic situation at present, health care workers are still tightly wrapped in thick protective clothing, in the front line with virus into combat, community volunteer still take cold wind gusts in residents" stand guard "downstairs, unexpectedly someone at this time, however, held" health training activities to promote products, puts the elderly will be coronavirus infection "discourse", Although under the banner of caring for the health of the elderly, it is tantamount to seeking wealth fatal!"]
    # participle
    data_new = []
    for item in data:
        data_new.append("".join(list(jieba.cut(item))))

    Instantiate a converter
    transfer = CountVectorizer(stop_words=["102"."1933"."A"])
    # 2. Call fit_transform()
    data_final = transfer.fit_transform(data_new)
    print("data_new:\n", data_final.toarray())
    print("List of eigenvalues: \n", transfer.get_feature_names())
    return None
Copy the code

Running results:

The appeal method is used to extract the characteristic value of the article. A certain word appears many times, but it only appears few times in other articles

Solution: Use TfidfVectorizer

5. Tf-idf text feature extraction

  • The main idea of TF-IDF: If a certain word or phrase has a high probability of occurrence in one article and rarely appears in other articles, it is considered that the word or phrase has good classification ability and is suitable for classification.
  • Tf-idf: Used to evaluate the importance of a word to one document in a document set or a corpus.
  • TF: Term Frequency
  • IDF: Reverse document frequency
  • API: sklearn feature_extraction. Text. TfidfVectorizer ()

Case study:

def tfidf_demo() :
    Automatic word segmentation for Chinese text feature extraction using TF-IDF: """
    data = ["The so-called training instructor is actually a self-employed marketing professional infected with COVID-19. He infected 102 people, the oldest of whom was born in 1933."."At the press conference, the relevant person in charge of the market supervision department of Jilin Province said that the relevant departments of the two places have carried out a joint investigation, and any illegal behavior will be severely punished."."The epidemic situation at present, health care workers are still tightly wrapped in thick protective clothing, in the front line with virus into combat, community volunteer still take cold wind gusts in residents" stand guard "downstairs, unexpectedly someone at this time, however, held" health training activities to promote products, puts the elderly will be coronavirus infection "discourse", Although under the banner of caring for the health of the elderly, it is tantamount to seeking wealth fatal!"]
    # participle
    data_new = []
    for item in data:
        data_new.append("".join(list(jieba.cut(item))))

    Instantiate a converter
    transfer = TfidfVectorizer(stop_words=["102"."1933"."A"])
    # 2. Call fit_transform()
    data_final = transfer.fit_transform(data_new)
    print("data_new:\n", data_final.toarray())
    print("List of eigenvalues: \n", transfer.get_feature_names())
    return None
Copy the code

Running results:

2.4 Feature preprocessing

Goal:

  • Understand the characteristics of numerical data and categorical data
  • MinMaxScaler is used to normalize the feature data
  • StandardScaler is used to standardize the characteristic data

Against 2.4.1 profile

What is feature preprocessing:

The process of transforming feature data into feature data more suitable for the algorithm model through some transformation functions

1. Include content

  • The normalized
  • standardized

2. Feature preprocessing API

sklearn.preprocessing
Copy the code

Why do we need normalization?

  • Feature units or large size difference, or the variance of a feature is several orders of magnitude larger than that of other features, easy to affect (dominate) the target result, so that some algorithms can not learn other features.

The normalized 2.4.2

1. Definition: map 3 data between (default [0,1]) by converting the original data

2, the formula

def minmax_demo() :
    """ normalization :return: """
    # 1. Get the data
    data = pd.read_csv('.. /data/dating.csv', encoding="gbk")
    data = data.iloc[:, 0:8]
    Instantiate a converter
    transfer = MinMaxScaler()
    # 3. Call fit_transform
    data_new = transfer.fit_transform(data)
    print("data_new:\n", data_new)
    # 4.

    return None
Copy the code

Results:

Conclusion: When outliers exist, that is, when the maximum or minimum value is an outlier, the normalized value is not accurate and only suitable for traditional precise small data scenarios.

2.4.3 standardization

1. Definition: Transform the original data to a range of 0 mean and 1 standard deviation

2, the formula

  • For normalization: if there are outliers that affect the maximum and minimum, then the result obviously changes
  • For standardization: if there are outliers, due to a certain amount of data, a small number of outliers have little influence on the average, so the variance change is small.

3, API

  • sklearn.preprocessing.StandardScaler()

4. Data calculation

Do the same with the above data

  • Analysis of the
  1. Instantiation StandardScaler
  2. Transform by fit_transform
def stand_demo() :
    """ standardized :return: """
    # 1. Get the data
    data = pd.read_csv('.. /data/dating.csv', encoding="gbk")
    data = data.iloc[:, 0:8]
    Instantiate a converter
    transfer = StandardScaler()
    # 3. Call fit_transform
    data_new = transfer.fit_transform(data)
    print("data_new:\n", data_new)
    return None
Copy the code

Execution Result:

Conclusion:

Standard deviation: concentration

In the case of enough samples, it is stable and suitable for modern noisy big data scenes.

2.5 Feature dimension reduction

Goal:

  • applicationVarianceThresholdImplement deletion of low variance features
  • Understand the characteristics and calculation of correlation coefficient
  • The eigenvalue selection is realized by correlation coefficient

2.5.1 dimension reduction

Dimensionality reduction: The process of reducing the number of random variables (features) to obtain a set of “unrelated” principal variables under certain constraints

  • Let’s reduce the number of random variables

  • Relevant features
    • The correlation between relative humidity and rainfall
    • .

Because when we are training, we are learning using eigenvalues. If the features themselves have problems or the correlation between features is strong, it will have a great influence on algorithm learning and prediction.

2.5.2 Two ways of dimensionality reduction

  • Feature selection
  • Principal component analysis (can understand a feature extraction method)

2.5.3 What is feature selection

1. Definition: Data contains redundant or related variables (or features, attributes, indicators, etc.) to find out the main features from the original features.

2, methods,

  • Filter: Mainly explore the characteristics of features themselves, and the correlation between features and target values
    • Variance selection method: low variance filtering
    • The correlation coefficient
  • Embedded: Algorithms automatically select features (associations between features and target values)
    • Decision tree: information entropy, information gain
    • Regularization: L1, L2
    • Deep learning: convolution, etc

3, modules,

sklearn.feature_selection
Copy the code

4, filter type

4.1 Low-variance feature filtering

Remove some features of poor place:

  • Small feature variance: most samples of a feature have similar values
  • Large feature variance: the value of a feature varies from one sample to another
  • Sklearn. Feature_selection. VarianceThreshold (threshold = 0.0)
    • Delete all low-variance features
    • Variance.fit_transform(X)
      • X: Numpy array format data [n_samples,n_features]
      • Return value: The training value is lower thanthresholdThe feature will be deleted. The default is to retain all non-zero variance features, that is, to delete features with the same value in all samples.
def variance_demo() :
    """ filter low variance feature :return: """
    # 1. Get the data
    data = pd.read_csv('.. /data/dating.csv', encoding="gbk")
    data = data.iloc[:, 1:8]
    Instantiate a converter
    transfer = VarianceThreshold(threshold=10)
    # 3. Call fit_transform
    data_new = transfer.fit_transform(data)
    print("data_new:\n", data_new, data_new.shape)

    # Calculate the correlation coefficient between two variables
    r = pearsonr(data["gender"], data["position"])
    print("Related Systems: \n", r)
    return None
Copy the code

Execution Result:

4.2 Correlation coefficient

  • Pearson correlation coefficient
    • A statistical indicator reflecting the degree of correlation between variables

The value of the correlation coefficient is between -1 and +1, i.e. [-1,+1]

  • When r is greater than 0, the two variables are positively correlated; when r is less than zero, the two variables are negatively correlated
  • When the absolute value of r is equal to 1, it means that the two variables are completely correlated; when r=0, it means that the two variables are not correlated
  • When the 0 < | r | < 1, says the two variables had a certain degree of correlation. And the closer the | r | 1, a linear relationship between two variables more closely; | r | the closer to zero, the two variables linear correlation.
  • According to three-tiered commonly: | r | < 0.4 for low-grade related; 0.4 < = | r | < 0.7 is significant correlation; 0.7 < = | r | < 1 for highly linear correlation.
from scipy.stats import pearsonr
Copy the code

2.6 Principal component analysis

Goal:

  • PCA is used to reduce dimension of feature

Application:

  • Principal component analysis between user and item category

2.6.1 profile

Principal component analysis (PCA) : the process of transforming high-dimensional data into low-dimensional data, in which the original data may be discarded and new variables created.

Function: data dimension compression, as far as possible to reduce the original data dimension (complexity), loss of a small amount of information.

Application: Regression analysis or cluster analysis.

API:

  • sklearn.decomposition.PCA(n_components=None)
def pca_demo() :
    """ PCA dimension reduction :return: ""
    # Prepare data
    data = [[2.8.4.5], [6.3.0.8], [5.4.9.1]]
    # instantiate the converter
    transfer = PCA(n_components=0.95)
    # call fit_transform ()
    data_new = transfer.fit_transform(data)
    print("data_new:\n", data_new)
    return None
Copy the code

Execution Result:

2.6.2 case

Explore the segmentation of user preferences for item categories

#Run jupyter notebook
jupyter notebook
Copy the code

Three classification algorithm

3.1 SkLearn converters and estimators

The target

  • Know the converter and estimator flow of SkLearn

3.1.1 converter

Steps of feature engineering:

  1. Instantiate (instantiate a Transformer class)
  2. Call fit_transform(not at the same time for creating a class word frequency matrix for documents)

Characteristic interfaces are called converters, which take several forms:

  • fit_transform
  • fit
  • transform

For example, standardization:

(x – mean)/ std

fit_transform

Fit calculates the mean and standard deviation of each column

Transform (X-mean)/ STD for final conversion

3.1.2 estimator

Estimators play an important role in SkLearn and are a class of APIS for implementing algorithms

  • Estimator for classification
    • Sklearn. Neighbors algorithm
    • Sklearn native_bayes bayesian
    • Sklearn.linear_model.LogisticRegression
    • Sklearn. tree Decision tree and random forest
  • Estimators for regression:
    • LinearRegression sklearn.linear_model.LinearRegression
    • Sklearn. Linear_model. Ridge Ridge regression

3.2 K-nearest neighbor algorithm

Learning goals

  • Learn the distance formula of KNN algorithm
  • Study the hyperparameter K value of KNN algorithm and its value problem
  • Learn the pros and cons of KNN
  • The KNeighborsClassifier is used to implement classification
  • Understand the accuracy of classification algorithm evaluation criteria

1. Principle of K-Nearest Neighbor algorithm (KNN)

K Nearest Neighbor algorithm is also called KNN algorithm

Definition: A sample belongs to a category if most of the k closest samples in the feature space (that is, the nearest neighbors in the feature space) belong to that category.

The distance between two samples is also called Euclidean distance

KNN algorithm was used to classify iris

def knn_iris() :
    """ Classifying iris with KNN algorithm :return: """
    # 1. Get data
    iris = load_iris()
    # 2. Divide the data set
    x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=6)
    # 3. Feature Engineering: Standardization
    tranfer = StandardScaler()
    x_train = tranfer.fit_transform(x_train)
    x_test = tranfer.fit_transform(x_test)
    # 4. KNN algorithm estimator
    estimator = KNeighborsClassifier(n_neighbors=3)
    estimator.fit(x_train, y_train)
    # 5. Model evaluation
    # Method 1: Directly compare the actual and predicted values
    y_predict = estimator.predict(x_test)
    print("y_predict:\n", y_predict)
    print("Direct comparison of actual value and predicted value: \n", y_predict == y_test)
    # Method 2: Calculation accuracy
    score = estimator.score(x_test, y_test)
    print("Accuracy: \n", score)
    return None
Copy the code

Running results:

3.3 Model selection and tuning

Learning Objectives:

  • Cross validation procedure
  • Hyperparameter search procedure
  • GridSearchCV is used to optimize the algorithm parameters

3.3.1 Cross validation

Cross validation: Divide the training data into training and validation sets. The following figure is an example: The data is divided into four parts, one of which is used as the verification set. It then goes through four tests, each with a different validation set. That is, four groups of model results are obtained, and the average value is taken as the final result. Also known as 4 fold cross validation.

The data is divided into training set and test set, but in order to make the model results obtained from the training set more accurate, the following processing is done:

KNN algorithm was used to classify iris, and network search and cross validation were added

def knn_iris_gscv() :
    """ Classifying iris with KNN algorithm, adding web search and cross validation :return: """
    # 1. Get data
    iris = load_iris()
    # 2. Divide the data set
    x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=6)
    # 3. Feature Engineering: Standardization
    tranfer = StandardScaler()
    x_train = tranfer.fit_transform(x_train)
    x_test = tranfer.fit_transform(x_test)
    # 4. KNN algorithm estimator
    estimator = KNeighborsClassifier()

    # Add web search and cross-validation
    # Parameter preparation
    param_dict = {"n_neighbors": [1.3.5.7.9.11]}
    estimator = GridSearchCV(estimator, param_grid=param_dict, cv=10)

    estimator.fit(x_train, y_train)
    # 5. Model evaluation
    # Method 1: Directly compare the actual and predicted values
    y_predict = estimator.predict(x_test)
    print("y_predict:\n", y_predict)
    print("Direct comparison of actual value and predicted value: \n", y_predict == y_test)
    # Method 2: Calculation accuracy
    score = estimator.score(x_test, y_test)
    print("Accuracy: \n", score)

    # Optimal parameters
    print("Best parameter :\n", estimator.best_params_)
    # Best result
    print("Best result :\n", estimator.best_score_)
    # Best estimator
    print("Best estimator :\n", estimator.best_estimator_)
    # Cross-verify the results
    print("Cross-validation result :\n", estimator.cv_results_)

    return None
Copy the code

Running results:

Facebook case

3.4 Naive Bayes algorithm

Learning Objectives:

  • Conditional probability and joint probability
  • Bayes’ formula, and feature independence
  • Laplace smoothing coefficient
  • Bayesian formula is used to calculate the probability

3.4.1 What is naive Bayes Classification

According to the naive Bayes algorithm classification can appear the following probability value, according to the size of the probability value of mail classification

3.4.2 Basis of probability

1. Definition of Probability

  • Probability is defined as the likelihood of an event happening
  • P(X) : the value ranges from 0 to 1.

3.4.3 Joint probability, conditional probability and mutual independence

  • Joint probability: contains multiple conditions, and all conditions are true probability
    • Remember: P (A, B)
  • Conditional probability: The probability of event A occurring if event B occurs
    • Remember: P (A | B)
  • Interdependent: If P(A,B) = P(A)P(B), then event A is said to be independent of event B
    • <=> Event A and event B are independent of each other

3.4.4 Bayes’ formula

3.4.5 API

  • Sklearn. Naive_bayes. MultinomiaINB (alpha = 1.0)
    • Naive Bayes classification
    • Alpha: Laplace smoothing coefficient

3.4.6 case

def nb_news() :
    """ Classification of news with naive Bayesian algorithm :return: """
    # Fetch data
    news = fetch_20newsgroups(subset="all")
    # Divide data set: feature training set, feature test set, target training set, target test set
    x_train,x_test,y_train,y_test = train_test_split(news.data,news.target)
    Feature engineering: Text feature extraction TFIDF
    transfer = TfidfVectorizer()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)

    # Naive Bayes algorithm predictor flow
    estimator = MultinomialNB()
    estimator.fit(x_train,y_train)
    # Model evaluation
    # Method 1: Directly compare the actual and predicted values
    y_predict = estimator.predict(x_test)
    print("y_predict:\n", y_predict)
    print("Direct comparison of actual value and predicted value: \n", y_predict == y_test)
    # Method 2: Calculation accuracy
    score = estimator.score(x_test, y_test)
    print("Accuracy: \n", score)
    return None
Copy the code

Running results:

3.5 the decision tree

Learning Objectives:

  • The formula and function of information entropy
  • Formulas and functions of information
  • The implementation of information gain is used to calculate the degree of uncertainty reduction of features
  • Understand three kinds of decision tree algorithm implementation

3.5.1 Understanding the decision tree

The idea of decision tree is very simple. The conditional branch structure in program design is if-else structure. The earliest decision tree is a classification learning method using this structure to divide data

3.5.2 Principle of decision tree

1, the principle of

  • Information entropy, information gain

2. Definition of information entropy

  • The technical term for H is information entropy, which is measured in bits

3. Decision tree division is based on —– information entropy

Definition: characteristics of A set of training data D the information gain of g (D, A), defined as the set D information entropy H (D) and the characteristics of A given under the condition of D information entropy H (D | A), namely:

3.5.4 case

1. Classify iris by decision tree

def decision_iris() :
    """ Classifying irises with decision trees :return: """
    # 1. Get the data set
    iris = load_iris()
    # 2. Divide the data set
    x_train,x_test,y_train,y_test = train_test_split(iris.data,iris.target,random_state=22)
    # 3 decision tree estimator
    estimator = DecisionTreeClassifier(criterion="entropy")
    estimator.fit(x_train,y_train)
    # 4. Model evaluation
    # Method 1: Directly compare the actual and predicted values
    y_predict = estimator.predict(x_test)
    print("y_predict:\n", y_predict)
    print("Direct comparison of actual value and predicted value: \n", y_predict == y_test)
    # Method 2: Calculation accuracy
    score = estimator.score(x_test, y_test)
    print("Accuracy: \n", score)

    # Visualize decision trees
    export_graphviz(estimator,out_file="iris_tree.dot")
    return None
Copy the code

Running results:

2. Survival prediction of Titanic passengers

3.5.6 Decision tree summary

  • Advantages:
    • Simple understanding and interpretation, tree visualization
  • Disadvantages:
    • Decision tree creators can create, but do not generalize well, trees with overly complex data, which is called overfitting
  • Improvement:
    • Pruning CART algorithm (already implemented in the decision tree API)
    • Random forests

3.6 Ensemble learning method of random forest

Learning Objectives:

  • The establishment process of every decision tree in random forest
  • Why do you need random Bootstrap sampling
  • Hyperparameters of random forest

3.6.1 What integrated learning method

Ensemble learning is to solve a single prediction problem by combining several models.

It works by generating multiple classifiers/models that independently learn and make predictions. These predictions are finally combined into a composite forecast, so that any single category makes a prediction.

3.6.2 What is random forest

Random forest is a classifier containing multiple decision trees, and its output categories are determined by the mode of the categories output by individual trees.

3.6.3 Random forest principle process

The learning algorithm creates each lesson tree according to the following algorithm:

  • Use N to represent the number of training cases and M to represent the number of features.
    • One sample is randomly selected at a time and repeated N times (duplicate samples may occur)
    • M features were randomly selected, m<< m, and the decision tree was established
  • Bootstrap sampling was taken

3.6.4 radar echoes captured API

3.6.5 case

Titanic passenger survival prediction

Four regression and clustering algorithm

4.1 Linear regression

Learning Objectives:

  • The principle of linear regression
  • Regression prediction is implemented using LinearRegression or SGDRegressor
  • Evaluation criteria and formula of regression algorithm

4.1.1 Principle of linear regression

1. Application scenarios of linear regression

2. What is linear regression

Definition and Formula

4.1.2 Loss and optimization principle of linear regression

1. Loss function: also known as cost, cost function and objective function

2. Optimization method

  • Normal equations: less used

  • Gradient Descent

Case study:

def linear1() :
    """ Normal equation optimization method for Boston housing price prediction: """
    # 1. Get data
    boston = load_boston()
    # 2. Divide the data set
    x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=22)
    # 3. Standardization
    transfer = StandardScaler()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)

    # 4. Predictor
    estimator = LinearRegression()
    estimator.fit(x_train,y_train)

    # 5. Get the model
    print("Normal equation - weight coefficient: \n",estimator.coef_)
    print("Normal equation - bias is :\n",estimator.intercept_)

    # 6. Model evaluation
    return None
Copy the code
def linear2() :
    """ Gradient descent optimization method to predict Boston housing prices: """
    # 1. Get data
    boston = load_boston()
    # 2. Divide the data set
    x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=22)
    # 3. Standardization
    transfer = StandardScaler()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)

    # 4. Predictor
    estimator = SGDRegressor()
    estimator.fit(x_train, y_train)

    # 5. Get the model
    print("Gradient descent - weighting coefficient: \n", estimator.coef_)
    print("Gradient descent - offset to :\n", estimator.intercept_)

    # 6. Model evaluation
    return None
Copy the code

Running results:

4.1.3 Regression performance evaluation

from sklearn.metrics import mean_squared_error
Copy the code
def linear1() :
    """ Normal equation optimization method for Boston housing price prediction: """
    # 1. Get data
    boston = load_boston()
    # 2. Divide the data set
    x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=22)
    # 3. Standardization
    transfer = StandardScaler()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)

    # 4. Predictor
    estimator = LinearRegression()
    estimator.fit(x_train, y_train)

    # 5. Get the model
    print("Normal equation - weight coefficient: \n", estimator.coef_)
    print("Normal equation - bias is :\n", estimator.intercept_)

    # 6. Model evaluation
    y_predict = estimator.predict(x_test)
    print("Forecast House Price :\n", y_predict)
    error = mean_squared_error(y_test, y_predict)
    print("Normal equation - mean square error is: \n", error)

    return None
Copy the code
def linear2() :
    """ Gradient descent optimization method to predict Boston housing prices: """
    # 1. Get data
    boston = load_boston()
    # 2. Divide the data set
    x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=22)
    # 3. Standardization
    transfer = StandardScaler()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)

    # 4. Predictor
    estimator = SGDRegressor()
    estimator.fit(x_train, y_train)

    # 5. Get the model
    print("Gradient descent - weighting coefficient: \n", estimator.coef_)
    print("Gradient descent - offset to :\n", estimator.intercept_)

    # 6. Model evaluation
    y_predict = estimator.predict(x_test)
    print("Forecast House Price :\n", y_predict)
    error = mean_squared_error(y_test, y_predict)
    print("Gradient descent - mean square error: \n", error)
    return None
Copy the code

4.1.4 Comparison between normal equation and gradient descent

Gradient descent Normal equations
You need to choose the learning rate Don’t need
You need to iterate In one operation
It can be used when the number of features is large You need to compute the equation, time complexity O(n3).
  • choose
    • Small-scale data:
      • LinearRegression (does not solve the fit problem)
      • Ridge regression
    • Large-scale data: SGT Dregressor

4.1.5 extension

Optimization methods: GD, SGD, SAG

1, GD

The original Gradient Descent requires calculation of all the values to obtain the Gradient, which is a large amount of calculation, hence the improved algorithm.

2, SGD

Stochastic gradient Descent: She considers only one training sample at a time.

  • Advantages of SGD:
    • efficient
    • Easy to implement
  • Disadvantages of SGD:
    • SGD requires many hyperparameters: regular term parameters, number of iterations.
    • SGD is sensitive to feature standardization

3, SAG

Stochastic Average gradient. Due to the slow speed of convergence, some algorithms based on gradient descent such as SAG have been proposed.

SAG optimization is found in ridge regression and logistic regression

4.2 Underfitting and overfitting

Learning Objectives:

  • Disadvantages of linear regression (without regularization)
  • Causes and solutions of over-fitting and under-fitting

2 brief introduction

1. Underfitting

2. Overfitting

Analysis:

  • The first case: because machine learning has too few swan features, the discrimination criteria are too rough to accurately identify swans
  • Case two: The machine can basically tell swans apart. Unfortunately, all the pictures of swans were white swans, and then the machine learned that the swan’s feathers were white, and then it saw a swan with black feathers and thought it wasn’t a swan.
  • Overfitting: a hypothesis that can obtain a better fit than other hypotheses on the training set but cannot fit the data well on the test data set is considered to be overfitting. (Model is too complex)
  • Underfitting: a hypothesis that does not fit the training set data well and does not fit the test data well is considered to be underfitting. (Model is too simple)

4.2.2 Causes and solutions

  • The reason of underfitting and the solution
    • Reason: Learning too few features of data
    • Solution: Increase the number of features in the data
  • Causes and solutions of overfitting
    • Reason: There are too many original features, some noisy features, and the model is too complicated because the model tries to take into account the data of each test point
    • Solutions:
      • regularization

Regularization: L2 regularization (common), L1 regularization

4.3 Improvement of linear regression — ridge regression

Learning Objectives:

  • Learning the difference between the principle of ridge regression and linear regression
  • Effects of regularization on weight parameters
  • The difference between L1 and L2 regularization

4.3.1 Linear regression with L2 regularization — Ridge regression

Ridge regression is also a linear regression. In order to solve the problem of overfitting, the regularization restriction is added to establish the regression equation.

1, API

2. Observe the change of regularization degree and its influence on the results

Case: Ridge regression to Boston housing price forecast

def linear3() :
    """ Ridge returns to the Boston housing price forecast: """
    # 1. Get data
    boston = load_boston()
    # 2. Divide the data set
    x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=22)
    # 3. Standardization
    transfer = StandardScaler()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)

    # 4. Predictor
    estimator = Ridge(alpha=0.5, max_iter=10000)
    estimator.fit(x_train, y_train)

    # 5. Get the model
    print("Ridge regression - weight coefficient: \ N", estimator.coef_)
    print("Ridge regression - bias is :\n", estimator.intercept_)

    # 6. Model evaluation
    y_predict = estimator.predict(x_test)
    print("Forecast House Price :\n", y_predict)
    error = mean_squared_error(y_test, y_predict)
    print("Ridge regression - mean square error is: \n", error)
    return None
Copy the code

4.4 Classification algorithms – Logistic regression and dichotomy

Learning Objectives:

  • Loss function of logistic regression
  • Optimization method of logistic regression
  • The sigmoid function
  • Application scenarios of logistic regression
  • Accuracy rate, recall rate index difference
  • The actual significance of F1-score index and recall rate
  • How to solve the sample imbalance in the case of assessment
  • ROC curve significance, AUC index size
  • Classification_report is used to calculate the accuracy and recall rate
  • Roc_auc_score is used to calculate the index

4.4.1 Logistic regression

Definition: Logistic Regression is a classification model of machine learning. Logistic Regression is a classification algorithm. Although it has Regression in its name, it is related to Regression. Because of its simplicity and efficiency, the algorithm is widely used in practice.

Application Scenarios:

  • Click through rate
  • Whether it is spam
  • Whether sick
  • Financial fraud
  • False account

Note: looking at the example above, we can see that it is characteristic of the judgment that falls between the two categories. Logistic regression is a powerful tool to solve the problem of binary classification.

4.4.2 Principle of logistic regression

1. Input: The output of linear regression is the input of logistic regression

2. Activate the function

  • The sigmoid function:

    g = 1/(1 + e^(-x))

    The following formula is a matrix representation

    Note: the sigmoid function input is the formula x, namely: x = h(w)=w1x1+ W2x2 +w3x3… +b

3. Loss and optimization

  1. loss

    The loss of logistic regression is called logarithmic likelihood loss

Synthesize the complete loss function

We know that minus log of P, the larger the P, the smaller the result, so we can analyze this loss.

  1. Optimize the loss

    The gradient descent optimization algorithm is also used to reduce the value of the loss function. In this way, the weight parameters of the corresponding algorithm before logistic regression are updated to increase the probability that originally belongs to class 1 and reduce the probability that originally belongs to class 0

4.4.3 API of logistic regression

4.4.4 case

Cancer classification prediction – benign/malignant breast cancer tumor prediction

Analysis process:

  1. To get the data

    Add names when reading

  2. The data processing

    Handling missing values

  3. Data set partitioning

  4. Characteristics of the engineering

    Dimensionless processing – standardization

  5. Logistic regression estimator

  6. Model to evaluate

Data address: archive.ics.uci.edu/ml/machine-…

4.4.5 Assessment method of classification

1. Accuracy and recall rate

  • Confusion matrix

2. Precision and Recall

  • Accuracy: The proportion of positive samples in the predicted results that are actually positive (understood)

  • Recall rate: The percentage of samples with positive results predicted by positive results (completeness, ability to distinguish positive samples)

  • F1-score: reflects the robustness of the model

3. Classified evaluation report API

4. Check the accuracy rate, recall rate and F1-Score of cancer Classification prediction — Benign/Malignant Breast Cancer Tumor Prediction

4.4.6 ROC curve and AUC index

1. TPR and FPR

  • (Recall rate) TPR = TP/(TP + FN)
    • The percentage of all samples of true category 1 that are predicted to be category 1
  • FPR = FP / (FP + TN)
    • The percentage of all samples with a true category of 0 that are predicted to be category 1

2. ROC curve

  • The horizontal axis of ROC is FPRate, and the vertical axis is TPRate. When the two are equal, it means that the probability of the predicted value of 1 by classifier is equal for the samples regardless of the real category is 1 or 0, and the AUC indicator is 0.5

3. AUC indicators

  • The probabilistic significance of AUC is the probability that a pair of positive and negative samples are randomly selected and the score of positive samples is greater than that of negative samples
  • The minimum value of AUC is 0.5, and the maximum value is 1. The higher the value, the better
  • AUC=1, perfect classifier, when using this prediction model, no matter what threshold is set, perfect prediction can be obtained. For the most part, there is no perfect classifier.
  • 0.5

Note: The final AUC ranges between [0.5,1] and is closer to 1, the better

4.4.7 AUC computing API

4.4.8 summary

  • AUC can only be used to evaluate dichotomies
  • AUC is very suitable for evaluating the performance of classifier with unbalanced samples

4.5 Model saving and loading

Learning Objectives:

  • Joblib is used to save and load the model

4.5.1 SkLearn model saving and loading API

4.5.2 of case

1. Model preservation

from sklearn.externals import joblib

# 4. Predictor
estimator = Ridge(alpha=0.5, max_iter=10000)
estimator.fit(x_train, y_train)

# Save the model
joblib.dump(estimator,"./my_ridge.pkl")
Copy the code

2. Load the model

# Load model
estimator = joblib.load("./my_ridge.pkl")
Copy the code

4.6 Unsupervised learning – K-means algorithm

Learning Objectives:

  • Principle of k-means algorithm
  • K-means performance evaluation of quasi-contour coefficients
  • Advantages and disadvantages of K-means

4.6.1 What is Unsupervised learning

Unsupervised learning: No target value

4.6.2 Unsupervised learning includes algorithms

  • clustering
    • K-means (K-means clustering)
  • Dimension reduction
    • PCA

4.6.3 K – means principle

4.6.4 K – means of API

4.6.5 case

K-means is used to cluster Instacart Market users

4.6.6 K-means performance evaluation indicators

1. Contour coefficient

Conclusion:

  • If b_i >> a_i: tends to 1, the better
  • If b_i << a_i: tends to -1, that’s not good
  • The value of contour coefficient is between [-1,1], and the closer it is to 1, the better the cohesion and separation degree are

4.6.7 Contour Coefficient API

4.6.8 case

Evaluation of Instacart Market user clustering – contour coefficient by k-means

4.6.9 K – means to summarize

  • Characteristic analysis: using iterative algorithm, intuitive and easy to understand, very practical
  • Disadvantages: Easy convergence to local optimal solution (multiple clustering)

Note: Clustering is usually done before classification

Five summarizes

5.1 the first day

5.2 the second day

5.3 the third day