“This is the 11th day of my participation in the Gwen Challenge in November. See details: The Last Gwen Challenge in 2021.”
Attention!!!!! Multi-picture warning!!
Integrated learning
1.1 What is integrated learning
Ensemble learning solves a single prediction problem by building several models. It works by generating multiple classifiers/models that independently learn and make predictions. These predictions are eventually combined into composite predictions, and are therefore better than any single class of predictions.
1.2 The two core tasks of machine learning
1.3 Boosting and Bagging in ensemble learning
As long as the performance of single classifier is not too bad, the result of ensemble learning is always better than that of single classifier.
2. Bagging
2.1Bagging integration principle
Goal: Sort the circles and squares below
Implementation process:
-
Sample different data sets
-
Training classifier
-
Affirmative vote, get the final result
2.2 Random forest construction process
In machine learning, a random forest is a classifier containing multiple decision trees, and the output categories are determined by the mode of the categories output by individual trees
Random forest =Bagging+ decision tree
For example, if you train five trees and four of them give True and one gives False, the final vote is True.
Key steps in the process of random forest construction (N represents the number of training cases (samples), and M represents the number of features) :
1) One sample is randomly selected at a time, and the sample is put back, and repeated N times (there may be repeated samples)
2) M features are randomly selected, m << m, and a decision tree is established
-
thinking
-
1. Why do we randomly sample training sets?
If no random sampling is carried out, the training set of each tree is the same, and the final classification result of the trained tree is also exactly the same
-
2. Why is there a return sampling?
If there is no sample put back, then the training samples of each tree are different and have no intersection, so that each tree is “biased” and absolutely “one-sided” (of course, it may be wrong to say so), that is to say, each tree is trained with great differences; The final classification of random forest depends on the voting of multiple trees (weak classifier).
-
2.3 Random forest API
- Sklearn. Ensemble. RandomForestClassifier (n_estimators = 10, criterion = gini, max_depth = None, the bootstrap = True, random_state=None, min_samples_split=2)
- N_estimators: integer, optional (default = 10) number of trees in the forest 120,200,300,500,800,1200
- Criterion: string, optional (default = “gini”) Measurement method of segmentation feature
- Max_depth: integer or None (default = None) Specifies the maximum tree depth 5,8,15,25, and 30
- Max_features =”auto “, the maximum number of features for each decision tree
- If “auto”, then
max_features=sqrt(n_features)
. - If “sqrt”, then
max_features=sqrt(n_features)
(same as “auto”). - If “log2”, then
max_features=log2(n_features)
. - If None, then
max_features=n_features
.
- If “auto”, then
- Bootstrap: Boolean, optional (default = True) Whether to use fallback sampling when building the tree
- Min_samples_split: indicates the minimum number of samples divided by a node
- Min_samples_leaf: minimum sample number of leaf nodes
- Hyperparameters: n_ESTIMator, max_depth, MIN_SAMples_split,min_samples_leaf
2.4 Random forest prediction cases
-
Instantiate a random forest
# Random forest to make predictions rf = RandomForestClassifier() Copy the code
-
Defines a selection list of hyperparameters
param = {"n_estimators": [120.200.300.500.800.1200]."max_depth": [5.8.15.25.30]} Copy the code
-
Grid search using GridSearchCV
# Hyperparameter tuning gc = GridSearchCV(rf, param_grid=param, cv=2) gc.fit(x_train, y_train) print("The accuracy of random forest prediction is:", gc.score(x_test, y_test)) Copy the code
Note:
- The establishment process of random forest
- The depth of the tree, the number of trees and so on need to be hyperparameter tuning
2.5 Bagging Integration Advantages
Bagging + Decision tree/linear regression/logistic regression/deep learning… = Bagging integrated learning methods
Integrated learning methods composed by the above methods:
- The accuracy of generalization can be improved by about 2% compared with the original algorithm
- Simple, convenient and versatile
3. Boosting
3.1 Boosting integration principle
3.1.1 What is Boosting
With the accumulation of learning from weak to strong
In short: with each addition of a weak learner, overall performance improves
Representative algorithms: Adaboost, GBDT, XGBoost
3.1.2 Implementation process
-
Train the first learner
-
Adjusting data distribution
-
Train the second learner
-
Adjust the data distribution again
-
Train the learner in turn and adjust the data distribution
Key points:
How to confirm the voting weight?
How to adjust data distribution?
Summary of the construction process of AdaBoost
Bagging integration and Boosting integration
-
Difference one: data aspect
- Bagging: Training in sampling data;
- Boosting: The importance of adjusting data based on the results of the previous round of learning.
-
Difference two: voting aspect
- Bagging: Affirmative voting for all learners;
- Boosting: Weighted voting for learning machines.
-
Difference 3: Learning order
- Bagging learning is parallel, with no dependencies for each learner;
- Boosting learning is a series, with a sequence of learning.
-
Distinction four: main function
- Bagging is primarily used to improve generalization performance (solve overfitting, or reduce variance)
- Boosting is mainly used to improve training accuracy (solve under-fit, also can be said to reduce deviation).
3.1.3 API is introduced
- from sklearn.ensemble import AdaBoostClassifier
- The API link:Scikit-learn.org/stable/modu…
3.2 GBDT
GBDT Gradient Boosting Decision Tree (GBDT Gradient Boosting Decision Tree) is an iterative Decision Tree algorithm, which is composed of multiple Decision trees, and the final answer is obtained by summating the conclusions of all trees. ** It was regarded as a generalization algorithm with strong generalization ability when it was first proposed. In recent years, machine learning models used for search sorting have attracted much attention.
GBDT = Gradient descent + Boosting + decision tree
3.2.1 gradient
3.2.2 GBDT execution process
If hi(x)= decision tree model in the above equation, then the above equation becomes:
GBDT = Gradient descent + Boosting + decision tree
3.2.3 case
Predicted height of No. 5:
Serial number | Age (years) | Weight (KG) | Height (M) |
---|---|---|---|
1 | 5 | 20 | 1.1 |
2 | 7 | 30 | 1.3 |
3 | 21 | 70 | 1.7 |
4 | 30 | 60 | 1.8 |
5 | 25 | 65 | ? |
Step 1: Calculate the loss function and calculate the first predicted value:
Step 2: Solve the partition points
The variance of age 21 was 0.01+0.0025=0.0125
Step 3: Obtain H1 (x) by adjusting the target value
Step 4: Solve for H2 (x)
Results obtained:
No.5 Height = 1.475 + 0.03 + 0.275 = 1.78
3.2.4 Main implementation ideas of GBDT
-
Gradient descent method is used to optimize the cost function.
-
One layer decision tree is used as weak learner and negative gradient is used as target value.
-
Boosting idea is used for integration.
3.3 XGBoost
XGBoost= second-order Taylor expansion + Boosting + decision tree + regularization
- Interview question: What do you know about XGBoost? Please explain how it works
Key answers: Second-order Taylor expansion, Boosting, Decision tree, Regularization
Boosting: XGBoost uses Boosting idea to learn iteratively for multiple weak learners
Second-order Taylor expansion: In each round of learning, XGBoost performs second-order Taylor expansion on the loss function, optimizing it using first – and second-order gradients.
Decision tree: In each round of learning, XGBoost optimizes using decision tree algorithms as weak learning.
Regularization: In the optimization process, XGBoost adds a penalty term into the loss function to prevent over-fitting and limits the number of leaf nodes in the decision tree and the value of leaf nodes in the decision tree.
3.4 Taylor expansion
The more Taylor expansions, the more accurate the results.