Integrated learning concept
By building and combining multiple learners to accomplish tasks, it is often possible to achieve better generalization performance than a single learner. Even though the power of each simple model was weak and the prediction accuracy was very low, when combined, the accuracy was significantly improved and was comparable to other types of strong models.
Individual learner
Individual learner concept
Individual learners are usually generated from training by an existing algorithm. At this time, individual learners of the same type are included in the integration. For example, individual learners in decision tree integration are all decision trees, and individual learners in neural network model are all neural networks. Such integration is homogenous, and individual learners in homogenous integration are also called base learners. Corresponding, there is heterogeneous integration, if the integration contains different types of individual learner, such as decision tree integration and neural network integration, then the individual learner is generally called component learner, rather than base learner, or directly called individual learner.
boosting bagging
According to the different generation modes of individual learners, current ensemble learning methods can be roughly divided into two categories:
- There are strong dependencies between individual learnersSerial generatedRepresents the serialization method ofBoostingSeries algorithms, Adaboost, GBDT, XGBoost
- There is no strong dependence between individual learners, butAt the same time generateRepresents Bagging andRandom forests.
Boosting
Working mechanism: first train a base learner from the initial training machine, then adjust the distribution of training samples according to the performance of the base learner, so that the training samples made wrong by the previous base learner receive more attention in the follow-up, and then train the next base learner based on the adjusted sample distribution.
Bagging
Bagging is one of the representatives of parallel integrated learning methods, and its algorithm principle is roughly as follows:
- Data processing: Clean and sort the data according to the actual situation
- Random sampling: repeat T times and randomly select T typebooks from the sample each time
- Individual training: Put each subsample into individual learner training
- Classification decision: use voting method to integrate classification decision
Combining strategies (averaging, voting, learning)
Average method
voting
Learning method
When there is a lot of training data, a more powerful combination strategy is to use “learning”, where the combination is done through another learner. Stacking is a typical example of learning. Here we call the individual learner as the primary learner, and the learner used for the combination is called the secondary learner or meta-learner. Stacking the primary learner from the initial data set, and then “generating” a new data set for training the secondary learner. In this new dataset, the output of the primary learner is treated as sample input characteristics, while the tags of the original sample are still treated as sample tags.
Random forest thought
Random forest is the representative algorithm of Bagging, and its principle is very similar to Bagging. Based on this, some improvements are made:
- For the ordinary decision tree, an optimal partition feature will be selected from all features of N samples, but the random forest will first randomly select some features from all features, and then select an optimal partition feature from the partial features. This further enhances the generalization ability of the model.
- When determining the number of partial features, an appropriate value is obtained through cross validation.
Random forest algorithm flow:
- I’m going to put back a random sample from the sample set and pick out n samples
- K features are randomly selected from all the feature values, and a decision tree is established using these features for the selected samples
- Repeat the above two steps m times, that is, generate M decision trees to form a random forest
- For new data, each tree makes a decision and finally votes to determine which category it falls into.
Generalization of random forest
Based on RF, there are many algorithm variants, widely used, not only for classification, but also for feature conversion, anomaly detection and so on. 1. The principle of extra trees is almost the same as RF, with specific differences as follows:
- For the training set of each decision tree, RF uses random sampling, while ET uses raw data set
- After the partition feature is selected, RF will be based on
gini
orThe information entropy
And select an optimal eigenvalue partition point, which is the same as the traditional decision tree. But ET will randomly select an eigenvalue to divide the decision tree. Because the partition points of eigenvalues are randomly selected instead of the most advantageous bits, the scale of the generated decision tree is generally larger than that generated by RF. In other words, the variance of the model decreases further with respect to RF, but the bias increases further with respect to RF. In some cases, ET has better generalization ability than RF. 2. Totally Random Trees Embedding (HEREINAFTER referred to as TRTE) TRTE is a data transformation method of unsupervised learning. It maps the low-dimensional data set to the high-dimensional data, so that the high-dimensional data can be better used in the classification regression model. As we know, kernel methods are used in support vector machines to map low-dimensional data sets to high-dimensional data sets. Here TRTE provides another method. 3. Isolation Forest (hereinafter referred to as IForest) is a method of anomaly detection. It also uses a similar approach to RF to detect anomalies
The advantages and disadvantages
advantages
- Training can be highly parallel
- Because the decision tree nodes can be randomly selected to divide features, the model can still be trained efficiently when the sample feature dimension is very high.
- Random sampling is adopted, and different feature sets are used in each model to avoid overfitting to a certain extent. The variance of the trained model is small and the generalization ability is strong.
disadvantages
- RF models tend to fall into overfitting on some noisy sample sets.
- Features with more value division tend to have greater influence on RF decision-making, thus affecting the effect of the fitting model.
2 sklearn parameters
RandomForestClassifier(n_estimators='warn', criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, Min_weight_fraction_leaf = 0.0, max_features = 'auto', max_leaf_nodes = None, min_impurity_decrease = 0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None)_
The parameter 2criterion indicates the selection of feature classification method. The default parameter is gini, and the optional parameter is entropy (information gain), which indicates the purity of information. The difference is the calculation method.
Application scenarios
- In the banking world, random forest algorithms can be used to detect loyal customers, that is, customers who borrow frequently from the bank and pay on time, as well as fraudulent customers, that is, those who fail to pay on time and behave abnormally.
- In the medical field, random forest algorithms can be used to identify whether different ingredients in medicine are combined in the right way or to identify diseases by analyzing patients’ medical records.
- In the stock market, stochastic forest algorithm can be used to identify the stock volatility behavior, predict losses or gains.
- In e-commerce, random forest algorithm can be used to predict whether customers like the products recommended by the system according to their shopping experience.
reference
Machine Learning Summary (Lecture 15) Algorithm: Random Forest
Bagging and random forest algorithm principle summary
Integrated learning using SkLearn — theory
Machine Learning – Random Forest tricks-20170917