Many of the decisions we make in life are based on the opinions of others, and decisions made by a group of people often produce better results than decisions made by any one member of the group, which is known as the wisdom of crowds. Ensemble Learning is similar to this idea. Ensemble Learning combines predictions from multiple models, aiming to perform better than any member integrating the learner, thereby improving predictive performance (model accuracy), which is also the most important concern of many classification and regression problems.
Ensemble Learning is to combine several weak classifiers (or regressors) to produce a new classifier. (Weak classifiers refer to classifiers whose classification accuracy is slightly better than random conjecture, i.e., error rate < 0.5).
Integrated machine learning involves combining predictions from multiple skilled models, and the success of the algorithm lies in ensuring the diversity of weak classifiers. And the integration of unstable algorithms can also get a relatively significant performance improvement. Integrated learning is an idea. Integrated learning approaches are popular and often the preferred technique when predicting the best performance of a modeling project is the most important outcome.
Why use integrated learning
(1) Better performance: The integration can make better predictions and achieve better performance than the contribution of any single model; (2) Stronger robustness: integration reduces the propagation or dispersion of prediction and model performance, smoothing the expected performance of the model. (3) More reasonable boundaries: There are some differences among weak classifiers, leading to different boundaries of classification. After merging several weak classifiers, more reasonable boundaries can be obtained, and the overall error rate can be reduced to achieve better results. (4) Adapt to different sample sizes: for samples that are too large or too small, different sample subsets can be divided and put back respectively, and then different classifiers can be trained with the sample subsets, and finally combined; (5) Easy fusion: It is difficult for multiple heterogeneous feature data sets to be fused, so each data set can be modeled before model fusion.
Bias and variance in machine learning modeling
Errors generated by machine learning models are typically described by two attributes: bias and variance.
Bias is a measure of how close the model can capture mapping functions between inputs and outputs. It captures the rigidity of the model: the strength of the model’s assumptions about the functional form of the mapping between inputs and outputs.
The variance of the model is the variation of the model’s performance when fitting different training data. It captures how the details of the data affect the model.
Ideally, we prefer models with low bias and low variance, which is, in fact, the goal of applying machine learning to a given predictive modeling problem. The bias and variance of model performance are correlated, and reducing the bias can often be easily achieved by increasing the variance. Conversely, variance can easily be reduced by increasing the bias.
Integration is used for predictive modeling problems to achieve better predictive performance than a single predictive model. This can be understood by adding biases to the model to reduce the variance component of the prediction error (i.e. in the case of tradeoff bias – variance).
Bagging ideas for integrated learning
Bagging, also known as Bootstrap Aggregating, involves fitting many learners across different samples from the same data set and averaging the predictions, in order to find diverse ensemble members by altering the training data.
Bagging idea is the integration technique of training N classifiers by re-selecting N new data sets from original data sets by sampling with put back. Duplicate data is allowed in model training data.
The Bagging trained model will use majority voting or averaging strategies to calculate the final classification results when predicting new sample classifications.
Bagging-based weak learners (classifiers/regressors) can be basic algorithm-based models such as Linear, Ridge, Lasso, Logistic, Softmax, ID3, C4.5, CART, SVM, KNN, Naive Bayes, etc.
Random Forest
Random forest algorithm principle
Random forest is an algorithm modified on the basis of Bagging strategy. The method is as follows:
(1) Use the Bootstrap strategy to sample data from the sample set; (2) K features are randomly selected from all features to construct a normal decision tree; (3) Repeat one or two times to build multiple decision trees; (4) Integrate multiple decision trees to form a random forest, and make decisions on data by voting or taking average values.
Random forest OOB Error
In the random forest, it can be found that about 1/3 of the samples sampled by Bootstrap will not appear in the sample set sampled by Bootstrap, and of course, they do not participate in the establishment of decision tree. This part of data is called out-of-bag data OOB (Out of BAG), which can be used to replace the test set error estimation method.
For random forests have been generated, with the bag outside data to test its performance, assumes that the total number of bags outside data for O, use it O a bag outside data as input, bring in before have generated random forests classifier, the classifier will give O data of the corresponding classification, because this article O the type of data is known, use the correct classification comparing with the results of the random forest classifier, The number of classification errors of random forest classifier was counted and set as X, so the error size of data outside the bag was X/O.
Advantages: This has been proven to be unbiased, so there is no need for cross-validation or separate test sets in the random forest algorithm to obtain the unbiased estimation of the test set error.
Disadvantages: When the amount of data is small, the data set generated by Bootstrap sampling changes the distribution of the initial data set, which introduces estimation bias.
Random forest algorithm variant
RF algorithm has good characteristics in practical application and is widely used, mainly in classification, regression, feature conversion, anomaly detection, etc. The following are common RF variant algorithms: ·Extra Trees (ET) ·Totally Random Trees Embedding (TRTE) ·Isolation Forest (IForest)
Extra Trees (ET)
Extra-trees (Extremely randomized Trees) were proposed by Pierre Geurts et al in 2006. RF is a variant of the basic principle and RF. However, there are two main differences between this algorithm and random forest:
(1) Bootstrap is used for random sampling in random forest, which is used as the training set of sub-decision tree and Bagging model is applied; ET uses all the training samples to train each subtree, that is, each subdecision tree of ET uses the original sample training.
(2) Random forest is the same as traditional decision tree in selecting and dividing feature points (based on information gain, information gain rate, Gini coefficient, mean square error, etc.), while ET is a completely random selection and dividing feature to divide decision tree.
For a decision tree, its optimal partition feature is randomly selected, so its prediction result is often not accurate, but the combination of multiple decision trees can achieve a good prediction effect.
When ET construction is completed, we can also apply all training samples to get the error of ET. Although the same training sample set is used to construct the decision tree and forecast, the optimal partition attribute is randomly selected, so we still get completely different prediction results. The prediction results can be compared with the real response value of the sample to obtain the prediction error. If compared with random forest, all training samples in ET are OOB samples, so the calculation of ET prediction error is also the calculation of OOB error.
Since Extra Trees are partition points with randomly selected feature values, the size of decision Trees is generally larger than that generated by RF. In other words, the variance of Extra Trees model was further reduced relative to RF. In some cases, ET has stronger generalization ability than random forest.
Totally Random Trees Embedding (TRTE)
TRTE is a data transformation method of unsupervised learning. It maps the low-dimensional data to the high-dimensional data, so that the high-dimensional data can be better applied to the classification regression model.
The conversion process of TRTE algorithm is similar to that of RF algorithm, and T decision trees are established to fit the data. When the decision tree construction is completed, the position of leaf node of each data in the data set is determined in T decision subtrees, and the feature transformation operation is completed by converting the position information into a vector.
For example, there are three decision trees, each of which has five leaf nodes. A data feature X is divided into the third leaf node of the first decision tree, the first leaf node of the second decision tree, and the fifth leaf node of the third decision tree. Then the feature encoding after x mapping is (0,0,1,0,0, 1,0,0,0,0,0,1), with 15-dimensional high-dimensional features. After feature mapping to higher dimensions, supervised learning can be further carried out.
Isolation Forest (IForest)
IForest is an outlier detection algorithm, which uses a similar way to RF to detect outliers. The difference between IForest algorithm and RF algorithm lies in:
(1) In the process of random sampling, only a small amount of data is generally needed; (2) In the process of decision tree construction, IForest algorithm randomly selects a partition feature and a partition threshold for the partition feature; (3) The general max_depth of the decision tree constructed by IForest algorithm is relatively small.
The purpose of IForest is to detect outliers, so as long as abnormal data can be distinguished, it does not need a lot of data; In addition, in the process of outlier detection, it generally does not need too large decision tree.
For the judgment of outliers, the test sample X is fitted to T decision trees. Calculate the depth ht(x) of the leaf node of this sample in each tree. The average depth h(x) is calculated. Then, the following formula can be used to calculate the abnormal probability value of sample point x. The value range of p(s,m) is [0,1], and the closer it is to 1, the higher the probability of abnormal point is.
M is the number of samples, ξ is Euler’s constant
Summary of advantages and disadvantages of random forest
In this AI lesson, we learned about Bagging and how it works, as well as bagging-based random forests. Finally, let’s summarize the advantages and disadvantages of random forest:
advantages
(1) The training can be parallelized, which has the advantage of speed for large-scale training samples; (2) Due to the random selection of the decision tree to divide the feature list, it still has good training performance when the sample dimension is relatively high; (3) Due to random sampling, the trained model has small variance and strong generalization ability; (4) Simple implementation; (5) insensitive to partial feature loss; (6) It can measure the importance of features.
disadvantages
(1) In some noisy features, it is easy to overfit; (2) Partition features with a large number of values will have a greater impact on RF decision-making, which may affect the effect of the model.