This article was originally posted on the Product Manager’s AI Knowledge base
Random Forest (4 steps +10 advantages and disadvantages)
Random forest is a kind of integration algorithm composed of decision tree, which can perform well in many cases.
This paper introduces the basic concept of random forest, 4 construction steps, 4 comparative evaluation methods, 10 advantages and disadvantages and 4 application directions.
What is a random forest?
Random forest belongs to Bagging (Bootstrap AGgregation) method of ensemble learning. The relationship between them can be shown as follows:
Decision Tree – Decision Tree
Before we explain random forests, we need to mention decision trees. Decision tree is a very simple algorithm, his explanatory strong, but also in line with human intuitive thinking. This is a supervised learning algorithm based on if-then-else rules. The above picture can intuitively express the logic of decision tree.
Random forests – the Random Forest | RF
Random forest is composed of many decision trees, and there is no correlation between different decision trees.
When we carry out the classification task, new input samples will enter and each decision tree in the forest will be allowed to judge and classify separately. Each decision tree will get its own classification result. Which of the classification results of the decision tree has the most classification, then the random forest will take this result as the final result.
Four steps to construct a random forest
- If there are N samples, then there are N randomly selected samples that are put back (one at a time, and then returned to continue selection). The selected N samples are used to train a decision tree as samples at the decision root node.
- When each sample has M attributes, when each node of the decision tree needs to be split, M attributes are randomly selected from the M attributes, meeting the condition M << M. Then some strategy (such as information gain) is used to select one of the m attributes as the split attribute of the node.
- In the process of decision tree formation, each node should be split according to Step 2 (it is easy to understand that if the attribute selected by the node next time is the attribute used when its parent node was split, the node has reached the leaf node and there is no need to continue splitting). Until it can no longer split. Note that there is no pruning during the formation of the decision tree.
- Establish a large number of decision trees according to steps 1~3, thus forming a random forest.
Advantages and disadvantages of random forest
advantages
- It can produce high dimensional (feature rich) data without dimensionality reduction and feature selection
- It can determine the importance of features
- You can judge how different features interact with each other
- It’s not easy to overfit
- The training speed is relatively fast, easy to make the parallel method
- It’s relatively simple to implement
- For unbalanced data sets, it balances errors.
- Accuracy can be maintained if a large proportion of features are missing.
disadvantages
- Random forests have been shown to overfit some noisy classification or regression problems.
- For the data of attributes with different values, more attributes divided by values will have a greater impact on the random forest, so the attribute weights produced by the random forest on such data are not credible
Random forest 4 implementation methods comparison test
Random forest is a common machine learning algorithm, which can be used for both classification problems and regression problems. In this paper, the implementation of random forest algorithm in scikit-learn, Spark MLlib, DolphinDB and XGBoost platforms is compared and tested. The evaluation criteria include memory usage, running speed and classification accuracy.
The test results are as follows:
DolphinDB was the fastest and XGBoost the worst. DolphinDB was the fastest and XGBoost the worst.
Four application directions of random forest
Random forests can be used in many places:
- Classification of discrete values
- Regression of continuous values
- Unsupervised learning clustering
- Outlier detection