Understanding Random Forest (with 4 construction steps +4 evaluation of implementation methods +10 advantages and disadvantages)

This article was originally posted on the Product Manager’s AI Knowledge base

Random Forest (4 steps +10 advantages and disadvantages)

Random forest is a kind of integration algorithm composed of decision tree, which can perform well in many cases.

This paper introduces the basic concept of random forest, 4 construction steps, 4 comparative evaluation methods, 10 advantages and disadvantages and 4 application directions.

What is a random forest?

Random forest belongs to Bagging (Bootstrap AGgregation) method of ensemble learning. The relationship between them can be shown as follows:

Decision Tree – Decision Tree

Before we explain random forests, we need to mention decision trees. Decision tree is a very simple algorithm, his explanatory strong, but also in line with human intuitive thinking. This is a supervised learning algorithm based on if-then-else rules. The above picture can intuitively express the logic of decision tree.

Random forests – the Random Forest | RF

Random forest is composed of many decision trees, and there is no correlation between different decision trees.

When we carry out the classification task, new input samples will enter and each decision tree in the forest will be allowed to judge and classify separately. Each decision tree will get its own classification result. Which of the classification results of the decision tree has the most classification, then the random forest will take this result as the final result.

Four steps to construct a random forest

If there are N samples, then there are N randomly selected samples that are put back (one at a time, and then returned to continue selection). The selected N samples are used to train a decision tree as samples at the decision root node.
When each sample has M attributes, when each node of the decision tree needs to be split, M attributes are randomly selected from the M attributes, meeting the condition M << M. Then some strategy (such as information gain) is used to select one of the m attributes as the split attribute of the node.
In the process of decision tree formation, each node should be split according to Step 2 (it is easy to understand that if the attribute selected by the node next time is the attribute used when its parent node was split, the node has reached the leaf node and there is no need to continue splitting). Until it can no longer split. Note that there is no pruning during the formation of the decision tree.
Establish a large number of decision trees according to steps 1~3, thus forming a random forest.

Advantages and disadvantages of random forest

advantages

It can produce high dimensional (feature rich) data without dimensionality reduction and feature selection
It can determine the importance of features
You can judge how different features interact with each other
It’s not easy to overfit
The training speed is relatively fast, easy to make the parallel method
It’s relatively simple to implement
For unbalanced data sets, it balances errors.
Accuracy can be maintained if a large proportion of features are missing.

disadvantages

Random forests have been shown to overfit some noisy classification or regression problems.
For the data of attributes with different values, more attributes divided by values will have a greater impact on the random forest, so the attribute weights produced by the random forest on such data are not credible

Random forest 4 implementation methods comparison test

Random forest is a common machine learning algorithm, which can be used for both classification problems and regression problems. In this paper, the implementation of random forest algorithm in scikit-learn, Spark MLlib, DolphinDB and XGBoost platforms is compared and tested. The evaluation criteria include memory usage, running speed and classification accuracy.

The test results are as follows:

DolphinDB was the fastest and XGBoost the worst. DolphinDB was the fastest and XGBoost the worst.

Four application directions of random forest

Random forests can be used in many places:

Classification of discrete values
Regression of continuous values
Unsupervised learning clustering
Outlier detection

Understanding Random Forest (with 4 construction steps +4 evaluation of implementation methods +10 advantages and disadvantages)

What is a random forest?

Four steps to construct a random forest

Advantages and disadvantages of random forest

Random forest 4 implementation methods comparison test

Four application directions of random forest

Related Posts

UE Maximum output power Reduction

TensorFlow 1.7.0 is available, with Bug fixes and improvements

1. Machine Learning 2021: Introduction to Basic Concepts of Machine learning