This is the 16th day of my participation in the August More Text Challenge. For details, see:August is more challenging

A brief introduction to random forest regression

Class sklearn. Ensemble. RandomForestRegressor (n_estimators = “warn”, mse criterion = ‘, ‘max_depth = None, Min_samples_split =2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features= ‘auto’, max_leaf_nodes=None, Min_impurity_decrease =0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False)

All parameters, attributes and interfaces are consistent with the RANDOM forest classifier. The only difference is the difference between the regression tree and the classification tree, and the Criterion of impurity is not consistent.

Two important parameters, properties and interfaces

2.1 criterion

A regression tree is an indicator for measuring the quality of branches. There are three supporting criteria:

  • The difference of mean squared error(MSE) between the parent and leaf nodes will be used as the criterion for feature selection. This method minimizes L2 loss by using the mean of leaf nodes
  • “Friedman_mse”, using Friedman’s modified MEAN square error for problems in the underlying branch
  • Mae (mean Absolute error), which uses the median of leaf nodes to minimize L1 losses

In the regression tree, MSE is not only a branch quality measurement index, but also an index to measure the regression of the regression tree. When using cross validation, mean square error is often selected as our evaluation (in the classification tree, this index is the prediction accuracy represented by SCORE). In regression, the smaller the MSE, the better. The regression tree interface score returns R squared, not MSE. Although the MEAN square error is always positive, skLearn calculates neg_mean_squared_error (neg_mean_squared_error) when using mean square error as a criterion. Sklearn will consider the properties of the indicators themselves when calculating the model evaluation indicators. The mean square error itself is a kind of error, which is divided into a kind of loss, which is represented by negative numbers. The true mean square error MSE is neg_mean_squared_error minus the minus sign. The predict_proba interface is not available for random forest regression. For regression, there is no question about the probability of a sample being assigned to a certain category. Therefore, predict_proba interface is not available.

2.2 Implementation of random forest

from sklearn.datasets import load_boston
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor

boston=load_boston()
regressor=RandomForestRegressor(n_estimators=120,random_state=200)
cross_val_score(regressor,boston.data,boston.target,cv=10,scoring='neg_mean_squared_error')
Copy the code

Another important function of random forests is to fill in missing values

The basic idea of three tuning parameters

3.1 Generalization error

Talk about the correct way of tuning parameters. Model tuning, the first step is to find the target: what are we going to do? In general, the goal is to improve a model evaluation measure. For random forests, for example, we want to improve the accuracy of the model on unknown data (as measured by score or OOB_score_). To find this goal, we need to think: what factors affect the accuracy of the model on unknown data? In machine learning, the index we use to measure the accuracy of the model on unknown data is called generalization error. When the model performs poorly on unknown data (test set or out-of-pocket data), it indicates that the degree of generalization of the model is insufficient, and the generalization error is large, and the model is not effective. The generalization error is affected by the structure (complexity) of the model. When the model is too complex, the model will be overfitted and the generalization ability will be insufficient, so the generalization error is large. When the model is too simple, the model will be underfitted, and the fitting ability will be insufficient, so the error will be large. The minimum generalization error can be achieved only when the complexity of the model is just right.

  • If the model is too complex or too simple, the generalization error will be high. What we want is an equilibrium in the middle
  • If the model is too complex, it will overfit, and if the model is too simple, it will underfit
  • For the tree model and the tree integration model, the deeper the tree, the more branches and leaves, the more complex the model
  • The goal of both the tree model and the tree integration model is to reduce the complexity of the model and move the model to the left of the image

N_estimators: N_estimators ↑, which does not affect the complexity of a single model. Max_depth: it can be increased or decreased. The default maximum depth is the highest complexity. And move to the left of the image min_samples_leaf: increase and decrease, the default minimum limit 1, namely the highest complexity, to reduce the complexity of the direction of min_samples_leaf↑, the model is simpler, and move to the left of the image min_samples_split: increase and decrease, The default minimum limit 2 is the highest complexity, and the parameter min_SAMples_split ↑ is set to the direction of complexity reduction, and the model is simpler, and it moves to the left of the image. Max_features: increases and decreases, and the default is auto, which is the square root of the total number of features, and is located in the middle complexity. You can also adjust the parameter max_features↓ in the direction of complexity reduction, the model is simpler, the image is moved to the left max_features↑, the model is more complex, the image is moved to the right criterion: increase or decrease, generally use GINI

3.2 Variance VS bias

In the figure below, each point is the predicted value generated by a base estimator in the integrated algorithm. The red dotted line represents the average of these predicted values, while the blue line represents the data as it should be.

  • Deviation: The difference between the model’s predicted and true values, i.e. the distance from each red dot to the blue line. In an ensemble algorithm, each base estimator has its own bias, and the bias of the ensemble estimator is the mean of the bias of all the base estimators. The more accurate the model, the lower the deviation. Variance: it reflects the error between each output result of the model and the average predicted value of the model, that is, the distance between each red point and the red dotted line, which measures the stability of the model. The more stable the model, the lower the variance.

The smaller the deviation is, the more accurate the model is. The variance measures whether the results of each prediction are close to each other. That is to say, the smaller the square difference is, the more stable the model is. A good model should be accurate and stable in predicting most unknown data. Our goal in tuning the parameters is to get the perfect balance of variance and bias. Although the variance and the bias cannot be minimised at the same time, the general error of their composition can have a minimum, and that is what we are looking for. For a complex model, reduce the variance, and for a relatively simple model, reduce the deviation. The base estimators of random forest all have low bias and high variance

The Boosting of four Bagging VS

Evaluation:

  • Bagging: Independent, simultaneous operation
  • Boosting: The latter model will have more weight on the samples that the previous model predicted to fail
  • Sampling data set:
  • Bagging: there is put back sampling
  • Boosting: There is a backsampling, but it will confirm the weight of the data, and each sample will give more weight to the samples that are easy to predict failure
  • Determine the outcome of integration:
  • Bagging: Average or majority rule
  • Boosting: Weighted average, models that perform better on the training set will have more weight
  • Goal:
  • Bagging: reduce variance and improve the overall stability of the model
  • Boosting: Reducing the bias and improving the overall accuracy of the model
  • When a single evaluator has overfitting problems:
  • Bagging: Can solve the problem of overfitting to some extent
  • Boosting: May exacerbate the overfitting problem
  • When the effectiveness of a single evaluator is weaker:
  • Bagging: Not very helpful
  • Boosting: It is likely to improve model performance
  • Representative algorithm:
  • Bagging: Random forest
  • Boosting: Gradient lifting tree, Adaboost