Firstly, understand the meanings of parameters of random forest model:

Class sklearn. Ensemble. RandomForestRegressor (n_estimators = 10, mse criterion = ‘, ‘max_depth = None, min_samples_split = 2, min_samples_leaf=1,


Min_weight_fraction_leaf = 0.0, max_features = “auto”, max_leaf_nodes = None, min_impurity_decrease = 0.0, min_impurity_split=None, bootstrap=True, oob_score=False,


n_jobs=1, random_state=None, verbose=0, warm_start=False)





The above values are the corresponding default values, and the classification model of random forest is similar.

N_estimators: The number of forest numbers.


This attribute is typically an impact factor whose performance is inversely proportional to the efficiency of the model, but even so, you should increase this number as much as possible to make your model more accurate and stable.


Criterion: A measure of fragmentation. Optional values: “MSE”, mean squared error; “Mae”, mean absolute error


Max_features: Number of attributes to consider when looking for the optimal split point. Optional values, int (specific number), float (percentage of number), string (” auto “, “SQRT”, “log2”).


This attribute is set for a single tree, and in general, the larger the value, the more attributes a single tree can consider, the better the model will perform. It’s not a certainty, but it’s a certainty that increasing this value will slow down the algorithm, so we need to consider a balance.


Max_depth: integer or None. If None, the node expands until all leaves are pure or all leaf nodes contain fewer samples than min_samples_split


Min_samples_split: Minimum number of samples required to split internal nodes. Int (specific number),float(percentage of number)


Min_samples_leaf: The minimum number of samples that should be on the leaf node. Int (specific number),float(percentage of number).


The smaller number of nodes makes the model more susceptible to noise data. I usually set this value to be greater than 50, but you need to find what works best for you.


Min_weight_fraction_leaf:


Max_leaf_nodes: In “best-first fashion”, the optimal nodes are defined as the relative reduction of purity. If None, the number of leaves is not limited; [float]


Min_impurity_split: indicates the threshold for prematurely ending tree growth. For the current node, if it is larger than this threshold, it will be split; otherwise, it will be regarded as a leaf node. [float]


Min_impurity_decrease: A threshold that indicates that a node can split if the reduction in purity is greater than or equal to this value.


Bootstrap: Whether the build numbers are bootstrap samples; [True/False]


Oob_score: Cross-validates related attributes.


N_jobs: Sets the number of concurrent tasks in FIT and PREDICT phases. If the value is -1, the number of concurrent tasks is equal to the number of computing cores. [integer, optional (default=1)]


Random_state: If it is an int it is the seed of a random number generator. If you specify an instance of RandomState, it is the seed of a random generator. If None, the random number generator is an instance of RandomState used by Np.random; [int, RandomState instance or None, optional (default=None)]


Verbose: Controls the verbosity of the build number process. [int, optional (default=0)]


Warm_start: When set to True, reuse the previous structure to fit the sample and add more estimators (in this case, random trees) to the combinator; [True/False]


class_weight: Banlanced mode is an automatic adjustment mode based on y tag values, whose weights are inversely proportional to the class frequency of the input data. The formula is: N_samples/(n_classes Np.bincount (y)). The “balanced_subsample” mode is the same as the “balanced” mode, except that the weights are calculated based on putting back samples during each tree growth.

GridSearcherCV(), which uses cross-validation. For a classifier, you specify the name and value of the parameter you want to call, pass it in as a dictionary, and it tells you the best combination of parameters Try both. Here is a piece of my code.

Let me write a code slice here


from sklearn.model_selection import GridSearchCV


from sklearn.ensemble import RandomForestClassifier


Prepare training data and y values


X_train, y_train = …


# Preliminarily define classifiers


rfc = RandomForestClassifier(max_depth=2, random_state=0)


# Select a value with the name of the parameter to be selected


Tuned_parameter = [{‘ min_samples_leaf: [1, 2, 3, 4], ‘n_estimators: [50100200]}]


# Artifact appearance, CV set cross validation


clf = GridSearchCV(estimator=rfc,param_grid=tuned_parameters, cv=5, n_jobs=1)


# Fit training set


clf.fit(X_train, y_train)


print(‘Best parameters:’)


pritn(clf.best_params_)


———————

For more free technical information: annalin1203