This is the 15th day of my participation in the August More Text Challenge. For details, see:August is more challenging

A random forest introduction

Ensemble Learning (ENSEMBLE Learning) is a very popular machine learning algorithm. It is not a single machine learning algorithm itself, but by building multiple models on the data, integrating the modeling results of all models. An ensemble algorithm takes modeling results from multiple evaluators into account and aggregates them to obtain a comprehensive result that is better for regression or classification performance than a single model. A model that integrates multiple models into an ensemble Estimator is called an ensemble estimator, and each model that makes up the ensemble estimator is called a Base estimator. Generally speaking, there are three kinds of integration algorithms: Bagging, Boosting, and stacking.

The core idea of bagging method is to construct several independent estimators, and then to determine the results of the integrated estimators by average or majority voting. The representative model of bagging is the random forest. In ascension, the base estimators are related and are built sequentially. Its core idea is to combine the strength of the weak evaluator to predict the samples that are difficult to evaluate again and again, so as to form a strong evaluator. The representative models of lifting method are Adaboost and gradient lifting tree.

Two important parameters

2.1 n_estimators

Number of base estimators. This parameter has a monotonous effect on the accuracy of random forest models. The larger n_estimators are, the better the model is. However, any model has decision boundary. After N_estimators reach a certain degree, the accuracy of random forest often does not rise or start to fluctuate. Moreover, the larger the N_estimators are, the more computation and memory is required, and the longer the training time will be. For this parameter, a balance should be struck between the difficulty of training and the effect of the model

2.2 RandomForestClassifier VS DecisionTreeClassifier

from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier from Wine =load_wine() wine.data wine.target # Compare results from both random forest and decision tree sklearn.model_selection import train_test_split X_train, x_test y_train, y_test = train_test_split (wine. The data, and wine. The target, test_size = 0.3) clf=DecisionTreeClassifier(random_state=200) rfc=RandomForestClassifier(random_state=200) clf=clf.fit(x_train,y_train) rfc=rfc.fit(x_train,y_train) score_c=clf.score(x_test,y_test) score_r=rfc.score(x_test,y_test) print('Tree Score:{}'.format(score_c), ' \n' 'Random Forest Score:{}'.format(score_r))Copy the code

Random forest results are higher than decision tree results

2.3 Comparison using cross validation again (cross_val_score)

from sklearn.model_selection import cross_val_score import matplotlib.pyplot as plt rfc=RandomForestClassifier(n_estimators=30) rfc_s=cross_val_score(rfc,wine.data,wine.target,cv=10) CLF = DecisionTreeClassifier () clf_s = cross_val_score (CLF, wine. The data, and wine. The target, CV = 10) PLT. Figure (figsize = (10, 5)) PLT. The plot (range (1, 11), rfc_s, label = 'RandomForest') PLT. The plot (range (1, 11), clf_s, label = 'Decision Tree') plt.title('RandomForest VS Decision Tree') plt.legend() plt.show()Copy the code

After 10 cross-validation, although the results of decision tree are sometimes the same as those of random forest, the results of random forest are significantly better than those of decision tree on the whole.

Three bags

When random_state is fixed, the random forest generates a fixed set of trees, but each tree is still inconsistent, eliminating the randomness of each result. And we can show that the more randomness there is, the better the bagging generally gets. When using bagged method, base classifiers should be independent and different from each other. But the limitations of this approach are very strong, because when we need thousands and thousands of trees, the data doesn’t necessarily provide thousands and thousands of features to allow us to build as many different trees as possible. So in addition to random_state. We also need some other randomness.

3.1 bootstrap&oob _score

In order to make the base classifiers as different as possible, different training sets are used for training. The bagging method generates different training data through the random sampling technique with placement. Bootstrap is used to control the parameters of the sampling technique. In a containing n samples of the original training set, we are random sampling, sampling a sample at a time, and under the extract a sample before the sample back on the original training set, that is to say, the next time sampling the sample still can be collected, this collection of n times, eventually get a and the original training sample set a large buffet set composed of n samples. Because of random sampling, each self-set is different from the original data set and from other sampling sets. So you have an inexhaustible and different set of bootstraps, and by training our base classifiers with these bootstraps, our base classifiers will naturally be different. In this way, a part of the training data is wasted. This data is called out of bag data (OOB), which is used as the test set of the integrated algorithm in addition to the test set originally divided. In other words, when using random forest, we can not divide the test set and training set, and only need to test the model with out-of-bag data. Of course, this is not always true. When n and n_estimators are not large enough, it is likely that no data will fall out of the bag and ooB data will be used to test the model. If we want to test with out-of-pocket data, we need to set oob_SCORE to True when instantiating. After training, we can use oob_score_, another important attribute of random forest, to see the results of our test on out-of-pocket data:

rfc=RandomForestClassifier(n_estimators=30,oob_score=True)
rfc=rfc.fit(wine.data,wine.target)
rfc.oob_score_
Copy the code

Important attributes and interfaces

The interface of random forest is exactly the same as that of decision tree, so there are still four commonly used interfaces: Apply, FIT, predict and Score. In addition, you need to pay attention to the predict_proba interface of the random forest. This interface returns the probability of each test sample being assigned to each type of label, and the probability of each type of label is returned. If the problem is binary, predict_proba returns values greater than 0.5 as 1, and predict_proba returns values less than 0.5 as 0. The traditional random forest uses the rules in the bagging-method to determine the result of integration, and the random forest in SKlearn is to average the probability returned by predict_proba corresponding to each sample to get an average probability, so as to determine the classification of test samples.

RFC =RandomForestClassifier(n_estimators=25) RFC = rfC. fit(x_train,y_train Score: {} '. The format (RFC) Score (x_test y_test))) print (* 100) '-' # feature importance Print (' feature_names :{}'.format([*zip(wine.feature_names, rfC.feature_importances_)])) print('-'*100) # Index of leaf nodes in each tree for each sample Print (' index of each sample leaf node in each tree :{}'.format(rfC.apply (x_test)[:2])) print('-'*100) # Label of test set prediction Print (' test set predicts label :{}'.format(rfC.predict (x_test))) print('-'*100) # Probability of each sample being assigned to each label Print (' probability of each sample being assigned to each label :{}'.format(rfC.predict_proba (x_test)[:5]))Copy the code

** Before using a random forest, it is important to check that the classification trees used to make up the random forest are at least 50% accurate.