This article has participated in the “new creative Ceremony” activity

Comparison between ordinary decision tree and random forest

Generate the Circles data set

X,y = datasets.make_moons(n_samples=500,noise=0.3,random_state=42)
plt.scatter(X[y==0.0],X[y==0.1])
plt.scatter(X[y==1.0],X[y==1.1])
plt.show()
Copy the code

Drawing function

def plot_decision_boundary(model, X, y) :
    x0_min, x0_max = X[:,0].min() -1, X[:,0].max() +1
    x1_min, x1_max = X[:,1].min() -1, X[:,1].max() +1
    x0, x1 = np.meshgrid(np.linspace(x0_min, x0_max, 100), np.linspace(x1_min, x1_max, 100))
    Z = model.predict(np.c_[x0.ravel(), x1.ravel()]) 
    Z = Z.reshape(x0.shape)
    
    plt.contourf(x0, x1, Z, cmap=plt.cm.Spectral)
    plt.ylabel('x1')
    plt.xlabel('x0')
    plt.scatter(X[:, 0], X[:, 1], c=np.squeeze(y))
    plt.show()
Copy the code

Use decision trees for prediction

Build a decision tree and train

from sklearn.tree import DecisionTreeClassifier

dt_clf = DecisionTreeClassifier(max_depth=6)
dt_clf.fit(X, y)
plot_decision_boundary(dt_clf,X,y)
Copy the code

drawing

Since the decision tree likes to go straight, the predictions are as follows

Cross validation

from sklearn.model_selection import cross_val_score
print(cross_val_score(dt_clf,X, y, cv=5).mean()) CV decided to do several rounds of cross-validation

# Split and fold cross validation, which will split the data set according to the original category ratio
from sklearn.model_selection import StratifiedKFold

strKFold = StratifiedKFold(n_splits=5,shuffle=True,random_state=0)
print(cross_val_score(dt_clf,X, y,cv=strKFold).mean())

# Leave one method for cross-validation
from sklearn.model_selection import LeaveOneOut

loout = LeaveOneOut()
print(cross_val_score(dt_clf,X, y,cv=loout).mean())

You can control the number of iterations and the ratio of test set to training set at each partition (i.e., there can be neither training set nor test set)
from sklearn.model_selection import ShuffleSplit

shufspl = ShuffleSplit(train_size=. 5,test_size=4.,n_splits=8) # 8 iterations;
print(cross_val_score(dt_clf,X, y,cv=shufspl).mean())
Copy the code

Use Voting Classifier to vote

Build the vote classifier

Voting Classifier Voting Classifier is a kind of hard Voting

from sklearn.ensemble import VotingClassifier

voting_clf = VotingClassifier(estimators=[
    ('knn_clf',KNeighborsClassifier(n_neighbors=7)),
    ('gnb_clf',GaussianNB()),
    ('dt_clf', DecisionTreeClassifier(max_depth=6))
],voting='hard')
voting_clf.fit(X_train,y_train)
voting_clf.score(X_test,y_test)
Copy the code

drawing

plot_decision_boundary(voting_clf,X,y)
Copy the code

Soft Voting classifier

Soft Voting classifiers are Soft votes based on the probabilities generated by each classifier

Build the vote classifier

voting_clf = VotingClassifier(e! [please add images to describe] (https://img-blog.csdnimg.cn/88798644fbc0458d88d4f1a97d4f7e17.png) stimators = [('knn_clf',KNeighborsClassifier(n_neighbors=7)),
    ('gnb_clf',GaussianNB()),
    ('dt_clf', DecisionTreeClassifier(max_depth=6))
],voting='soft')
voting_clf.fit(X_train,y_train)
voting_clf.score(X_test,y_test)
Copy the code

drawing

bagging

parsing

Establish multiple models, select part of data randomly from the data to train different models, and determine the final prediction results through the prediction results of multiple models.

Multiple decision tree model

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

bagging_clf = BaggingClassifier(DecisionTreeClassifier(),n_estimators=500,max_samples=100,bootstrap=True)

bagging_clf.fit(X_train,y_train)
bagging_clf.score(X,y)
Copy the code

drawing

Personally, this prediction is similar to the result of a single decision tree, which is also straight forward

Out of Bag-oob

Similar to bagging, the only difference is that after each model selects a portion of the data, the next model selects only the data set from the unselected data

Code implementation

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

bagging_clf = BaggingClassifier(DecisionTreeClassifier(),# classifier
                                n_estimators=500.# number of classifiers
                                max_samples=100.# Number of training samples per model
                                bootstrap=True.# Put back to sample
                                oob_score=True)#out of bag

bagging_clf.fit(X,y)
bagging_clf.oob_score_
Copy the code

drawing

This one is about the same, maybe it’s better to do it with another data set

Random forests

The effect of random forest is the same as that of ensemble learning composed of multiple decision trees

Code implementation

from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(n_estimators=500,random_state=Awesome!,oob_score=True)

rf_clf.fit(X,y)
rf_clf.oob_score_
Copy the code

drawing

Extreme tree

Both extreme random tree algorithm and random forest algorithm are composed of many decision trees. The main difference between limit tree and random forest Is that while the Bagging model is applied in the random forest, the extreme tree uses all the samples, but the characteristics are randomly selected, and since the splitting is random, the results are in some ways better than those obtained in the random forest.

Code implementation

from sklearn.ensemble import ExtraTreesClassifier
et_clf = ExtraTreesClassifier(n_estimators=500,random_state=Awesome!,bootstrap=True,oob_score=True)
et_clf.fit(X,y)
et_clf.oob_score_
Copy the code

drawing

Ada Boosting

For the prediction results generated by the training of each model, the weight of the failed data of the previous model will be increased, and the data used by the next model is the weighted data

Code implementation

from  sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=6),n_estimators=500)
ada_clf.fit(X_train,y_train)
ada_clf.score(X_test,y_test)
Copy the code

drawing

Gradient Boosting

Train only data that predicts errors

Code implementation

from  sklearn.ensemble import GradientBoostingClassifier

gd_clf = GradientBoostingClassifier(max_depth=6,n_estimators=500)

gd_clf.fit(X_train,y_train)
gd_clf.score(X_test,y_test)
Copy the code

drawing