This article has participated in the “new creative Ceremony” activity
Comparison between ordinary decision tree and random forest
Generate the Circles data set
X,y = datasets.make_moons(n_samples=500,noise=0.3,random_state=42)
plt.scatter(X[y==0.0],X[y==0.1])
plt.scatter(X[y==1.0],X[y==1.1])
plt.show()
Copy the code
Drawing function
def plot_decision_boundary(model, X, y) :
x0_min, x0_max = X[:,0].min() -1, X[:,0].max() +1
x1_min, x1_max = X[:,1].min() -1, X[:,1].max() +1
x0, x1 = np.meshgrid(np.linspace(x0_min, x0_max, 100), np.linspace(x1_min, x1_max, 100))
Z = model.predict(np.c_[x0.ravel(), x1.ravel()])
Z = Z.reshape(x0.shape)
plt.contourf(x0, x1, Z, cmap=plt.cm.Spectral)
plt.ylabel('x1')
plt.xlabel('x0')
plt.scatter(X[:, 0], X[:, 1], c=np.squeeze(y))
plt.show()
Copy the code
Use decision trees for prediction
Build a decision tree and train
from sklearn.tree import DecisionTreeClassifier
dt_clf = DecisionTreeClassifier(max_depth=6)
dt_clf.fit(X, y)
plot_decision_boundary(dt_clf,X,y)
Copy the code
drawing
Since the decision tree likes to go straight, the predictions are as follows
Cross validation
from sklearn.model_selection import cross_val_score
print(cross_val_score(dt_clf,X, y, cv=5).mean()) CV decided to do several rounds of cross-validation
# Split and fold cross validation, which will split the data set according to the original category ratio
from sklearn.model_selection import StratifiedKFold
strKFold = StratifiedKFold(n_splits=5,shuffle=True,random_state=0)
print(cross_val_score(dt_clf,X, y,cv=strKFold).mean())
# Leave one method for cross-validation
from sklearn.model_selection import LeaveOneOut
loout = LeaveOneOut()
print(cross_val_score(dt_clf,X, y,cv=loout).mean())
You can control the number of iterations and the ratio of test set to training set at each partition (i.e., there can be neither training set nor test set)
from sklearn.model_selection import ShuffleSplit
shufspl = ShuffleSplit(train_size=. 5,test_size=4.,n_splits=8) # 8 iterations;
print(cross_val_score(dt_clf,X, y,cv=shufspl).mean())
Copy the code
Use Voting Classifier to vote
Build the vote classifier
Voting Classifier Voting Classifier is a kind of hard Voting
from sklearn.ensemble import VotingClassifier
voting_clf = VotingClassifier(estimators=[
('knn_clf',KNeighborsClassifier(n_neighbors=7)),
('gnb_clf',GaussianNB()),
('dt_clf', DecisionTreeClassifier(max_depth=6))
],voting='hard')
voting_clf.fit(X_train,y_train)
voting_clf.score(X_test,y_test)
Copy the code
drawing
plot_decision_boundary(voting_clf,X,y)
Copy the code
Soft Voting classifier
Soft Voting classifiers are Soft votes based on the probabilities generated by each classifier
Build the vote classifier
voting_clf = VotingClassifier(e! [please add images to describe] (https://img-blog.csdnimg.cn/88798644fbc0458d88d4f1a97d4f7e17.png) stimators = [('knn_clf',KNeighborsClassifier(n_neighbors=7)),
('gnb_clf',GaussianNB()),
('dt_clf', DecisionTreeClassifier(max_depth=6))
],voting='soft')
voting_clf.fit(X_train,y_train)
voting_clf.score(X_test,y_test)
Copy the code
drawing
bagging
parsing
Establish multiple models, select part of data randomly from the data to train different models, and determine the final prediction results through the prediction results of multiple models.
Multiple decision tree model
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
bagging_clf = BaggingClassifier(DecisionTreeClassifier(),n_estimators=500,max_samples=100,bootstrap=True)
bagging_clf.fit(X_train,y_train)
bagging_clf.score(X,y)
Copy the code
drawing
Personally, this prediction is similar to the result of a single decision tree, which is also straight forward
Out of Bag-oob
Similar to bagging, the only difference is that after each model selects a portion of the data, the next model selects only the data set from the unselected data
Code implementation
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
bagging_clf = BaggingClassifier(DecisionTreeClassifier(),# classifier
n_estimators=500.# number of classifiers
max_samples=100.# Number of training samples per model
bootstrap=True.# Put back to sample
oob_score=True)#out of bag
bagging_clf.fit(X,y)
bagging_clf.oob_score_
Copy the code
drawing
This one is about the same, maybe it’s better to do it with another data set
Random forests
The effect of random forest is the same as that of ensemble learning composed of multiple decision trees
Code implementation
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier(n_estimators=500,random_state=Awesome!,oob_score=True)
rf_clf.fit(X,y)
rf_clf.oob_score_
Copy the code
drawing
Extreme tree
Both extreme random tree algorithm and random forest algorithm are composed of many decision trees. The main difference between limit tree and random forest Is that while the Bagging model is applied in the random forest, the extreme tree uses all the samples, but the characteristics are randomly selected, and since the splitting is random, the results are in some ways better than those obtained in the random forest.
Code implementation
from sklearn.ensemble import ExtraTreesClassifier
et_clf = ExtraTreesClassifier(n_estimators=500,random_state=Awesome!,bootstrap=True,oob_score=True)
et_clf.fit(X,y)
et_clf.oob_score_
Copy the code
drawing
Ada Boosting
For the prediction results generated by the training of each model, the weight of the failed data of the previous model will be increased, and the data used by the next model is the weighted data
Code implementation
from sklearn.ensemble import AdaBoostClassifier
ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=6),n_estimators=500)
ada_clf.fit(X_train,y_train)
ada_clf.score(X_test,y_test)
Copy the code
drawing
Gradient Boosting
Train only data that predicts errors
Code implementation
from sklearn.ensemble import GradientBoostingClassifier
gd_clf = GradientBoostingClassifier(max_depth=6,n_estimators=500)
gd_clf.fit(X_train,y_train)
gd_clf.score(X_test,y_test)
Copy the code