Series catalog:

Python Data Mining and Machine Learning — Communication Credit Risk Assessment (1) — Reading data

Python Data Mining and Machine Learning — Communication Credit Risk Assessment (part 2) — Data preprocessing

Python Data Mining and Machine Learning — Communication Credit Risk Assessment (3) — Feature Engineering

Training data split

The training data was divided into training set and cross validation set in a ratio of 7:3. X_train and y_train are used to train the model, x_test and y_test are used for cross-validation.

data_train = data_train.set_index('UserI_Id')
y = data_train[data_train.columns[0]]
x = data_train[data_train.columns[1:]]
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7,random_state=10)
Copy the code

Random forest default parameter

Firstly, the default parameters of random forest were adopted and the model was evaluated by out-of-pocket score. In each round of bagging random sampling, approximately 36.8% of the data in the training set is not collected. About 36.8% Of the unsampled data are often referred to as Out Of Bag data (OOB). These data are not involved in the fitting of the training set model, so they can be used to test the generalization ability of the model.

rf = RandomForestClassifier(oob_score=True, random_state=0)
rf.fit(x_train, y_train)
print rf

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn import metrics
y_train_pred = rf.predict(x_train)
y_train_predprob = rf.predict_proba(x_train)[:,1]
print u'Training bag outside score:', rf.oob_score_
print Training set AUC Score: %f % metrics.roc_auc_score(y_train, y_train_predprob)
print u'Training set accuracy:', accuracy_score(y_train, y_train_pred)
print u'Training set accuracy:', precision_score(y_train, y_train_pred)
print u'Training set recall rate:', recall_score(y_train, y_train_pred)
print u'Training set F1 value:', f1_score(y_train, y_train_pred)
print(metrics.classification_report(y_train, y_train_pred))
print(metrics.confusion_matrix(y_train, y_train_pred))

y_test_pred = rf.predict(x_test)
y_test_predprob = rf.predict_proba(x_test)[:,1]
print u'Test set Accuracy:', accuracy_score(y_test, y_test_pred)
print u'Test set accuracy:', precision_score(y_test, y_test_pred)
print u'Test set recall rate:', recall_score(y_test, y_test_pred)
print u'Test set F1 value:, f1_score(y_test, y_test_pred)
print(metrics.classification_report(y_test, y_test_pred))
print(metrics.confusion_matrix(y_test, y_test_pred))
Copy the code

By default, the out-of-pocket score is 0.77, but there is a large gap between the F1 values of the training set and the cross-validation set, and the model generalization ability is not strong, so the parameter tuning is carried out through grid search.

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_split=1e-07, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, min_weight_nodes =None, min_impurity_split=1e-07, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, N_estimators =10, n_jobs=1, oob_score=True, random_state=0, verbose=0, warm_start=False) Training set AUC Score: 0.999003 Training set accuracy: 0.98693877551 Training set accuracy: 0.988970588235 Training set recall rate: 0.984947111473 Training set F1 value: 0.986954749287 precision recall f1 - score support 0 0.98 0.99 0.99 2442 1 2458 avg/total 0.99 0.99 0.98 0.99 0.99 0.99 4900 [[2415 27] [37 2421]] Test set Accuracy: 0.805714285714 Test set accuracy: 0.810176125245 Test set recall: 0.79462571977 Test set F1 value: 0.802325581395 Precision Recall F1-score support 0 0.80 0.82 0.81 1058 1 0.81 0.79 0.80 1042 AVG/total 0.81 0.81 0.81 0.81 2100 [[864 194] [214 828]]Copy the code

Grid search is performed for the maximum iteration number of the weak learner n_ESTIMators

The optimal number of weak learner iterations {‘ n_ESTIMators ‘: 90} 0.907140520132 is obtained by searching from 10 to 100, step size is 10, metric measure is ROC_AUC,5 fold cross validation, oob_score: whether out-of-bag samples are used to evaluate the model

from sklearn.model_selection import GridSearchCV
param_test1 = {'n_estimators': range(10,101,10)}
gsearch1 = GridSearchCV(estimator=RandomForestClassifier(max_depth=8, max_features='sqrt', oob_score=True, random_state=10),param_grid=param_test1, scoring='roc_auc', cv=5)
gsearch1.fit(x_train, y_train)
print gsearch1.cv_results_, gsearch1.best_params_, gsearch1.best_score_
Copy the code

Grid search is performed for max_depth and min_samples_split, the maximum depth of decision tree and the minimum sample number required for internal node subdivision

{‘min_samples_split’: 45, ‘max_depth’: 8} 0.90777502455. From 0.77 to 0.828979591837.

param_test2 = {'max_depth': the range (3,14,1),'min_samples_split'Gsearch2 = GridSearchCV(estimator=RandomForestClassifier(n_ESTIMators =90, max_features='sqrt', oob_score=True, random_state=10),param_grid=param_test2, scoring='roc_auc',iid=False, cv=5)
gsearch2.fit(x_train, y_train)
print gsearch2.cv_results_, gsearch2.best_params_, gsearch2.best_score_

rf1 = RandomForestClassifier(n_estimators=90,max_depth=8,min_samples_split=45,max_features='sqrt' ,oob_score=True, random_state=10)
rf1.fit(x_train, y_train)
print rf1.oob_score_
Copy the code

Min_samples_split, the minimum number of samples required for internal node subdivision, and min_samples_leaf, the minimum number of samples required for leaf node subdivision, were adjusted together

For the minimum sample number min_samples_split required for internal node redivision, we cannot decide it together for the time being, because it is also associated with other parameters of the decision tree. Next, we divided the internal nodes into min_samples_split, the minimum sample number required, and min_samples_leaf, the minimum sample number of leaf nodes. Get parameters {‘min_samples_leaf’: 10, ‘min_samples_split’: 70}, 0.90753607636810929)

param_test3 = {'min_samples_split': range (30150),'min_samples_leaf':range(10,60,10)}
gsearch3 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= 90, max_depth=8,max_features='sqrt' ,oob_score=True, random_state=10),param_grid = param_test3, scoring='roc_auc',iid=False, cv=5)
gsearch3.fit(x_train, y_train)
gsearch3.cv_results_, gsearch3.best_params_, gsearch3.best_score_
Copy the code

Call max_features to the maximum number of features

{‘max_features’: 5}, 0.90721677436061976)

param_test4 = {'max_features':range(3,20,2)} gsearch4 = GridSearchCV(estimator = 0, max_depth= 0, min_samples_split=70,min_samples_leaf=10,oob_score=True,random_state=10),param_grid = param_test4, scoring='roc_auc',iid=False, cv=5)
gsearch4.fit(x_train, y_train)
gsearch4.cv_results_, gsearch4.best_params_, gsearch4.best_score_
Copy the code

Model cross validation effect

The out-of-bag score of the training set is 0.82, an increase of 5 percentage points, and the F1 values of the training set and the cross-validation set are close to each other, both of which can reach more than 0.8. The model generalization ability is improved.

rf2 = RandomForestClassifier(n_estimators=90,max_depth=8,min_samples_split=70,min_samples_leaf=10,max_features=5 ,oob_score=True,random_state=10)
rf2.fit(x_train, y_train)
print rf2

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn import metrics
y_train_pred = rf2.predict(x_train)
y_train_predprob = rf2.predict_proba(x_train)[:,1]
print u'Training bag outside score:', rf2.oob_score_
print Training set AUC Score: %f % metrics.roc_auc_score(y_train, y_train_predprob)
print u'Training set accuracy:', accuracy_score(y_train, y_train_pred)
print u'Training set accuracy:', precision_score(y_train, y_train_pred)
print u'Training set recall rate:', recall_score(y_train, y_train_pred)
print u'Training set F1 value:', f1_score(y_train, y_train_pred)
print(metrics.classification_report(y_train, y_train_pred))
print(metrics.confusion_matrix(y_train, y_train_pred))

y_test_pred = rf2.predict(x_test)
y_test_predprob = rf2.predict_proba(x_test)[:,1]
print u'Test set Accuracy:', accuracy_score(y_test, y_test_pred)
print u'Test set accuracy:', precision_score(y_test, y_test_pred)
print u'Test set recall rate:', recall_score(y_test, y_test_pred)
print u'Test set F1 value:, f1_score(y_test, y_test_pred)
print(metrics.classification_report(y_test, y_test_pred))
print(metrics.confusion_matrix(y_test, y_test_pred))
Copy the code
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=8, max_features=5, max_leaf_nodes=None, min_impurity_split=1e-07, min_samples_leaf=10, Min_samples_split =70, min_weight_fraction_leaf=0.0, n_estimators=90, n_jobs=1, oob_score=True, random_state=10, min_weight_fraction_leaf=0.0, n_estimators=90, n_jobs=1, oob_score=True, random_state=10, Verbose =0, warm_start=False) Training set bag outside Score: 0.82612244898 Training set AUC Score: 0.922081 Training set accuracy: 0.838367346939 Training set accuracy: Training set recall rate: 0.871847030106 Training set F1 value: 0.844033083891 Precision Recall F1-Score support 0 0.86 0.80 0.83 2442 1 0.82 0.87 0.84 2458 AVG/total 0.84 0.84 0.84 0.84 0.87 0.87 0.87 4900 [[1965 477] [315 2143]] Test set Accuracy: 0.824761904762 Test set accuracy: 0.806363636364 Test set recall rate: 0.851247600768 Test set F1 value: 0.828197945845 precision recall f1 - score support 0 0.84 0.80 0.82 1058 1 1042 avg/total 0.83 0.81 0.85 0.83 0.82 0.82 2100 [[845 213] [155 887]]Copy the code

You might want to see more

Hadoop/CDH

Hadoop Combat (1) _ Aliyun builds the pseudo-distributed environment of Hadoop2.x

Hadoop Deployment (2) _ Vm Deployment of Hadoop in full distribution Mode

Hadoop Deployment (3) _ Virtual machine building CDH full distribution mode

Hadoop Deployment (4) _Hadoop cluster management and resource allocation

Hadoop Deployment (5) _Hadoop OPERATION and maintenance experience

Hadoop Deployment (6) _ Build the Eclipse development environment for Apache Hadoop

Hadoop Deployment (7) _Apache Install and configure Hue on Hadoop

Hadoop Deployment (8) _CDH Add Hive services and Hive infrastructure

Hadoop Combat (9) _Hive and UDF development

Hadoop Combat (10) _Sqoop import and extraction framework encapsulation


The wechat official account “Data Analysis” is used to share self-cultivation of data scientists. Since we met each other, it is better to grow up together.

Reprint please specify: Reprint from wechat official account “Data Analysis”


Reader communication telegraph group:

https://t.me/sspadluo