Decision tree and random forest application

I. Purpose and requirements of the experiment

1) Purpose of the experiment

Understand the working mechanism of decision tree;
Master decision tree model and model improvement process;
Familiar with and establish random forest classification model.

2) Experimental requirements

Write a good source program according to the experimental topic;
Analyze the possible problems in the process of computer operation in advance, and determine the debugging steps and testing methods;
Input a certain amount of test data to analyze the running results;
After the computer experiment, write the experiment report carefully, analyze and summarize the problems in the computer.

Ii. Experimental environment (tools, configuration, etc.)

Hardware requirements: a computer;
Software required: Mac operating system. This experiment is developed on Jupyter Notebook.

Iii. Experimental content (experimental scheme, experimental steps, design ideas, etc.)

1) Experimental scheme

Learn and follow the blue Bridge cloud class to complete the test;
In the experiment, the Adult dataset decision tree is constructed, the data is processed after loading, and a decision tree classifier is created after data processing.
Adjust the decision tree parameters, understand the meaning of these parameters, and improve the accuracy through multiple modeling and training;
The random forest classification model was established, and the corresponding classification prediction model was established by using the random forest algorithm provided by SciKit-Learn.

2) Experimental steps

Model definition:
- Open the experimental environment of the Blue Bridge experimental Building;
- Import the modules required for the experiment.
Data processing and loading:
- Load the data set;
- Create a sample data set according to experimental requirements and divide training data and test data;
- Draw a decision tree using the methods provided by SciKit-learn;
- Do the necessary cleaning of the data, convert the target value to a binary value of 0,1, and fix Age to an integer type;
- The data is preprocessed to distinguish the categories and continuous features in the data set, and the missing data is filled in, and the category features are independently thermal coded.
- Create zero-value features to complement test data.
Training model:
- The shannon entropy calculation function entrop() is realized according to the experimental content.
- Information_gain (root,left,right);
- Realize best_feature_to_split function based on information gain;
- A simple tree building decision is realized by recursively calling partition function, and the entropy change of each step is output.
- Construct the Adult dataset decision tree, load and read the dataset;
- Using training data to create a decision tree classifier;
- GridSearchCV grid search is used to tune the decision tree and return the best parameters.
- A RandomForestClassifier with RandomForestClassifier is constructed.
Visualization of training process;
Test, and modify many times.

3) Design ideas:

First of all, for the first example to understand the working mechanism of simple decision tree, you need to create sample data set, specify some training and test data, then understand the decision tree based on information entropy according to the training data set, and draw the decision tree using the method provided by SciKit-learn.
Next, when calculating entropy and information gain, we need to realize Shannon entropy function, information gain calculation function and partition function based on information gain, and then realize a simple tree construction strategy, and output the entropy change of each step, which is also very helpful to understand the generation of decision tree.
Next, the Adult data decision tree is constructed. This is a fairly complete example of processing the data after loading the dataset, converting the target value to a 0,1 binary value, converting the floating point type to an integer, populating missing data, and so on.
After the data is processed, a decision tree classifier is created using the training data, and then the parameters of the decision tree model are adjusted to check the accuracy of the test set.
Finally, the random forest classification model is established, and the corresponding classification prediction model is established by using the random forest algorithm provided by SciKit-Learn, and the classification prediction model is output on the test set.

4. Experimental results

Simple example exercise

Create sample data sets

Create a sample dataset and heat encode the data categories
def create_df(dic, feature_list) :
    out = pd.DataFrame(dic)
    out = pd.concat([out, pd.get_dummies(out[feature_list])], axis=1)
    out.drop(feature_list, axis=1, inplace=True)
    return out
# Ensure that the features encoded by single heat are present in both training and test data
def intersect_features(train, test) :
    common_feat = list(set(train.keys()) & set(test.keys()))
    return train[common_feat], test[common_feat]
Copy the code

Defining characteristic

features = ['Looks'.'Alcoholic_beverage'.'Eloquence'.'Money_spent']
Copy the code

Training data

df_train = {}
df_train['Looks'] = ['handsome'.'handsome'.'handsome'.'repulsive'.'repulsive'.'repulsive'.'handsome']
df_train['Alcoholic_beverage'] = [
    'yes'.'yes'.'no'.'no'.'yes'.'yes'.'yes']
df_train['Eloquence'] = ['high'.'low'.'average'.'average'.'low'.'high'.'average']
df_train['Money_spent'] = ['lots'.'little'.'lots'.'little'.'lots'.'lots'.'lots']
df_train['Will_go'] = LabelEncoder().fit_transform(
    ['+'.The '-'.'+'.The '-'.The '-'.'+'.'+'])
df_train = create_df(df_train, features)
df_train
Copy the code

FIG. 1 Results of training data

The test data

df_test = {}
df_test['Looks'] = ['handsome'.'handsome'.'repulsive']
df_test['Alcoholic_beverage'] = ['no'.'yes'.'yes']
df_test['Eloquence'] = ['average'.'high'.'average']
df_test['Money_spent'] = ['lots'.'little'.'lots']
df_test = create_df(df_test, features)
df_test
Copy the code

# Ensure that the features encoded by single heat are present in both training and test data
y = df_train['Will_go']
df_train, df_test = intersect_features(train=df_train, test=df_test)
Copy the code

Figure 2 test set

FIG. 3 Training set

Figure 4 test set

Draw a decision tree based on information entropy.

dt = DecisionTreeClassifier(criterion='entropy', random_state=17)
dt.fit(df_train, y)
tree_str = export_graphviz(
    dt, feature_names=df_train.columns, out_file=None, filled=True)
graph = pydotplus.graph_from_dot_data(tree_str)
SVG(graph.create_svg())
Copy the code

FIG. 5 Decision tree

Figure 6 Grouping of balls

Implement shannon entropy calculation functionentropy()

from math import log
def entropy(a_list) :
    lst = list(a_list)
    size = len(lst) * 1.0
    entropy = 0
    set_elements = len(set(lst))
    if set_elements in [0.1] :return 0
    for i in set(lst):
        occ = lst.count(i)
        entropy -= occ/size * log(occ/size, 2)
    return entropy
Copy the code

The list ball_left in Figure 7 gives the entropy of the state

FIG. 8 Entropy of a 6-sided cube with equal probability dice

Realize the calculation function of information gaininformation_gain(root, left, right)

def information_gain(root, left, right) :
    return entropy(root) - 1.0 * len(left) / len(root) * entropy(left) \
                         - 1.0 * len(right) / len(root) * entropy(right)
Copy the code

FIG. 9 Information gain

Realize partition function based on information gainbest_feature_to_split

def best_feature_to_split(X, y) :
    Information gain for feature segmentation
    out = []
    for i in X.columns:
        out.append(information_gain(y, y[X[i] == 0], y[X[i] == 1]))
    return out
Copy the code

Implement a simple tree-building strategy by recursively calling BEST_feature_to_split and output the entropy change at each step

def btree(X, y) :
    clf = best_feature_to_split(X, y)
    param = clf.index(max(clf))
    ly = y[X.iloc[:, param] == 0]
    ry = y[X.iloc[:, param] == 1]
    print('Column_' + str(param) + ' N/Y? ')
    print('Entropy: ', entropy(ly), entropy(ry))
    print('N count:', ly.count(), '/'.'Y count:', ry.count())
    ifentropy(ly) ! =0:
        left = X[X.iloc[:, param] == 0]
        btree(left, ly)
    ifentropy(ry) ! =0:
        right = X[X.iloc[:, param] == 1]
        btree(right, ry)
Copy the code

FIG. 10 Information gain used for feature segmentation

FIG. 11 Tree construction

Construct the Adult dataset decision tree

Load the population income census data set

data_train = pd.read_csv(
    'https://labfile.oss.aliyuncs.com/courses/1283/adult_train.csv', sep='; ')
data_test = pd.read_csv(
    'https://labfile.oss.aliyuncs.com/courses/1283/adult_test.csv', sep='; ')
Copy the code

Characteristics of population income census data concentration

Age – Continuous numerical characteristics
Workclass — Continuous numerical characteristics
FNLWGT — Continuous numerical characteristics
Education – Category characteristics
Education_Num — Continuous numerical characteristics
Martial_Status – Category characteristics
Occupation – Indicates the characteristics of a category
Relationship – Category characteristics
Race – Category characteristics
Sex – Category characteristics
Capital_Gain — Continuous numerical characteristics
Capital_Loss — Continuous numerical characteristics
Hours_per_week – Continuous numeric characteristics
Country – Category characteristics
Target – Income level, binary Target value

Do some necessary cleaning of the data set. Also, convert the target value to a binary value of 0,1:

Remove incorrect data from the test set
data_test = data_test[(data_test['Target'] = =' >50K.')
                      | (data_test['Target'] = =' <=50K.')]
# encode the target as 0 and 1
data_train.loc[data_train['Target'] = =' <=50K'.'Target'] = 0
data_train.loc[data_train['Target'] = =' >50K'.'Target'] = 1
data_test.loc[data_test['Target'] = =' <=50K.'.'Target'] = 0
data_test.loc[data_test['Target'] = =' >50K.'.'Target'] = 1
Copy the code

Figure 12. Statistical indicators of characteristics and target values

Figure 13. Target distribution count of training dataset

Draw the associated distribution images of each feature:

fig = plt.figure(figsize=(25.15))
cols = 5
rows = np.ceil(float(data_train.shape[1]) / cols)
for i, column in enumerate(data_train.columns):
    ax = fig.add_subplot(rows, cols, i + 1)
    ax.set_title(column)
    if data_train.dtypes[column] == np.object:
        data_train[column].value_counts().plot(kind="bar", axes=ax)
    else:
        data_train[column].hist(axes=ax)
        plt.xticks(rotation="vertical")
plt.subplots_adjust(hspace=0.7, wspace=0.2)
Copy the code

FIG. 14 Correlation distribution image of each feature

Change the Age feature type from Object to integer

data_test['Age'] = data_test['Age'].astype(int)
Copy the code

Process all floating-point type features in the test data into integer types to correspond to the training data:

data_test['fnlwgt'] = data_test['fnlwgt'].astype(int)
data_test['Education_Num'] = data_test['Education_Num'].astype(int)
data_test['Capital_Gain'] = data_test['Capital_Gain'].astype(int)
data_test['Capital_Loss'] = data_test['Capital_Loss'].astype(int)
data_test['Hours_per_week'] = data_test['Hours_per_week'].astype(int)
Copy the code

Select category and continuous characteristic variables from the data set:

categorical_columns = [c for c in data_train.columns
                       if data_train[c].dtype.name == 'object']
numerical_columns = [c for c in data_train.columns
                     ifdata_train[c].dtype.name ! ='object']
print('categorical_columns:', categorical_columns)
print('numerical_columns:', numerical_columns)
Copy the code

For continuous features, the median is used to fill the missing data, while for category features, mode is used to fill:

for c in categorical_columns:
    data_train[c].fillna(data_train[c].mode(), inplace=True)
    data_test[c].fillna(data_train[c].mode(), inplace=True)
for c in numerical_columns:
    data_train[c].fillna(data_train[c].median(), inplace=True)
    data_test[c].fillna(data_train[c].median(), inplace=True)
Copy the code

To ensure that all the features of the data set are of numerical type, it is convenient to pass in the model later:

data_train = pd.concat([data_train[numerical_columns],
                        pd.get_dummies(data_train[categorical_columns])], axis=1)
data_test = pd.concat([data_test[numerical_columns],
                       pd.get_dummies(data_test[categorical_columns])], axis=1)
Copy the code

set(data_train.columns) - set(data_test.columns)
data_train.shape, data_test.shape
It was found that Holland was not included in the test data after the unique heat coding. In order to correspond with the training data, zero-value features need to be created here for completion. ' ' '
data_test['Country_ Holand-Netherlands'] = 0
set(data_train.columns) - set(data_test.columns)
Copy the code

Remove target feature

X_train = data_train.drop(['Target'], axis=1)
y_train = data_train['Target']

X_test = data_test.drop(['Target'], axis=1)
y_test = data_test['Target']
Copy the code

Find the decision tree parameter model

Using the default parameters,max_depth=3.random_state=17:

tree = DecisionTreeClassifier(max_depth=3, random_state=17)
tree.fit(X_train, y_train)
tree_predictions = tree.predict(X_test)
accuracy_score(y_test, tree_predictions)
Copy the code

Figure 15. Accuracy of default parameters

Some important parameters

criterion

Criterion is the parameter used to determine how impurity is calculated.

Sklearn offers two options:

1) Input “entropy”, use entropy

2) Type “gini” and use Gini coefficients.

Note that when using Information entropy, sklearn actually calculates Information Gain based on Information entropy, that is, the difference between the Information entropy of the parent node and the Information entropy of the child node. Compared with gini coefficient, information entropy is more sensitive to impurity and has the strongest punishment for impurity. But in practice, the effect of information entropy and Gini coefficient is basically the same. Information entropy is slower to calculate the Bikini coefficient because the Gini coefficient does not involve logarithms. In addition, since information entropy is more sensitive to impurity, the growth of decision tree will be more “delicate” when it is used as an indicator. Therefore, for high-dimensional data or data with a lot of noise, information entropy is easy to overfit. In this case, gini coefficient is usually better. When the model fitting degree is insufficient, that is, when the model is not very good in both the training set and the test set, information entropy is used. Of course, these are not absolute.

random_state & splitter

Random_state is the parameter used to set the random pattern in the branch. By default, None, randomness is more apparent at higher dimensions, and almost non-apparent at lower dimensions (such as the iris dataset). Splitter is also used to control the random options in the decision tree, there are two input values, input “best”, although the decision tree is random when branching, However, more important features will be selected for branching (the importance can be viewed by the feature_importances_ attribute). Enter “random” to make branching more random. The tree becomes deeper and larger because it contains more unnecessary information, which reduces the fit of the training set. This is also a way to prevent overfitting. When you predict that your model will overfit, use these two parameters to help you reduce the likelihood of overfitting once the tree is built. Of course, once the tree is built, we still use pruning parameters to prevent overfitting.

Pruning parameters

Unchecked, a decision tree will grow to the point where a measure of impurity is optimal, or no more features are available. Such a decision tree tends to be overfitted, that is, it will perform well on the training set and poorly on the test set. It is impossible for the sample data collected by us to be completely consistent with the overall situation. Therefore, when a decision tree has too excellent interpretation of training data, the rules it finds must contain the noise in the training sample, and make its fitting degree of unknown data insufficient.

In order to make the decision tree have better generalization, we need to prune the decision tree. The pruning strategy has a great influence on the decision tree, and the correct pruning strategy is the core of the optimization decision tree algorithm. Sklearn provides us with different pruning strategies:

max_depth:

Limit the maximum depth of the tree and cut off any branches that exceed the set depth. This is the most widely used pruning parameter and is very effective at high dimensions and low sample sizes. As the decision tree grows one more layer, the demand for sample size will double, so limiting tree depth can effectively limit overfitting. It is also very useful in integration algorithms. In practice, it is recommended to start from =3 and see the effect of fitting before deciding whether to increase the setting depth.

min_samples_leaf & min_samples_split

Min_samples_leaf specifies that each child node of a node after branching must contain at least min_SAMples_leaf training samples, otherwise branching will not occur; Alternatively, branching occurs in such a way that each child node contains a sample of MIN_samples_leaf.

Usually used with max_depth, it has a magic effect in regression trees and can make the model smoother. Setting the number of this parameter too low will induce overfitting, and setting it too high will prevent the model from learning the data. In general, it is recommended to start with =5. If the sample size contained in the leaf node varies greatly, it is recommended to enter a floating point number as a percentage of the sample size. At the same time, this parameter can ensure the minimum size of each leaf, which can avoid the occurrence of low variance and over-fitting leaf nodes in regression problems. For classification problems with few categories, =1 is usually the best choice. Min_samples_split specifies that a node must contain at least min_samples_split training samples before it can be branched, otherwise branching will not occur.

max_features & min_impurity_decrease

Max_features Limits the number of features to be considered when branching. Any features exceeding the limit are discarded. Similar to max_depth, max_features is a pruning parameter used to limit the over-fitting of high-dimensional data, but its method is more violent, which directly limits the number of features that can be used and forces the decision tree to stop. Without knowing the importance of each feature in the decision tree, Imposing this parameter may result in insufficient model learning. If you want to avoid overfitting by dimensionality reduction, PCA, ICA or dimensionality reduction algorithms in feature selection module are recommended.

Use GridSearchCV grid search to tune the decision tree and return the best parameter:

tree_params = {'max_depth': range(8.11)}
locally_best_tree = GridSearchCV(DecisionTreeClassifier(random_state=17),
                                 tree_params, cv=5)
locally_best_tree.fit(X_train, y_train)
print("Best params:", locally_best_tree.best_params_)
print("Best cross validaton score", locally_best_tree.best_score_)
Copy the code

FIG. 16 Search results

Using the best parameters found above, test again:

tuned_tree = DecisionTreeClassifier(max_depth=9, random_state=17)
tuned_tree.fit(X_train, y_train)
tuned_tree_predictions = tuned_tree.predict(X_test)
accuracy_score(y_test, tuned_tree_predictions)
Copy the code

Figure 17 test results after adjustment

The stochastic forest classification model is established

Establish random forest classification model and predict:

rf = RandomForestClassifier(n_estimators=100, random_state=17)
rf.fit(X_train, y_train)
forest_predictions = rf.predict(X_test)
accuracy_score(y_test,forest_predictions)
Copy the code

FIG. 18 Test results of random forest classification model

Problems encountered and solutions

Problem: Lack of understanding of the kernel of the decision tree algorithm, and the various parameters that can be used in the decision tree.
Solution: Study on the Internet and summarize the knowledge. I learned that decision tree is a top-down tree classification process of sample data, consisting of nodes and directed edges. Nodes are divided into internal nodes and leaf nodes. Each internal node represents a feature or attribute, leaf nodes represent categories, and edges represent partitioning conditions. Starting from the top node, all samples are gathered together. After the division of the root node, samples are divided into different sub-nodes, and further divided according to the characteristics of the sub-nodes until all samples are classified into a certain category. Building a decision tree is a process of recursively selecting internal nodes, calculating the edges of partition conditions, and finally reaching leaf nodes. In addition, the model of decision tree in skLearn library is understood to some extent, and the parameter values are changed many times in this experiment to complete the experiment.

Five, attached to the source program

import warnings import pydotplus from io import StringIO from IPython.display import SVG from sklearn.metrics import accuracy_score from sklearn.ensemble import RandomForestClassifier from sklearn.tree import DecisionTreeClassifier, export_graphviz from sklearn import preprocessing from sklearn.model_selection import GridSearchCV, cross_val_score import collections from sklearn.preprocessing import LabelEncoder import pandas as pd import numpy as np  import seaborn as sns from matplotlib import pyplot as plt %matplotlib inline plt.rcParams['figure.figsize'] = (10, 8) warnings.filterwarnings('ignore')Copy the code

def create_df(dic, feature_list):
    out = pd.DataFrame(dic)
    out = pd.concat([out, pd.get_dummies(out[feature_list])], axis=1)
    out.drop(feature_list, axis=1, inplace=True)
    return out

def intersect_features(train, test):
    common_feat = list(set(train.keys()) & set(test.keys()))
    return train[common_feat], test[common_feat]
Copy the code

features = ['Looks', 'Alcoholic_beverage', 'Eloquence', 'Money_spent']
Copy the code

df_train = {}
df_train['Looks'] = ['handsome', 'handsome', 'handsome', 'repulsive',
                     'repulsive', 'repulsive', 'handsome']
df_train['Alcoholic_beverage'] = [
    'yes', 'yes', 'no', 'no', 'yes', 'yes', 'yes']
df_train['Eloquence'] = ['high', 'low', 'average', 'average', 'low',
                         'high', 'average']
df_train['Money_spent'] = ['lots', 'little', 'lots', 'little', 'lots',
                           'lots', 'lots']
df_train['Will_go'] = LabelEncoder().fit_transform(
    ['+', '-', '+', '-', '-', '+', '+'])

df_train = create_df(df_train, features)
df_train
Copy the code

df_test = {}
df_test['Looks'] = ['handsome', 'handsome', 'repulsive']
df_test['Alcoholic_beverage'] = ['no', 'yes', 'yes']
df_test['Eloquence'] = ['average', 'high', 'average']
df_test['Money_spent'] = ['lots', 'little', 'lots']
df_test = create_df(df_test, features)
df_test
Copy the code

y = df_train['Will_go']
df_train, df_test = intersect_features(train=df_train, test=df_test)
df_train
Copy the code

df_test
Copy the code

dt = DecisionTreeClassifier(criterion='entropy', random_state=17)
dt.fit(df_train, y)
Copy the code

tree_str = export_graphviz(
    dt, feature_names=df_train.columns, out_file=None, filled=True)
graph = pydotplus.graph_from_dot_data(tree_str)
SVG(graph.create_svg())
Copy the code

balls = [1 for i in range(9)] + [0 for i in range(11)]
Copy the code

balls_left = [1 for i in range(8)] + [0 for i in range(5)]
balls_right = [1 for i in range(1)] + [0 for i in range(6)]
Copy the code

from math import log def entropy(a_list): LST = list(a_list) size = len(LST) * 1.0 entropy = 0 set_cc = len(set(LST)) if set_cc in [0, 1]: return 0 for i in set(lst): occ = lst.count(i) entropy -= occ/size * log(occ/size, 2) return entropyCopy the code

entropy(balls_left) 
Copy the code

entropy([1, 2, 3, 4, 5, 6])
Copy the code

def information_gain(root, left, right): Root - Initial data, "Return entropy(root) - 1.0 * len(left)/len(root) * entropy(left) \ -1.0 * len(right) / len(root) * entropy(right)Copy the code

information_gain(balls, balls_left, balls_right)
Copy the code

Def best_feature_to_split(X, y): "out = [] for I in. out.append(information_gain(y, y[X[i] == 0], y[X[i] == 1])) return outCopy the code

def btree(X, y): clf = best_feature_to_split(X, y) param = clf.index(max(clf)) ly = y[X.iloc[:, param] == 0] ry = y[X.iloc[:, param] == 1] print('Column_' + str(param) + ' N/Y? ') print('Entropy: ', entropy(ly), entropy(ry)) print('N count:', ly.count(), '/', 'Y count:', ry.count()) if entropy(ly) ! = 0: left = X[X.iloc[:, param] == 0] btree(left, ly) if entropy(ry) ! = 0: right = X[X.iloc[:, param] == 1] btree(right, ry)Copy the code

best_feature_to_split(df_train, y)
Copy the code

Btree (df_train, y) ### Build Adult dataset decision treeCopy the code

data_train = pd.read_csv( 'https://labfile.oss.aliyuncs.com/courses/1283/adult_train.csv', sep='; ')Copy the code

data_train.tail()
Copy the code

data_test = pd.read_csv( 'https://labfile.oss.aliyuncs.com/courses/1283/adult_test.csv', sep='; ')Copy the code

data_test.tail()
Copy the code

# remove of the test set error data data_test = data_test [(data_test [' Target '] = = '> 50 k.') | (data_test [' Target '] = = '< = 50 k.')] # code 0 and a Target  1 data_train.loc[data_train['Target'] == ' <=50K', 'Target'] = 0 data_train.loc[data_train['Target'] == ' >50K', 'Target'] = 1 data_test.loc[data_test['Target'] == ' <=50K.', 'Target'] = 0 data_test.loc[data_test['Target'] == ' >50K.', 'Target'] = 1Copy the code

data_test.describe(include='all').T
Copy the code

data_train['Target'].value_counts()
Copy the code

fig = plt.figure(figsize=(25, 15)) cols = 5 rows = np.ceil(float(data_train.shape[1]) / cols) for i, column in enumerate(data_train.columns): ax = fig.add_subplot(rows, cols, i + 1) ax.set_title(column) if data_train.dtypes[column] == np.object: data_train[column].value_counts().plot(kind="bar", axes=ax) else: Data_train [r]. The column hist (axes = ax) PLT. Xticks (rotation = "vertical") PLT. Subplots_adjust (img tags like hspace = 0.7, wspace = 0.2)Copy the code

data_train.dtypes
Copy the code

data_test.dtypes
Copy the code

data_test['Age'] = data_test['Age'].astype(int)
Copy the code

data_test['fnlwgt'] = data_test['fnlwgt'].astype(int)
data_test['Education_Num'] = data_test['Education_Num'].astype(int)
data_test['Capital_Gain'] = data_test['Capital_Gain'].astype(int)
data_test['Capital_Loss'] = data_test['Capital_Loss'].astype(int)
data_test['Hours_per_week'] = data_test['Hours_per_week'].astype(int)
Copy the code

categorical_columns = [c for c in data_train.columns if data_train[c].dtype.name == 'object'] numerical_columns = [c for  c in data_train.columns if data_train[c].dtype.name ! = 'object'] print('categorical_columns:', categorical_columns) print('numerical_columns:', numerical_columns)Copy the code

for c in categorical_columns:
    data_train[c].fillna(data_train[c].mode(), inplace=True)
    data_test[c].fillna(data_train[c].mode(), inplace=True)

for c in numerical_columns:
    data_train[c].fillna(data_train[c].median(), inplace=True)
    data_test[c].fillna(data_train[c].median(), inplace=True)
Copy the code

data_train = pd.concat([data_train[numerical_columns],
                        pd.get_dummies(data_train[categorical_columns])], axis=1)

data_test = pd.concat([data_test[numerical_columns],
                       pd.get_dummies(data_test[categorical_columns])], axis=1)
Copy the code

set(data_train.columns) - set(data_test.columns)
Copy the code

data_train.shape, data_test.shape
Copy the code

data_test['Country_ Holand-Netherlands'] = 0
Copy the code

set(data_train.columns) - set(data_test.columns)
Copy the code

data_train.head(2)
Copy the code

data_test.head(2)
Copy the code

X_train = data_train.drop(['Target'], axis=1)
y_train = data_train['Target']

X_test = data_test.drop(['Target'], axis=1)
y_test = data_test['Target']
Copy the code

tree = DecisionTreeClassifier(max_depth=3, random_state=17)
tree.fit(X_train, y_train)
tree_predictions = tree.predict(X_test)
accuracy_score(y_test, tree_predictions)
Copy the code

tree_params = {'max_depth': range(8, 11)}

locally_best_tree = GridSearchCV(DecisionTreeClassifier(random_state=17),
                                 tree_params, cv=5)

locally_best_tree.fit(X_train, y_train)
print("Best params:", locally_best_tree.best_params_)
print("Best cross validaton score", locally_best_tree.best_score_)
Copy the code

tuned_tree = DecisionTreeClassifier(max_depth=9, random_state=17)
tuned_tree.fit(X_train, y_train)
tuned_tree_predictions = tuned_tree.predict(X_test)
accuracy_score(y_test, tuned_tree_predictions)
Copy the code

rf = RandomForestClassifier(n_estimators=100, random_state=17)
rf.fit(X_train, y_train)
Copy the code

forest_predictions = rf.predict(X_test) 
accuracy_score(y_test,forest_predictions)
Copy the code