Original link:tecdat.cn/?p=9326
Original source:Tuo End number according to the tribe public number
In this article, I will use decision trees (for classification) in Python. Emphasis will be placed on the basics and understanding of the final decision tree.
The import
So first, let’s do some imports.
from __future__ import print_function
import os
import subprocess
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier, export_graphviz
Copy the code
data
Next, we need to consider some data. I will use the well-known IRIS data set, which can make a variety of measurements for a variety of different IRIS types. Pandas and SCkit-learn can import this data easily, and I’ll use PANDAS to write a function imported from a CSV file. The purpose of this is to demonstrate how sciKit-learn can be used with pandas. Therefore, we define a function to get iris data:
Def get_iris_data(): """ Get IRIS data from local CSV or pandas." "" if os.path.exists("iris.csv"): print("-- iris.csv found locally") df = pd.read_csv("iris.csv", index_col=0) else: print("-- trying to download from github") fn = "https://raw.githubusercontent.com/pydata/pandas/" + \ "master/pandas/tests/data/iris.csv" try: df = pd.read_csv(fn) except: exit("-- Unable to download iris.csv") with open("iris.csv", 'w') as f: print("-- writing to local iris.csv file") df.to_csv(f) return dfCopy the code
- This function first tries to read the data locally. Use the os.path.exists () method. If the iris.csv file is found in the local directory, use pandas to read the file through pd.read_csv ().
- If the localiris.csv Did not find, grab URL data to run.
The next step is to get the data and see what it looks like using the head () and tail () methods. So, first get the data:
df = get_iris_data()
-- iris.csv found locally
Copy the code
And then:
print("* df.head()", df.head(), sep="\n", end="\n\n") print("* df.tail()", df.tail(), sep="\n", End ="\n\n") * df.head() SepalLength SepalWidth PetalLength PetalWidth Name 0 5.1 3.5 1.4 0.2 iris-setosa 1 4.9 3.0 1.4 0.2 iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 iris-setosa 4 5.0 3.6 1.4 0.2 iris-setosa * df.tail() SepalLength SepalWidth PetalLength PetalWidth Name 145 6.7 3.0 5.2 2.3 Iris-virginica 146 6.3 2.5 5.0 1.9 Iris-virginica 147 6.5 3.0 5.2 2.0 Iris-virginica 148 6.2 3.4 5.4 2.3 Iris-virginica 149 5.9 3.0 5.1 1.8 Iris-virginicaCopy the code
From this information, we can discuss our goal: to predict iris types given features SepalLength, SepalWidth, PetalLength, and PetalWidth.
pretreatment
To pass this data to SciKit-learn, we need to encode Names as integers. To do this, we’ll write another function that returns the modified data box along with a list of target (class) names:
Let’s see what we have:
* df2.head()
Target Name
0 0 Iris-setosa
1 0 Iris-setosa
2 0 Iris-setosa
3 0 Iris-setosa
4 0 Iris-setosa
* df2.tail()
Target Name
145 2 Iris-virginica
146 2 Iris-virginica
147 2 Iris-virginica
148 2 Iris-virginica
149 2 Iris-virginica
* targets
['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']
Copy the code
Next, we get the column name:
features = list(df2.columns[:4])
print("* features:", features, sep="\n")
* features:
['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']
Copy the code
The decision tree was fitted with Scikit-learn
Now we can use the DecisionTreeClassifier imported above to fit the decision tree as follows:
- We use simple indexes to extract X and Y data from the data box.
- The decision tree imported at the beginning is initialized with two parameters: min_samples_split = 20 requires 20 samples from a node to split, and random_state = 99 for seed random number generator.
The visual tree
We can use the following functions to generate graphics:
- Write a dot file from the export_graphviz method imported from scikit-learn above. This file is used to generate graphics.
- PNG image dt.png is generated.
Run the function:
visualize_tree(dt, features)
Copy the code
The results of
We can use this figure to understand the pattern of decision tree discovery:
- All data (all rows) starts at the top of the tree.
- All features are considered to see how to split the data in the most useful way – using the Gini metric by default.
- At the top, we see that the most useful condition is PetalLength <= 2.4500.
- This split continued until
- There is only one category after splitting.
- Or, the result has fewer than 20 samples.
Pseudocode for a decision tree
Finally, we consider generating pseudocode that represents a decision tree for learning.
- The target name can be passed to the function and included in the output.
- Use the spacer_base argument to make the output easier to read.
The output of the result applied to IRIS data is:
get_code(dt, features, Targets) if (PetalLength <= 2.45000004768) {return iris-setosa (50 examples)} else {if (PetalWidth <= 1.75) { If (PetalLength <= 4.94999980927) {if (PetalWidth <= 1.65000009537) {return iris-versicolor (47 examples)} else { return Iris-virginica ( 1 examples ) } } else { return Iris-versicolor ( 2 examples ) return Iris-virginica ( 4 Examples)}} else {if (PetalLength <= 4.85000038147) {return iris-versicolor (1 examples) return iris-virginica ( 2 examples ) } else { return Iris-virginica ( 43 examples ) } } }Copy the code
Compare this to the graphic output above – this is just a different representation of the decision tree.
Cross-validation of decision trees in Python
The import
First, we import all the code:
from __future__ import print_function
import os
import subprocess
from time import time
from operator import itemgetter
from scipy.stats import randint
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn.grid_search import GridSearchCV
from sklearn.grid_search import RandomizedSearchCV
from sklearn.cross_validation import cross_val_score
Copy the code
The main additions are the methods in sklearn.grid_search, which can:
- Time to search
- Sort the results using itemgetter
- Generate random integers using scipy.stats.randint.
Now we can start writing the function.
Include:
- Get_code – Write pseudocode for the decision tree,
- Visualize_tree – Generates graphics for the decision tree.
- Encode_target – Processes raw data for use with scikit-learn.
- Get_iris_data – If needed, get iris.csv from the network and write the copy to a local directory.
New features
Next, let’s add some new capabilities to do grid and random searches and report the main parameters found. The first is the report. This feature takes output from a grid or random search, outputs a report of the model and returns the best parameter Settings.
The grid search
Next is run_gridsearch. This feature requires
- Feature X,
- Target y,
- (Decision tree) classifier CLF,
- Try the parameter dictionary param_grid
- Cross-validate multiple of CV, default is 5.
The param_grid is a set of parameters that will be tested, be careful not to have too many choices in the list.
Random search
Next comes the run_randomsearch function, which samples parameters from a specified list or distribution. Similar to grid search, the parameters are:
- Function of X
- Target y
- (Decision tree) classifier CLF
- Cross-validate multiple of CV, default is 5
- Number of random parameter Settings for n_iter_search. The default is 20.
Ok, so we’ve defined all the functions.
Cross validation
To get the data
Next, let’s use the search method set up above to find the appropriate parameter Settings. Start with some preliminary preparation – get the data and build the target data:
print("\n-- get data:")
df = get_iris_data()
print("")
features = ["SepalLength", "SepalWidth",
"PetalLength", "PetalWidth"]
df, targets = encode_target(df, "Name")
y = df["Target"]
X = df[features]
-- get data:
-- iris.csv found locally
Copy the code
First cross validation
In all of the examples below, I will use 10x cross validation.
- Divide the data into 10 parts
- Nine parts were fitted
- Other parts of the test accuracy
Repeat this on all combinations, using the current parameter Settings, to produce ten model accuracy estimates. The average and standard deviation of ten scores are usually reported.
print("-- 10-fold cross-validation "
"[using setup from previous post]")
dt_old = DecisionTreeClassifier(min_samples_split=20,
random_state=99)
dt_old.fit(X, y)
scores = cross_val_score(dt_old, X, y, cv=10)
print("mean: {:.3f} (std: {:.3f})".format(scores.mean(),
scores.std()),
end="\n\n" )
-- 10-fold cross-validation [using setup from previous post]
mean: 0.960 (std: 0.033)
Copy the code
0.960 is not bad. This means that the average accuracy (the percentage of correct classifications using trained models) is 96%. This is very accurate, but let’s see if we can find a better parameter.
Application of grid search
First, I’ll try a grid search. The dictionary parA_Grid provides different parameter Settings to test.
print("-- Grid Parameter Search via 10-fold CV") dt = DecisionTreeClassifier() ts_gs = run_gridsearch(X, y, dt, param_grid, CV =10) -- Grid Parameter Search via 10-fold CV GridSearchCV took 5.02 seconds for 288 candidate Parameter Settings. Parameters: {'min_samples_split': 10, 'max_leaf_nodes': 5, 'criterion': 'gini', 'max_depth': None, 'min_samples_leaf': 1} Model with rank: 2 Mean Validation score: 0.967 (STD: Parameters: {'min_samples_split': 20, 'max_leaf_nodes': 5, 'criterion': 'gini', 'max_depth': None, 'min_samples_leaf': 1} Model with rank: 3 Mean Validation score: 0.967 (STD: 0.033) Parameters: {'min_samples_split': 10, 'max_leaf_nodes': 5, 'criterion': 'gini', 'max_depth': 5, 'min_samples_leaf': 1}Copy the code
On most runs, the average for the various parameter Settings is 0.967. That’s an improvement from 96% to 96.7%! We can see the best parameter setting, ts_gs, as follows:
print("\n-- Best Parameters:")
for k, v in ts_gs.items():
print("parameter: {:<20s} setting: {}".format(k, v))
-- Best Parameters:
parameter: min_samples_split setting: 10
parameter: max_leaf_nodes setting: 5
parameter: criterion setting: gini
parameter: max_depth setting: None
parameter: min_samples_leaf setting: 1
Copy the code
And duplicate the cross-validation results:
Print ("\n\n-- Testing best parameters [Grid]..." ) dt_ts_gs = DecisionTreeClassifier(**ts_gs) scores = cross_val_score(dt_ts_gs, X, y, cv=10) print("mean: {:.3f} (std: {:.3f})".format(scores.mean(), scores.std()), end="\n\n" ) -- Testing best parameters [Grid]... Mean: 0.967 (0.033) STD:Copy the code
Next, let’s use pseudocode to get the best tree:
print("\n-- get_code for best parameters [Grid]:", end="\n\n") dt_ts_gs.fit(X,y) get_code(dt_ts_gs, features, targets) -- get_code for best parameters [Grid]: If (PetalWidth <= 0.800000011921) {return iris-setosa (50 examples)} else {if (PetalWidth <= 1.75) {if (PetalWidth <= 0.800000011921) {return Iris-setosa (50 examples)} else {if (PetalWidth <= 1.75) {if ( PetalLength <= 4.94999980927) {if (PetalWidth <= 1.65000009537) {return iris-versicolor (47 examples)} else {PetalLength <= 4.94999980927) {PetalWidth <= 1.65000009537) {return iris-versicolor (47 examples)} else { return Iris-virginica ( 1 examples ) } } else { return Iris-versicolor ( 2 examples ) return Iris-virginica ( 4 examples ) } } else { return Iris-versicolor ( 1 examples ) return Iris-virginica ( 45 examples ) } }Copy the code
We can also make a graph of the decision tree:
visualize_tree(dt_ts_gs, features, fn="grid_best")
Copy the code
The application of random search
Next, we try to find the parameters using a random search method. In this example, I used 288 samples so that I tested the same number of parameter Settings as the grid search above:
As with grid searches, this will typically find multiple parameter Settings with an average accuracy of 0.967 or 96.7%. As mentioned above, the optimal parameters for cross-validation are:
print("\n-- Best Parameters:")
for k, v in ts_rs.items():
print("parameters: {:<20s} setting: {}".format(k, v))
-- Best Parameters:
parameters: min_samples_split setting: 12
parameters: max_leaf_nodes setting: 5
parameters: criterion setting: gini
parameters: max_depth setting: 19
parameters: min_samples_leaf setting: 1
Copy the code
Also, we can test the best parameters again:
-- Testing best parameters [Random]... Mean: 0.967 (0.033) STD:Copy the code
Copy the code
To see what the decision tree looks like, we can generate pseudocode to get the best random search results, right
And visualize the tree
visualize_tree(dt_ts_rs, features, fn="rand_best")
Copy the code
conclusion
Therefore, we used a grid with cross-validation and a random search to adjust the parameters of the decision tree. In both cases, the improvement from 96 percent to 96.7 percent was small. Of course, in more complex problems, the effect is greater. A few final notes:
- After finding the best parameter Settings through a cross-validation search, all data is typically trained using the best parameters found.
- Conventional wisdom holds that random search is more efficient than grid search for practical applications. Grid searches do take too long, which certainly makes sense.
- The basic cross-validation ideas developed here can be applied to many other SCIKit learning models – random forest, logistic regression, SVM, etc.
Most welcome insight
1. Why do employees dimission from decision tree model
2. Tree-based methods of R language: decision tree, random forest
3. Use scikit-learn and PANDAS in Python
4. Machine learning: Running random forest data analysis reports in SAS
5.R language improves airline customer satisfaction with random forest and text mining
6. Machine learning boosts fast fashion precise sales time series
7. Identifying changing Stock Market Conditions with Machine learning: Application of Hidden Markov Models
8. Python Machine learning: Recommendation System Implementation (Matrix factorization for collaborative filtering)
9. Python uses PyTorch machine learning classification to predict bank customer churn