Original link:tecdat.cn/?p=9859

Original source:Tuo End number according to the tribe public number

 

An overview

This paper is about tree-based regression and classification methods.

Tree methods are simple to understand but very useful for interpretation, but they generally cannot compete with the best supervised learning methods for predictive accuracy. Therefore, we also introduced Bagging, random forests and enhanced trees. Each of these examples involves generating multiple trees and then merging them to produce a single consensus forecast. We see that merging large numbers of trees can greatly improve prediction accuracy at the expense of explanatory power.

Decision trees can be applied to regression and classification problems. We will consider regression first.

Decision tree basis: regression

Let’s start with a simple example:

We predict the Salary of baseball players.

The result would be a series of split rules. The first branch splits the data with Years < 4.5 on the left and the rest on the right. If we code this model, we’ll see that the relationship ends up being slightly more complex.

Library (tree) Library (ISLR) attach(Hitters) # delete NA data Hitters< -na.omit (Hitters) # log Convert Salary to correct the distribution hist(Hitters$Salary)Copy the code

Hitters$Salary <- log(Hitters$Salary)
hist(Hitters$Salary)
Copy the code

summary(tree.fit)
Copy the code
## ## Regression tree: ## tree(formula = Salary ~ Hits + Years, data = Hitters) ## Number of terminal nodes: 7 ## Residual mean deviance: 0.271 = 69.1/255 ## ## -2.2400-0.2980-0.0365 0.0000 0.3230 2.1500 ## min.1st Qu. Median Mean 3rd Qu. MaxCopy the code

Now, we discuss the construction of prediction trees by layering feature Spaces. In general, there are two steps.

  1. Find the variable/split that best separates dependent variables, resulting in the lowest RSS.
  2. Divide the data into two leaves on the first identified node.
  3. Within each leaf, find the best variable/partition to separate the results.

The goal is to find the number of areas to minimize RSS. However, it is computationally infeasible to consider dividing every possible partition into J regions. To do so, we take a top-down greedy approach. It’s top-down because we start at the point where all the observations belong to one region. Greedy because at each step of the tree building process, the best split is chosen for that particular step, rather than looking ahead to a split that will produce a better tree at some future step.

Once all regions are created, we use the average of the training observations in each region to predict the dependent variable of a given test observation.

pruning

Although the model above can produce good predictions for the training data, the basic tree approach may overfit the data, resulting in poor test performance. This is because the resulting tree is often too complex. Smaller trees with fewer splits usually come at the expense of smaller deviations, resulting in lower variances, easier to interpret, and lower test errors. One possible way to do this is to build a tree only if the RSS reduction resulting from each split exceeds a certain (high) threshold.

Therefore, a better strategy is to generate a tree and then prune back to get a better subtree.

The cost complexity pruning algorithm, also known as weakest link pruning, provides a solution to this problem. Instead of considering every possible subtree, we consider the tree sequence alpha indexed by non-negative adjustment parameters.

 


trees <- tree(Salary~., train)
plot(trees)
text(trees, pretty=0)
Copy the code


plot(cv.trees)
Copy the code

It seems that tree 7 has the least deviation. Then we can prune the tree. However, this does not really prune the model, so we can choose smaller trees to improve the bias state. This is about the fourth branch.

prune.trees <- prune.tree(trees, best=4)
plot(prune.trees)
text(prune.trees, pretty=0)
Copy the code

Use pruned trees to make predictions for the test set.

mean((yhat - test$Salary)^2)
Copy the code
# # 0.3531 [1]Copy the code

Classification tree

Classification trees are very similar to regression trees, except that they are used to predict qualitatively rather than quantitatively.

To grow the classification tree, we used the same recursive binary split, but NOW RSS is not used as the split standard. The alternative is to use classification error rates. While intuitive, it turns out to be insensitive to tree growth.

In fact, the other two methods are desirable, although they are numerically very similar:

Gini Index_ is a measure of the total variance between K classes.

If the proportions of training observations in a given category are all close to zero or one, the value of __cross-entropy_ will be close to zero.

These two methods are preferred when pruning trees, but if the prediction accuracy of the final pruning model is the objective, the rule classification error rate is preferred.

To prove this, we will use the Heart data set. These data included binary outcome variables for 303 patients with chest pain in AHD. The results were coded as Yes or No in the presence of heart disease.

dim(Heart)
[1] 303 15
Copy the code

 

So far, this is a very complicated tree. Let’s determine if we can use the trimmed version to improve fit by cross-validation using the classification scoring method.

cv.trees
Copy the code
# # # # $size 16 [1] 9 5 3 2 1 # # # # $dev # # 44 45 42 [1] 41 41 81 # # # # # # $k [1] - Inf 0.0 1.0 2.5 5.0 37.0 # # # # $method  ## [1] "misclass" ## ## attr(,"class") ## [1] "prune" "tree.sequence"Copy the code

It looks like four split trees have the least deviation. Let’s see what the tree looks like. Again, we use the prune. Misclass classification Settings.

prune.trees <- prune.misclass(trees, best=4)
plot(prune.trees)
text(prune.trees, pretty=0)
Copy the code

## Confusion Matrix and Statistics ## ## Reference ## Prediction No Yes ## No 72 24 ## Yes 10 45 ## ## Accuracy : 0.775 ## 95% CI: (0.7, 0.839) ## No Information Rate: 0.543 ## p-value [Acc > NIR] : 2.86e-09 ## ## Kappa: 0.539 ## Mcnemar's Test p-value: 0.0258 ## ## Sensitivity: 0.878 ## Specificity: 0.97 ## Pos Pred Value: 0.750 ## Neg Pred Value: 0.18 ## Prevalence: 0.543 ## Detection Rate: 0.477 ## Detection Prevalence: 0.636 ## Balanced Accuracy: 0.765 ## ## 'Positive' Class: No ##Copy the code

Here, we achieved an accuracy of about 76%.

So why break it up? Splitting leads to improved node purity, which may lead to better prediction when using test data.

Trees and linear models

The best model always depends on the problem at hand. If the relationship can be approximated by a linear model, linear regression is likely to dominate. Conversely, if we have complex, highly non-linear relationships between features and y, the decision tree may outperform the traditional approach.

Advantages/Disadvantages

Advantages:

  • Trees are easier to explain than linear regression.
  • More reflective of human decision-making.
  • Easy to display graphically.
  • Can handle qualitative predictive variables without pseudovariables.

Disadvantages:

  • Trees generally do not have the same predictive accuracy as traditional methods, but, for exampleBagging, random forest and enhancementCan improve performance.

Other examples

 

Variables actually used in tree structure: “price”, “CompPrice”, “age”, “income”, “ShelveLoc”, “advertising”, number of terminal nodes: 19, average residual deviation: 0.414 = 92/222, error classification error rate: 0.0996 = 24/241

Here we see that the training error is about 9%. We use plot() to show the tree structure and text() to show the node labels.

plot(sales.tree)
text(sales.tree, pretty=0)
Copy the code

Let’s look at how the full tree handles the test data.

## Confusion Matrix and Statistics ## ## Reference ## Prediction High Low ## High 56 12 ## Low 23 68 ## ## Accuracy : 0.78 ## 95% CI: (0.707, 0.842) ## No Information Rate: 0.503 ## p-value [Acc > NIR] : 6.28e-13 ## ## Kappa: 0.559 ## Mcnemar's Test p-value: 0.091 ## ## Sensitivity: 0.709 ## Specificity: 0.89 ## Pos Pred Value: 0.75 ## Neg Pred Value: 0.747 ## Prevalence: 0.497 ## Detection Rate: 0.352 ## Prevalence: 0.428 ## Balanced Accuracy: 0.779 ## ## 'Positive' Class: High ##Copy the code

The test error rate of about 74% is pretty good, but we can improve it with cross-validation.

Here, we see that the lowest misclassification error is for model 4. Now we can prune the tree into a 4 model.

## Confusion Matrix and Statistics ## ## Reference ## Prediction High Low ## High 52 20 ## Low 27 60 ## ## Accuracy : 0.704 ## 95%CI: (0.627, 0.774) ## No Information Rate: 0.503 ## p-value [Acc > NIR] : 2.02e-07 ## ## Kappa: 0.408 ## Mcnemar's Test p-value: 0.381 ## ## Sensitivity: 0.358 ## Specificity: 0.750 ## Pos Pred Value: 0.722 ## Neg Pred Value: 0.690 ## prevention: 0.327 ## Detection Rate: 0.327 ## prevention: 0.453 ## Balanced Accuracy: 0.704 ## ## 'Positive' Class: High ##Copy the code

This didn’t really improve our classification, but we greatly simplified the model.

## CART ## ## 241 samples ## 10 predictors ## 2 classes: 'High', 'Low' ## ## No pre-processing ## Resampling: Cross-Validated (10 fold) ## ## Summary of sample sizes: 217, 217, 216, 217, 217, 217, ... ## ## Resampling results across tuning parameters: ## ## cp ROC Sens Spec ROC SD SD Spec SD ## 0.06 0.7 0.7 0.7 0.1 0.2 0.1 ## 0.1 0.6 0.7 0.6 0.2 0.2 0.2 ## 0.4 0.5 0.3 0.8 0.09 0.3 0.3 ## ## ROC was used to select the optimal model using the largest value. ## The final value used for The model was CP = 0.06.Copy the code

## Confusion Matrix and Statistics ## ## Reference ## Prediction High Low ## High 56 21 ## Low 23 59 ## ## Accuracy : 0.723 ## 95%CI: (0.647, 0.791) ## No Information Rate: 0.503 ## p-value [Acc > NIR] : 1.3e-08 ## ## Kappa: 0.446 ## Mcnemar's Test p-value: 0.88 ## ## Sensitivity: 0.709 ## SCP: 0.738 ## SCP: 0.727 ## Neg Pred Value: 0.720 ## prevention: 0.797 ## Detection Rate: 0.352 ## Detection: 0.484 ## Balanced Accuracy: 0.723 ## ## 'Positive' Class: High ##Copy the code

The prediction accuracy was reduced by choosing simpler trees.


Most welcome insight

1. Why do employees dimission from decision tree model

2. Tree-based methods of R language: decision tree, random forest

3. Use scikit-learn and PANDAS in Python

4. Machine learning: Running random forest data analysis reports in SAS

5.R language improves airline customer satisfaction with random forest and text mining

6. Machine learning boosts fast fashion precise sales time series

7. Identifying changing Stock Market Conditions with Machine learning: Application of Hidden Markov Models

8. Python Machine learning: Recommendation System Implementation (Matrix factorization for collaborative filtering)

9. Python uses PyTorch machine learning classification to predict bank customer churn