Designed bar \

\

❈PytLab is a columnist for the Python Chinese community. It is mainly engaged in the application of scientific computing and high performance computing, and its main languages are Python, C, C++. Familiar with numerical algorithm (optimization method, monte carlo algorithm, etc.) and parallelization algorithm (MPI,OpenMP and other multi-threaded and multi-process parallelization) and python optimization method, often use C++ to write extensions for python. Zhihu column: Chemical dog code brick daily

blog:pytlab.org

github:github.com/PytLab

preface

Recently, I began to focus on the application of the topic. After this summary, the learning of algorithm principle came to an end. This paper mainly introduces the implementation of decision tree for regression problems, including regression tree and model tree, and introduces the techniques and implementation of preprune and postprune to prevent tree overfitting. Finally, the regression tree and standard linear regression are compared.

The body of the

In a previous article I summarized type prediction using a build decision tree. Intuitively, tree structure is the easiest to deal with classification problems. Through recursion, we select the best segmentation feature in the data to segment the training data and split the tree, and finally reach the bottom condition to obtain the trained decision tree. We can visually view the training model and classify the data.

Generally, the methods of splitting and selecting feature of decision tree include ID3, C4.5 algorithm, C5.0 algorithm and CART tree. In “Machine Learning Algorithm Practice-Decision Tree”, ID3 and C4.5 algorithms are introduced, and ID3 algorithm is used to deal with classification problems. This paper mainly uses decision Trees to solve Regression problems And uses CART(Classification And Regression Trees) algorithm.

CART

CART is a binary recursive segmentation technology. The segmentation method uses the Gini index estimation function based on the minimum distance to divide the current sample set into two sub-sample sets, so that each generated non-leaf node has two branches. Therefore, the decision tree generated by CART algorithm is a binary tree with simple structure.

Classification tree is a method to divide the data into discrete classes by binary tree when the target variables are discrete variables. The regression tree is to generate a regression tree by selecting a value of the optimal segmentation feature and dividing the data according to the value greater than or less than the target variable is a continuous variable.

Feature selection and optimal segmentation point selection

When we use decision tree to solve regression problem, we need to constantly select a value of a feature as a segmentation point to generate a subtree. The criteria for selection is to make the two parts of the data to be divided the best purity.

1. For discrete data, we can determine the most segmentation point by calculating the change of Gini unpurity of the data divided between the two parts; 2. For continuous variables, we calculate the least square residual, that is, we select the features and segmentation points that minimize the variance of the segmented data. The intuitive understanding is to make the two parts of the data can have the most similar value.

Termination conditions for tree splitting

With the method of selecting segmentation features and optimal segmentation points, the tree can be divided according to this, but what are the termination conditions of splitting?

1. The value of all target variables in the node is the same. Since they are already the same value, there is no need to split. 2. The depth of the tree reaches the pre-specified maximum value. 3. 4. The data amount of the node is less than the preset threshold

Python implementation of regression trees

This section uses Python to implement a simple regression tree, perform regression on a given data and visualize the regression curve and tree structure. The complete code can be found at github.com/PytLab/MLBo…

The first part is the data loading. All the test data here are in Machine Learning in Action, which have a regular format and a consistent loading method. Due to tree regression, independent variables and dependent variables are placed in the same two-dimensional array:

After finding segmentation features and segmentation values in tree regression, the data need to be divided to construct subtrees or leaf nodes:

Then it is important to select the optimal segmentation feature and segmentation value. Here, we make the optimal segmentation point with the minimum variance after segmentation by searching and playing:

There are two conditions for stopping selection: one is when the size of the segmented sub-data set is less than a certain value; One is that the variance reduction of the data segmented by the selected optimal segmentation point is less than a certain value.

Fleaf is the function reference for creating leaf nodes, and this function is also different for different tree structures. For example, in the regression tree in this part, the leaf node is created according to the average value of the divided data set, while for the model tree, the return value of this function is the regression coefficient obtained according to the data set. Ferr is a function to calculate the impurity of the data set, and this function will be different for different tree models. For regression trees, this function calculates the variance of the data set to determine the purity of the data set, while for model trees, we need to calculate the fitting degree of the linear model, which is the sum of squares of the residuals of the linear model.

Then comes the most important regression tree generating function, the tree structure must be created by recursion, the bottom will be reached when no new segmentation point can be selected:

Regression of data using regression tree

Ready-made segmentation data are used here as training data to generate regression trees. For all data used in this paper, please refer to github.com/PytLab/MLBo…

Visual data point

Create regression trees and visualize

Seeing this segmented data, the regression tree fits it best. We create a regression tree:

A regression tree structure represented by a Python dictionary:

Here I still use Graphviz to visualize the regression tree, similar to the dotify function in the previous article on classification of decision trees. Here I slightly modify the label of the leaf node. We can recursively obtain the dot file corresponding to the decision tree. See the implementation of dotify function :github.com/PytLab/MLBo…

Then get the tree structure diagram:

Generate regression tree picture:

The number on the node represents: Feature number: feature segmentation value

Draw regression tree regression curve

With the regression tree, we can draw the regression curve of the regression tree to see whether it has a good regression effect on segmented data:

Tree pruning

Before introducing tree pruning, the code in the previous part is used to perform regression for two groups of similar data. The visualized data and regression curve are as follows (data file left & data file right):

The distribution of the two sides of the data parameter are basically the same but use the same regression tree is quite different to the left of the regression tree only two branches, while the right branch has a lot of, even sometimes get a branch, for all the data points so regression tree will be very large, the following are two visualization of regression tree:

If a tree has too many nodes, it indicates that the model may be “over-fitting” the data. So we need to reduce the complexity of the decision tree to avoid overfitting, and this process is pruning. Pruning technology is divided into pre-pruning and post-pruning.

Pre pruning

Prepruning is done by changing parameters before decision tree generation and then during tree generation. For example, in the function we created the regression tree above, there is an opt parameter, including n_tolerance and err_tolerance, which can control when to stop tree splitting. When the minimum data amount of leaf nodes is increased and the tolerance of error is increased, tree splitting will be terminated earlier. When the tolerance of error change was increased to 2000, the regression tree and regression curve were visualized as follows:

After pruning

Pre-pruning technology needs to be used to specify parameters in advance, but post-pruning technology is a more ideal pruning technology because it automatically prays through test data without user intervention, but we need to write pruning functions to handle it.

The general idea of post-pruning is that we try to merge the left and right subtrees (nodes) of a subtree, and calculate the variance before and after the merger through the test data. If the variance after the merger is smaller than that before the merger, it means that the subtree can be merged.

Collapse a tree: We collapse a tree recursively by merging the tree and returning the average value of the tree.

Python implementation of post-pruning:

Let’s take a look at the effect of post-pruning using test data instead of pre-pruning the tree:

The output shows that a total of 8 pruning operations were performed. Let’s visualize the tree before and after pruning:

The size of the tree did decrease.

The model tree

In the previous part, the average value of the segmented data is placed on the leaf node and is taken as the predicted value of the samples that meet the conditions. In this part, a linear model will be placed on the leaf node to make prediction. This means that our tree is made up of multiple linear models, which is obviously better than forcing averages to model.

Model trees using multiple linear functions for regression are more interpretable than those using multiple averages to form a large tree. Moreover, the use of linear models can reduce the size of the tree. After all, the coverage range of averages is only local, while linear models can cover all data with linear relationships. The model tree also has higher prediction accuracy

Creating a model tree

The idea of model tree and regression tree is exactly the same, except for the ways of generating leaf nodes and calculating data errors (impure). For a leaf node in the model tree, we need to use the segmented data to perform linear regression to obtain the linear regression coefficient rather than simply calculating the average value of the data. The calculation of impurity is not simply to calculate the variance of data, but to calculate the sum of squares of the residuals of linear models.

To compute linear models for leaf nodes, we also need to implement a standard linear regression function linear_regression, the Python implementation of ferr and FLEAF for the corresponding model tree

Apply model trees to piecewise linear data

This part uses the piecewise linear data prepared in advance to build the model tree, and the data points are visualized as follows:

Now we use this data to build a model tree:

The resulting tree structure:

Visualization:

Draw the regression curve:

You can see from the model tree that only two branches are needed for this data, and the depth of the number is only 2 levels.

When x<0.304, linear model Y =3.47+1.19x was used for regression. When x>0.304, linear model Y =0.0017+1.20x was used for regression

Regression tree vs. linear regression

In this part, we used standard linear regression and regression tree to perform regression on the same set of data, and calculated the Correlation Coefficient with the same set of test data to compare the regression effects of the two models.

For the data, I still use the existing data from Machinie Learning in Action. The data visualization is as follows:

Now we use standard linear regression and regression tree respectively to perform regression on this data, and calculate the correlation coefficient R2R2 between the model predicted value and the test sample (see github.com/PytLab/MLBo for the complete code).

Correlation coefficient calculation:

Correlation coefficient obtained:

Draw the regression curve of linear regression and tree regression (yellow will be tree regression, red will be linear regression):

Visible tree regression is more effective than simple linear models in predicting complex data.

conclusion

This paper introduces the application of decision tree to the regression prediction of continuous values, and realizes the visualization of regression tree, pruning and model tree and corresponding tree structure output. The corresponding Python implementation of model tree is also given and regression tests are carried out for piecewise linear data. Finally, the regression tree model is compared with the simple standard linear regression model.

reference

“Machine Learning in Action” CART classification and regression tree principle and implementation


* * * * * * * * * * *Long press scan to follow the Python Chinese community for more technical wizardry!

Python Chinese Community ****Python Chinese developer’s spiritual home

For cooperation and submission, please contact Wechat: PYTHonPost

— Life is short, I use Python —

1MEwnaxmMz7BPTYzBdj751DPyHWikNoeFS





\