preface

This paper implemented a simple version GBDT Github:https://github.com/Freemanzxp/GBDT_Simple_Tutorial first to a common understanding: If someone is 30 years old, we use 20 years old to fit in the first iteration, and find the error is 10 years old. The fitting result of the first iteration is 20 years old. In the second iteration, we used 6 years to fit the remaining error, and found that the error was still 4 years old. In the second iteration, the fitting result was 6 years old. In the third iteration, we fit the remaining error by 3 years, and the difference is only one year. In the third iteration, the result is 3 years; In the fourth iteration, we used 1 year to fit the remaining errors, and there was no error. Finally, we added the results of the four times of fitting, and 20+6+3+1=30 years old, this is our final fitting result. If we have not finished the number of iterations, we can continue to iterate. With each iteration, the age error of the fitting will decrease. Finally, the ages of each fit are added up to the output of the model. The example in this article is such logic.Copy the code

1/ Data introduction

As shown in the table below, we have a sample data characterized by age and weight, and height is the label value. There are 5 pieces of data, the first four are training sample data, and the last one is to be tested sample data.Copy the code

2/ Training phase

<1> Parameter Settings

Learning rate: Learning_rate =0.1 Number of iterations: n_trees=5, and the number of decision trees generated. The depth of the tree: max_depth=3, determines how many times to divide. The learning rate and iteration times are for Boosting integrated learning framework, and the tree depth is for CART regression decision tree.Copy the code

<2> Initializes the weak learner

The initial weak learner F0 (x)=(1.1+1.3+1.7+1.8)/4=1.475, which is actually the mean of label values of all training data.Copy the code

<3> Number of iterations n_trees

M = 1, 2,... ,M Because we set the number of iterations: n_trees=5, where M=5, the fitting value given by our initial learner f0(x) is 1.475 (initial learner: each record is fitted with the mean value). Then, errors can be obtained according to the real value and the fitting value, as shown in the figure below:Copy the code

At this point, we take the error as the true value of the four training sample data to train the weak learner F1 (x) and construct the first regression decision tree.Copy the code

The next step is to find the best features and the best points of segmentation for the best features. Se_sum (square error) of the two groups of data after splitting were calculated from 5 for age feature to 70 for weight feature respectively. Se_left is the sum of the squares of error of the left node, and se_right is the sum of the squares of error of the right node. The sum of the squares of error is obtained by summation of the two. Finding the minimum se_sum is the best partition point, so we can find the best feature and the best partition point. If the se_sum of more than one partition point is the same, one of them is randomly selected. The following is the SE_sum of all partition points:Copy the code

The above error square and se_sum are minimum 0.025, but there are two dividing points: age 21 and weight 60, so pick a random dividing point, here we pick age 21, and now our first tree looks like this:Copy the code

We set the parameter max_depth=3 for each tree. Now the depth of the tree is only 2, and we need to divide the tree again. This time, we need to divide the left and right nodes. Note: in the same tree, every time a node is divided, it is divided in the data set owned by the node, and the dividing principle is the same: calculate the sum of squares of errors. Find the smallest partition point. For the left node, which contains only 0 and 1 samples, age 7 is selected according to the following table:Copy the code

For the right node, there are only 2 and 3 samples. According to the following table, age 30 is selected (weight 70 can also be selected).Copy the code

After another partition, our first tree now looks like this:Copy the code

At this point, our tree depth meets the setting, and one more thing we need to do is assign a parameter to each leaf node to fit the residuals. In fact, this is the same as the initialization of the learner above, so the voltage loss is squared, the derivative is set to zero, and after simplification, the main parameters for each leaf node are actually the average of the label values. The label value of this place is not the original y, but the error y-f0(x) to be fitted in this round. At this point, our first tree looks like this:Copy the code

Repeat the above operation to generate a total of 5 trees:Copy the code

Second tree:Copy the code

The third tree:Copy the code

Fourth tree:Copy the code

The fifth tree:Copy the code

4/ Finally predict very high unknown samples

In F1 (x), the age of sample 4 is 25, 21 years older than the partition node, and less than 30 years old, so it is predicted to be 0.2250. In F2 (x), the… Omitted here… So it’s predicted to be 0.2025 why 0.2025? This is based on the second tree, and you can simply run the code on GitHub

In F3 (x), sample 4… Omitted here… So it is predicted to be 0.1823. In f4(x), sample 4’s… Omitted here… So it’s predicted to be 0.1640. In F5 (x), sample 4’s… Omitted here… So it’s predicted to be 0.1476

Final prediction result: F (x) = 1.475 +0.1 * (0.225+0.2025+0.1823+0.164+0.1476) = 1.56714Copy the code