Original link:tecdat.cn/?p=4281

Original source:Tuo End number according to the tribe public number

 

If we average the results of all these models, we can sometimes find a better model from their combination than any of the individual parts. This is how the integration model works.

Let’s build a very small set of three simple decision trees to illustrate:

Each of these trees makes classification decisions based on different variables.

The random forest model grows trees deeper than the decision tree above, in effect the default is to grow every tree possible. Random forests do this in two ways.

The first tip is to use bagging. Bagging randomly samples the rows in your training set. It is easy to simulate in R using sample functions. Suppose we want to bag on a 10-line training set.

> sample(1:10, replace = TRUE)

[1] 3 1 9 1 7 10 10 2 2 9
Copy the code

In this simulation, if you run this command again, you get a different row sample each time. On average, about 37% of rows are excluded from the bootstrap sample. With these repeated and omitted rows, each decision tree grown using bagging will be slightly different.

The second random source goes beyond this limit. Instead of looking at the entire pool of available variables, a random forest takes only a portion of them, usually the square root of the number of available variables. In our example, we have 10 variables, so it makes sense to use a subset of three variables.

Through these two random sources, the whole consists of a series of completely unique trees, all of which are classified differently. As in our simple example, each tree is called to categorize a given passenger, count the votes (there may be hundreds or thousands of trees), and select the majority decision.

R’s random forest algorithm has no restrictions on our decision tree. We must clean up the missing values in the data set. Rpart has the great advantage of using substitution variables when encountering an NA value. There’s a lot of age missing from our data set. If any of our decision trees are split by age, then the tree will search for another variable that is split in a similar way to age and use them instead. Random forests cannot do this, so we need to find a way to manually replace these values.

Take a look at the age variable of the merged data box:

> summary(combi$Age) min.1st qu. Median Mean 3rd qu.max. NA's 0.17 21.00 28.00 29.88 39.00 80.00 263Copy the code

Out of 1309, 263 values were missing, a whopping 20%! Whether this subset is missing values. We also want to use the method=”anova” version of the decision tree now, because we’re not predicting a class anymore, but continuous variables. So, let’s generate a tree on a subset of the data using the available age values, and then replace the missing samples:

> combi$Age[is.na(combi$Age)] <- predict(Agefit, combi[is.na(combi$Age),])
Copy the code

You can continue to check the summary, and all of these NA values disappear.

Now let’s look at a summary of the entire data set and see if there are any other problem variables that we didn’t notice before:

> summary(combi)
Copy the code

 

> summary(combi$Embarked)

C Q S

2 270 123 914
Copy the code

Two passengers are blank. First, we need to find out who they are! We can use which for this:

> which(combi$Embarked == '')

[1] 62 830
Copy the code

Then we simply replace these two and encode them as a factor:

> combi$Embarked <- factor(combi$Embarked)
Copy the code

The other variable is Fare, so let’s see:

> summary(combi$Fare) min.1st qu. Median Mean 3rd qu.max. NA's 0.000 7.896 14.450 33.300 31.280 512.300 1Copy the code

It only has one passenger NA, so let’s find out which one it is and replace it with the median fare:

> which(is.na(combi$Fare))

[1] 1044
Copy the code

B: Ok. Our data box has now been cleaned. Now comes the second constraint: the random forest in R can only digest up to 32 levels of factors. Our FamilyID variable has almost doubled. We can take two paths here, either change the levels to their base integers (using the unclass() function) and have the tree treat them as continuous variables, or manually reduce the number of levels to keep them below the threshold.

We take the second approach. Then we convert it back to a factor:

> combi$FamilyID2 <- combi$FamilyID

> combi$FamilyID2 <- factor(combi$FamilyID2)
Copy the code

We were down to level 22, so we did a good job of separating the test and training sets, installing and loading packages

RandomForest: > install packages (' randomForest)Copy the code

Set random seeds.

> set.seed(415)
Copy the code

The internal numbers don’t matter; you just need to make sure you use the same seed number each time to generate the same random number within the random forest function.

Now we are ready to run our model. A syntax is similar to a decision tree.

> fit <- randomForest( )
Copy the code

We force the model to predict our classification by temporarily changing the target variable to use only two levels of factors, rather than method=”class” specified as used.

If you are working with a larger data set, you may want to reduce the number of trees, at least initially, by limiting the complexity of each tree with nodesize and reducing the number of rows sampled with SampSize

So let’s look at which variables are important:

> varImpPlot(fit)
Copy the code

 

Our Title variable takes the lead in both metrics. We should be very happy to see that the remaining engineering variables are also doing very well.

The prediction function works in a similar way to the decision tree, and we can build the commit file exactly the same way.

> Prediction <- predict(fit, test)

> write.csv(submit, file = "firstforest.csv", row.names = FALSE)
Copy the code

Let’s try a forest of conditional inference trees.

So go ahead and install and load the party package.

> install.packages('party')

> library(party)
Copy the code

Build the model in a similar way to our random forest:

> set.seed(415)

> fit <- cforest( )
Copy the code

Conditional inference trees can handle more levels of factors than Random Forests. Let’s make another prediction:

> Prediction <- predict(fit, test, OOB=TRUE, type = "response")
Copy the code

 

 

If you have any questions, please leave a message below!

 


Most welcome insight

1. Why do employees dimission from decision tree model

2. Tree-based methods of R language: decision tree, random forest

3. Use scikit-learn and PANDAS in Python

4. Machine learning: Running random forest data analysis reports in SAS

5.R language improves airline customer satisfaction with random forest and text mining

6. Machine learning boosts fast fashion precise sales time series

7. Identifying changing Stock Market Conditions with Machine learning: Application of Hidden Markov Models

8. Python Machine learning: Recommendation System Implementation (Matrix factorization for collaborative filtering)

9. Python uses PyTorch machine learning classification to predict bank customer churn