The latest Tencent cloud technology open class live, Tencent W3C representative, how to become a technical expert from the white?Click here for more details.
The author | Guo Xiaofa edit | township
| leads in this case the decision tree using R language and random forest package a classification of kaggle is all the problem solving process. This article requires an understanding of decision trees and random forests in machine learning, as well as basic R language syntax.
Ghouls, Goblins, and Ghosts
The opening
This paper describes the whole process of solving a classification problem of Kaggle with decision tree and random forest in R language. This article requires an understanding of decision trees and random forests in machine learning, as well as basic R language syntax.
background
About Kaggle
Kaggle is an online platform for data mining and prediction contests that anyone can enter. Enterprises or organizers can publish the background, data and goals of practical problems to Kaggle, and the majority of participants can analyze and submit answers, and finally rank according to the accuracy rate. Some of Kaggle’s contests have prizes, while others are practice fields for data mining. It is Ghouls, Goblins and Ghosts that we practice this time.
About Ghouls, Goblins, and Ghosts
The title website: www.kaggle.com/c/ghouls-go… Here’s the background: Researchers at leisure have measured the characteristics of 900 monsters in the institute, and have sorted 371 of them into ghouls, ghosts and Goblins. We’ll sort out the rest of the monsters.
This is a classification problem, and we’ll use decision trees to do it later.
About decision trees
Don’t know the decision tree is what please make your own Google www.cnblogs.com/leoo2sk/arc…
Observational data
Preliminary observation data
Train.csv and test.csv are downloaded from the website and placed in the directory D:/RData/Ghost/. The first step is of course to see what these data look like and what features can be processed. The data is first read in, merged into a data.frame, and the training data set is distinguished from the test data set by adding a variable data_set. The advantage of merging into the same data.frame is that all subsequent transformation operations of feature variables can take effect simultaneously on the training data set and test data set.
CSV ("D:/RData/Ghost/train.csv") train$data_set <- 'train' test <- Read. CSV ("D:/RData/Ghost/test.csv") test$data_set <- 'test' <- factor(all$data_set)Copy the code
Let’s look at the variables in the dataset:
STR (all) results as follows:
'data.frame': 900 obs. of 8 variables: $ id : int 0 1 2 4 5 7 8 11 12 19 ... $bone_length: num 0.355 0.576 0.468 0.777 0.566... $rotting_flesh: num 0.351 0.426 0.354 0.509 0.876... $hair_length: num 0.466 0.531 0.812 0.637 0.419... $has_soul: num 0.781 0.44 0.791 0.884 0.636... $ color: Factor w/ 6 levels "black","blood",.. 4. 1. $ type : Factor w/ 3 levels "Ghost","Ghoul",.. : 3 2 1 2 3 2 1 2 3 1... $ data_set : chr "train" "train" "train" "train" ...Copy the code
There are six other variables in the dataset besides the variables ID and datA_set.
Meaning of each field in the dataset:
- Id Indicates the id of the monster
- Bone_length Average length of bone (normalized to 0-1)
- Rotting_flesh – The percentage of carrion
- Hair_length – Average length of hair (normalized to 0-1)
- Has_soul – Percentage of souls
- Color – Body color, divided into ‘white’, ‘black’, ‘clear’, ‘blue’, ‘green’, ‘blood’ six kinds
- Type – Monster type, including ‘Ghost’, ‘Goblin’ and ‘Ghoul’
The summary (all) results:
id bone_length rotting_flesh hair_length Min. : 0.0 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 1st Qu.:224.8 1st Qu.:0.3321 1st Qu.:0.4024 1st Qu.:0.3961 Median :449.5 Median :0.4268 Median :0.5053 Median :0.5303 Mean :449.5 Mean :0.4291 Mean :0.5050 Mean :0.5222 3rd Qu.:674.2 3rd Qu.:0.5182 3rd Qu.:0.6052 3rd Qu.:0.6450 Max. :899.0 Max. :1.0000 Max. :1.0000 Max. :1.0000 has_soul color type data_set Min. :0.0000 Black :104 Ghost :117 Test :529 1st Qu.:0.3439 Blood: 21 Ghoul :129 train:371 Median :0.4655 blue: 54 Goblin:125 Mean :0.4671 clear:292 NA's :529 3rd Qu.:0.5892 green: 95 Max. :1.0000 White :334Copy the code
The raw data of four variables bone_length, Rotting_flesh, hair_length and has_soul have been standardized. The variables color, type, and data_set are factor variables. From the distribution of the date_set variable, you can see 371 records in the training set and 529 records in the test set. The variable type is the response variable of this classification algorithm.
The overall data is neat, and there are no missing values, indicating that we will soon be able to directly extract the model.
Preliminary data analysis
We first look at the distribution of each variable in each monster classification, to manually screen some variables and classification results of the correlation.
Train < -all [all$data_set == 'train', ] # for bone_length, rotting_flesh, hair_length, has_soul four variables for analysis the source (' D: / RData/comm/multiplot. R ') plots < - list name_pic () <- c("bone_length", "rotting_flesh", "hair_length", "has_soul") for(i in 1:length(name_pic)) { p <- ggplot(train, aes_string(x = "type", y = name_pic[i] , fill = "type") ) + geom_boxplot() + guides(fill = FALSE) + ggtitle(paste(name_pic[i], "Vs type") + xlab("Creature") plots[[I]] < -p} # See the attachment for the specific code. multiplot(plotlist = plots, cols = 2)Copy the code
It can be seen from the image that Ghost and Ghoul, two types of monsters, are highly differentiated in the four variables, while Goblin is in the middle, which intersects with the other two types of features.
Viewing the graph, we can preliminarily infer the following:
Goblin will end up being less accurate.
Ghost and Ghoul have relatively high differentiation in hair_length and has_soul, and their importance should be higher in subsequent models.
Other indicators, such as decay, are less impressive. Let’s continue to see if the color is related to the type of monster
Ggplot (train, AES (x=color, fill=type)) + geom_bar() + theme(text=element_text(size=14))Copy the code
This strongly suggests that the monsters have no particular preference for color, and that the proportions of all three colors are almost the same.
Next, take a look at the scatter diagram between variables to intuitively feel the correlation between variables.
Pairs (~bone_length+rotting_flesh+hair_length+has_soul, data=train, col =train $type, labels = c("Bone Length", "Rotting Flesh", "Hair Length", "Soul"))Copy the code
The variables of the three monsters are so intertwined that humans are running out of ideas, leaving the rest to the computer.
Model training
Basic model
With all the features taken care of (we didn’t actually do anything, Khan), let’s start throwing them into the model. We use the CART decision tree in the Rpart package of R language to classify samples.
First, set the control parameters of the decision tree
Library (rpart) library(rpart.plot) library(rpart.plot) library(rpart.plot) Otherwise, the node continues to split the children # minbucket -- the minimum number of samples contained in the middle node of the tree # maxDepth -- the maximum depth of the decision tree # xval -- the number of times of cross-validation # cp -- the complexity parameter, Tc < -rpart. control(minsplit=20,minbucket=10,maxdepth=10,xval=5,cp=0.005)Copy the code
We simply add all the features to the model
# Set random seed, Seed (223) # model formula fm.base < -type ~ bone_length + rotting_flesh + hair_length + has_soul + color # Method: Select the corresponding variable segmentation method according to the data type of dependent variable: Continuous method= "anova", discrete method= "class", counting method= "poisson", survival analysis method= "exp" # control: Base < -rpart (formula=fm.base, data=train, method="class", control=tc)Copy the code
There is a lot of information in the model result, among which the importance degree of variable. Importance is the more important one, which gives the importance degree of each model feature in model training.
Par (mai=c(0.5,1,0.5,0.5)) barplot(mod.base$variable. Importance, horiz=T, cex.names=0.8, las=2)Copy the code
As you can see from the bar chart, hair_length and has_soul are more important, while color contributes nothing, which is also consistent with the boxplot above.
So let’s draw the final decision tree.
Rpart.plot (mod. Base, branch=1, under=TRUE, faclen=0, type=0)Copy the code
Let’s see how accurate the model is on the training set
# type: The type of prediction is class, Factor pred.base<-predict(mod. Base, train, type='class') Pred.base) # result Ghost Ghoul Goblin Ghost 105 4 8 Ghoul 2 112 15 Goblin 17 27 81 Table (train$type, pred.base)))) [1] 0.8032345Copy the code
The accuracy on the training set is 80%, which looks good, so let’s strike while the iron is hot, train the results on the test set and submit the answers.
Test < -all [all$data_set == 'test',] pred. Base <-predict(mod. Base, test, Type ='class') # generate result res < -data.frame (id = test$id, type= pred.base) write-.csv (res, file = "D:/RData/Ghost/res_basic.csv", row.names = FALSE)Copy the code
We submitted the result file res_basic. CSV to Kaggle and soon got the result of our first submission on the Public Leaderboard with an accuracy rate of 0.68431. That’s a little low, which means we still have a long way to go.
Feature reprocessing
There are fewer variables, and as you can see from the importance graph above, hair_length and has_soul have higher importance. We combine hair_length and has_soul to get a new feature. We multiplied the four variables one by one to get 10 new variables, and then retrained with the decision tree.
All $bone_rot = all$bone_length * all$rotting_flesh all$bone_hair = all$bone_length * all$hair_length all$bone_soul = all$bone_length * all$has_soul all$rot_hair = all$rotting_flesh * all$hair_length all$rot_soul = All $rotting_flesh * all$has_soul all$hair_soul = all$hair_length * all$has_soul # Add new power to generate new variable all$bone2 = all$bone_length ^ 2 all$rot2 = all$rotting_flesh ^ 2 all$hair2 = all$hair_length ^ 2 all$soul2 = all$has_soul ^ 2Copy the code
If you look at the distribution of the new variables, you can see that the hair_soul variable in the bottom right corner almost completely separates Ghost from Ghoul, and also increases the degree of differentiation from Goblin.
Retrain the model, compute on the test set, and submit for validation.
Train < -all [all$data_set == 'train',] test < -all [all$data_set == 'test', More <- type ~ bone_length+rotting_flesh+hair_length+has_soul+color+bone_rot+bone_hair+ Bone_soul +rot_hair+rot_soul+hair_soul+rot2+hair2+soul2+bone2 method="class", control=tc)Copy the code
Let’s look at the importance of the variables in the new model.
It can be seen that the newly added variable hair_soul has the highest importance, indicating the highest contribution to the model.
Pred. More <-predict(mod.more, test, type='class') type = pred.more) write.csv(res, file = "D:/RData/Ghost/res_more.csv", row.names = FALSE)Copy the code
After this submission, the accuracy rate increased to 0.71078.
Random forests
As the saying goes, two heads are better than one. There is a similar technique in machine learning, which is model composition. For decision tree, random forest is a simple model combination method. Bagging is used to build a forest, which is composed of many decision trees. Each decision tree in the random forest is unrelated. After the forest is obtained, when a new input sample enters, each decision tree in the forest is asked to judge separately to see which category this sample belongs to, and then to see which category is selected more, and then to predict that category.
Seed (223) # method = "CV" -- k fold cross validation # number -- K in k-fold cross validation, CTRL < -trainControl (method = "CV ", number=10) Repeats = 20, verboseIter = TRUE) mod # training model. The rf < - train (FM) more, data = "train", method = "rf trControl = CTRL, tuneLength = 3)Copy the code
Predicted results
Pred. Rf <-predict(mod. Rf, test) res < -data.frame (id = test$id, type = pred.rf) write. CSV (res, file = "D:/RData/Ghost/res_rf.csv", row.names = FALSE)Copy the code
The submission result is 0.72023
Reference books
- Introduction to Data Mining
- Machine Learning With R Cookbook
- Machine Learning using R