background
Titanic: Machine Learning from Disaster – Kaggle
I was recommended to follow this contest 2 years ago, but I opened the page and had no idea how to do it.
Two years later, when I opened this page again, I saw Titanic Tutorial-Kaggle clearly, and I could do it completely foolishly. What blinded my eyes in those days ~
Target
use machine learning to create a model that predicts which passengers survived the Titanic shipwreck
Data
Titanic: Machine Learning from Disaster – Kaggle
- train.csv
- Survived: 1=yes, 0=No
- test.csv
- gender_submission.csv: for prediction
- PassengerId: those from test.csv
- Survived: final result
Guide to help start and follow
Titanic Tutorial – Kaggle
- Join the Competition Here!
- Submit a initial result:
- NoteBook
Learning Model
Excerpt of the site’s explanation, specific talk behind.
- random forest model
- constructed of several “trees”
- that will individually consider each passenger’s data
- and vote on whether the individual survived.
- Then, the random forest model makes a democratic decision:
- the outcome with the most votes wins!
- constructed of several “trees”
sklearn.ensemble.RandomForestClassifier
In Titanic, the RandomForestClassifier algorithm was used. Before I knew this algorithm, I noticed that the algorithm class in SkLearn was in the ensemble module. So I wanted to get to know ensemble first
ensemble
A number of things considered as a group
It sounds like a combination.
Ensemble Learning (ENSEMBLE Learning, ensemble Learning, ensemble Learning, ensemble Learning, ensemble Learning) – Zhihu mentioned that there are three common integrated learning frameworks: Bagging, Boosting and Stacking.
As can be seen from API Reference – SciKit-learn 0.22.1 documentation, there are algorithms for each of these frameworks.
Random Forest is an algorithm in bagging framework. Just try to understand this one for now and come across other frameworks later. But before we get to that, we need to know what Ensemble Learning is.
In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.
This interpretation should be taken literally, combining multiple algorithms to achieve better predictive performance than any single algorithm.
Bagging framework
Sklearn. Ensemble. BaggingClassifier – scikit – learn 0.22.1 documentation
A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction.
The general idea is:
- Sample a subset randomly from the source dataset
- Train the classifier on this subset sample
- Repeat the above steps several times
- Then integrate the predicted results of each classifier (average or vote)
- Form the final forecast
The question is:
- How many subsets do I sample, how many classifiers do I have to do?
- What does the random selection algorithm use?
- What are the advantages and disadvantages of averaging and voting when integrating the results of various classifiers?
- How to train the various classifiers?
I don’t even know
The Random Forest mentioned above is an algorithm for bagging framework. Now let’s see how this algorithm answers some of my questions.
The Random Forest algorithm
1.11. Ensemble Methods — Scikit-learn 0.22.1 documentation
The prediction of the ensemble is given as the averaged prediction of the individual classifiers.
First of all, this algorithm is to average each classifier. Forest of what? We would mean forest of trees, and we would mean decision trees, so these algorithms are actually algorithms based on randomized decision trees
random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction.
Random Forest builds a decision tree for each classifier and then merges it.
How are classifiers divided? Take the code of Titanic as an example to try to understand:
from sklearn.ensemble import RandomForestClassifier
y = train_data["Survived"]
features = ["Pclass"."Sex"."SibSp"."Parch"]
X = pd.get_dummies(train_data[features])
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)
Copy the code
y
The training center is a collection of people who survived the disasterfeatures
: is the characteristic value of these people, such as gender, class, etc- X: generates dummy data
get_dummies
Not directlytrain_data[features]
?
Try using train_data[features] directly and print X like this:
Pclass Sex SibSp Parch
0 3 male 1 0
1 1 female 1 0
Copy the code
If we continue modeling with X, we will get an error:
ValueError: could not convert string to float: 'male'
Copy the code
Obviously, train_data cannot be used directly because the Sex field is a string and the model requires float. [features]
The purpose of get_dummies() is to convert these string fields to float. As you can also see from the print below, the Sex field is split into two fields, Sex_male and Sex_female, with values 0 and 1, respectively.
Pclass SibSp Parch Sex_female Sex_male 0 3 1 0 0 1 1 1 1 0 1 0 2 3 0 0 1 0 3 1 1 0 1 0 4 3 0 0 0 1 .. . . . . . 886 2 0 0 0 1 887 1 0 0 1 0 888 3 1 2 1 0 889 1 0 0 0 1 890 3 0 0 0 1Copy the code
RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
- What do these parameters mean?
- N_estimators: The number of decision trees
- Max_undies: The maximum depth of the decision tree
- random_state: controls the random number generator. May be used in conjunction with other parameters such as shuffle. I also mentioned that this number uses a controlled randomization algorithm, so that when I run it multiple times it still produces the same result each time, right?
- To make a randomized algorithm deterministic (i.e. running it multiple times will produce the same result), an arbitrary integer random_state can be used
- What do these parameters mean?
For details about how to tune parameters, refer to the Tuning guidelines
Random Forest
Application scenarios of
Since it is a classifier algorithm, naturally many classification application scenarios are suitable; There are also scenes of regression problems.
The Random Forest Algorithm: A Complete Guide-Built In gives an analogy of A practical example:
- Are you deciding where to travel and ask your friends
- Friends ask what you liked and disliked about previous trips
- On this basis, some suggestions are given
- This provides fodder for your decision making
- Same procedure, you ask another friend
- And another friend
- .
Similarly, you have several offers, are hesitant about which one to accept, etc. If you look at a few houses and decide which one to choose, it seems that you can use this algorithm to try.
I learned some unfamiliar code
- Pandas. Datafame. Head: Returns the first few rows of the dataset
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")
test_data.head()
Copy the code
- pandas.DataFrame.loc
men = train_data.loc[train_data.Sex == 'male'] ["Survived"]
rate_men = sum(men)/len(men)
Copy the code
Reference
- How to get started with Ensemble Learning? – zhihu
- The Ensemble Learning.”
- The Random Forest Algorithm: A Complete Guide – Built In