background

Titanic: Machine Learning from Disaster – Kaggle

I was recommended to follow this contest 2 years ago, but I opened the page and had no idea how to do it.

Two years later, when I opened this page again, I saw Titanic Tutorial-Kaggle clearly, and I could do it completely foolishly. What blinded my eyes in those days ~

Target

use machine learning to create a model that predicts which passengers survived the Titanic shipwreck

Data

Titanic: Machine Learning from Disaster – Kaggle

train.csv
- Survived: 1=yes, 0=No
test.csv
gender_submission.csv: for prediction
- PassengerId: those from test.csv
- Survived: final result

Guide to help start and follow

Titanic Tutorial – Kaggle

Join the Competition Here!
Submit a initial result:
NoteBook

Learning Model

Excerpt of the site’s explanation, specific talk behind.

random forest model
- constructed of several “trees”
  - that will individually consider each passenger’s data
  - and vote on whether the individual survived.
  - Then, the random forest model makes a democratic decision:
    - the outcome with the most votes wins!

sklearn.ensemble.RandomForestClassifier

In Titanic, the RandomForestClassifier algorithm was used. Before I knew this algorithm, I noticed that the algorithm class in SkLearn was in the ensemble module. So I wanted to get to know ensemble first

ensemble

A number of things considered as a group

It sounds like a combination.

Ensemble Learning (ENSEMBLE Learning, ensemble Learning, ensemble Learning, ensemble Learning, ensemble Learning) – Zhihu mentioned that there are three common integrated learning frameworks: Bagging, Boosting and Stacking.

As can be seen from API Reference – SciKit-learn 0.22.1 documentation, there are algorithms for each of these frameworks.

Random Forest is an algorithm in bagging framework. Just try to understand this one for now and come across other frameworks later. But before we get to that, we need to know what Ensemble Learning is.

In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.

This interpretation should be taken literally, combining multiple algorithms to achieve better predictive performance than any single algorithm.

Bagging framework

Sklearn. Ensemble. BaggingClassifier – scikit – learn 0.22.1 documentation

A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction.

The general idea is:

Sample a subset randomly from the source dataset
Train the classifier on this subset sample
Repeat the above steps several times
Then integrate the predicted results of each classifier (average or vote)
Form the final forecast

The question is:

How many subsets do I sample, how many classifiers do I have to do?
What does the random selection algorithm use?
What are the advantages and disadvantages of averaging and voting when integrating the results of various classifiers?
How to train the various classifiers?

I don’t even know

The Random Forest mentioned above is an algorithm for bagging framework. Now let’s see how this algorithm answers some of my questions.

The Random Forest algorithm

1.11. Ensemble Methods — Scikit-learn 0.22.1 documentation

The prediction of the ensemble is given as the averaged prediction of the individual classifiers.

First of all, this algorithm is to average each classifier. Forest of what? We would mean forest of trees, and we would mean decision trees, so these algorithms are actually algorithms based on randomized decision trees

random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction.

Random Forest builds a decision tree for each classifier and then merges it.

How are classifiers divided? Take the code of Titanic as an example to try to understand:

from sklearn.ensemble import RandomForestClassifier

y = train_data["Survived"]

features = ["Pclass"."Sex"."SibSp"."Parch"]
X = pd.get_dummies(train_data[features])
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)
Copy the code

yThe training center is a collection of people who survived the disaster
features: is the characteristic value of these people, such as gender, class, etc
X: generates dummy dataget_dummiesNot directlytrain_data[features]?

Try using train_data[features] directly and print X like this:

     Pclass     Sex  SibSp  Parch
0         3    male      1      0
1         1  female      1      0
Copy the code

If we continue modeling with X, we will get an error:

ValueError: could not convert string to float: 'male'
Copy the code

Obviously, train_data cannot be used directly because the Sex field is a string and the model requires float. [features]

The purpose of get_dummies() is to convert these string fields to float. As you can also see from the print below, the Sex field is split into two fields, Sex_male and Sex_female, with values 0 and 1, respectively.

Pclass SibSp Parch Sex_female Sex_male 0 3 1 0 0 1 1 1 1 0 1 0 2 3 0 0 1 0 3 1 1 0 1 0 4 3 0 0 0 1 .. . . . . . 886 2 0 0 0 1 887 1 0 0 1 0 888 3 1 2 1 0 889 1 0 0 0 1 890 3 0 0 0 1Copy the code

RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
- What do these parameters mean?
  - N_estimators: The number of decision trees
  - Max_undies: The maximum depth of the decision tree
  - random_state: controls the random number generator. May be used in conjunction with other parameters such as shuffle. I also mentioned that this number uses a controlled randomization algorithm, so that when I run it multiple times it still produces the same result each time, right?
    - To make a randomized algorithm deterministic (i.e. running it multiple times will produce the same result), an arbitrary integer random_state can be used

For details about how to tune parameters, refer to the Tuning guidelines

`Random Forest`Application scenarios of

Since it is a classifier algorithm, naturally many classification application scenarios are suitable; There are also scenes of regression problems.

The Random Forest Algorithm: A Complete Guide-Built In gives an analogy of A practical example:

Are you deciding where to travel and ask your friends
Friends ask what you liked and disliked about previous trips
- On this basis, some suggestions are given
This provides fodder for your decision making
Same procedure, you ask another friend
And another friend
.

Similarly, you have several offers, are hesitant about which one to accept, etc. If you look at a few houses and decide which one to choose, it seems that you can use this algorithm to try.

I learned some unfamiliar code

Pandas. Datafame. Head: Returns the first few rows of the dataset

test_data = pd.read_csv("/kaggle/input/titanic/test.csv")
test_data.head()
Copy the code

pandas.DataFrame.loc

men = train_data.loc[train_data.Sex == 'male'] ["Survived"]
rate_men = sum(men)/len(men)
Copy the code

Reference

How to get started with Ensemble Learning? – zhihu
The Ensemble Learning.”
The Random Forest Algorithm: A Complete Guide – Built In

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

My first Kaggle competition learning – Titanic

background

Target

Data

Guide to help start and follow

Learning Model

sklearn.ensemble.RandomForestClassifier

ensemble

Bagging framework

The Random Forest algorithm

`Random Forest`Application scenarios of

I learned some unfamiliar code

Reference

My first Kaggle competition learning – Titanic

background

Target

Data

Guide to help start and follow

Learning Model

sklearn.ensemble.RandomForestClassifier

ensemble

Bagging framework

The Random Forest algorithm

Random ForestApplication scenarios of

I learned some unfamiliar code

Reference

Related Posts

Wolf 2 Responsive Designer for MAC

Google launches Open Roberta cloud platform to promote robot programming

Jobs are being automated? It’s not all bad

`Random Forest`Application scenarios of