The Titanic, a British passenger ship sailing from Southampton to its destination of New York, sank on an April morning in 1912 after colliding with an iceberg. With an estimated 2,224 passengers and crew on board, more than 1,500 people died, making it the worst peacetime maritime disaster in modern history.

Today, we will be in the famous Titanic data set is created on the machine learning model, the data set provides data on the Titanic passengers, economic status, gender, age, and so on, for example, let’s combine these characteristics, build a according to the parameter prediction whether some people can survive the situation at the time of machine learning model, You can even use it to measure your chances of survival.

1. Prepare

Before you begin, make sure Python and PIP are successfully installed on your computer. If not, please visit this article: Super Detailed Python Installation Guide to install Python.

In Windows, open Cmd(Start – Run – Cmd). In Apple, open Terminal(command+ space enter Terminal).

Of course, I recommend that you use the VSCode editor, Copy this code, install dependency modules in the terminal below the editor, what a comfortable thing to do: the best companion to Python programming – VSCode detailed guide.

Type the following command to install the dependency modules we need:

pip install numpy
pip install panda
pip install seaborn
pip install matplotlib  
pip install scikit-learn
Copy the code

If Successfully installed XXX is displayed, the installation is successful. Oh, and don’t forget to download the data set. You can download it from Kaggle, or you can download the full data and code from Titanic.

2. Analyze basic data

Before we start using machine learning for analysis, we need to do some general data analysis, such as missing value detection, feature number, basic association analysis, etc.

2.1 missing value

The first is missing value detection. It is impossible for such a data set to have no missing values. We should clearly analyze the missing data before starting machine learning analysis.

This is where the tools come in: seven lines of code skillfully visualizing missing data in a Python chart, and this is where your extensive knowledge comes in handy. Generated heat map:

As can be seen, cabin and Age have the most missing values, and these two columns need to be deleted if necessary. The pursuit in which several values are lost is easier to deal with and can be solved by padding. Thermal map code:

2.2 Finding characteristic variables

In this section, we will focus on finding out which variables make passengers more likely to survive, such as age and gender, position on board, and so on.

Firstly, age and gender were analyzed, and the following analysis chart was drawn according to the training set:

As you can see, the male mortality rate is actually higher, reflecting the general principle of letting women and children escape first. For males between the ages of five and 18, the chances of survival appear to be very low, although this may be due to the small number of people in that age group on board.

Moving on, do cabin class and boarding location affect survival? Embarked in port, pclass is the passenger grade, and the number 1 is first class.

It can be seen that the survival rate of first-class passengers is higher than that of other passengers. Moreover, the survival rate of men landing in port C is higher than that of women, which has to make people doubt the morality of passengers in port C.

And one more thing, are more relatives more likely to survive?

As you can see, those with between one and three relatives are most likely to survive, but survival beyond three is not so good.

The code for this visualization is as follows:

3. Machine learning prediction

First of all, we need to preprocess the data according to the data analysis just done, and remove the data [passenger ID] that is not helpful to our model. Besides, there were too many cabin missing, so we got rid of them here too. [name] dimension, the name must be digitized to analyze, in order to simplify the process, here is also removed. And then the other one we want to get rid of is Ticket, which is unique, which means nothing to us, so we get rid of it.

3.1 Complete the missing data

And of course we have to make up our missing values. According to the mean value and standard deviation of age, the random number of age is obtained to fill in the missing age data. Embarkation points are replaced by “S” places.

3.2 Digital Data

Here, we need to digitize three dimensions altogether:

1. Ticket price, from floating point to plastic 2. Gender to number 3

I have to say that pandas is really convenient. Map is done.

3.3 Single value segment value

Since age is a number one by one, such a number is not very meaningful in the case of insufficient data. We need to divide it by age group, and the ticket price is also the same. We converted it together:

3.4 Creating a Model

This is finally the key point, but this is the easiest part of the whole third section, because the SkLearn module has wrapped up everything we need to do. All we need to do is call the module, pass in the data to train, and test.

We use the random forest model (honestly, if there is no Sklearn, this model can write my head bald), the introduction of random forest can be seen in this article, in fact, is to solve the decision tree overfitting problem, this article is easy to understand: blog.csdn.net/mao_xiao_fe…

The code for training and testing is as follows:

The accuracy is as follows:

> > python 1. Py 0.9034792368125701Copy the code

Let’s plug in our own data and see if we can survive. The final data format is like this. You just plug in your own data and append your own line to the test data:

For example, I would have flown second class (actually third class, but considering I was on the Titanic why not second class?). ; Sex is 1, Age is in range 3; Embarked on: SibSp is the number of siblings and spouses aboard the ship, Parch is the number of parents and children aboard the ship, and since I’m probably traveling alone, we’re setting this to 0, Fare should be 2, and Embarked on: 0.

>> python 1.py  
1
Copy the code

God, I survived, it was not easy (funny, I wonder if it was a change of cabin). Let’s give it a try. The complete code is too long, but it will not be published here. You can find it in the backend of the Python useful library.

This paper is referenced fromTowardsdatascience.com/predicting-…

So that’s the end of our article, if you enjoyed our Python tutorial today, please keep checking us out, and if it helped, please give us a thumbs up/check it out below. If you have any questions, please leave them in the comments below, and we’ll be patient to answer them!


Python Dict.com Is more than a dictatorial model

Could you have survived the Titanic? Python tells you!