Kaggle
Kaggle is a platform for data scientists to share data, exchange ideas and compete. Kaggle is often viewed as not suitable for beginners, or as a rough route to learning.
There is nothing wrong. They do pose challenges for people like you and me who are just starting out. As a (junior) data scientist, I couldn’t help but start my journey by searching for interesting data sets on Kaggle. I learned about the Titanic data set.
Titanic
The data set contains information about passengers on the Titanic.
I use Python to visualize and understand more about data sets. I used SciKit-Learn to train a set of classifiers to predict a person’s chances of survival. The model is then saved using pickle and deployed as a Web application on a local host using Flask. Finally, I used AWS to host it.
The code can be found on GitHub.
1. Data check
First thing first. I imported the data into pandas’ DataFrame. It includes passenger identity, time of survival, ticket class, name, sex, age, number of siblings and spouses on board, number of parents and children on board, ticket number, passenger fare, cabin number and boarding port. The first five lines of data are shown in the figure below.
What you can immediately observe is :- every row PassengerID is unique, -Survived is the target we want to infer -Name is probably not useful -Ticket is Ticket data -Ticket is marked as NaN if it is missing.
For the sake of simplicity, I decided to drop the Ticket field for now. These fields may contain useful information, but extensive feature engineering is required to extract them. Let’s start with the easiest step.
On the other hand, let’s take a closer look at the missing data. The variables Embarked and Fare have some missing entries. On the other hand, about 20 percent of passengers’ ages are not recorded. This could pose a problem for us, as Age could be one of the key predictors in the data set. “Women and children first” was the norm at the time, and reports show that they were indeed saved first. More than 77% of Cabin entries are missing and unlikely to be helpful, so let’s remove it first.
2. Data visualization
The Pair diagram (not shown below) is usually my first choice at the beginning of a data visualization task, because it’s usually very helpful and it has very little code. A line of seaborn.pairplot() gives you a graph of $n^2$(exactly $n(n+1)/2$different graphs), where n represents the number of variables. It gives you a basic understanding of the relationship between each pair of variables and the distribution of each variable itself. Let’s look at the different variables.
First, the relationship between the target variable and the predictor Survived is examined item by item. With seaborn.countplot(), we find that most people fall into the third category, which is not surprising; In general, they are less likely to survive. Even with this single predictor, all else unknown, we can infer that first-class passengers are more likely to survive and third-class passengers are less likely to survive.
At the same time, women and children are more likely to survive, consistent with the “women and children first” theory mentioned earlier. If we examine only three variables, Pclass, Sex and Age, young female passengers in first class will be the most likely to survive.
However, it may be difficult to interpret density maps seaborn.kdeplot(). For the “surviving” and “not surviving” categories, they have a wide span, while the “not surviving” category has a smaller mean and variance. It’s worth noting that there’s an interesting tail in the “survival” category allocation, where three people get first class tickets for $512 each. They all boarded the ship in Cherbourg, and they all survived.
Embarkation, on the other hand, also seems to play a role in determining who survives. Most embark at The port of Southampton – the first leg of the journey – and their survival rates are lowest. Maybe they were assigned to cabins farther from the exit, or spending more time on the cruise ship could make people relax or tired. Or maybe it’s just indirectly caused by a third variable — like fewer women/children/first class passengers boarding at the first port. Further investigation is needed.
If you prefer tables to graphs, we can also visualize the data with pandas.datafame.groupby () and average each class. However, I don’t think there is a clear pattern in the Parch table below.
The correlation matrix generated by seaborn.Heatmap () illustrates the strength of the correlation between any two variables. As you can see, Sex and survives have the highest correlation, while Fare and Pclass are highly correlated. SibSp and Parch don’t seem to play a big role in predicting a person’s chances of survival, even though our intuition tells us otherwise.
3. Fill in missing data
We found many data items missing in our previous data check. We don’t seem to know how much ThomasStorey, 60, paid for his ticket, for example. Intuition tells us that ticket prices depend largely on ticket class and port of boarding, and we can cross-check with the correlation matrix above. Therefore, we only take the average of third-class fare in Southampton. This is just an educated guess and may be wrong, but it’s good enough. Remember that it is impossible to get noise-free data, and machine learning models should be robust to noise.
There were two other women we didn’t know were on the boat. The class of ticket is closely related to the price of the ticket. Since they both paid $80 for first-class seats, I’m guessing it was in Cherbourg (C in photo).
If there are only a few missing terms in a particular variable, we can use the technique above to make an educated guess by taking the maximum likelihood value. Still, it would be very dangerous to do the same thing if we lost more data, such as 20 percent of Age.
We can guess as much as we can at this point. Since we discarded the Cabin and filled in the other missing entries, we could use all the other variables to infer the missing Age from the random forest regression variable. 80% of the “training” data can be extrapolated to the remaining 20%.
4. Feature engineering
While most of these titles are “Mr.,” “Mrs.,” and “Miss,” there are also some less common titles — “Dr.,” “Reverend,” “Colonel,” and so on, some of which occur only once, like “Ms.,” “Dona,” “Colonel.” Their rare title doesn’t help with model training. To find patterns, you need data. Let’s classify those relatively rare headlines as “rare.”
Classifying data requires extra care before model training. The classifier cannot handle string inputs such as “Mr”, “Southampton”, etc. Although we can map them to integers such as (‘Mr’, ‘Miss’, ‘Mrs’, ‘Rare’)→(1,2,3,4), there should be no concept of title hierarchy. Being a doctor doesn’t mean you’re superior. In order not to mislead the machines and accidentally construct a sexist AI, we should one-hot code them. They became:
(,0,0,0 (1), (0,1,0,0),,0,1,0 (0), (0,0,0,1))Copy the code
On the other hand, I decided to add two variables — FamilySize and IsAlone. Adding FamilySize=SibSp+Parch+1 makes more sense because the whole family will stay together on the cruise. In addition, loneliness may be a key factor. You may be more prone to making rash decisions, or you may be more flexible in a disaster without caring for your family. By adding variables one at a time, I found that their presence in the model increased the overall predictability.
5. Model evaluation
I tried the most popular classifiers I know of — random forest, support vector machines, KNN, AdaBoost, etc.
XGBoost came out on top with 87% accuracy. To improve the robustness of the classifier, we train a group of classifiers with different properties, and get the final result by majority voting.
Finally, I submitted it to Kaggle and got an 80% accuracy rate. Not bad. There is always room for improvement.
For example, there is definitely some useful information hidden in the Cabin and Ticket, but for the sake of simplicity we’ve discarded it. We can also create more features
But I’ll leave that for now.
6. Deploy as a Web application
Flask is an easy-to-use Web framework in Python.
from flask import Flask
app = Flask(__name__)
@app.route("/")
def hello():
return "<h1>Write something here.</h1>"
app.run(host='0.0.0.0', port=60000)Copy the code
You can browse it on your local host.
What else do we need? We want people to fill out a form to collect the required data and pass it to the machine learning model. The model will have an output, and we will redirect the user to that page.
We’ll use WTForms to build a form in Python, with a single form defined by a class that looks like this:
from wtforms import Form, TextField, validators, SubmitField, DecimalField, IntegerField, SelectField
class ReusableForm(Form):
sex = SelectField('Sex:',choices=[('1', 'Male'), ('0', 'Female') ],
validators=[validators.InputRequired()])
fare = DecimalField('Passenger Fare:',default=33,places=1,
validators=[validators.InputRequired(),
validators.NumberRange(min=0,
max=512,
message='Fare must be between 0 and 512')])
submit = SubmitField('Predict')Copy the code
I found an HTML template from WillKoehrsen and built on top of it.
7. Cloud hosting
The page is now viewable through my local host and everything is fine. The last step is online hosting. There are three major cloud hosting services — AWS, GCP, and Azure. AWS is by far the most popular, so I chose the 12-month free service.
I use my private key to connect to a Linux server instance, migrate my repository to the server, run my script and it works!
Not so good for me…
Rock and the AI technology blog resources summary station: http://docs.panchuang.net/PyTorch, the official Chinese tutorial station: Chinese official document: http://pytorch.panchuang.net/OpenCV http://woshicver.com/