This article documents an introductory exercise in machine learning: simple dichotomies using decision trees. At the same time, combined with the classic case Titanic on Kaggle, to test the actual effect.

1. Data sets

The data set of Titanic in Kaggle was used. Data are divided into:

  • Training Set (train.csv)
  • Test set (test.csv)
  • Submission standard: Gender_submission. CSV

Since Kaggle involves going online for science, the original data set is already available on Gighub.

Second, data processing

First import the training set and check the data:

from sklearn.tree import DecisionTreeClassifier Import model decision tree classifier
from sklearn.model_selection import cross_val_score,train_test_split,GridSearchCV # The imported model functions are respectively cross validation, division of training set and data set, and grid search
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

data = pd.read_csv('/Users/liz/code/jupyter-notebook/sklearn/1- DecisionTree/Titanic_train.csv') # import data set
data.head() Display the first five rows of the dataset
[out]:
Copy the code

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

From the data shown above, what we have to do is to use the survives as the tag and the remaining columns as the features. Objective: To predict labels based on known characteristics. The practical significance of this data set is to make a prediction of the survival of passengers based on the known data.

Data. The info () look at the situation of the whole training set out # : < class 'pandas. Core. Frame. The DataFrame' > RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: Float64 (2), INT64 (5), Object (5) Memory Usage: 83.7+ KBCopy the code
The data analysis
  1. The above data shows that there are 891 pieces of data, among which the features with missing values are Age, Cabin, and Embarked; Non numeric characteristics are: Name, Sex, Ticket, Cabin, Embarked.
  2. When we use the existing features to predict the survival of passengers, some of the more troublesome and less important features are not used. For example, Name and Ticket can be omitted, because the passenger’s Name and purchased Ticket have little effect on the survival of the passenger. Another reason is that both are non-numerical data, which is complicated to process in numerical form (all data accepted in a computer must ultimately be presented in numerical form).
  3. Because there are many missing values in Cabin, deletion is adopted here for the same reason as above.
  4. Although gender is also a character data, gender has a certain impact on the possibility of escape in practice, so it is reserved.
  5. Fill in the missing value; Convert non-numeric data to numeric data.
Drop (['Name','Cabin','Ticket'],inplace=True,axis=1) drop(['Name','Cabin','Ticket'],inplace=True,axis=1) Data.loc [:,'Age'] = data['Age'].fillna(int(data['Age'].mean())) # Embarked Because only two data have missing values, Data = data.dropna() data = data.reset_index(drop = True) Data ['Sex'] = (data['Sex'] == 'male'). Astype (int) # 0 = female, Embarked on = data[' pursuit '].unique().tolist() # tags: ['S', 'C', Data. Iloc [:,data. Columns == 'Embarked'] = data['Embarked']. Apply (lambda x # : tags. The index (x)) see data. The data info # () to check the data information out: < class 'pandas. Core. Frame. DataFrame' > RangeIndex: 889 entries, 0 to 888 Data columns (total 9 columns): PassengerId 889 non-null int64 Survived 889 non-null int64 Pclass 889 non-null int64 Sex 889 non-null int64 Age 889 non-null float64 SibSp 889 non-null int64 Parch 889 non-null int64 Fare 889 non-null float64 Embarked 889 non-null int64  dtypes: float64(2), int64(7) memory usage: X = data.iloc[:,data.columns!= 'Survived'] # Remove the Survived column as the feature x y = data.iloc[:,data.columns == 'Survived'] # remove the Survived column as feature YCopy the code
Model training

Idea: Use cross validation to evaluate our model; At the same time, grid search is used to find the best common parameters in the decision tree.

# Grid search: a technique that allows us to adjust multiple parameters simultaneously is essentially enumeration technique.
# paramerters: arguments used to determine.
parameters = {'splitter': ('best'.'random'),'criterion': ('gini'.'entropy'),'max_depth':[*range(1.10)].'min_samples_leaf':[*range(1.50.5)].'min_impurity_decrease':[*np.linspace(0.0.5.20)]}# Grid search instance code, the more parameters need to determine, the longer the time
clf = DecisionTreeClassifier(random_state=30)
GS = GridSearchCV(clf,parameters,cv=10) # CV =10, do 10 cross validation
GS = GS.fit(x_train,y_train)

# Optimal parameters
GS.best_params_
out:
    {'criterion': 'gini'.'max_depth': 3.'min_impurity_decrease': 0.0.'min_samples_leaf': 1.'splitter': 'best'}
    
# Best score
GS.best_score_
Copy the code

After determining the optimal value of the set parameters, the training model is started:

# Train the model and fill in the model instantiation with the best values of the above Settings
clf_model = DecisionTreeClassifier(criterion='gini'
                                  ,max_depth=3
                                  ,min_samples_leaf=1
                                  ,min_impurity_decrease=0
                                  ,splitter='best'
                                  )
clf_model = clf_model.fit(x,y)
Copy the code

Export model:

# Export model
from sklearn.externals import joblib
joblib.dump(clf_model,'/Users/liz/Code/jupyter-notebook/sklearn/1- DecisionTree/clf_model.m')
Copy the code

Processing of test sets:

Data_test = pd.read_csv('/Users/ Liz /code/jupyter-notebook/sklearn/ 1-decisiontree/titanic_test.csv ') data_test.info() out: <class 'pandas.core.frame.DataFrame'> RangeIndex: 418 entries, 0 to 417 Data columns (total 11 columns): PassengerId 418 non-null int64 Pclass 418 non-null int64 Name 418 non-null object Sex 418 non-null object Age 332 non-null float64 SibSp 418 non-null int64 Parch 418 non-null int64 Ticket 418 non-null object Fare 417 non-null float64 Cabin 91 non-null object Embarked 418 non-null object dtypes: float64(2), int64(4), object(5) memory usage: 36.0+ KB # The processing method of the test set is the same as that of the training set, and the test set should maintain the same characteristics as that of the training set. Data_test. Drop (['Name','Ticket','Cabin'],inplace=True,axis=1) data_test['Age'] = data_test['Age'].fillna(int(data_test['Age'].mean())) data_test['Fare'] = data_test['Fare'].fillna(int(data_test['Fare'].mean())) data_test.loc[:,'Sex'] = (data_test['Sex'] == 'male').astype(int) tags = data_test['Embarked'].unique().tolist() data_test['Embarked'] = data_test['Embarked'].apply(lambda x : tags.index(x))Copy the code

At this point, the test set data is preprocessed, the model is exported and the data is tested:

Export the model and test the data set
model = joblib.load('/Users/liz/Code/jupyter-notebook/sklearn/1- DecisionTree/clf_model.m')
Survived = model.predict(data_test) # Test results
# Generate data
Survived = pd.DataFrame({'Survived':Survived}) Convert the result to dictionary form and export it later as CSV
PassengerId = data_test.iloc[:,data_test.columns == 'PassengerId'] # slice, PassengerId separated
gender_submission = pd.concat([PassengerId,Survived],axis=1)* Survived by PassengerId

# export data
# export data
gender_submission.index = np.arange(1, len(gender_submission)+1) # index starts at 1
gender_submission.to_csv('/Users/liz/Code/jupyter-notebook/sklearn/1- DecisionTree/gender_submission.csv',index=False) # index=False, the index is not displayed when exporting
Copy the code

Export file:

PassengerId Survived
0 892 0
1 893 1
2 894 0
3 895 0
4 896 1
. . .
413 1305 0
414 1306 1
415 1307 0
416 1308 0
417 1309 0

418 rows × 2 columns

Submit your results to Kaggle for final score:

The final score is 0.77990, which is not high and the highest score is full. This paper is just an introduction to machine learning and Kaggle.

The final source code and Kaggle data sets are uploaded to my Github repository, as are some of the notes that have been moved around the web. The repository will be updated continuously…

The attached

Making site: github.com/ChemLez/ML-…