This article documents an introductory exercise in machine learning: simple dichotomies using decision trees. At the same time, combined with the classic case Titanic on Kaggle, to test the actual effect.

1. Data sets

The data set of Titanic in Kaggle was used. Data are divided into:

Training Set (train.csv)
Test set (test.csv)
Submission standard: Gender_submission. CSV

Since Kaggle involves going online for science, the original data set is already available on Gighub.

Second, data processing

First import the training set and check the data:

from sklearn.tree import DecisionTreeClassifier Import model decision tree classifier
from sklearn.model_selection import cross_val_score,train_test_split,GridSearchCV # The imported model functions are respectively cross validation, division of training set and data set, and grid search
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

data = pd.read_csv('/Users/liz/code/jupyter-notebook/sklearn/1- DecisionTree/Titanic_train.csv') # import data set
data.head() Display the first five rows of the dataset
[out]:
Copy the code

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th…	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

From the data shown above, what we have to do is to use the survives as the tag and the remaining columns as the features. Objective: To predict labels based on known characteristics. The practical significance of this data set is to make a prediction of the survival of passengers based on the known data.

Data. The info () look at the situation of the whole training set out # : < class 'pandas. Core. Frame. The DataFrame' > RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: Float64 (2), INT64 (5), Object (5) Memory Usage: 83.7+ KBCopy the code

The data analysis

The above data shows that there are 891 pieces of data, among which the features with missing values are Age, Cabin, and Embarked; Non numeric characteristics are: Name, Sex, Ticket, Cabin, Embarked.
When we use the existing features to predict the survival of passengers, some of the more troublesome and less important features are not used. For example, Name and Ticket can be omitted, because the passenger’s Name and purchased Ticket have little effect on the survival of the passenger. Another reason is that both are non-numerical data, which is complicated to process in numerical form (all data accepted in a computer must ultimately be presented in numerical form).
Because there are many missing values in Cabin, deletion is adopted here for the same reason as above.
Although gender is also a character data, gender has a certain impact on the possibility of escape in practice, so it is reserved.
Fill in the missing value; Convert non-numeric data to numeric data.

Drop (['Name','Cabin','Ticket'],inplace=True,axis=1) drop(['Name','Cabin','Ticket'],inplace=True,axis=1) Data.loc [:,'Age'] = data['Age'].fillna(int(data['Age'].mean())) # Embarked Because only two data have missing values, Data = data.dropna() data = data.reset_index(drop = True) Data ['Sex'] = (data['Sex'] == 'male'). Astype (int) # 0 = female, Embarked on = data[' pursuit '].unique().tolist() # tags: ['S', 'C', Data. Iloc [:,data. Columns == 'Embarked'] = data['Embarked']. Apply (lambda x # : tags. The index (x)) see data. The data info # () to check the data information out: < class 'pandas. Core. Frame. DataFrame' > RangeIndex: 889 entries, 0 to 888 Data columns (total 9 columns): PassengerId 889 non-null int64 Survived 889 non-null int64 Pclass 889 non-null int64 Sex 889 non-null int64 Age 889 non-null float64 SibSp 889 non-null int64 Parch 889 non-null int64 Fare 889 non-null float64 Embarked 889 non-null int64  dtypes: float64(2), int64(7) memory usage: X = data.iloc[:,data.columns!= 'Survived'] # Remove the Survived column as the feature x y = data.iloc[:,data.columns == 'Survived'] # remove the Survived column as feature YCopy the code

Model training

Idea: Use cross validation to evaluate our model; At the same time, grid search is used to find the best common parameters in the decision tree.

# Grid search: a technique that allows us to adjust multiple parameters simultaneously is essentially enumeration technique.
# paramerters: arguments used to determine.
parameters = {'splitter': ('best'.'random'),'criterion': ('gini'.'entropy'),'max_depth':[*range(1.10)].'min_samples_leaf':[*range(1.50.5)].'min_impurity_decrease':[*np.linspace(0.0.5.20)]}# Grid search instance code, the more parameters need to determine, the longer the time
clf = DecisionTreeClassifier(random_state=30)
GS = GridSearchCV(clf,parameters,cv=10) # CV =10, do 10 cross validation
GS = GS.fit(x_train,y_train)

# Optimal parameters
GS.best_params_
out:
    {'criterion': 'gini'.'max_depth': 3.'min_impurity_decrease': 0.0.'min_samples_leaf': 1.'splitter': 'best'}
    
# Best score
GS.best_score_
Copy the code

After determining the optimal value of the set parameters, the training model is started:

# Train the model and fill in the model instantiation with the best values of the above Settings
clf_model = DecisionTreeClassifier(criterion='gini'
                                  ,max_depth=3
                                  ,min_samples_leaf=1
                                  ,min_impurity_decrease=0
                                  ,splitter='best'
                                  )
clf_model = clf_model.fit(x,y)
Copy the code

Export model:

# Export model
from sklearn.externals import joblib
joblib.dump(clf_model,'/Users/liz/Code/jupyter-notebook/sklearn/1- DecisionTree/clf_model.m')
Copy the code

Processing of test sets:

Data_test = pd.read_csv('/Users/ Liz /code/jupyter-notebook/sklearn/ 1-decisiontree/titanic_test.csv ') data_test.info() out: <class 'pandas.core.frame.DataFrame'> RangeIndex: 418 entries, 0 to 417 Data columns (total 11 columns): PassengerId 418 non-null int64 Pclass 418 non-null int64 Name 418 non-null object Sex 418 non-null object Age 332 non-null float64 SibSp 418 non-null int64 Parch 418 non-null int64 Ticket 418 non-null object Fare 417 non-null float64 Cabin 91 non-null object Embarked 418 non-null object dtypes: float64(2), int64(4), object(5) memory usage: 36.0+ KB # The processing method of the test set is the same as that of the training set, and the test set should maintain the same characteristics as that of the training set. Data_test. Drop (['Name','Ticket','Cabin'],inplace=True,axis=1) data_test['Age'] = data_test['Age'].fillna(int(data_test['Age'].mean())) data_test['Fare'] = data_test['Fare'].fillna(int(data_test['Fare'].mean())) data_test.loc[:,'Sex'] = (data_test['Sex'] == 'male').astype(int) tags = data_test['Embarked'].unique().tolist() data_test['Embarked'] = data_test['Embarked'].apply(lambda x : tags.index(x))Copy the code

At this point, the test set data is preprocessed, the model is exported and the data is tested:

Export the model and test the data set
model = joblib.load('/Users/liz/Code/jupyter-notebook/sklearn/1- DecisionTree/clf_model.m')
Survived = model.predict(data_test) # Test results
# Generate data
Survived = pd.DataFrame({'Survived':Survived}) Convert the result to dictionary form and export it later as CSV
PassengerId = data_test.iloc[:,data_test.columns == 'PassengerId'] # slice, PassengerId separated
gender_submission = pd.concat([PassengerId,Survived],axis=1)* Survived by PassengerId

# export data
# export data
gender_submission.index = np.arange(1, len(gender_submission)+1) # index starts at 1
gender_submission.to_csv('/Users/liz/Code/jupyter-notebook/sklearn/1- DecisionTree/gender_submission.csv',index=False) # index=False, the index is not displayed when exporting
Copy the code

Export file:

	PassengerId	Survived
0	892	0
1	893	1
2	894	0
3	895	0
4	896	1
.	.	.
413	1305	0
414	1306	1
415	1307	0
416	1308	0
417	1309	0

418 rows × 2 columns

Submit your results to Kaggle for final score:

The final score is 0.77990, which is not high and the highest score is full. This paper is just an introduction to machine learning and Kaggle.

The final source code and Kaggle data sets are uploaded to my Github repository, as are some of the notes that have been moved around the web. The repository will be updated continuously…

The attached

Making site: github.com/ChemLez/ML-…

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Machine learning: An introduction to decision trees: the Titanic Case

1. Data sets

Second, data processing

The data analysis

Model training

The attached

Machine learning: An introduction to decision trees: the Titanic Case

1. Data sets

Second, data processing

The data analysis

Model training

The attached

Related Posts

Feature Engineering (finished)

ACL 2021 | Meituan proposes a text representation model based on contrast learning, which improves the effect by 8% compared with Bert-Flow

Maximum likelihood estimation: Understanding the optimization objective of linear regression from a probabilistic perspective