Titanic passenger survival prediction

Decision tree algorithm –

DecisionTreeClassifier(Criterion =’entropy’) Criterion-standard entropy: based on the ID3 algorithm, the actual result is not much different from C4.5 gini: gini coefficient – CART algorithm

Key processes for survival prediction

Preparation stage
1. Data exploration – Analysis of data quality
  1. Info – Basic information about a data table: number of rows, number of columns, data type for each column, data integrity
  2. Describe – Statistics of a table: total number, mean, standard deviation, minimum, maximum
  3. Describe (include = [‘O’] string types (not numbers) in general
  4. Head – the first few rows of data
  5. Tail – The last few rows of data
2. Data cleaning
  1. Df. Fillna (df[‘XX’]. Mean (), inplace = True) df. Fillna (df[‘XX’].value_counts(), inplace = True)
3. Feature selection – data dimension reduction to facilitate subsequent classification operations
  1. Filter out meaningless columns
  2. Filter out columns with more missing values
  3. Put the rest of the features into the eigenvector
  4. Data conversion – Character type column to numeric column for easy subsequent operation – DictVectorizer class to convert
    
    DictVectorizer: the symbol is expressed by transforming the number 0/1
    1. Instantiate a converter devc = DictVectorizer(sparse =False) – Sparse =False means not to use a sparse matrix, where non-zero values are represented by position one-hot to make categories fairer and have no precedence over each other
    2. Call fit_transform() to_dict(Orient =’record’) – convert to list
Classification stage
1. Decision tree model
  
  1. Import decision tree model
  1. Generating decision tree
  2. A decision tree is generated by fitting
2. Model evaluation & prediction
  1. Prediction – Decision tree outputs prediction results
  2. Evaluate known predicted value and true result – CLF. score(feature, result tag) Not known true predicted result – K fold cross validation – cross_val_score
3. Decision tree visualization
Drawing stage – GraphViz
1. Install graphviz first
2. Import graphviz package – import Graphviz
3. Import export_graphviz in sklearn
4. Start by using export_graphviz to navigate around the data to be displayed in the decision tree model
5. Get the data source with Graphviz
6. The data show

K-fold cross validation Most samples are taken out for training, and a small amount is used for classifier validation — K times of cross validation, with 1/K data selected for verification each time and the rest as training, taking K times in turn and taking the average value

The data set is evenly divided into K equal parts
One piece of data was used as test data and the rest as training data
Calculate test accuracy
Repeat steps 2 and 3 using different test sets

Titanic passenger survival prediction

Decision tree algorithm –

Key processes for survival prediction

Related Posts

Recognition and statistics of banknote denomination based on MATLAB GUI morphology

Maximum likelihood method for image restoration -Matlab

Optimization and scheduling of cascade hydropower station based on MATLAB particle swarm optimization algorithm