Decision tree algorithm –

DecisionTreeClassifier(Criterion =’entropy’) Criterion-standard entropy: based on the ID3 algorithm, the actual result is not much different from C4.5 gini: gini coefficient – CART algorithm

Key processes for survival prediction

  1. Preparation stage

    1. Data exploration – Analysis of data quality
      1. Info – Basic information about a data table: number of rows, number of columns, data type for each column, data integrity
      2. Describe – Statistics of a table: total number, mean, standard deviation, minimum, maximum
      3. Describe (include = [‘O’] string types (not numbers) in general
      4. Head – the first few rows of data
      5. Tail – The last few rows of data
    2. Data cleaning
      1. Df. Fillna (df[‘XX’]. Mean (), inplace = True) df. Fillna (df[‘XX’].value_counts(), inplace = True)
    3. Feature selection – data dimension reduction to facilitate subsequent classification operations
      1. Filter out meaningless columns
      2. Filter out columns with more missing values
      3. Put the rest of the features into the eigenvector
      4. Data conversion – Character type column to numeric column for easy subsequent operation – DictVectorizer class to convert

        DictVectorizer: the symbol is expressed by transforming the number 0/1
        1. Instantiate a converter devc = DictVectorizer(sparse =False) – Sparse =False means not to use a sparse matrix, where non-zero values are represented by position one-hot to make categories fairer and have no precedence over each other
        2. Call fit_transform() to_dict(Orient =’record’) – convert to list
  2. Classification stage

    1. Decision tree model

      1. Import decision tree model
      1. Generating decision tree
      2. A decision tree is generated by fitting
    2. Model evaluation & prediction
      1. Prediction – Decision tree outputs prediction results
      2. Evaluate known predicted value and true result – CLF. score(feature, result tag) Not known true predicted result – K fold cross validation – cross_val_score
    3. Decision tree visualization
  3. Drawing stage – GraphViz

    1. Install graphviz first
    2. Import graphviz package – import Graphviz
    3. Import export_graphviz in sklearn
    4. Start by using export_graphviz to navigate around the data to be displayed in the decision tree model
    5. Get the data source with Graphviz
    6. The data show

K-fold cross validation Most samples are taken out for training, and a small amount is used for classifier validation — K times of cross validation, with 1/K data selected for verification each time and the rest as training, taking K times in turn and taking the average value

  • The data set is evenly divided into K equal parts

  • One piece of data was used as test data and the rest as training data

  • Calculate test accuracy

  • Repeat steps 2 and 3 using different test sets