Decision tree algorithm –
DecisionTreeClassifier(Criterion =’entropy’) Criterion-standard entropy: based on the ID3 algorithm, the actual result is not much different from C4.5 gini: gini coefficient – CART algorithm
Key processes for survival prediction
-
Preparation stage
- Data exploration – Analysis of data quality
- Info – Basic information about a data table: number of rows, number of columns, data type for each column, data integrity
- Describe – Statistics of a table: total number, mean, standard deviation, minimum, maximum
- Describe (include = [‘O’] string types (not numbers) in general
- Head – the first few rows of data
- Tail – The last few rows of data
- Data cleaning
- Df. Fillna (df[‘XX’]. Mean (), inplace = True) df. Fillna (df[‘XX’].value_counts(), inplace = True)
- Feature selection – data dimension reduction to facilitate subsequent classification operations
- Filter out meaningless columns
- Filter out columns with more missing values
- Put the rest of the features into the eigenvector
- Data conversion – Character type column to numeric column for easy subsequent operation – DictVectorizer class to convert
DictVectorizer: the symbol is expressed by transforming the number 0/1- Instantiate a converter devc = DictVectorizer(sparse =False) – Sparse =False means not to use a sparse matrix, where non-zero values are represented by position one-hot to make categories fairer and have no precedence over each other
- Call fit_transform() to_dict(Orient =’record’) – convert to list
- Data exploration – Analysis of data quality
-
Classification stage
- Decision tree model
1. Import decision tree model- Generating decision tree
- A decision tree is generated by fitting
- Model evaluation & prediction
- Prediction – Decision tree outputs prediction results
- Evaluate known predicted value and true result – CLF. score(feature, result tag) Not known true predicted result – K fold cross validation – cross_val_score
- Decision tree visualization
- Decision tree model
-
Drawing stage – GraphViz
- Install graphviz first
- Import graphviz package – import Graphviz
- Import export_graphviz in sklearn
- Start by using export_graphviz to navigate around the data to be displayed in the decision tree model
- Get the data source with Graphviz
- The data show
K-fold cross validation Most samples are taken out for training, and a small amount is used for classifier validation — K times of cross validation, with 1/K data selected for verification each time and the rest as training, taking K times in turn and taking the average value
-
The data set is evenly divided into K equal parts
-
One piece of data was used as test data and the rest as training data
-
Calculate test accuracy
-
Repeat steps 2 and 3 using different test sets