preface

Titanic is a very classical data analysis project. Due to its small amount of data and relatively perfect data structure of data set, Titanic is suitable for modeling and parameter adjustment test. Titanic is the first project of the author’s entry into data mining. After a few months, I come back to this starting point again. I feel deeply that there is a long way to go and I should continue to move forward. This paper aims to revise the previous project discussion with a more complete and systematic approach.

1.1 Jupyter setup, packet guide and data set loading

Import related modules.

import pandas as pd,numpy as np,matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import warnings 
from sklearn.exceptions import ConvergenceWarning
from typing import types
import sklearn
import pandas_profiling
Copy the code

Intercepts warning

warnings.filterwarnings('ignore')
warnings.filterwarnings(action ='ignore',category=ConvergenceWarning)   
Copy the code

To prevent Chinese garbled characters, set the Seaborn Chinese font.

mpl.rcParams['font.sans-serif'] =[u'simHei']
mpl.rcParams['axes.unicode_minus'] =False
sns.set(font='SimHei')
Copy the code

Example Set the number of lines displayed in jupyter

pd.options.display.min_rows = None
pd.set_option('display.expand_frame_repr', False)
pd.set_option('expand_frame_repr', False)
pd.set_option('max_rows', 30)
pd.set_option('max_columns', 30)
Copy the code

Load data.

df_train = pd.read_csv('train.csv') df_test = pd.read_csv('test.csv') df_train.shape,df_test.shape -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- ((891, 12), (418), 11))Copy the code

1.2 Exploratory analysis

1.2.1 Preview the dataset

  • The predicted value is extracted as the target predicted value
  • Add test sets to training sets in preparation for data cleaning and feature engineering
targets =df_train.Survived
combined = df_train.drop('Survived',axis=1).append(df_test)

Copy the code
  • Preview data set
Combined. The head (5). Append (combined. The tail (5)) -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 3 Braund, Mr. Owen Harris Male 22.0 1 0 A/5 21171 7.2500 NaN S 1 21 Cumings, Mrs. John Bradley (Florence Briggs Th... Female 38.01 0 PC 17599 71.2833c85 C 2 33 Heikkinen, Miss. Laina female 26.0 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) Female 35.0 10 113803 53.1000 C123 S 4 53 Allen, Mr. William Henry Male 35.0 00 373450 8.0500 NaN S 413 1305 3 Spector, Mr. Woolf male NaN 00 a.5. 3236 8.0500 NaN S 414 1306 1 Oliva y Ocana, Dona. Fermina Female 39.000 PC 17758 108.9000 C105 C 415 1307 3 Saether, Mr. Simon Sivertsen male 38.5 00 SOTON/O.Q. 3101262 7.2500 NaN S 416 1308 3 Ware, Mr. Frederick male NaN 00 359309 8.0500 NaN S 417 1309 3 Peter, Master. Michael J male NaN 1 1 2668 22.3583 NaN CCopy the code
  • Preview relevant statistics and missing data
Combined. The describe () -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- PassengerId Pclass Age SibSp Parch Fare Count 1309.000000 1309.000000 1046.000000 1309.000000 1309.000000 1309.000000 1308.000000 mean 655.000000 2.294882 29.881138 0.498854 0.385027 33.295479 STD 378.020061 0.837836 14.413493 1.041658 0.865560 51.758668min 1.000000 1.000000 0.170000 0.000000 0.000000 0.000000 25% 328.000000 2.000000 21.000000 0.000000 0.000000 7.895800 50% 655.000000 3.000000 28.000000 0.000000 0.000000 14.454200 75% 982.000000 3.000000 39.000000 1.000000 0.000000 31.275000 Max 1309.000000 3.000000 80.000000 8.000000 9.000000 512.329200Copy the code
  • Previewing data types
Combined. Info () -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- < class 'pandas. Core. Frame. The DataFrame' > Int64Index: 1309 entries, 0 to 417 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 1309 non-null int64 1 Pclass 1309 non-null int64 2 Name 1309 non-null object 3 Sex 1309 non-null object 4 Age 1046 non-null float64 5 SibSp 1309 non-null int64 6 Parch 1309 non-null int64 7 Ticket 1309 non-null object 8 Fare 1308 non-null float64 9 Cabin 295 non-null object 10 Embarked on 1307 non-null object Dtypes: float64(2), INT64 (4), Object (5) Memory usage: 122.7+ KBCopy the code
  • Number and distribution of missing values
Df_train.isnull ().sum() missing_pct = df_train.isnull().sum() * 100 / len( pd.DataFrame({ 'name': df_train.columns, 'missing_pct': missing_pct, }) missing.sort_values(by='missing_pct', Ascending = False), head () -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- the name missing_pct Cabin 77.104377 Age Embarked on: 19.865320 PassengerId 0.000000 Survived 0.000000Copy the code

1.2.2 Practical significance of features

  • PassengerId: the identification of every passenger on a ship
  • Pclass: Passenger class. It has three possible values: 1,2,3 (first, second, and third categories)
  • Name: Passeger’s Name
  • Sex, gender,
  • Age: Age
  • SibSp: Number of siblings and spouses traveling with passengers
  • Parch: The number of parents and children traveling with passengers
  • Ticket: Ticket number
  • Fare Fare:
  • Cabin: No
  • Embarked: This describes three possible areas in which people Embarked on the Titanic. Three possible values S, C, and Q

1.2.3 Quantity and distribution of predicted values

  • From the distribution of predicted values, the survival rate of passengers on board was 38.4%, lower than the mortality rate of 61.6%, and the overall data sample distribution was relatively uniform
Plot ('Survived',data=df_train, AX = AX [0], Palette =['g','r']) Df_train [' Survived] value_counts (). The plot. The pie (autopct = '% % % 1.1 f, ax = ax [1], the colors = [' g', 'r']) ax [0]. Set_ylabel (' ') ax[0].set_xlabel('Survived') ax[1].set_ylabel('') ax[1].set_xlabel('Survived') plt.show() -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --Copy the code

1.2.4 Sex, age and survival

  • The survival rate of men is much lower than that of women, which may be related to the concept of women and children first
  • In terms of age, men in their 20s were the most likely to die, while children under the age of 10 were given priority
1 - df_train df_train [' Died '] = [' Survived] FIG, ax = PLT. Subplots (1, 2, figsize = (15, 8)) df_train.groupby('Sex')[['Survived','Died']].sum().plot.bar(ax=ax[0], color=['m','c'],stacked=True) plt.ylabel('count') df_train.groupby('Survived')['Age'].plot.hist(ax=ax[1]) plt.xlabel('Age') plt.ylabel('count') plt.legend() plt.show()Copy the code

1.2.5 Ticket price, age and survival

  • The age group between 30 and 40 has the highest ticket price, and the ticket price above 100 has a higher chance of survival
Plt. figure(figsize=(30,10)) ax = plt.subplot() ax. Scatter (df_train[df_train['Survived'] ==1]['Age'],df_train[df_train['Survived']==1]['Fare'], color = 'purple',s=df_train[df_train['Survived']==1]['Fare']) ax.scatter(df_train[df_train['Survived']==0]['Age'],df_train[df_train['Survived']==0]['Fare'], color = 'k',s=df_train[df_train['Survived']==0]['Fare']) plt.show()Copy the code

1.2.6 Levels, Boat points, fares and Survival

  • People with a Pclass of 1 have higher average ticket prices
  • The average ticket price is slightly higher for those with a Pclass of 2 than for those with a Pclass of 3
plt.figure() plt.ylabel('Average fare') Df_train. Groupby (' Pclass). Mean () [' Fare]. The plot (kind = 'bar', color = [' orange ', 'g', 'c'], figsize = (30, 10))Copy the code

  • Among different boarding stations, the mortality rate of Pclass=3 is always the highest, and the survival rate of Pclass=1 is always higher than the mortality rate
  • People with Pclass 2 and 3 fare less than 100, and the survival rate and mortality rate are almost half. People with Pclass 1 fare close to the survival and death rate in the low-fare area, and the survival rate is close to 100% in the high-fare area, indicating that the higher the ticket price, the priority to escape
  • The ticket price of the crowd who boarded the ship in Q was below 100, and the survival rate and mortality rate were almost half. Those who board in S and whose ticket price is higher than 100 are more likely to survive; The number of people who boarded at C was higher than the total number of people who survived.
FIG, ax = PLT subplots (3, 1, figsize = (20 (9)) sns.violinplot(ax=ax[0],x='Embarked',y='Pclass',hue='Survived',data=df_train,split = True, palette ={0:'r',1:'g'}) sns.violinplot(ax=ax[1],x='Pclass',y='Fare',hue='Survived',data=df_train,split = True, palette ={0:'r',1:'g'}) sns.violinplot(ax=ax[2],x='Embarked',y='Fare',hue='Survived',data=df_train,split = True, palette ={0:'r',1:'g'})Copy the code

1.3 Data Cleaning

1.3.1 Missing Value processing

  • The characteristics of missing values are as follows:
Name missing_pct Cabin 77.463713 Age 20.091673 Embarked on 0.152788 Fare 0.076394 PassengerId 0.000000Copy the code
  • Define a function that looks for missing values
Def get_missing(): Global missing_pct = combined.isnull().sum() * 100 / len() # missing = pd.dataframe ({ 'name': combined.columns, 'missing_pct': Missing_pct}) return missing.sort_values(by='missing_pct', ascending=False).head() print(f'{feature} is ok')Copy the code
  • The median age is used to fill in the null value, depending on gender and rank
def get_age(row): global df_train train_age_median = df_train.groupby(['Sex','Pclass','Age']).median().reset_index()[['Sex','Pclass','Age']] condition= ( (train_age_median['Sex'] == row['Sex']) & (train_age_median['Pclass'] == row['Pclass']) ) return train_age_median[condition]['Age'].values[0] def fill_age(): global combined combined['Age'] = combined.apply(lambda row : get_age(row) if np.isnan(row['Age']) else row['Age'], Axis = 1) status (" Age ") return combined combined = fill_age () -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- the Age is okCopy the code
  • For null values in Cabin, fill them with ‘U’ to indicate unknown
combined['Cabin']=combined['Cabin'].fillna('U')
Copy the code
  • The mode is used to fill in the missing value Embarked in
combined['Embarked']=combined['Embarked'].fillna('S')
Copy the code
  • Populate Fare with the mean
combined['Fare']=combined['Fare'].fillna(df_train['Fare'].mean())
Copy the code

1.3.2 Outlier handling

  • Delete PassengerId
combined.drop('PassengerId',axis=1,inplace=True)
Copy the code
  • There are outliers in Fare, and it is found in EDA that the higher Fare, the greater the probability of survival. Therefore, these values are related to the prediction target and cannot be deleted easily
num_type=combined.select_dtypes(exclude=['object']) #['PassengerId', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare'] num=['Pclass', 'Age', 'SibSp', 'Parch', 'Fare'] plt.figure(figsize=(16,8) SNS.Copy the code

1.4 Feature Engineering

1.4.1 Feature correlation screening

  • Correlations between features do not appear to be high enough to require special treatment
PLT. Figure (figsize = (15, 8)) SNS. Pairplot (df_train. Drop (' PassengerId 'axis = 1) PLT. The show (PLT), figure (figsize = (15, 8)) sns.heatmap(df_train.drop('PassengerId',axis=1).corr(),annot=True) plt.show()Copy the code

1.4.2 handle Name

  • Extract the appellation from Name, such as Miss, Mrs, etc
Title_Dictionary = {'Capt':'Officer', 'Col':'Officer', 'Don':'Royalty', 'Dr':'Officer', 'Jonkheer':'Royalty', 'Lady':'Royalty', 'Major':'Officer', 'Master':'Master', 'Miss':'Miss', 'Mlle':'Miss', 'Mme':'Mrs', 'Mr':'Mr', 'Mrs' :' Mrs', 'Ms' :' Mrs', 'Rev' : 'Officer', 'Sir' : 'Royalty', 'the Countess' : 'Royalty} # function: Title def get_titles(): global combined,Title_Dictionary combined['Title'] = combined['Name'].map(lambda name :name.split(',')[1].split('.')[0].strip()) combined['Title'] = combined.Title.map(Title_Dictionary) combined.drop('Name',axis=1,inplace = True) status('Title') return combined combined=get_titles() titles_dummies =pd.get_dummies( combined['Title'],prefix ='Title') combined = pd.concat([combined,titles_dummies],axis=1) Drop ('Title',axis=1,inplace=True) status('names') return combined #Copy the code

1.4.3 processing Sex

  • 1 represents male and 0 represents female
def modify_Sex():
    global combined
    combined['Sex']=combined.Sex.map({'male':1,'female':0})
    status('sex')
    return combined
combined  = modify_Sex()
Copy the code

1.4.4 Handling Pclass, Embarked, and Cabin

  • Cabin pretreatment
combined['Cabin'] = combined['Cabin'].map(lambda e : e[0])
Copy the code
  • Unique thermal coding is used
Def dummies_coder(): Global combined for name in ['Embarked','Cabin','Pclass']: df_dummies = pd.get_dummies(combined[name],prefix=name) combined = pd.concat([combined,df_dummies],axis=1) combined.drop(name,axis=1,inplace=True) status(name) return combined combined =dummies_coder() -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Embarked is ok Cabin is ok Pclass is okCopy the code

1.4.5 Handling SibSp and Parch

– Sorted by family size

  • Singleone means a single person, SmallFamily means a SmallFamily of less than four people, and BigFamily means a large family of more than four people
def modify_Family(): global combined combined['Family_size'] = combined['Parch'] + combined['SibSp'] +1 combined['Singleone']=combined['Family_size'].map(lambda s : 1 if s==1 else 0) combined['BigFamily']= combined['Family_size'].map(lambda s:1 if s>4 else 0) combined['SmallFamily']=combined['Family_size'].map(lambda s:1 if 1<s<=4 else 0) combined.drop(['Parch','SibSp','Family_size'],axis=1,inplace=True) status('family') return combined combined = Modify_Family () -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- family is okCopy the code

1.4.6 processing Ticket

  • Extract the alphabetic sequence beginning with Ticket
  • Unique thermal coding is used
 def preperform_Ticket(ticket):
    ticket = ticket.replace('.','')
    ticket = ticket.replace('/','')
    ticket = ticket.split()
    ticket = map(lambda t : t.strip(),ticket)
    ticket = list(filter(lambda t : not t.isdigit(),ticket))
    if len(ticket) >0:
        return ticket[0]
    else:
        return  'xxx'
def modify_Ticket():
    global combined
    combined['Ticket'] = combined['Ticket'].map(preperform_Ticket)
    tickets_dummies = pd.get_dummies(combined['Ticket'],prefix = 'Ticket')
    combined = pd.concat([combined,tickets_dummies],axis=1)
    combined.drop('Ticket',axis=1,inplace = True)
    status('ticket')
    return combined
combined = modify_Ticket()
Copy the code

1.4.7 Processing Fare and Age

  • Use data to divide the box processing
def modify_df(feature,threshold_values): global combined combined.loc[combined[feature]<threshold_values[0],feature]=0 combined.loc[(combined[feature] >= threshold_values[0]) & (combined[feature] < threshold_values[1]), feature] = 1 combined.loc[(combined[feature] >= threshold_values[1]) & (combined[feature] < threshold_values[2]), feature] = 2 combined.loc[(combined[feature] >= threshold_values[2]) , Feature = 3 return threshold_values =[15,30,45] fare_threshold_values =[15,30,100 =modify_df('Age',age_threshold_values) combined =modify_df('Fare',fare_threshold_values)Copy the code

1.5 Modeling parameters

  • Import related modules
import sklearn from sklearn.model_selection import GridSearchCV,train_test_split from sklearn.linear_model import LogisticRegressionCV from sklearn.ensemble import RandomForestClassifier,ExtraTreesClassifier from sklearn.ensemble import GradientBoostingClassifier from sklearn.tree import DecisionTreeClassifier from xgboost import XGBClassifier from  lightgbm import LGBMClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.svm import SVC from catboost import CatBoostClassifier,Pool,cv from sklearn import metrics from sklearn.metrics import accuracy_score,roc_auc_score,f1_score,make_scorer from sklearn.model_selection import cross_val_score from bayes_opt import BayesianOptimizationCopy the code
  • Partition data set
train = combined[:891] test =combined[891:] targets =np.array(targets) x_train,x_val,y_train,y_val= Train_test_split (" train ", the targets, test_size = 0.2, random_state = 2021)Copy the code

1.5.1 Use of basic models and scoring

  • Training and evaluation using basic models
lg_cv =LogisticRegressionCV() rf = RandomForestClassifier() extree = ExtraTreesClassifier() gbdt =GradientBoostingClassifier() knn =KNeighborsClassifier() models = [lg_cv,rf,extree, knn, gbdt] for model in models: model=model.fit(x_train,y_train) predict_val=model.predict(x_val) print(model) print('accuracy-score :',metrics.accuracy_score(y_val,predict_val)) val_proba = model.predict_proba(x_val) print(f'auc:{roc_auc_score(y_val,val_proba[:,1])}') print('*'*50) -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- LogisticRegressionCV () accuracy - score: Auc 0.7597765363128491:0.7911450182576942 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * RandomForestClassifier () accuracy-score : Auc 0.7094972067039106:0.7766692749087115 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ExtraTreesClassifier () accuracy-score : Auc 0.7039106145251397:0.744131455399061 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * KNeighborsClassifier () accuracy-score : Auc 0.7206703910614525:0.7631716223265518 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * GradientBoostingClassifier() accuracy-score : Auc 0.7430167597765364:0.8000782472613458 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *Copy the code

1.5.2 RandomForest adjustable parameters

  • Parameter using GridSearchCV,
Parameter_grid ={'max_depth':[4,6,8,10], 'n_estimators': ,30,50,100 [10], 'max_features' : [' SQRT ', 'auto', 'log2],' min_samples_split:,3,10 [2], 'min_samples_leaf:,3,10 [1],' bootstrap ': [True,False], } cross_validation =StratifiedKFold(n_splits=5) auc_score=make_scorer(roc_auc_score,average='micro') grid_search = GridSearchCV(model_rf, cv=cross_validation,param_grid=parameter_grid, Score =auc_score,verbose=1) grid_search. Fit (x_train,y_train) print(f' best parameter :{grid_search. Best_params_}') -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- # the best parameter is: {' bootstrap ': False,' max_depth: 10, 'max_features' : 'log2', 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 30}Copy the code
  • The model was established by using the optimized parameters and verified by the verification set. The AUC value was improved after the adjustment
  • Draw roc visualization curve
model_rf = RandomForestClassifier(**grid_search.best_params_) model_rf.fit(x_train,y_train) def roc(model,x,y,name): Predict_proba (x)[:,1] # Predict_proba (x)[:,1] # Y_proba) roc_auc = metrics. Auc (FPR, TPR) print(f'{name} auc: Plt. figure(figsize=(15, 7)) plt.title(name) plt.plot(FPR, TPR, 'b', Label = name + 'AUC = %0.4f' % roc_auc) plt.xlim(0,1) plt.ylim(0,1) plt.legend(loc='best') plt.title('ROC') Plt.plot ([0,1],[0,1],'r--') plt.show() plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') # plot diagonals plt.plot([0,1],[0,1],'r--') plt.show() Roc (model_RF,x_val,y_val,' validation set ')! [image.png](https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/46bacac3cc7c4940996947570462d5c8~tplv-k3u1fbpfcp-watermark . Image) -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- validation set AUC: 0.7916666666666666Copy the code

1.5.3 XGB adjustable parameters

  • Use XGB Bayesian tuning
def BO_xgb(x,y): t1=time.clock() def xgb_cv(max_depth,gamma,min_child_weight,max_delta_step,subsample,colsample_bytree): Paramt ={'booster': 'gbtree', 'max_depth': int(max_depth), 'gamma': gamma, 'eta': 0.1, 'objective': 'binary:logistic', 'nthread': 4, 'eval_metric': 'auc', 'subsample': max(min(subsample, 1), 0), 'colsample_bytree': max(min(colsample_bytree, 1), 0), 'min_child_weight': min_child_weight, 'max_delta_step': int(max_delta_step), 'seed': 1001} model=XGBClassifier(**paramt) res = cross_val_score(model,x, y, scoring='roc_auc', CV = 5). The mean () return res cv_params = {' max_depth ': (1, 30),' gamma ': (0.001, 10.0),' min_child_weight ': (0, 20), 'max_delta_step: (0, 10),' subsample: (0.1, 1.0), 'colsample_bytree' : (0.1, Xgb_op = BayesianOptimization(xGB_cv,cv_params) xgb_op.maximize(n_iter=20) print(xgb_op.max) t2=time.clock() Print (' time: '(t2 - t1)) return xgb_op. Max result = BO_xgb (x_train, y_train) -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- {'target': 0.9039100745280519, 'params': {' colsamPLE_bytree ': 1.0, 'gamma': 0.001, 'max_delta_step': 1.5409323299196414, 'max_depth': 3.216932691991951, 'min_child_weight': 0.0, 'subSAMPLE ': 1.0}} Time: 18.490272999999434Copy the code

  • Roc curve visualization, the gap between the training set and the verification set is still there, the model can be more perfect
best_params=result['params'] best_params['max_depth']=best_params['max_depth'].astype(int) model_xgb = XGBClassifier(**best_params) model_xgb.fit(x_train,y_train) def roc(m,xt,yt,namet,xv,yv,namev): Yt_pred = M.predict_proba (XT)[:,1] Yv_pred = M.predict_proba (XV)[:,1] """" Prediction and calculation of roc related indicators Thresholdt = metrics. Roc_curve (yt, yt_pred) roc_auct = metrics. Auc (FPRT, TPRT) {}'.format(roc_auct)) fprv, tprv, thresholdv = metrics.roc_curve(yv, yv_pred) roc_aucv = metrics.auc(fprv, TPRV) print (namev + 'AUC: {} '. The format (roc_aucv) "" draw a roc curve" "" "" ax = PLT, subplot (ax). The plot (TPRT FPRT, 'r', Plot (FPRV, TPRV, 'g', Label = namev+ 'AUC = %0.4f' % roc_aucv) plt.ylim(0,1) plt.xlim(0,1) plt.legend(loc='best') plt.title('ROC') Plt.ylabel ('True Positive Rate') plt.xlabel('False Positive Rate') # plot diagonals plt.plot([0,1],[0,1],'r--') plt.show() Plt. subplot(1,1,1) roc(model_xgb,x_val,y_val,' verification set ',x_train,y_train,' training set ') -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- validation set AUC: training set AUC 0.799621804903495:0.9331065759637188Copy the code

1.5.4 Model fusion

  • Randomforest and XGBoost models are used for prediction and the average value of the two models is taken
trained_models = [model_xgb,model_rf]
predictions = []
for model in trained_models:
    predictions.append(model.predict_proba(test)[:, 1])
predictions_df = pd.DataFrame(predictions).T
predictions_df['out'] = predictions_df.mean(axis=1)
Copy the code

1.5.5 Model prediction and output result file

  • Output result file
abc =pd.read_csv('gender_submission.csv') df_out =pd.DataFrame() df_out['PassengerId'] = abc['PassengerId'] df_out['Survived'] =predictions_df['out'].map(lambda x : 1 if x >= 0.5 else 0) df_out.to_csv('422titanic. CSV ',index=False)Copy the code
  • Shame on you, not much.