Original link:tecdat.cn/?p=18266

Original source:Tuo End number according to the tribe public number

 

In this paper, generalized linear model (GLM), polynomial regression and generalized additive model (GAM) are established in R language to predict who survived the sinking of Titanic in 1912.


str(titanic)
Copy the code

The data variable is:

  • Survived: Passenger survival indicator (1 if alive)
  • Pclass: Passenger class
  • Sex: Passenger gender
  • Age: Age of passenger
  • SibSp: Number of siblings/spouses
  • Parch: Number of parents/children
  • Embarked: Embarkation port
  • Name: Name of passenger

The last variable is not used much, so let’s delete it,

titanic = titanic[,1:7]
Copy the code

Now, let’s answer the question:

What percentage of the passengers survived?

The short answer is

 

Many years ago, mean $Survived) [1] of 0.3838384Copy the code

It can be found in the contingency table below

Table (Titanic $Survived)/nrow(Titanic) 0 1 0.6161616 0.3838384Copy the code

Or 38.38 percent of the survivors here. That is, it can also be obtained by logistic regression to no explanatory variables (in other words, regression only to constants). Regression gives:

 
Coefficients:
(Intercept)  
    -0.4733  
 
Degrees of Freedom: 890 Total (i.e. Null);  890 Residual
Null Deviance:	    1187 
Residual Deviance: 1187 	AIC: 1189
Copy the code

Given the value of beta zero, and since the probability of survival is zero

We did this by considering

 

Exp (0.4733)/(1 + exp (0.4733)) [1] of 0.3838355Copy the code

We can also use the predict function

Predict (GLM (= has just Survived ~ 1, binomial, type = "response") [1] 1, 0.3838384Copy the code

In addition, it also applies in probabilistic regression,

Reg = GLM (survives ~1, family=binomial(link="probit"),data= Titanic) predict(reg,type="response")[1] 1 0.3838384Copy the code

What percentage of first-class passengers survived?

We only look at people in first class,

 

[1] of 0.6296296Copy the code

About 63% survive. We can do logistic regression

Coefficients: (Intercept) Pclass2 Pclass3 0.5306-0.6394-1.6704 Degrees of Freedom: 890 Total (i.e. Null); 888 Residual Null Deviance: 1187 Residual Deviance: 1083 AIC: 1089Copy the code

Since class 1 is a reference class, we’ll leave it as it is

Exp (0.5306)/(1 + exp (0.5306)) [1] of 0.629623Copy the code

Predict forecast:

Predict (reg, newdata = data frame (Pclass = "1"), type = "response") 1, 0.6296296Copy the code

We can try probability regression, we get the same result,

Predict (reg, newdata = data frame (Pclass = "1"), type = "response") 1, 0.6296296Copy the code

chi-squareIndependence Test: What is the test statistic between survival and non-survival?

The order for the Chi-square test is as follows

chisq.test(table( Survived, Pclass))
 
	Pearson's Chi-squared test
 
data:  table( Survived,  Pclass)
X-squared = 102.89, df = 2, p-value <2.2 e-16Copy the code

We have a contingency table, and if the variables are independent, we haveAnd then the statisticsWe can see the contribution to testing

1 2 30-4.601993-1.537771 3.993703 1 5.830678 1.948340-5.059981Copy the code

This gives us a lot of information: we observed two positive correspond to the “survival” and “first class” and “cannot survive” and “third class” the strong (positive) relationship between, and two strong negative, corresponding to the “survival” and “third,” a strong negative correlation between the “cannot survive” and “first class”. We can visualize these values in the figure below

 

 


ass(table( Survived, Pclass), shade = TRUE, las=3)
Copy the code

 

We then had to perform logistic regression and predict the survival probability of the two simulated passengers

Let’s say we have two passengers

Newbase = data.frame(Pclass = as.factor(c(1,3)), Sex = as.factor(c("female","male")), Age = c(17,20), SibSp = c(1,0), Parch = c (2, 0),Copy the code

Let’s do a simple regression for all the variables,


 
Coefficients:
              Estimate Std. Error z value Pr(>| z |) (Intercept) Pclass2 16.830381 607.655774 0.028 0.97790 1.268362 0.298428 4.250 2.14 e-05 Pclass3-2.493756 * * * 0.296219-8.419<2e-16 *** Sexmale -2.641145 0.222801-11.854<2e-16 *** age-0.043725 0.008294-5.272 1.35e-07 *** sibsp-0.355755 0.128529-2.768 0.00564 ** parch-0.044628 0.120705 0.71159 EmbarkedC -12.260112 607.655693-0.020 0.98390 EmbarkedQ -13.104581 607.655894-0.022 0.98279 EmbarkedS -12.687791 607.655674-0.021 0.98334 -- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '0.1' (Dispersion parameter for binomial family taken to be 1) Null deviance: 964.52 on 713 degrees of freedom Residual deviance: 632.67 on 704 Degrees of Freedom (177 Observations deleted due to missingness) AIC: 652.67 Number of Fisher Scoring Iterations: 13Copy the code

Two variables are not important. Let’s delete them


 
Coefficients:
             Estimate Std. Error z value Pr(>| z |) (Intercept) 4.334201 0.450700 9.617<2e-16 *** Pclass2 -1.414360 0.284727-4.967 6.78E-07 *** Pclass3 -2.652618 0.285832-9.280<2e-16 *** Sexmale -2.627679 0.214771-12.235<2e-16 *** age-0.044760 0.008225-5.442 5.27e-08 *** SIBsp-0.380190 0.121516-3.129 0.00176 ** -- code: 0 '***' 0.001 '**' 0.01 '*' 0.05 '0.1' (Dispersion parameter for binomial family taken to be 1) Null deviance: 964.52 on 713 degrees of freedom Residual deviance: 636.56 on 708 Degrees of Freedom (177 Observations deleted due to missingness) AIC: 648.56 Number of Fisher Scoring Iterations: 5Copy the code

When we have a continuous variable like age, we can do polynomial regression


 
Coefficients:
              Estimate Std. Error z value Pr(>| z |) (Intercept) 3.0213 0.2903 10.408<2e-16 *** Pclass2 -1.3306 0.2842-4.786 1.70e-06 *** Pclass3 -2.5569 0.2853-8.962<2e-16 *** Sexmale -2.6582 0.2176-12.216<2E-16 *** poly(Age, 3) 1-17.7668 3.2583-5.453 4.96e-08 *** poly(Age, 3)2 6.0044 3.0021 2.000 0.045491 ** poly(Age, 3) 3) 3-5.9181 3.0992-1.910 0.056188. Sibsp-0.5041 0.1317-3.828 0.000129 *** -- signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '0.1' (Dispersion parameter for binomial family taken to be 1) Null deviance: 964.52 on 713 degrees of freedom Residual deviance: 627.55 on 706 degrees of Freedom AIC: 643.55 Number of Fisher Scoring Iterations: 5Copy the code

But interpreting the parameters becomes complicated. We notice that third-order terms are important here, so we’ll do the regression manually


Coefficients:
              Estimate Std. Error z value Pr(>| z |) (Intercept) 5.616 e+00 e-01 8.554 6.565<2e-16 *** Pclass2 - + 1.360e+ + 2.842E-01-4.786 + 1.7e-06 *** Pclass3 - + 2.557e+ 2.853E-01-8.962<2e-16 *** Sexmale -2.658e+00 2.176e-01-12.216<2E-16 *** age-1.905E-01 5.528E-02-3.446 0.000569 *** I(Age^2) 4.290E-03 1.854E-03 2.314 0.020669 * I(Age^3) -3.5202005 1.843E-05-1.910 0.056188. Sibsp-5.041e-01 1.31701-3.828 0.000129 *** -- signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '0.1' (Dispersion parameter for binomial family taken to be 1) Null deviance: 964.52 on 713 degrees of freedom Residual deviance: 627.55 on 706 degrees of Freedom AIC: 643.55 Number of Fisher Scoring Iterations: 5Copy the code

As you can see, the p values are the same. In short, it makes sense to translate age as a non-linear function of age. You can visualize this function


plot(xage,yage,xlab="Age",ylab="",type="l")
Copy the code

 

In fact, we can use splines. Generalized Additive models (GAM) are perfect visualization tools

(Dispersion Parameter for binomial family taken to be 1) Null Deviance: 964.516 on 713 degrees of freedom Residual Deviance: 627.5525 on 706 degrees of Freedom AIC: 177 Observations deleted due to missingness Number of Local Scoring Iterations: 4 Anova for Parametric Effects Df Sum Sq Mean Sq F value Pr(>F)    
Pclass      2  26.72  13.361  11.3500 1.407e-05 ***
Sex         1 131.57 131.573 111.7678 <22.76 7.588 6.4455 0.0002620 *** SibSp 1 14.66 14.659 12.4525 0.0004445 *** Residuals 706 831.10 1.177 - Signif. Codes: 0 '* * *' 0.001 '0.01' * '* *' 0.05 '. '0.1' 1 'Copy the code

We can see how the age variable shifts,

 

And we find that the transformation is close to our third order polynomial. We can add confidence bands to verify that the function is not truly linear

 

We now have three models. Finally, the predictions of two simulated passengers were given,

Predict (reg,newdata=newbase,type="response") 3 Predict (regam,newdata=newbase,type="response") 2Copy the code

You can see that Leonardo DiCaprio has about a 12 percent chance of survival (third-class tickets, given his age, and no family on board).


Most welcome insight

1. Use SPSS to estimate the HLM hierarchical linear model

2. Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA) and Regular Discriminant Analysis (RDA) of R language

3. Lmer mixed linear regression model based on R language

4. Simple Bayesian linear regression simulation analysis of Gibbs sampling in R language

5. Use GAM (Generalized additive Model) to analyze power load time series in R language

6. Hierarchical linear model HLM using SAS, Stata, HLM, R, SPSS and Mplus

Ridge regression, lasso regression, principal component regression in R language: Linear model selection and regularization

8. Prediction of air quality ozone data by linear regression model in R language

9.R language hierarchical linear model case