Original link:tecdat.cn/?p=18266
Original source:Tuo End number according to the tribe public number
In this paper, generalized linear model (GLM), polynomial regression and generalized additive model (GAM) are established in R language to predict who survived the sinking of Titanic in 1912.
str(titanic)
Copy the code
The data variable is:
Survived
: Passenger survival indicator (1 if alive)Pclass
: Passenger classSex
: Passenger genderAge
: Age of passengerSibSp
: Number of siblings/spousesParch
: Number of parents/childrenEmbarked
: Embarkation portName
: Name of passenger
The last variable is not used much, so let’s delete it,
titanic = titanic[,1:7]
Copy the code
Now, let’s answer the question:
What percentage of the passengers survived?
The short answer is
Many years ago, mean $Survived) [1] of 0.3838384Copy the code
It can be found in the contingency table below
Table (Titanic $Survived)/nrow(Titanic) 0 1 0.6161616 0.3838384Copy the code
Or 38.38 percent of the survivors here. That is, it can also be obtained by logistic regression to no explanatory variables (in other words, regression only to constants). Regression gives:
Coefficients:
(Intercept)
-0.4733
Degrees of Freedom: 890 Total (i.e. Null); 890 Residual
Null Deviance: 1187
Residual Deviance: 1187 AIC: 1189
Copy the code
Given the value of beta zero, and since the probability of survival is zero
We did this by considering
Exp (0.4733)/(1 + exp (0.4733)) [1] of 0.3838355Copy the code
We can also use the predict function
Predict (GLM (= has just Survived ~ 1, binomial, type = "response") [1] 1, 0.3838384Copy the code
In addition, it also applies in probabilistic regression,
Reg = GLM (survives ~1, family=binomial(link="probit"),data= Titanic) predict(reg,type="response")[1] 1 0.3838384Copy the code
What percentage of first-class passengers survived?
We only look at people in first class,
[1] of 0.6296296Copy the code
About 63% survive. We can do logistic regression
Coefficients: (Intercept) Pclass2 Pclass3 0.5306-0.6394-1.6704 Degrees of Freedom: 890 Total (i.e. Null); 888 Residual Null Deviance: 1187 Residual Deviance: 1083 AIC: 1089Copy the code
Since class 1 is a reference class, we’ll leave it as it is
Exp (0.5306)/(1 + exp (0.5306)) [1] of 0.629623Copy the code
Predict forecast:
Predict (reg, newdata = data frame (Pclass = "1"), type = "response") 1, 0.6296296Copy the code
We can try probability regression, we get the same result,
Predict (reg, newdata = data frame (Pclass = "1"), type = "response") 1, 0.6296296Copy the code
chi-squareIndependence Test: What is the test statistic between survival and non-survival?
The order for the Chi-square test is as follows
chisq.test(table( Survived, Pclass))
Pearson's Chi-squared test
data: table( Survived, Pclass)
X-squared = 102.89, df = 2, p-value <2.2 e-16Copy the code
We have a contingency table, and if the variables are independent, we haveAnd then the statisticsWe can see the contribution to testing
1 2 30-4.601993-1.537771 3.993703 1 5.830678 1.948340-5.059981Copy the code
This gives us a lot of information: we observed two positive correspond to the “survival” and “first class” and “cannot survive” and “third class” the strong (positive) relationship between, and two strong negative, corresponding to the “survival” and “third,” a strong negative correlation between the “cannot survive” and “first class”. We can visualize these values in the figure below
ass(table( Survived, Pclass), shade = TRUE, las=3)
Copy the code
We then had to perform logistic regression and predict the survival probability of the two simulated passengers
Let’s say we have two passengers
Newbase = data.frame(Pclass = as.factor(c(1,3)), Sex = as.factor(c("female","male")), Age = c(17,20), SibSp = c(1,0), Parch = c (2, 0),Copy the code
Let’s do a simple regression for all the variables,
Coefficients:
Estimate Std. Error z value Pr(>| z |) (Intercept) Pclass2 16.830381 607.655774 0.028 0.97790 1.268362 0.298428 4.250 2.14 e-05 Pclass3-2.493756 * * * 0.296219-8.419<2e-16 *** Sexmale -2.641145 0.222801-11.854<2e-16 *** age-0.043725 0.008294-5.272 1.35e-07 *** sibsp-0.355755 0.128529-2.768 0.00564 ** parch-0.044628 0.120705 0.71159 EmbarkedC -12.260112 607.655693-0.020 0.98390 EmbarkedQ -13.104581 607.655894-0.022 0.98279 EmbarkedS -12.687791 607.655674-0.021 0.98334 -- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '0.1' (Dispersion parameter for binomial family taken to be 1) Null deviance: 964.52 on 713 degrees of freedom Residual deviance: 632.67 on 704 Degrees of Freedom (177 Observations deleted due to missingness) AIC: 652.67 Number of Fisher Scoring Iterations: 13Copy the code
Two variables are not important. Let’s delete them
Coefficients:
Estimate Std. Error z value Pr(>| z |) (Intercept) 4.334201 0.450700 9.617<2e-16 *** Pclass2 -1.414360 0.284727-4.967 6.78E-07 *** Pclass3 -2.652618 0.285832-9.280<2e-16 *** Sexmale -2.627679 0.214771-12.235<2e-16 *** age-0.044760 0.008225-5.442 5.27e-08 *** SIBsp-0.380190 0.121516-3.129 0.00176 ** -- code: 0 '***' 0.001 '**' 0.01 '*' 0.05 '0.1' (Dispersion parameter for binomial family taken to be 1) Null deviance: 964.52 on 713 degrees of freedom Residual deviance: 636.56 on 708 Degrees of Freedom (177 Observations deleted due to missingness) AIC: 648.56 Number of Fisher Scoring Iterations: 5Copy the code
When we have a continuous variable like age, we can do polynomial regression
Coefficients:
Estimate Std. Error z value Pr(>| z |) (Intercept) 3.0213 0.2903 10.408<2e-16 *** Pclass2 -1.3306 0.2842-4.786 1.70e-06 *** Pclass3 -2.5569 0.2853-8.962<2e-16 *** Sexmale -2.6582 0.2176-12.216<2E-16 *** poly(Age, 3) 1-17.7668 3.2583-5.453 4.96e-08 *** poly(Age, 3)2 6.0044 3.0021 2.000 0.045491 ** poly(Age, 3) 3) 3-5.9181 3.0992-1.910 0.056188. Sibsp-0.5041 0.1317-3.828 0.000129 *** -- signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '0.1' (Dispersion parameter for binomial family taken to be 1) Null deviance: 964.52 on 713 degrees of freedom Residual deviance: 627.55 on 706 degrees of Freedom AIC: 643.55 Number of Fisher Scoring Iterations: 5Copy the code
But interpreting the parameters becomes complicated. We notice that third-order terms are important here, so we’ll do the regression manually
Coefficients:
Estimate Std. Error z value Pr(>| z |) (Intercept) 5.616 e+00 e-01 8.554 6.565<2e-16 *** Pclass2 - + 1.360e+ + 2.842E-01-4.786 + 1.7e-06 *** Pclass3 - + 2.557e+ 2.853E-01-8.962<2e-16 *** Sexmale -2.658e+00 2.176e-01-12.216<2E-16 *** age-1.905E-01 5.528E-02-3.446 0.000569 *** I(Age^2) 4.290E-03 1.854E-03 2.314 0.020669 * I(Age^3) -3.5202005 1.843E-05-1.910 0.056188. Sibsp-5.041e-01 1.31701-3.828 0.000129 *** -- signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '0.1' (Dispersion parameter for binomial family taken to be 1) Null deviance: 964.52 on 713 degrees of freedom Residual deviance: 627.55 on 706 degrees of Freedom AIC: 643.55 Number of Fisher Scoring Iterations: 5Copy the code
As you can see, the p values are the same. In short, it makes sense to translate age as a non-linear function of age. You can visualize this function
plot(xage,yage,xlab="Age",ylab="",type="l")
Copy the code
In fact, we can use splines. Generalized Additive models (GAM) are perfect visualization tools
(Dispersion Parameter for binomial family taken to be 1) Null Deviance: 964.516 on 713 degrees of freedom Residual Deviance: 627.5525 on 706 degrees of Freedom AIC: 177 Observations deleted due to missingness Number of Local Scoring Iterations: 4 Anova for Parametric Effects Df Sum Sq Mean Sq F value Pr(>F)
Pclass 2 26.72 13.361 11.3500 1.407e-05 ***
Sex 1 131.57 131.573 111.7678 <22.76 7.588 6.4455 0.0002620 *** SibSp 1 14.66 14.659 12.4525 0.0004445 *** Residuals 706 831.10 1.177 - Signif. Codes: 0 '* * *' 0.001 '0.01' * '* *' 0.05 '. '0.1' 1 'Copy the code
We can see how the age variable shifts,
And we find that the transformation is close to our third order polynomial. We can add confidence bands to verify that the function is not truly linear
We now have three models. Finally, the predictions of two simulated passengers were given,
Predict (reg,newdata=newbase,type="response") 3 Predict (regam,newdata=newbase,type="response") 2Copy the code
You can see that Leonardo DiCaprio has about a 12 percent chance of survival (third-class tickets, given his age, and no family on board).
Most welcome insight
1. Use SPSS to estimate the HLM hierarchical linear model
2. Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA) and Regular Discriminant Analysis (RDA) of R language
3. Lmer mixed linear regression model based on R language
4. Simple Bayesian linear regression simulation analysis of Gibbs sampling in R language
5. Use GAM (Generalized additive Model) to analyze power load time series in R language
6. Hierarchical linear model HLM using SAS, Stata, HLM, R, SPSS and Mplus
Ridge regression, lasso regression, principal component regression in R language: Linear model selection and regularization
8. Prediction of air quality ozone data by linear regression model in R language
9.R language hierarchical linear model case