Original link:tecdat.cn/?p=21625
Original source:Tuo End number according to the tribe public number
We know the calculation of the confidence intervals of the parameters, which are subject to a certain distribution (T-distribution, normal distribution), so multiply the corresponding T-score value or z-score value before the standard error. But if we can’t find the right distribution, can’t we compute the confidence interval? Fortunately, there is a method that can be used to calculate confidence intervals for almost any parameter, and that is the Bootstrap method.
BOOTSTRAP was used to obtain the confidence interval of the prediction. We’ll talk about it on the basis of linear regression.
> reg=lm(dist~speed,data=cars)
> points(x,predict(reg,newdata= data.frame(speed=x)))
Copy the code
This is a single point forecast. When we want to give a prediction a confidence interval, the confidence interval of the prediction depends on the parameter estimation error.
Prediction confidence interval
Let’s start with the confidence interval of the prediction
> for(s in 1:500){
+ indice=sample(1:n,size=n,
+ replace=TRUE)
+ points(x,predict(reg,newdata=data.frame(speed=x)),pch=19,col="blue")
Copy the code
The blue values are the probable predicted values obtained by re-sampling in our observational database. It is worth noting that, under the assumption of residual normality (the slope and constant estimates of the regression line), the confidence interval (90%) is shown as follows:
predict(reg,interval ="confidence",
Copy the code
Here, we can compare the distribution of values across 500 generated datasets and compare the empirical quantiles with those under the normal hypothesis,
> hist(Yx,proba=TRUE
> boxplot(Yx,horizontal=TRUE
> polygon(c( x ,rev(x I]))))
Copy the code
It can be seen that empirical quantiles are comparable to those under the normal hypothesis.
> Quantile (Yx, C (.05,.95)) 5% 95% 58.63689 70.31281 + level=.9,newdata=data.frame(speed=x)) fit LWR upr 1 65.00149 59.65934 70.34364Copy the code
Possible values of the variable of interest
Now let’s look at another type of confidence interval, the possible values of the variable of interest. This time, in addition to extracting new samples and calculating predictions, we will also add noise each time we draw to get possible values.
> for(s in 1:500){
+ indice=sample(1:n,size=n,
+ base=cars[indice,]
+ erreur=residuals(reg)
+ predict(reg,newdata=data.frame(speed=x))+E
Copy the code
Here, we can compare (first graphically) the values obtained by resampling with the values obtained under the normal hypothesis,
> hist(Yx,proba=TRUE)
> boxplot(Yx) abline(v=U[2:3)
> polygon(c(D$x[I,rev(D$x[I])
Copy the code
Numerically, the following comparison is given
> Quantile (Yx, C (.05,.95)) 5% 95% 44.43468 96.01357 U= PREDICT (reg,interval ="prediction" fit LWR upr 1 67.63136 45.16967 90.09305Copy the code
This time, there’s a slight asymmetry on the right. Obviously, we cannot assume a Gaussian residual because there are larger positive values than negative ones. This makes sense given the nature of the data (the braking distance cannot be negative).
Then we start talking about using regression models in supply. In order to achieve independence, it was felt that it was necessary to use data for incremental payments rather than cumulative payments.
You can create a database where the explanatory variables are rows and columns.
> base=data.frame(
+ y
> head(base,12)
y ai bj
1 3209 2000 0
2 3367 2001 0
3 3871 2002 0
4 4239 2003 0
5 4929 2004 0
6 5217 2005 0
7 1163 2000 1
8 1292 2001 1
9 1474 2002 1
10 1678 2003 1
11 1865 2004 1
12 NA 2005 1
Copy the code
We can then start with a regression model based on logarithmic incremental payment data, which is based on a lognormal model
Coefficients: T the value Estimate Std. Error (Pr > | | t) (Intercept). * * * 7.9471 0.1101 72.188 6.35 e-15 as factor (ai) 2001 0.1604 0.1109 1.447 As.factor (ai)2002 0.2718 0.1208 2.250 0.04819 * AS.factor (AI)2003 0.5904 0.1342 4.399 0.00134 ** As.factor (ai)2004 0.5535 0.1562 3.543 0.00533 ** AS.factor (AI)2005 0.6126 0.2070 2.959 0.01431 * as.factor(BJ) 1-0.9674 Factor (BJ) 2-4.2329 0.1208-35.038 8.50E-12 *** as.factor(BJ) 3-5.0571 0.1342-37.684 Factor (BJ) 5-4.9026 0.2070-23.685 4.08E-10 *** -- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '0.1' 0.1753 on 10 degrees of Freedom (15 observations deleted due to missingness) Multiple R-squared: 0.9975, Adjusted R - squared: 0.9949 F - statistic: 391.7 on 10 and 10 DF, p - value: 1.338 e-11 > exp (predict (reg1, + newdata = base) + the summary (reg1) $sigma ^ 2/2) [1], [2], [3], [4] [5] [6], [1] [2] 2871.2 1091.3 41.7 18.3 7.8 21.3 3370.8 1281.2 48.9 21.5 9.2 25.0 [3,] 3768.0 1432.1 54.7 24.0 10.3 28.0 [4,] 5181.5 1969.4 75.2 33.0 14.2 38.5 [5,] 4994.1 1898.1 72.5 31.8 13.6 37.1 [6,] 5297.8 2013.6 76.9 33.7 14.5 39.3 > py[is.na(y)]) [1] 2481.857Copy the code
This is slightly different from the results of the chain gradient method, but is still comparable. We can also try Poisson regression.
glm(y~ + as.factor(ai)+ + as.factor(bj),data=base, + family=poisson) Coefficients: Estimate Std. Error z value (Pr > | z |) (Intercept) 8.05697 0.01551 519.426 < 2-16 * * * e as) factor (ai) 0.06440 0.02090 2001 As.factor (ai)2002 0.20242 0.02025 9.995 < 2e-16 *** as.factor(AI)2003 0.31175 0.01980 15.744 < 2e-16 *** as. Factor (ai)2004 0.44407 0.01933 22.971 < 2e-16 *** as. Factor (AI)2005 0.50271 0.02079 24.179 < 2e-16 *** As. factor(bj) 1-0.96513 0.01359-70.994 < 2e-16 *** as.factor(BJ) 2-4.14853 0.06613-62.729 < 2e-16 *** as -5.10499 0.12632-40.413 < 2e-16 *** as.factor(bj) 4-5.94962 0.24279-24.505 < 2e-16 *** as.factor(BJ) 5-5.01244 0.21877 -22.912 < 2e-16 *** -- signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '0.1' (Dispersion parameter for poisson family taken to be 1) Null deviance: 46695.269 on 20 degrees of freedom Residual deviance: 30.214 on 10 degrees of Freedom (15 Observations deleted due to missingness) AIC: 209.52 Number of Fisher Scoring iterations: Predict (reg2, newData =base,type="response") > sum(py2[is.na(y)]) [1] 2426.985Copy the code
The predicted results are in good agreement with those obtained by the chain gradient method. Klaus Schmidt and Angela Wunsche established a link with the least deviation method in chain gradient method, marginal and maximum likelihood estimation in 1998.
Most welcome insight
1.R language multiple Logistic Logistic regression application case
2. Panel smooth transfer regression (PSTR) analysis case implementation
3. Partial least squares regression (PLSR) and principal component regression (PCR) in MATLAB
4.R language Poisson regression model analysis cases
5. Hosmer-lemeshow goodness of fit test in R language regression
6. Implementation of LASSO regression, Ridge regression and Elastic Net model in R language
7. Realize Logistic Logistic regression in R language
8. Python predicts stock prices using linear regression
9. How to calculate IDI and NRI indices for R language in survival analysis and Cox regression