Original link:tecdat.cn/?p=14528

Original source:Tuo End number according to the tribe public number

 

When we’re missing a value, the system tells me to replace it with -1, and then adds an indicator that the variable is equal to -1. This allows for no deletion of variables or observations.


video

Processing of missing values: linear regression model interpolation


Here we simulate the data and then generate the data from the model. NA is not defined. The general recommendation is to replace the missing value with -1, and then fit the undefined model. By default, R’s policy is to delete missing values. If 50% is not defined, data is missing and half of the rows are deleted

n=1000
x1=runif(n)
x2=runif(n)
e=rnorm(n,.2)
y=1+2*x1-x2+e
alpha=.05
indice=sample(1:n,size=round(n*alpha))
base=data.frame(y=y,x1=x1)
base$x1[indice]=NA
reg=lm(y~x1+x2,data=base)
Copy the code
 

We simulate 10,000 and then look at the undefined distribution,

M =1000 B=rep(NA,m) hist(B,probability=TRUE,col= RGB (0,0,1,.4),border="white",xlab="missing values = 50%") lines(density(B),lwd=2,col="blue") abline(v=2,lty=2,col="red")Copy the code
 

 

Of course, the ratio of missing values is lower – there are fewer missing observations, so the variance of the estimator is smaller.

 

Now let’s try the following strategy: replace the missing value with a fixed value and add a metric,

B = rep (NA, m) hist (B, aim-listed probability = TRUE, col = RGB (0, 1,. 4), border = "white") lines (density (B), LWD = 2, col = "blue") abline(v=2,lty=2,col="red")Copy the code

 

 

It doesn’t change very much, the rate of missing values goes down to 5%,

 

For example, there are still 5% missing values that we have

 

If we look at the sample, especially the undefined points, we’ll see that

 

The missing values were randomly selected completely independently,

x1=runif(n)






plot(x1,y,col=clr)
Copy the code

 

 

(1/3 of the missing value here is red). But you can assume the maximum value of the missing value, for example,

x1=runif(n)






clr=rep("black",n)
clr[indice]="red"
plot(x1,y,col=clr)
Copy the code
 

 

One might wonder, what does the estimator give you?

 

It doesn’t change much, but if we look closely, we can see more differences. What happens if you don’t define a variable,



for(s in 1:m){
 









  base$x1[indice]=-1
  reg=lm(y~x1+x2+I(x1==(-1)),data=base)
  B[s]=coefficients(reg)[2]
}
Copy the code
 

 

This time, we have a biased estimator.

set.seed(1) indice=sample(1:n,size=round(n*alpha),prob = x1^3) base$x1[indice]=-1 coefficients(reg1) (Intercept) x1 x2 I(x1 == (-1))TRUE 1.0988005 1.7454385-0.5149477 3.1000668 base$x1[indice]=NA coefficients(reg2) (Intercept) x1 x2 1.1123953 1.8612882 to 0.6548206Copy the code
 

A better way, as I said, is to extrapolate. The idea is to predict for undefined missing predictive values. The simplest approach is to create a linear model and calibrate it against non-missing values. The model is then estimated on this new basis.

for(s in 1:m){ base$x1[indice]=NA reg0=lm(x1~x2,data=base[-indice,]) base$x1[indice]=predict(reg0,newdata=base[indice,]) Reg = lm (y ~ x1 + x2, data = base)} hist (B, aim-listed probability = TRUE, col = RGB (0, 1,. 4), border = "white") lines(density(B),lwd=2,col="blue") abline(v=2,lty=2,col="red")Copy the code
 

 

In the numerical example, we get

Base $x1[indice]=NA Coefficients (reg3) (Intercept) x1 x2 1.1593298 1.8612882 -0.6320339Copy the code
 

This approach at least corrects the bias

Then, if we look closely, we get exactly the same value as the first method, which involves deleting the missing rows.



Coefficients:
            Estimate Std. Error t value Pr(>| | t) (Intercept) 1.15933 0.06649 17.435< 2e-16 ***
x1           1.86129    0.21967   8.473  <2e-16 *** x2-0.63203 0.20148-3.137 0.00176 ** -- Signif. codes: 0 '* * *' 0.001 '* *' 0.01 '*' 0.05 '. '0.1' '1 Residual standard error: 1.051 on 997 degrees of freedom Multiple R - squared: 0.1094, Adjusted R - squared: 0.1076 F - statistic: 61.23 on 2 and 997 DF, p-value:< 2.2e-16 
 

 
Coefficients: Estimate Std. Error t value Pr(>| | t) (Intercept) 1.11240 0.06878 16.173< 2e-16 ***
x1           1.86129    0.21666   8.591  <2e-16 *** x2 -0.65482 0.20820-3.145 0.00172 ** -- Signif. codes: 0 '* * *' 0.001 '* *' 0.01 '*' 0.05 '. '0.1' '1 Residual standard error: 1.037 on 797 degrees of Freedom (200 observations deleted due to missingness) Multiple R-squared: Adjusted R-squared: Adjusted R-squared: Adjusted F-statistic: 55.5 on 2 and 797 DF, p-value:<2.2 e-16Copy the code
 

In addition to performing linear regression, another method of interpolation can be used.

On the basis of simulation, we obtain

For (J in indice) base0$x1[J]= KPP (j,base0,k=5) reg4=lm(y~x1+x2,data=base) Coefficients (reg4) (Intercept) x1 x2 1.197944 1.804220-0.806766Copy the code
 

If we look at 10,000 simulations, we’ll see

for(s in 1:m){ base0=base for(j in indice) base0$x1[j]=kpp(j,base0,k=5) reg=lm(y~x1+x2,data=base0) B [s] = coefficients (reg) [2]} hist (B, aim-listed probability = TRUE, col = RGB (0, 1,. 4), border = "white") lines (density (B), LWD = 2, col = "blue")  abline(v=2,lty=2,col="red")Copy the code
 

 

The bias here seems to be weaker than if there were no interpolation, in other words, the interpolation approach seems to me to be more powerful than a strategy that aims to replace NA with arbitrary values and add metrics to the regression.

reference

1. Use SPSS to estimate the HLM hierarchical linear model

2. Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA) and Regular Discriminant Analysis (RDA) of R language

3. Lmer mixed linear regression model based on R language

4. Simple Bayesian linear regression simulation analysis of Gibbs sampling in R language

5. Use GAM (Generalized additive Model) to analyze power load time series in R language

6. Hierarchical linear model HLM using SAS, Stata, HLM, R, SPSS and Mplus

Ridge regression, lasso regression, principal component regression in R language: Linear model selection and regularization

8. Prediction of air quality ozone data by linear regression model in R language

9.R language hierarchical linear model case