Original link:tecdat.cn/?p=6358
Original source:Tuo End number according to the tribe public number
Multiple interpolation has become a common method to deal with missing data. We can consider using multiple interpolation to estimate missing values in X. The next natural question is, should the variable Y be included as a covariable in the interpolation model of X?
Stata
To illustrate these concepts, we simulate a small data set in Stata, with no missing data initially:
Rnormal () gen y = x + 0.25 * rnormal () Twoway () (LFIT yx)Copy the code
Copy the code
Scatter plot of Y versus X before any data is missing
Next, we set 50 of X’s 100 observations to be missing:
Gen xmiss = (_ n <= 50)Copy the code
Copy the code
The interpolation model
In this article, we have two variables Y and X, the analysis model by Y Y on the composition of a certain type of regression (meaning is the dependent variable Y and X is a covariate), we want to generate the interpolation, we get the effective estimation of parameters in Y | X model.
Enter X and ignore Y
Suppose we use a regression model to estimate X, but do not include Y as a covariable in the interpolation model. We can easily do this in Stata, generating an estimate for each missing value, and then plotting Y from the result of X by extrapolating or observing X (when it is observed) :
Mi impute reg x, add (1)Copy the code
Copy the code
Y versus X, where the X is missing and the Y is ignored.
Clearly shows the problem of ignoring missing values of Y in X – in the ones where we have estimated X, there is no correlation between Y and X that should actually exist.
Taking the results into account
Suppose that if we conversely consider the X result as Y (as a covariable in the interpolation model of X), the following steps occur. X | Y interpolation model will use observed X individuals to fitting. Since we assume that X is lost randomly at Y, the complete case study fit is valid. Therefore, if in fact there is no correlation between X and Y, we should (in expectation) find it in this complete case.
To continue our simulation data set, we first discard the previously generated estimate and then re-enter X, but this time including Y as a covariable in the interpolation model:
Mi impute reg x = y, add (1)Copy the code
Copy the code
Y versus X, where Y is used to estimate the missing X value
Variable selection in multiple interpolation
The general rule for selecting variables to be included in the interpolation model is that all variables involved in the analysis model must be included, either as variables to be estimated or as covariables in the interpolation model.