Endogeneity, the subject of this introduction, may be heavily biased towards regression estimation. I will specialize in modeling endogeneity caused by missing variables. In subsequent articles in this series, I will simulate other specification problems such as heteroscedasticity, multicollinearity, and collider bias.
Data generation process
Data generation process (DGP) considering some outcome variables:
For this simulation, I set the parameter values.As well asIndependent variables that are positively correlated with the simulation,and(N = 500).
1234 5 6789 | # simulation parameters``set.seed``(144); a=50; b=.5; c=.01; x=``rnorm``(n=ss,mean=1000,sd=50); ` ` z = d + h * x + ` ` rnorm ` ` (ss, 0, 10) |
---|
simulation
The simulation will estimate the following two models. The first model is correct and contains all the terms in the actual DGP. However, the second model omits variables that are present in the DGP. Instead, variables are strayed into error terms.
The second model will produce a biased estimate. Differences can be biased. That’s because itIt’s endogenous, which is a fancy way of saying it’s related to misterminology. Due to theandAnd then. To illustrate this point, I ran a simulation of 5,000 iterations below. For each iteration, IThe result variables are constructed using DGP. And then I run the regression estimation, first model 1, then Model 2.
1234 5 6789101112 | sim=``function``(endog){`` ``e=``rnorm``(n=ss,mean=0,sd=10)`` ``# Select data generation process`` ``if``(endog==``TRUE``){ fit ``lm``(y~x) }``else``{ fit=``lm``(y~x+z)}`` ``return``(fit$coefficients)``} sim_results_endog=``t``(``replicate``(trials,``sim``(endog=``TRUE``))) |
---|
The simulation resultsThe simulation produces two different sampling distributions. Note that I have set the true value to. ifIf not omitted, the simulation produces a green sampling distribution centered on the true value. The average for all simulations is 0.4998. whenIs omitted, and the red sampling distribution obtained by simulation is centered around 0.5895. It deviates from the true value of.5895. In addition, the variance of the bias sampling distribution is much smaller than the surrounding true variance. This affects the ability to make any meaningful inferences about real parameters. * * * *
It can be analyzed. Consider that in model 1 (described above),andRelated in the following ways:
Substitute equation 3 into equation 1 and reorder:
When you omit a variableIt’s actually the estimated equation 4. As you can see,Quantity deviation. In this case, due toandBy constructing a positive correlation and their slope coefficients are positive, so the deviation is going to be positive. According to the parameters of the simulation, it should be a “real” deviation. This is the distribution of the bias, and it’s centered around.0895, which is very close to the true bias.
The above derivation also allows us to determine the direction of the relevant deviations from what we knowandAnd the sign ofTrue local effect ofon). If they’re both the same sign, then the estimateThere will be bias. If the sign is different, estimate the valueIt’s going to shift down.conclusionThe above example is general, but has a special application. For example, if we assume that an individual’s income is a function of years of education and years of experience, then omitting one variable would bias the slope estimate of the other variable.