Original link:tecdat.cn/?p=22805 

Original source:Tuo End number according to the tribe public number

 

Why dummy variables?

Most data can be measured by numbers, such as height and weight. However, variables such as gender, season and location cannot be measured numerically. Instead, we measure them using dummy variables.

Example: Gender

Let’s assume that the effect of X on Y is different in men and women.

For males, y is equal to 10+5x+ EY is equal to 10+5x+e

For women, y is equal to 5 plus x plus EY is equal to 5 plus x plus e.

Where E is a random effect with an average value of zero. So in the real relationship between y and x, gender affects both the intercept and the slope.

First, let’s generate the data we need.

= 5 # true slope, men, women = 1 ifelse (d gender = = $1, $x 10 + 5 * d + e, 5 $x + d + e)Copy the code

First, we can look at the relationship between x and y and color the data by gender.

plot(data=d)
Copy the code

Obviously, the relationship between y and x should not be drawn by a line. We need two: one for the male and one for the female.

If we just return y to x and gender, the result is

The estimated coefficient of x is incorrect.

The correct setting should be such that gender affects both intercept and slope.

Or add a dummy variable using the following method.

The model states that for females (gender =0), the estimated model is y=5.20+0.99x; For males (gender =1), the estimated relationship is y=5.20+0.99x+4.5+4.02x, that is, y=9.7+5.01x, which is pretty close to the real relationship.

Next, let’s try two dummy variables: gender and location

Dummy variables of gender and location

Gender doesn’t matter, but location does

Let’s get some data where gender doesn’t matter, but location does.

Draw to see the relationship between X and y, coloring the data by gender, and breaking it down by location.

plot(d,grid~location)
Copy the code

The effect of gender on Y seems to be significant. But when you compare the data from Chicago to the data from Toronto, the intercept is different, the slope is different.

If we ignore the effects of gender and location, the model will be

R-squared is pretty low.

We know gender doesn’t matter, but we’ll just throw it in and see if it makes a difference.

As expected, the effect of gender was not significant.

Now let’s look at the effect of location

Location matters a lot. But our model setup basically says that the position just changes the intercept.

What if the position changes both the intercept and the slope?

You can try this too.

Gender doesn’t matter, and location changes the intercept and slope.

Gender doesn’t matter, and location changes the intercept and slope

Now let’s get some data that both gender and location matter. Let’s start with two locations.

Ifelse (d = = $sex "0" & d $location = = "Toronto", $x + 1 + 1 * d and e + ifelse (gender = = "1" & d d $$location = = "Chicago", 20 $x + 2 * d + e, + ifelse (d = = $sex "0" & d $location = = "Chicago," $x + 2 * 2 d + e, NA))))Copy the code
Plot (d,x,y,color= gender ~ location)Copy the code

Gender and location matter. Five locations

Finally, let’s try a model with five locations.

+ ifelse (gender = = "1" & d d $$location = = "Chicago", 2 + 10 * $x + e, d + ifelse (gender = = "0" & d d $$location = = "Chicago," $x + 2 * 2 d + e, + ifelse (gender = = "1" & d d $$location = = "New York", 3 d $x + e + 15 *, + ifelse (gender = = "0" & d d $$location = = "New York", 3 + 5 * d $x + e, + ifelse (gender = = "1" & d d $$location = = "Beijing", 30 * 8 + $x + e, d + ifelse (gender = = "0" & d d $$location = = "Beijing", 8 + 2 * $x + e, d + ifelse (gender = = "1" & d d $$location = = "Shanghai",Copy the code
Plot (x,y,color= gender ~ location)Copy the code

 

So, if you think certain factors (gender, location, season, etc.) might affect your explanatory variables, set them as dummy variables.


Most welcome insight

1.R language multiple Logistic Logistic regression application case

2. Panel smooth transfer regression (PSTR) analysis case implementation

3. Partial least squares regression (PLSR) and principal component regression (PCR) in MATLAB

4.R language Poisson regression model analysis cases

5. Hosmer-lemeshow goodness of fit test in R language regression

6. Implementation of LASSO regression, Ridge regression and Elastic Net model in R language

7. Realize Logistic Logistic regression in R language

8. Python predicts stock prices using linear regression

9. How to calculate IDI and NRI indices for R language in survival analysis and Cox regression