This is the 8th day of my participation in Gwen Challenge

Multiple linear regression

We talked about simple linear regression, simple linear regression is a regression function with only one subvariable, and most of the time our dependent variable is affected by multiple independent variables, so that’s what we’re talking about with multiple linear regression.

There are two or more independent variables in multiple linear regression. For example, our salary is not only affected by years of work, but also by industry and position. The salary value is affected by these factors.

The general form of multiple regression model is


y = b 0 + b 1 x 1 + b 2 x 2 + . . . + b n x n y=b_0+b_1*x_1+b_2*x_2+… +b_n*x_n

It can be found that multiple linear regression and simple linear regression are very similar, the difference is the number of independent variables.

There are some restrictions on multiple linear regression

  1. Linear (Linearity)
  2. Elasticity to the advantage of elasticity
  3. Multivariate normality ()
  4. Independence of errors
  5. Lack of multicollinearity

Dummy variable trap

As mentioned earlier, for fields of non-numeric type, we need to convert them to dummy variables such as country. The country field exists in the join data

price location
1 A position
2 B position
3 B position
4 A position

For example, there are three types in the location field: A location,B location, and C location. There is no way to convert them into dummy variables for calculation.

A position B position
1 0
0 1
0 1
1 0

Thus we represent the Location column with dummy variables, which is equivalent to adding to the model


b 2 x 2 + b 3 x 3 b_2 * x_2 + b3 * x_3

However, there are some problems, such as data redundancy in the values of position A and position B, namely multicollinearity problem.

When position A = 0, the current position can already be represented as position B, and there is no need to add column B = 1 for additional display. The column relationship between position A and position B can be expressed as


D 2 = 1 D 1 D_2 = 1 – D_1

Therefore, when the original feature has M categories, if it is converted into M dummy variables, complete collinearity between variables will occur. So we don’t really need to add the B-position column to the model. To sum up, when the original feature has M categories, we need to convert it into M-1 dummy variables.

model

There are many parameters in the problem, but we should only add the parameters that have a significant effect on the dependent variable to the model, but how we determine which parameters have a significant effect on the dependent variable to help us determine the parameters of the model. There are five common models for multiple linear regression

  1. Select all-in for All
  2. Take a Backward step
  3. Forward Selection
  4. Bidirectional Elimination
  5. Information content comparison Score Comparision

All selection and information comparison is a rough and easy way to do it, and all selection is to add all of our parameters to the model and participate in the calculation. The comparison of information content is to try all combinations and select the combination with the best effect.

Generally used more is reverse elimination, direct selection, two-way elimination, they are also called stepwise regression

Backward elimination

The idea of reverse elimination is to first determine a significance threshold, and then select all parameters to create the model. By calculating the maximum p-value parameter, the parameters whose p-value is greater than the significance threshold are deleted until all parameter values are less than the significance threshold, and finally determine the model.

graph TB; 1. Set significance threshold --> 2. Create all-selection model; 2. Create a full selection model -->3. 3. The calculation parameter value P - > | maximum parameter P value is greater than the significance threshold | 4. Remove maximum P parameters 3. Calculation parameters value P - > | early play parameter P value is less than the significance threshold | 5. Determine the model 4. delete the maximum p-value parameter --> 3. calculate the parameter value p-value 5. To determine the model

Consequent choice

Reverse elimination is to remove the parameters with low impact from the total, while forward selection is the opposite idea of selecting the parameters with high impact from all the parameters. First of all, we still need to determine the significance threshold. Next, we need to conduct simple linear regression for all parameters and select the minimum P-value less than the significance threshold. Then try to add different parameters to the rest of the parameters which will bring the early small P value, until you can’t choose the position.

graph TB; 1. Set significance threshold --> 2. Perform simple linear regression calculation for all parameters; 2. Perform simple linear regression calculation for all parameters -->3. Add the minimum p-value parameter to the model; 4. Try to find small P values early parameters - > | there are less significant threshold parameter | 3. Add minimum P value parameters to the model 4. Try to find small P values early parameters - > | no less significant threshold parameter | 5. Determine the model 3. Add minimum p-value parameter to the model --> 4. Try to find early small p-value parameter 5. To determine the model

Two-way selection

Two-way elimination is a combination of reverse elimination and forward selection. First, two significant thresholds should be set for elimination and selection respectively. Then, new variables are firstly added by selecting the significance threshold and then the old variables are eliminated by eliminating the significance threshold in reverse elimination until the model is successfully established when the new variables cannot be selected and the old variables cannot be eliminated.

graph TB; 1. Set significance threshold --> 2. 2. Forward selection -->3. Reverse elimination; 3. Backward elimination -- > 4. Whether to complete the consequent choice and backward elimination 4. Whether to complete the consequent choice and backward elimination - > | | 5. To determine whether model 4. Complete the consequent choice and backward elimination - > | | 2 unfinished. 5. Determine the model

Reverse elimination implementation – Python

Multiple LinearRegression is still calculated using the sklear.linear_model.LinearRegression model.

Since there is a constant b0b_0B0 in the formula, we need Np.ones to add a constant column to the training set to add b0b_0B0.

# Multiple linear regression model
from sklearn.linear_model import LinearRegression
Create a regressor
regressor = LinearRegression()
regressor.fit(X_train,y_train)

# regression prediction
y_pred = regressor.predict(X_test)

# Reverse elimination
import statsmodels.regression.linear_model as sm

X_train = np.append(arr=np.ones((40.1),dtype=float),values=X_train,axis=1)

# P_Value Top 0.05
# step 1 - all in
# X_opt = np. Array (object = X_train [:,1,2,3,4,5 [0]], dtype = float)

# step 2 - remove x2
# X_opt = np. Array (object = X_train [:,1,3,4,5 [0]], dtype = float)

# step 3 - remove x1
# X_opt = np. Array (object = X_train [:, 5-tetrafluorobenzoic [0]], dtype = float)

# step 4 - remove x4
# X_opt = np. Array (object = X_train [:,3,5 [0]], dtype = float)

# step 5 - remove x2
X_opt = np.array(object=X_train[:,[0.3]],dtype=float)

regressor_OLS = sm.OLS(endog=y_train,exog=X_opt).fit()
summary = regressor_OLS.summary()
Copy the code

Regressor_ols.summary () was observed after each calculation, and the corresponding parameter P value was observed to confirm whether the parameters needed to be removed.

OLS Regression Results ============================================================================== Dep. Variable: Y R-Squared: 0.945 Model: OLS adj. R-Squared: 0.944 Method: Least Squares F-statistic: 652.4 Date: Fri, 18 Jun 2021 Prob (F-statistic): 1.56E-25 Time: 11:25:33 log-likelihood: -423.09 No. Observations: 40 AIC: Df Model: 1 Covariance Type: Nonrobust = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = coef STD err t P > | t | [0.025 0.975] -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- const e+04 2842.717 17.032 4.842 E+04 e+04 x1 0.8516 0.033 25.542 5.42 0.000 4.27 0.000 0.784 0.919 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = Omnibus: 13.132 Durbin - Watson: Jarque-bera (JB): 16.254 Skew: -0.991 Prob: 0.000295 Kurtosis: 5.413 Cond. No. 1.57 e+05 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =Copy the code

Finally to complete the establishment of the model, to achieve the prediction of data.