Red Stone’s personal website: Redstonewill.com
Linear regression and logistic regression are usually the first algorithms that people learn about predictive models. Because of their popularity, many analysts assumed they were the only form of regression. More knowledgeable scholars will know that they are the two main forms of all regression models.
The truth is that there are many forms of regression, and each has its own particular context. In this article, I will introduce the seven most common regression models in simple form. Through this article, I hope to help you to have a broader and comprehensive understanding of regression, rather than just know how to use linear regression and logistic regression to solve practical problems.
This paper will mainly introduce the following aspects:
-
What is regression analysis?
-
Why use regression analysis?
-
What are the types of regression?
-
Linear Regression
-
Logistic Regression
-
Polynomial Regression
-
Stepwise Regression
-
Ridge Regression
-
Lasso Regression
-
ElasticNet Regression
-
-
How to choose the appropriate regression model?
1. What is regression analysis?
Regression analysis is a predictive modeling technique that studies the relationship between dependent variables (targets) and independent variables (predictors). This technique is used in forecasting, time series modelling and finding causal relationships between variables. For example, the study of the relationship between reckless driving and the frequency of traffic accidents can be solved by regression analysis.
Regression analysis is an important tool for data modeling and analysis. The figure below illustrates the use of a curve to fit discrete data points. Among them, the sum of the differences between all the discrete data points and the corresponding positions of the fitting curve is minimized. More details will be introduced slowly.
2. Why regression analysis?
As mentioned above, regression analysis can estimate the relationship between two or more variables. Here’s a simple example:
Let’s say you want to estimate a company’s sales growth based on current economic conditions. You have recent company figures that show sales growth is about two and a half times faster than the economy. Using this insight, we can predict the company’s future sales based on current and past information.
There are many benefits to using regression models, such as:
-
The significant relationship between dependent and independent variables is revealed
-
It reveals the influence degree of multiple independent variables on a dependent variable
Regression analysis also allows us to compare the effects of variables measured at different scales, such as the effects of price changes and the effects of the amount of promotional activity. This has the advantage of helping the market researcher/data analyst/data scientist evaluate and select the best set of variables for the prediction model.
3. What are the regression types?
There are many regression techniques that can be used to make predictions. These regression techniques are primarily driven by three metrics: the number of independent variables, the type of measurement variables, and the shape of the regression line. We will discuss this in detail in the following sections.
For creative people, it is possible to combine the above parameters and even create new regressions. But before we do that, let’s take a look at some of the most common regressions.
1) Linear Regression
Linear regression is the most well-known modeling technique and one of the first choices for people to learn how to predict models. In this technique, the dependent variable is continuous and the independent variable can be continuous or discrete. Regression is linear in nature.
Linear regression establishes the relationship between the dependent variable (Y) and one or more independent variables (X) by using the best fitting line (also known as the regression line).
It is expressed as: Y=a+b*X+e, where A is the line intercept, b is the line slope, and e is the error term. If you give the independent variable X, you can use this linear regression expression to calculate the predicted value, the dependent variable Y.
The difference between unary linear regression and multivariate linear regression is that multivariate linear regression has more than one independent variable, while unary linear regression has only one independent variable. The next question is “How do YOU get the best fit line?”
How to get the best fit line (determine a and B values)?
This problem can be easily solved using the Least Square Method. The least square method is a common algorithm for fitting regression line. It calculates the best fitting line by minimizing the square sum of the vertical error of each data point with the predicted line. Because we’re calculating the sum of the squares of the errors, so the errors don’t cancel out.
We can use the metric R-Square to evaluate the performance of the model.
Key points:
-
There must be a linear relationship between independent variables and dependent variables.
-
Multiple regression has multicollinearity, autocorrelation and heteroscedasticity.
-
Linear regression is very sensitive to outliers. Outliers can seriously affect the regression line and the final forecast.
-
Multicollinearity increases the variance of coefficient estimates and makes them sensitive to small changes in the model. The result is unstable coefficient estimation.
-
In the case of multiple independent variables, we can choose the most important independent variables by using the methods of forward selection, backward elimination and stepwise selection.
2) Logistic regression
Logistic regression is used to calculate the probability of Success or Failure. Logistic regression should be used when the dependent variable is binary (0/1, True/False, Yes/No). Here, Y ranges from 0 to 1, which can be represented by the following equation.
odds= p/ (1-p) = probability of event occurrence / probability of not event occurrence ln(odds) = ln(p/(1-p)) logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3.... +bkXkCopy the code
Where p is the probability of an event occurring. You might be wondering, “Why do I use log log in the equation?”
Because of the binomial distribution we are using here (the dependent variable), we need to choose an activation function that maps the output between [0,1], and the Logit function will do the job. In the above equation, the best parameters are obtained by using maximum likelihood estimation rather than using linear regression to minimize square error.
Key points:
-
Logistic regression is widely used in classification problems.
-
Logistic regression does not require a linear relationship between dependent variables and independent variables. It can handle multi-type relationships because it performs nonlinear log transformation on the predicted output.
-
To avoid overfitting and underfitting, we should cover all useful variables. A good practice to ensure this is the case in practice is to estimate logistic regression using stepwise filtering.
-
The larger the training sample size, the better, because if the sample size is small, the maximum likelihood estimation will be worse than the least square method.
-
The independent variables should not be correlated, i.e. there is no multicollinearity. However, in analysis and modeling, we can choose to include the effects of the interaction of categorical variables.
-
If the value of the dependent variable is ordinal, it is called ordinal logistic regression.
-
If the dependent variable is multi-category, it is called multivariate logistic regression.
3) Polynomial Regression
Corresponding to a regression equation, if the index of the independent variable is greater than 1, it is a polynomial regression equation, as shown below:
y=a+b*x^2
Copy the code
In polynomial regression, the best fitting line is not a straight line, but a curve fitting data points.
Key points:
There may be some inducement to fit higher order polynomials to reduce error, but this is prone to overfitting. A fitting curve should be drawn, with emphasis on ensuring that the curve reflects the true distribution of the sample. Here is an example to help us understand.
Pay particular attention to both ends of the curve to see if the shapes and trends make sense. Higher polynomials can make weird inferences.
4) Stepwise Regression
Stepwise regression is used when we deal with multiple independent variables. In this technique, the selection of independent variables is accomplished by means of an automatic process, and no manual intervention is involved.
Stepwise regression is performed by looking at statistical values such as R-Square, T-STATS, and AIC indicators to identify important variables. Stepwise fitting of regression model by adding/removing covariates based on specific criteria. The common stepwise regression methods are as follows:
-
Standard stepwise regression does two things, adding or removing independent variables in each step.
-
Forward selection starts with the most important independent variable in the model, and then adds variables at each step.
-
Reverse elimination starts with all of the model’s independent variables and then removes the least significant variable in each step.
The goal of this modeling technique is to achieve maximum predictive power by using the fewest independent variables. It is also one of the methods for processing high-dimensional data sets.
5) Ridge Regression
Ridge regression is a technique used when the data suffers from multicollinearity (highly correlated independent variables). In multicollinearity, even if the least square estimator (OLS) is unbiased, the variance is so large that the observation wisdom is far from the true value. Ridge regression can effectively reduce variance by adding extra deviation degree to regression estimation.
Previously we introduced linear regression equations as follows:
This equation also has an error term, and the complete equation can be expressed as:
y=a+b*x+e (error term), [error term is the value needed to correct fora prediction error between the observed and predicted value] => y=a+y= a+ b1x1+ b2x2+.... +e,for multiple independent variables.
Copy the code
In a linear equation, the prediction error can be decomposed into two subcomponents. The first is due to bias and the second is due to variance. Prediction errors can occur due to either or both of these components. Here we will discuss errors due to variance.
Ridge regression solves the multicollinearity problem by means of the contraction parameter lambda. Take a look at the following equation:
There are two terms in the formula above. The first is the least square term and the second is the sum of the squares of the coefficient β, preceded by the contraction parameter λ. The purpose of adding the second term is to reduce the magnitude of the coefficient β to reduce the variance.
Key points:
-
Ridge regression is identical to all the assumptions of least square regression unless normality is not assumed.
-
Ridge regression Narrows the value of the coefficient but does not reach zero, indicating that it has no feature selection feature.
-
This is a regularization method using L2 regularization.
6) Lasso Regression
Similar to ridge regression, the Least Absolute Shrinkage and Selection Operator regression penalizes the Absolute value of the regression coefficient. In addition, it can reduce variability and improve the accuracy of linear regression models. Take a look at the following equation:
The lasso regression is different from the ridge regression in that the penalty function uses the sum of the absolute values of the coefficients rather than the square. This results in the penalty term (or equivalent to the sum of the absolute values of the constraint estimates) such that some regression coefficient estimates are exactly zero. The bigger the penalty, the closer the estimate gets to zero. Implement a selection from n variables.
Key points:
-
Unless normality is not assumed, the lasso regression is the same as all the assumptions of the least square regression.
-
Lasso regression helps feature selection by shrinking the coefficient to zero (exactly zero).
-
This is a regularization method using L1 regularization.
-
If a set of independent variables is highly correlated, the lasso regression will select only one of them and reduce the rest to zero.
7) ElasticNet Regression
Elastic regression is a hybrid technique of ridge regression and lasso regression, which uses both L2 and L1 regularization. Elastic networks are useful when there are multiple related characteristics. The lasso regression is likely to pick one at random, while the elastic regression is likely to pick both.
One advantage of weighing ridge regression and lasso regression is that it allows the elastic regression to inherit some of the stability of ridge regression under rotation.
Key points:
-
In the case of highly correlated variables, it supports group effects.
-
There is no limit to the number of variables selected
-
It has two contraction factors λ1 and λ2.
In addition to the seven most commonly used regression techniques, you can also look at other models, such as Bayesian, Ecological and Robust regression.
4. How to choose an appropriate regression model?
Life is usually easy when you only know one or two skills. One training institute I know tells its students: if the results are continuous, use linear regression; If the result is binary, use logistic regression! However, the more options available, the more difficult it is to choose the right answer. A similar situation occurs in the choice of regression model.
In multiple types of regression models, it is important to select the most appropriate technique based on the type of independent and dependent variables, data dimension and other essential characteristics of data. Here are some suggestions on how to choose the right regression model:
-
Data mining is an indispensable part in establishing prediction model. This should be the first step in choosing the right model, such as determining the relationship and influence of variables.
-
To compare the fitting degree suitable for different models, we can analyze their different index parameters, such as statistically significant parameters, R-square, Adjusted R-square, AIC, BIC and error term, and the other is The Mallows’ Cp criterion. Check the model for possible deviations by comparing it to all possible submodels (or carefully selecting them).
-
Cross validation is the best method to evaluate prediction models. You can divide the data set into two groups (training set and verification set). A measure of prediction accuracy can be given by measuring the simple mean square deviation between observed and predicted values.
-
If the dataset has multiple mixed variables, you should not use the automatic model selection method because you do not want to put these mixed variables into the model at the same time.
-
It also depends on your goals. Simple models are easier to implement than highly statistically significant models.
-
Regression regularization methods (LasSo, Ridge, and ElasticNet) work well when the data set is high and the independent variables are multicollinearity.
Conclusion:
Now, I want you to get an overall picture of the comeback. These regression techniques should be selected and applied according to different data conditions. One of the best ways to figure out which regression to use is to examine the family of variables, that is, discrete or continuous.
In this article, I discuss seven types of regression methods and the key points associated with each. As a beginner in this industry, I recommend that you learn these techniques and implement these models in real-world applications.
Original link:
45 Questions to test a Data Scientist on Regression (Skill Test — Regression Solution)