Original link:tecdat.cn/?p=23449 

Original source:Tuo End number according to the tribe public number

This paper answers questions about logistic regression: how is it different from linear regression, how to fit and evaluate these models with GLM () functions in R, and so on?

Logistic regression is a technique of machine learning from the field of statistics. It is a powerful statistical method for modeling binomial results with one or more explanatory variables. It measures the relationship between classified dependent variables and one or more independent variables by using logical function to estimate probability, which is the logical distribution.

This R tutorial will guide you through the simple implementation of logistic regression.

  • You will first explore the theory behind logistic regression: you will learn more about how it differs from linear regression and what logistic regression models look like. You’ll also find multiindex and ordinal logistic regression.
  • Next, you will solve logistic regression problems in R: you will not only explore a data set, but also use the powerful GLM () function in R to fit logistic regression models, evaluate the results and solve overfitting problems.

Tip: If you are interested in taking your linear regression skills to the next level, consider also taking our R language course!

Regression analysis: An introduction

Logistic regression is a regression analysis technique. Regression analysis is a set of statistical procedures that you can use to estimate relationships between variables. More specifically, you use this set of techniques to model and analyze the relationship between a dependent variable and one or more independent variables. Regression analysis helps you understand how the typical value of the dependent variable changes when one independent variable is adjusted and the others are fixed.

As you have read, there are various regression techniques. You can tell them apart by looking at three things: the number of independent variables, the type of dependent variables, and the shape of the regression line.

Linear regression

Linear regression is one of the most widely known modeling techniques. In short, it allows you to use a linear relationship to predict the (average) value of Y, with a straight line for a given value of X. This line is called the “regression line”.

So the linear regression model is y= Ax +b. The model assumes that the dependent variable Y is quantitative. In many cases, however, the dependent variable is qualitative or, in other words, categorical. For example, gender is qualitative and can be male or female.

Predicting a qualitative response to an observation can be referred to as classifying that observation because it involves assigning observations to a category or rank. On the other hand, methods often used for classification first predict the probability of each category of qualitative variables as a basis for classification.

Linear regression does not predict probability. For example, if you use linear regression to model a binary dependent variable, the resulting model may not limit the predicted Y values to 0 and 1. This is where logistic regression comes in, where you can get a probability score that reflects the probability of an event happening.

Logistic Logistic regression

Logistic regression is an example of a classification technique that you can use to predict a qualitative response. More specifically, logistic regression models the probability that gender falls into a particular category.

What this means is that if you want to do gender categorization, where the response gender falls into one of two categories, male or female, you will use a logistic regression model to estimate the probability that the gender falls into a particular category.

For example, the probability of a given sex with long hair can be written as:.

Pr (gender = female | longhair) (abbreviated as p (longhair)) value between 0 and 1. Then, for any given longhair value, gender can be predicted.

Given is to interpret the variable X, Y is the dependent variable, then you should be how to establish the p (X) = Pr (Y | X = 1) and the relationship between the X model? The linear regression model indicates that these probabilities are.

The problem with this approach is that any time a linear fit is performed on a binary dependent variable encoded as 0 or 1, in principle we can always predict p(X)<0 for some values of X and P (X)>1 for others.

To avoid this problem, you can use the logistic function to model p(X), whose output is between 0 and 1 for all values of X.

The logarithm function always produces an S-shaped curve, so whatever the value of X is, we’re going to get a reasonable prediction.

The above equation can also be reconstructed as:

The number of

It’s called the probability ratio, and it can take any value between 0 and infinity. The probability values approaching 0 and infinity indicate that p (X) is very low and very high, respectively.

By taking the log of both sides of this equation, you get.

The left-hand side is called Logit. In a logistic regression model, adding a unit of X changes the logarithm β0. But whatever the value of X, if beta 1 is positive, then increasing X will be associated with increasing P(X), and if beta 1 is negative, then increasing X will be associated with decreasing P(X).

The coefficients β0 and β1 are unknown and must be estimated based on available training data. For logistic regression, you can use maximum likelihood, a powerful statistical technique. Let’s look at your example of gender categorization again.

You look for estimates of beta 0 and beta 1, plug those estimates into the model of P (X), and produce a number close to 1 for all female samples, and a number close to 0 for all non-female samples.

This can be formalized by a mathematical equation called the likelihood function.

The estimators β0 and β1 were chosen to maximize this likelihood function. Once the coefficients are estimated, you can simply calculate the probability of being female in any situation with long hair. In general, the maximum likelihood method is a very good method for fitting nonlinear models.

Polynomial Logistic regression

So far, this tutorial has only focused on binomial logistic regression, because you are classifying instances as male or female. The polynomial Logistic regression model is a simple extension of the binomial Logistic regression model that you can use when the exploratory variable has more than two nominal (unordered) categories.

In polynomial logistic regression, exploratory variables are encoded virtually as multiple 1/0 variables. All categories but one have a variable, so if there are M categories, there will be m-1m -1 dummy variables. The dummy variable for each category has a value of 1 in its category and 0 in all other categories. One category, the reference category, does not need its own dummy variable because it is uniquely identified by all other variables being zero.

Then, multi-fork logistic regression estimates a separate binary logistic regression model for each dummy variable. The result is an M-1m -1 binary logistic regression model. Each model conveys the impact of predictors on the probability of success for that category, compared to the reference category.

Ordered logistic logistic regression

In addition to multi-fork logistic regression, you also have ordered logistic regression, which is another extension of binary logistic regression. Ordered regression is used to predict dependent variables with “ordered” categories and independent variables. You already see this in the name of this type of logistic regression, because “ordered” means “order of categories”.

In other words, it is used to analyze the relationship between dependent variables (which have multiple ordered hierarchies) and one or more independent variables.

For example, you are conducting customer interviews to assess their satisfaction with our new product launch. Your task is to ask respondents a question and their answer is somewhere between satisfactory – satisfactory or dissatisfied – very dissatisfied. To give a good summary of your answer, you include some grades in your answer such as very dissatisfied, dissatisfied, neutral, satisfied, very satisfied. This helps you observe the natural order of categories.

Logistic regression of R language was performed using GLM

In this section, you will investigate an example of binary logistic regression, which you will solve with the ISLR package, which will give you the data set, and the GLM () function, which is generally used to fit generalized linear models, will be used to fit logistic regression models.

Load the data

The first thing to do is install and load the ISLR package, which has all the data sets you want to use.

In this tutorial, you will use stock market data sets. This data set shows the daily returns of the STANDARD & Poor’s 500-stock index between 2001 and 2005.

To explore the data

Let’s explore. Names () is useful for viewing what’s on the data box, head() is a glimpse of the first few lines, and summary() is useful.

 

The summary() function gives you a brief summary of each variable on the data frame. You can see the volume, the closing price, and the up-and-down direction. You will use the “direction of advance or decline” as the dependent variable because it shows whether the market has gone up or down since the previous day.

Visualization of data

Data visualization is probably the fastest and most useful way to summarize and understand your data. You’ll start by exploring numerical variables individually.

The histogram provides a histogram of a numeric variable, which is divided into sections whose height shows the number of instances belonging to each section. They are useful for obtaining the distribution characteristics of an attribute.

for(i in 1:8)hist(Smarket[,i]
Copy the code

This is extremely difficult to see, but most variables show a Gaussian or double-Gaussian distribution.

You can look at the distribution of data in different ways using box plots and box and whiskers plots. The box contains the middle 50% of the data, the line shows the median, and the whisker line shows the reasonable range of data. Any point outside the whisker is an outlier.

for(i in 1:8) boxplot(Smarket[,i]
Copy the code

All Lags and “Today” have a similar ring attached to them. Other than that, there was no sign of outliers.

Missing data has a big impact on modeling. Therefore, you can use miss graphs to quickly understand the amount of missing data in a dataset. The X-axis displays properties and the Y-axis displays instances. A horizontal line represents missing data for an instance, and a vertical block represents missing data for an attribute.


mis( col=c("blue", "red")
Copy the code

There is no missing data in this dataset!

Let’s begin by calculating the correlation between each pair of numeric variables. These paired correlations can be plotted in a correlation matrix to see which variables change together.


corrplot(correlations, method="circle")
Copy the code

Using dot notation, blue represents positive correlation and red represents negative correlation. The bigger the point, the more relevant it is. You can see that the matrix is symmetric and the diagonal is completely positive, because it shows how each variable is related to itself. However, none of the variables are correlated.

So let’s do a data graph. There is a pair() function that plots variables in Smarket as a scatter plot matrix. In this case, the “up or down direction”, your binary dependent variable, is the color indicator.

There doesn’t seem to be any correlation here. This class of variables comes from the variable’s return today, so up and down are divided.

Let’s look at the density distribution of each variable broken down by directional value. Like the scatter matrix above, a density plot plotted by direction helps to see where the ups and downs are going. It also helps to understand how the directions of a variable overlap.

Plot(x=x, y=y, plot="density", scales=scales)
Copy the code

As you can see, the directional values of all these variables overlap, which means that it’s hard to predict a rise or a fall with just one or two variables.

Establish Logistic regression model

Now you call the glm.fit() function. The first argument you pass to this function is an R formula. In this case, the formula indicates that direction is the dependent variable, while lag and volume variables are the predictors. As you saw in the introduction, GLM is often used to fit generalized linear models.

In this case, however, you need to make it clear that you want to fit a logistic regression model. You solve this problem by setting the family parameter to a binomial. In this way, you tell GLM () to fit into a logistic regression model rather than one of the many other models that GLM can fit into.

Next, you can do a summary(), which tells you something about fitting.

As you can see, summary() returns the estimate, standard error, Z-score, and p-value for each coefficient. None of the coefficients seem significant. It also gives out invalid bias (the bias of only the mean) and residual bias (the bias of models that include all the predictors). The difference between the two is very small, and there are six degrees of freedom.

You assign the predicted results of GLM.fit () to GLM.probs of type equal to the dependent variable. This will predict the training data you use to fit the model, and give me a vector of fitting probabilities.

If you look at the first five probabilities, they’re pretty close to 50%.

probs[1:5]
Copy the code

Now I’m going to make a prediction about whether the market is going up or down based on lag and other predictors. In particular, I will turn probability into classification with a threshold of 0.5. To do this, I use the ifelse() command.

Ifelse (probs > 0.5, "Up", "Down")Copy the code

Glm. pred is a true and false vector. If glM. probs is greater than 0.5, glm.pred calls "Up"; Otherwise, call "False".

Here, you attach the data frame Smarket and make a table of glM.pred, which is the rise and fall in the previous direction. You can also take the average of them.

From the table, the instances on the diagonal are where you get the correct classification, and the instances off the diagonal are where you get the wrong classification. It looks like you made a lot of mistakes. The average gives a ratio of 0.52.

Create training samples and test samples

How could you do better? It is a good strategy to divide the data into training sets and test sets.

Train = Year<2005 predict(glm.fit, newdata = Smarket[! Train,], type = "response")Copy the code

Let’s look at this code block in detail.

  • Train is the year less than 2005. For all years less than 2005, you'll get a true; Otherwise, I get a false.
  • You then re-fit the model with glM.fit (), only the subset is equal to ‘train’, which means it only fits data less than 2005.
  • You then again use the predict() function against GLM. probs to predict remaining data greater than or equal to 2005. For new data, you give it Smarket, with!” Train “as the index (train is true if the year is greater than or equal to 2005). You set the type to “dependent variable” to predict the probability.
  • Finally, you use ifelse() again on glm.pred to generate the up and down variables.

You now make a new variable to store a new subset of the test data and call it direction.2005. The dependent variable is still the direction. You make a table and calculate the average for this new set of tests.

Direction.2005 = Direction[!train]
Copy the code

It’s worse than it was before. How did this happen?

Solve the problem of overfitting

Well, you might be overfitting the data. To solve this problem, you fit a smaller model that uses Lag1, Lag2, and Lag3 as predictors, leaving out all other variables. The rest of the code is the same.

GLM (family = binomial, subset = train)Copy the code

Well, you got a 59% classification rate, which isn’t too bad. Using smaller models seems to work better.

Finally, you do a summary() of glM.fit to see if there are any noticeable changes.

Nothing becomes significant, at least the P-value is better, indicating an improved prediction of performance.

conclusion

So this concludes the R tutorial on using GLM () functions and set families to build logistic regression models for binomials. GLM () does not assume a linear relationship between dependent variables and independent variables. However, it assumes a linear relationship between link functions and independent variables in the Logit model, and I hope you learned something valuable.


Most welcome insight

1.R language multiple Logistic Logistic regression application case

2. Panel smooth transfer regression (PSTR) analysis case implementation

3. Partial least squares regression (PLSR) and principal component regression (PCR) in MATLAB

4.R language Poisson regression model analysis cases

5. Analysis of mixed effects of R language in lung cancer by Logistic model

6. Implementation of LASSO regression, Ridge regression and Elastic Net model in R language

7.R language logistic regression, Naive Bayes Bayes, decision tree, random forest algorithm to predict heart disease

8. Python predicts stock prices using linear regression

9.R language uses logistic regression, decision tree and random forest to classify and predict credit data sets