(Note: This is a post that attempts to explain the intuition behind Logistic regression to readers who are not entirely familiar with statistics. So you probably won’t find any serious math work here.)
What does that mean?
1. Unlike the measured regression, logistic regression does not attempt to predict the value of a numerical variable given a set of inputs. Instead, the output is the probability that a given input point belongs to a class. For simplicity, suppose we have only two classes (for multi-class problems, you can look at multinomial Logistic regression), and the probability in question is the probability that P+ – > a data point belongs to the ‘+’ class. Of course, P_ =1-P+. Therefore, the output of Logistic regression is always in [0,1].
2. The central premise of Logistic regression assumes that your input space can be divided into two nice “regions,” with each class corresponding to a linear (read: straight) boundary. So what does a “linear” boundary mean? For both dimensions, it’s a straight line – no bending. In three dimensions, it’s a plane. And so on. This boundary will be determined by your input data and learning algorithm. But in order to take it for granted, it is obvious that the data points must be divided into the two regions above by a linear boundary. If your data points do satisfy this constraint, they are said to be linearly separable. Look at the picture below.
This partition plane is called a linear discriminant because 1. Its function is linear, 2. It helps the model “differentiate” between points that belong to different categories. (Note: If your points are not linearly separable in the original concept space, you can consider converting feature vectors to higher-dimensional Spaces by adding dimensions of interaction items, higher-dimensional items, and so on. The use of higher dimensions of such linear algorithms gives you some benefits of nonlinear function learning, because the boundary will be nonlinear if drawn back in the original input space.
========== X ===========
First, let’s try to understand the geometric meaning of “dividing” the input space into two distinct regions. Given two input simple variables (different from the 3d diagram shown above) x1 and x2, the corresponding boundary functions will be similar to
. (It is important to note that the x1 and x2 input variables are two, and that the output variables are not part of the concept space — unlike techniques such as linear regression.) Let’s think about (A,b). Input the values of x1 and x2 to the boundary function, and we’ll get its output
. Now, depending on where (a,b) is, there are three possibilities:
1. (a,b) is in the region defined by the + class point. The results of
It’s going to be positive, it’s going to be somewhere at zero infinity. Mathematically, the greater the magnitude of the value, the greater the distance between the point and the boundary. Intuitively, the greater the probability that (a,b) is in the plus category. Therefore, P+ will be at (0.5,1).
2. (a,b) is in the region defined by the – class. Now,
It’s going to be negative, lying at minus infinity, 0. But in the positive case, the higher the absolute value of the output of the function, the higher the probability of a,b being in the minus class. P+ will now be at [0,0.5].
3. (a,b) is on the linear boundary. In this case,
. This means that the model can’t really say whether (a,b) belongs to the + or – class. As a result, P+ will be exactly 0.5.
So now we have a function that outputs minus infinity, infinity given the input data point. But how do we map this to the probability that P+, starting at [0,1]? The answer lies in
In a function. Let’s say P(X) is the probability of event X happening. In this case, the odds ratio (OR(X)) is defined as
Is basically the ratio of the probability of an event occurring to the probability of an event not occurring. Obviously, probability and probability convey exactly the same information. But when $P (X) $goes from 0 to 1, OR(X) goes from 0 to infinity.
However, we still haven’t applied it yet, because our boundary function goes from the given value – infinity to infinity. So what we’re going to do is we’re going to
OR(X) of, is called the logarithmic probability function (
log-odds function). Mathematically, OR of X goes from 0 to infinity, and log of OR of X goes from minus infinity to infinity!
So we finally have a way to explain what happens when you plug in an input property into the boundary function. The boundary function actually defines the logarithmic probability of the + class in our model. So basically, in a two-dimensional example, given a point (a,b), Logistic regression would do the following:
Step 1. Compute the boundary function (or, log-odds function)
. Let’s simply call this value t.
Step 2. Do this to calculate the odds ratio
. (because t is the log of OR+).
Step 3. Knowing OR+, it calculates P+ using simple mathematical relations
. In fact, once you know the t from step 1, you can give it in combination with step 2 and step 3
The RHS of the above equation is called the logical function. Therefore, the learning model is also given a name :-).
========== X ===========
We now understand the intuition behind Logistic regression, but the question remains – how does it learn the boundary function
? The mathematical work behind this is beyond the scope of this article, but here’s a rough idea: Consider a function g(x), where X is a data point in a training data set. G (x) can be simply defined as: if x is part of the + class, g(x)=P+, (where P+ is the output given by the Logistic regression model). If x is part of the – class, g(x)=1-P+. Intuitively, g(x) quantifies your model
The probability of classifying training points. Therefore, if you g(x) average the entire training data, you will get the probability that the system correctly classifies the random data points, regardless of the category. To simplify things slightly, Logistic regression learns to try to maximize the “average” of g(x). The method used is called
Maximum likelihood estimation(for obvious reasons). Unless you are a mathematician, you can understand
Optimize as it happens, as long as you have a good idea of what to optimize – mainly because most statistics or ML libraries have built-in methods to do it.
========== X ===========
So far so good! Like all of my blog posts, I hope this helps some people who are trying to learn something by Google and on their own to understand the misconceptions of Logistic regression techniques.
Click on the link to the original English text
More articles are welcome to visit http://www.apexyun.com
Public id: Galaxy 1
Contact email: [email protected]
(Please do not reprint without permission)