Disclaimer: This article by Luo Zhou Yang original, reprint, please note the author and source

In this paper, I took notes from reading Speech and Language Processing-Logistic Regression, a classic Stanford textbook. I recommend reading the original text.

Multinomial Logistic Regression can be used for both binary and Multinomial Logistic Regression. Logistic regression is a classification algorithm, not a regression algorithm. Logistic regression belongs to the discriminative classifier, while naive Bayes belongs to the generative classifier.

Discriminant classifier and generative classifier

To distinguish the two classifiers, we can use a simple example: to tell whether the animal in a photo is a cat or a dog.

The goal of the Generative Model is to understand what cats are and what dogs are, and then make judgments about it. Discriminative model, on the other hand, is learning how to distinguish between two animals, not what they are.

For a more intuitive comparison in mathematics, first look at our Niave Bayes classification formula:

For generative models (for example, Naive Bayes), use a ** likelihood ** item to computeThis item represents the characteristics of how to generate a document, if we know it is category C. With the Discriminative Model, it tries to compute it directly.

Composition of probability-based machine learning classifier

A probability-based machine learning classifier has the following components:

  • Characteristic representation, that is, representation of each input
  • A classification function that estimates the current input category, such as Sigmoid and Softmax
  • An objective function that usually involves minimizing errors on the training set, such as the cross entropy loss function
  • An algorithm that optimizes the objective function, such as SGD

Sigmoid

The goal of dichotomous logistic regression is to train a classifier to make dichotomous decision, and Sigmoid is one of the feasible ways.

Logistic regression is done by learning two parameters from the training setandTo make decisions.

The category estimation formula of logistic regression is as follows:

The two parameters to be learned are also directly reflected in the above equation.

In linear algebra, you usually sum the weights up here** dot product **

So the result is a floating point number, and in the case of dichotomous, there are only 0 and 1, so how do we know if this z belongs to category 0 or category 1?

Let’s see what the sigmoid function looks like.

The image is as follows:

As you can see, the range of the sigmoid function is (0,1) and is symmetric about (0,0.5), so it is easy to get a decision boundary:

  • Z < = 0.5When belong to0category
  • Z > 0.5When belong to1category

The sigmoid function has many nice properties:

  • Its input range is, the output value range isThis is the natural probability representation.
  • inx=0It’s almost linear, it doesn’t change very much at very negative or very positive

At this point, we can calculate the probability of category 0 and category 1:

Cross-entropy loss function

Speaking of loss functions, you might think of mean square loss (MSE) :

This loss is used a lot in linear regression, but when applied to probabilistic classifications, it becomes more difficult to optimize (mainly non-convexity).

Conditional Maximum Likelihood Estimation: Select parametersandTo maximize the difference between the tag and the training data ().

Since the distribution of categories is a Bernoulli distribution, we can easily write:

Because, wheny=1When,wheny=0When,.

Thus, the logarithmic probability can be obtained:

Our training is to maximize this logarithmic probability. If you take a negative number to both sides of this equation, the maximization problem becomes a minimization problem, i.e. the goal of training is to minimize:

And becauseSo ourNegative log likelihood lossFormula is:

That’s what we haveCross-entropy loss, because the formula is:The probability distribution and the estimation distribution ofThe cross entropy between.

Therefore, based on the data of the whole batch, we can get the average loss as follows:

Gradient descent

The goal of gradient descent is to minimize losses, which can be expressed by the formula:

For our Logistic Regression,isand.

So how do we minimize this loss? Gradient descent is a way of finding the minimum value, and it is done by reciprocal to get the fastest decay direction of the function.

For the loss function of logistic regression, it is convex function, so it has only one minimum and no local minimum, so the global minimum can be found in the optimization process.

Take a two-dimensional example to get a feel for this process, as shown in the figure below:

It can be seen that the optimization process of the above loss function is to move one small step in the positive direction of the gradient every time! The formula can be expressed as follows:

The aboveThat determines thisA small stepWhat is theta? Also called thetaLearning Rate.

The gradient aboveIt’s going to be a constant.

What if it’s N dimensions? Then the gradient is a vector, as follows:

So, our parameter update is:

Logistic Regression gradient

The losses of logistic regression are as follows:

We have:

For a batch of data, our gradient is as follows:

regularization

The above training model may be overfitting. To solve this problem, we need a technique called regularization.

Regularization is a constraint on weights, or more specifically, to maximize logarithmic probabilityOn the premise of the weightThe constraint.

So our goal can be described by the following formula:

Among them,isRegularization term.

From the above equation, we can see that the regular term is to punish large weights. We tend to choose between models that are similar in effectThe one with less. The so-calledLess isFeatures of less, namely refers toI have more zeros in my vector.

Common regularization methods include L2 regularization and L1 regularization.

The regular calculation of L2 is Euclidean distance, and the formula is as follows:

L1 calculates the Mahatton distance regularly, and the formula is as follows:

So what are the pros and cons of L2 regular versus L1 regular?

  • L2 regularization is easier to optimize because its derivative isAnd the derivative of L1 is discontinuous at 0
  • L2 regularization is more likely to require small weight values, L1 regularization is more likely to require some larger weight values, but more weight values are 0 at the same time, that is to say, L1 regularization results tend to sparse weight matrix.

Both L1 and L2 regularization have A Bayesian interpretation. L1 regularization can be interpreted as the Laplace prior probability of weights, and L2 regularization corresponds to the assumption that the distribution of weights is a Laplace prior probability with an average of 0() is normally distributed.

The Gaussian distribution of weight is as follows:

According to Bayes’ rule, our weight can be estimated by the following formula:

Calculate the prior probability using the Gaussian distribution above, you can get:

We let., take the logarithm, then:

Multinomial logistic regression

So all we’ve been talking about is dichotomies, what if we want more categories? Multinomial Logistic regression is required at this point, also called Softmax regression or Maxent Classifier.

A multi-categorical set of categories is noTwo, so let’s replace sigmoid with a function that calculates the probability of the output, which is the pan-Chinese version of Sigmoidsoftmax.

Among them,.

So, for the input

We have:

Obviously, the denominator of the SoftMax function is an accumulation, so SoftMax outputs a probability value for each input, and the probability values for all inputs add up to 1!

Similar to sigmoid, putTo:

Notice that ourandThey all correspond to the categories at this point, so I’m going to writeand.

Similarly, our loss function becomes the generalized version:

Among them,1{y=k}saidThe value is 1, otherwise it is 0.

Thus, the following derivative can be obtained (without derivation) :

To consider

  • Is logistic regression very similar to neural networks? Can you tell the similarities and differences?

To contact me