In the theory of machine learning, the concept of “maximum likelihood estimation” is often encountered in the learning theory. The solving process of maximum likelihood estimation is so simple that we often ignore the principle behind it. My understanding of machine learning algorithms improved substantially when I thoroughly understood the idea behind maximum likelihood estimation. Let’s take a casual and easy look at the ideas behind maximum likelihood estimation and its application to machine learning. Of course, this article will be updated as the understanding of the theory continues to deepen.

Maximum Likelihood Estimate (MLE)

If the national height is evaluated, one premise is that the height is normally distributed (as shown in Figure 1). In the normal distribution, we care about the parameter: mean valueAnd varianceSo how do you get the mean and variance of the height of the whole country? It is impossible to measure the height of 1.3 billion people individually. What we can do is to randomly find some people of different ages and genders to measure their height, so that their height level represents the height level of the whole country, and then estimate the mean and variance of the height of the whole country by them. I think this is what everyone can think of! But why should these people’s heights be representative of the nation’s? Here contains its exquisite ideas, need to carefully consider ️.

FIG. 1 Normal distribution probability density curve

Well, if I may change the subject, let’s say an old hunter goes hunting with a young apprentice. Now, we know that one of them killed a rabbit, and if you had to guess who killed the rabbit, most people would just guess that it was the old hunter, and that’s using maximum likelihood.

Maximum likelihood principle

Let’s assume that our sampling is ideally correct; High probability events are more likely to occur in a single observation (hunting problem); The probability of events occurring during an observation should be high (height problem).

Now let’s go back to height and take a slightly theoretical approach to understanding maximum likelihood estimation:

Let’s say we take a random sample of the entire populationFor individuals, individuallyIt’s their height, and since weobservationUp to their height. According to the maximum likelihood principle above, we can conclude that the probability of occurrence of these samples we see should be the largest, or since we have extracted and observed their height, they are the most likely to occur. There is an assumption that we are optimistic about the samples we extract, that is to say, the samples we extract are ideal and reasonable. Then let’s go through the process of maximum likelihood estimation (don’t worry about high school knowledge) :

Suppose we have a national height distribution, but we do not know the specific parameter of distribution: mean valueAnd variance. Since we can’t measure the height of every one of the 1.3 billion people in the country, we did it at randomPersonal height:. We hope to estimate the parameters of distribution by the idea of maximum likelihood

The probability density function of the normal distribution is known as:

Step1 calculate the likelihood function:

Step2 take logarithm of likelihood function:

Step3 calculate the parameter corresponding to the maximum value:

To:

Step4 obtain the estimated parameters:

More general likelihood functions.Corresponding to the height problem aboveNow that we have observedSo they should be the most likely to show up. Assuming that each person’s height is related only to himself, and each person is independent of the other, the joint probability is obtained by multiplying the probabilities of each personIt should also be maximum satisfaction according to the idea of maximum likelihoodMaximum parameterThat’s our best estimate. So we could say if, thenCloser to the real value.

1. How to reflect the best fitting of samples extracted by our estimation or model?

A: Find the maximum likelihood function.

2. Which set of parameters is the best we are looking for?

Answer: When the likelihood function is at its maximum, the corresponding parameter

In English daily life, there is generally no distinction between the use of likelihood and probability **. In statistics, the probability of observing something based on the parameters of some model (roughly speaking, we can think of the parameters as determining the model) is called probability; The probability that some data has been observed and the parameters of the model take a specific value is called likelihood.

Ok, so once we understand the maximum likelihood estimation, it’s kind of an aha moment to reevaluate our machine learning algorithm.

Maximum likelihood in machine learning

Let’s look at the simplest binary machine learning model, logistic regression, which can be expressed as follows:

Among themIs the model parameter,Is the sample, i.e ,

What is the method of learning parameters? The answer is maximum likelihood estimation and gradient descent.

Given the training data, including , , maximum likelihood estimation method is used to estimate model parameters:

Set:

The likelihood function is:

Logarithmic likelihood is:

rightTake the maximum, getParameter estimation of.

This becomes the optimization problem of logarithmic likelihood as objective function, and the optimal value can be obtained by the common gradient descent method.

3. Deeply understand the connection between machine learning and maximum likelihood

In machine learning and deep learning, the empirical distribution of sample data is used because the given sample is always limitedThe substitution represents the distribution of all real data. The probability distribution of the model is. Consider a group containingA data set of two samplesIndependent from unknown real data distributionGenerated.

makeIs made up ofProbability distribution over the same space. In an ideal world,Take any inputTo the real probabilityIn layman’s terms, we want our machine learning model to be able to correctly predict all the data, not just the data in the training set, but also the data not included in the training set, which is almost impossible.

rightMaximum likelihood estimation of is:

The exponential:

Take the expectation of the empirical distribution:

Now, one way to measure the difference between these two probability distributions is to do the KL divergence, and let’s say we measure the probability distribution of the sample dataAnd the probability distribution of the trained modelThe gap between:

Ideally, the smaller the KL divergence of the two, the better, and the model can better learn the data of the training set. Among themIs the data distribution of the training set, which can be considered fixed, so minimizing KL divergence is equivalent to minimizing:

If you are familiar with the above formula, this is the most commonly used loss function in machine learning – cross entropy.

In this section, please refer to section 5.5 of Deep Reading (flower book). Some symbols have been changed to make it easier to understand.

Let me compare the cross entropy formula in information theory with the common cross entropy form in machine learning to better understand the above:

Form of cross entropy in information theory:

Form of cross entropy in deep learning:

Among themThe real label of the corresponding training set can be corresponding , Is the predicted value of the model can correspond

References

1. Li Hang (2012) Statistical learning methods. Tsinghua University Press, Beijing.

2.Deep Learning,Ian Goodfellow and Yoshua Bengio and Aaron Courville,MIT Press,2016