How to understand EM algorithm in general

preface

Know classmates may know that the EM algorithm, the EM algorithm is 10 data mining algorithm, it may be said to machine learning or data mining basic circle does not open, but the EM algorithm and KMP algorithm, data structure, seemingly simple, but seems not to understand while see, want to bypass is not open around a love-hate relationship, may be you empathy by all who read this article.

For a long time, I have always insisted that when you learn a knowledge point and feel that you do not understand, nine times out of ten it is not because you are not smart enough, nine times out of ten it is because the material you read is not popular enough, not easy to understand (if still not, ask someone).

Before writing this EM note, I have read a lot of materials, some of which are relatively popular, but most of which are not easy to understand. This paper strives to be easy to understand and complete, including principle, derivation and application. The goal is to make you understand even if you don’t understand all other EM articles.

Therefore, this article will first talk less about the formula and more examples, and so understand the essence and significance of EM algorithm, and then investigate the formula will come naturally. If you have any questions, feel free to leave a comment, thanks.

What is maximum likelihood

What exactly is an EM algorithm? Wikipedia explains:

The Expectation maximization algorithm (Expectation maximization algorithm) is an algorithm to find the maximum likelihood estimate or the maximum posterior estimate of parameters in the probability model, in which the probability model depends on invisible variables that cannot be observed.

The maximum expectation algorithm is calculated alternately by two steps,

The first step is to calculate the expectation (E), using the existing estimate of the hidden variable, calculate its maximum likelihood estimate;

The second step is to maximize (M), maximizing the maximum likelihood value obtained at step E to compute the value of the parameter. The parameter estimates found at step M are used in the next step E calculation, and the process continues alternately.

1.1 Likelihood function

Do you get it? Nine times out of ten you don’t get it. Because you probably don’t know what the maximum likelihood estimate is, right? In order to understand the maximum likelihood estimation, we have to start with the likelihood function. But what is a likelihood function

In mathematical statistics, likelihood function is a function about parameters in statistical model, which represents the likelihood of parameters in the model. “Likelihood” has a similar meaning to “probability” or “probability”, both referring to the likelihood of an event happening.

And maximum likelihood is the same thing as maximum possible.

Let’s say one of your classmates is out hunting with a hunter when a hare darts past. A shot rang out, and the hare answered. If you had to guess, who shot the bullet that hit? You assume that the shot was shot by a hunter, since the hunter has a greater chance of making a hit than your classmate.

The inference made in this example embodies the basic idea of maximum likelihood method.

Most of the time we extrapolate results based on known conditions, but maximum likelihood estimation is to know the results already, and then seek the conditions that make the results most likely to occur, as an estimate.

See, probability is inferring outcomes from conditions, whereas likelihood is inferring conditions from outcomes. In this sense, the likelihood function can be understood as the inverse of conditional probability.

Given A given parameter B, the probability of event A’s occurrence can be written as follows:

Bayes’ formula tells us that

Now we turn it around: eventsAIt has already happened. Please go through the likelihood function, estimated parametersBThe possibility of.

Once you get into the formula, you might be confused. Then I recall what I said at the beginning: Isn’t there one that’s easy to understand?

The answer, of course, is yes. Let’s take a concrete example from hand.

1.2 Example of likelihood function: given sample X, find the parameter θ

Since Google’s Go robot AlphaGo beat human world champion Lee Se-dol 4-1, the tide of ARTIFICIAL intelligence (AI) has been surging in self-driving cars, face recognition, security monitoring, medical imaging and other fields. Yueqionline, which focuses on AI education, has accumulated 100,000 AI students by the end of 2017.

Suppose we needed to measure the height distribution of 100, 000 male and female students online in July. How would we do that? Considering 100,000 is a huge number, it’s impossible to count them one by one. That’s right. Random sampling, 100 boys and 100 girls from 100,000 students, and then the height of each of them.

For this problem, we can abstract it through mathematical modeling

Firstly, we select 100 male/female students’ height from 100,000 students and form sample set X, X={x1,x2… ,xN}, where xi represents the height of the ith person drawn, and N equals 100 represents the number of samples drawn.
Assume that the height of boys follows a normal distributionAnd the height of girls follows another normal distribution: 。
But the mean u and the variance ∂2 of these two distributions are not known (don’t ask me, what is the mean and what is the variance, please check Google or Wikipedia) and let be the unknown parameter θ=[u, ∂]T
Now need to use maximum likelihood method (MLE), through which the height of 100 boys and 100 girls as a result, the sample set X to estimate the unknown parameters of the two normal distribution theta, problem definition is equivalent to X, known to theta; in other words for p (theta | X)

Because these boys (height) is subject to the same gaussian distribution p (x | theta). Then draw the boys A (height) of the probability is p (xA | theta), is the probability that drew the boy B p (xB | theta), considering they are independent, so at the same time to draw boys boys A and B is the probability of p (xA | theta) * p (xB | theta).

Similarly, I’m from the distribution of p (x | theta) overall samples at the same time to draw the probability of the samples of 100 boys, 100 samples of the sample set x of joint probability (i.e., the product of their respective probability), use the type said:

Words, there is a article will use the said p (x | theta), some articles can use p (x; θ), but the essence is the same regardless of which notation you use. Of course, if involves the Bayes formula, with the former said p (x | theta) better.

Out of all the guys online in July, I picked these 100 guys and not everyone else, which means that in the whole school, these 100 guys have the highest probability of appearing, and that probability is the likelihood function L(θ) up there. How do you do that? In other words, what theta maximizes L of theta?

1.3 Maximum likelihood means maximum possibility

Suppose we find a parameterCan maximize the likelihood function L(θ) (that is to say, the height probability of the 100 boys is the highest), thenShould be the “most likely” parameter value, equivalent to the maximum likelihood estimator of θ. Remember to:

Here L(θ) is continuous, and for the sake of analysis, we can define the logarithmic likelihood function and make it continuous:

Now we need to maximize the likelihood function L(θ) of θ, and then the θ corresponding to the maximum is our estimate.

To find the extreme value of a function, the most straightforward assumption, based on what we learned in undergraduate calculus, is to take the derivative, set the derivative to be zero, and then solve the equation for theta (assuming, of course, that the function L(theta) is continuously differentiable). But what if theta is a vector with multiple parameters? Of course, the partial derivative of L(θ) with respect to all parameters, that is, the gradient, so that n unknown parameters, there are n equations, the solution of the system of equations is the extreme point of the likelihood function, and finally get the values of these n parameters.

General steps for estimating the maximum likelihood function:

Write the likelihood function;
Take logarithm of likelihood function and arrange;
Take the derivative, set the derivative to 0, and get the likelihood equation;
To solve the likelihood equation, the parameters obtained are obtained.

Second, popular understanding of EM algorithm

2.1 Complex case of maximum likelihood estimation

We already know that maximum likelihood estimation can be summed up in one sentence: know the result and deduce the condition θ backwards.

For example, in the above article, to calculate the height distribution of boys and girls among 100,000 online students in July

Firstly, we select 100 male/female students’ height from 100,000 students and form sample set X, X={x1,x2… ,xN}, where xi represents the height of the ith person drawn, and N equals 100 represents the number of samples drawn.

Assume that the height of boys follows a normal distributionAnd the height of girls follows another normal distribution:, but the mean u and variance ∂2 of these two distributions are not known. Let be the unknown parameter θ=[u, ∂]T

Now need to use maximum likelihood method (MLE), through which the height of 100 boys and 100 girls as a result, the sample set X to estimate the unknown parameters of the two normal distribution theta, problem definition is equivalent to X, known to theta; in other words for p (theta | X)

The goal of maximum likelihood estimation is to solve for the optimal parameter θ to achieve the result, but maximum likelihood estimation needs to face only one probability distribution or know the probability distribution by which the result is achieved, but you don’t know the parameters of the probability distribution.

But now let’s make it a little bit more complicated, let’s say that these 100 boys are mixed up with 100 girls. We have data on the height of 200 people, but we don’t know whether each of those 200 people is a boy or a girl, so gender is like a hidden variable.

And that’s a little bit awkward, because usually we can’t know whether each person is more likely to be a boy or a girl until we know the exact normal distribution of height for men and women. Conversely, we can estimate as accurately as possible the parameters of the normal distribution of male and female height only if we know whether each person is a boy or a girl.

EM algorithm is to solve the “maximum likelihood estimation” such a more complex existence.

2.2 Implicit variables in EM algorithm

Understanding hidden variables in EM algorithms is critical, just as understanding KMP is essential to understanding the meaning of the NEXT array.

In general, Y represents the observed random variable data, and Z represents the hidden random variable data (because we cannot observe the probability distribution from which the result is derived, so this is called the hidden variable). So Y and Z together are called complete data, and Y alone is called incomplete data.

At this time have you found that the EM algorithm is facing the main problem is: there is a hidden variable data Z. And if we know Z, then we can solve the problem using maximum likelihood estimation. So how do I make Z known?

Let’s take a second everyday example.

Let’s say you’re a chef in a five-star hotel and you need to divide your pan evenly between two plates. If there were only one plate to serve, nothing would have to be said, but the problem is that there are two plates, and since it is impossible to estimate how many dishes should be served on each plate, it is impossible to divide the dishes perfectly evenly at once.

Solution:

The chef pours the dishes from the pot into two plates, sees which plate has more, and then transfers the dishes from one plate to the other. Repeat the process until the two plates have roughly the same amount of food. In the example above, the result of the average distribution is the “observed data Z”, the number of dishes allocated to each plate to achieve the average distribution is the “parameter to be determined θ”, and the hand feel of the distributed dish is the “probability distribution”.
And if there is only one plate, the probability distribution is determined (” all the dishes in the pot onto a plate “such a person has a feel!), but because there are two plates, so” how many vegetables is good for you to “give a plate to handle is somewhat vague, but we can use the above method to achieve the ultimate goal.

The idea of EM algorithm is:

Give yourself an initial value for θ (since I don’t know how many dishes each plate would have to have in order for two plates to divide the pot equally, I’ll give you an estimate);

According to the given observation data and the current parameter θ, calculate the expectation of the conditional probability distribution of the unobserved data Z (in the previous step, the dish has been poured into two dishes according to hand feeling, and then in this step, judge the hand feeling of the dish according to “there are dishes in both dishes” and “how many dishes are in the current two dishes”);

You already figured out z in the previous step, so use the maximum likelihood estimate to find the optimal θ ‘(if you already have your hand feel, figure out how many dishes should be on your plate, and then mix them up).

Since the results of steps 2 and 3 May not be optimal, repeat steps 2 and 3 until they converge (repeat the process many times until the two plates have roughly the same amount of medium).

The second step above is called step E (find the expectation), and the third step is called step M (maximize), thus increasing E and M.

2.3 THIRD example of EM algorithm: coin toss

Nature Biotech in his EM tutorial article Do, C. B., & Batzoglou, S. (2008). What is the Expectation maximization algorithm? Nature Biotechnology, 26(8), 897. A coin tossing example is used to illustrate the idea of EM algorithm.

For example, two coins, A and B, can be estimated directly if they know whether A or B is flipped each time (see figure A below).

If you don’t know whether you’re flipping A or B, and you just observe 5 rounds of 10 flips, 50 flips, then you can’t directly estimate the probability of heads. This is where the EM algorithm comes in (see figure B below).

Maybe at first glance, you don’t understand. That’s okay. Let’s generalize this example.

So again, two coins A and B, let’s say they’re tossed at random with probability PA and PB of heads. To estimate the probability of these two coins coming up, let’s take turns flipping coins A and B five times in A row for A total of five rounds:

COINS	The results of	statistical
A	Heads, tails, tails	3-2
B	Inverse inverse	2-3
A	Heads, tails, tails	1-4
B	Heads and tails	3-2
A	Right, right, right	2-3

Coin A was flipped 15 times, with 3 heads in the first round, 1 head in the third round, and 2 heads in the fifth round, so PA is easy to estimate. Similarly, PB is easy to calculate, as follows:

PA = (3+1+2) / 15 = 0.4 PB= (2+3) /10 = 0.5

The question is, what if we don’t know whether we’re flipping A coin or B coin (i.e., the type of coin is an implicit variable), and then take turns flipping five more times, we get the following result:

COINS	The results of	statistical
Unknown	Heads, tails, tails	3-2
Unknown	Inverse inverse	2-3
Unknown	Heads, tails, tails	1-4
Unknown	Heads and tails	3-2
Unknown	Right, right, right	2-3

OK, so this is interesting. Now our goal is still the same, we still estimate PA and PB, what do we need to do?

Obviously, we have A hidden variable of the coin type, let’s call it z, which can be thought of as A 5-dimensional vector (z1, Z2, Z3, Z4,z5), representing the coin used in each flip. For example, z1 represents whether the coin used in the first flip is A or B.

But if we don’t know this variable z, we can’t estimate PA and PB, so we have to estimate Z before we can estimate PA and PB.

But to estimate Z, we also need to know PA and PB, so that we can use the law of maximum likelihood to estimate Z. Isn’t that a chicken-and-egg problem? How to break it?

The answer is to initialize a PA and PB randomly, use it to estimate Z, and then estimate the new PA and PB based on Z, or according to the maximum likelihood probability rule, if the new PA and PB are the same as the PA and PB we initialized, what does that tell us?

This shows that the PA and PB we initialized are a pretty good estimate!

That is, we initialize PA and PB, and we estimate z based on maximum likelihood, and then we estimate P1 and P2 based on maximum likelihood, and when we initialize PA and PB, it means that P1 and P2 are probably true values. This contains the maximum likelihood estimates of the two interactions.

What if the newly estimated PA and PB are very different from our initial values? I just keep iterating with the new P1 and P2 until it converges.

So let’s just say, let’s give some arbitrary values to PA and PB, for example, the probability that coin A comes up heads PA = 0.2 the probability that coin B comes up heads PB = 0.7

Then, let’s see which coin is most likely to be tossed in the first round. If it is coin A, the probability of getting 3 heads and 2 tails is 0.2*0.2*0.2*0.8*0.8 = 0.00512. If it is coin B, the probability of getting 3 heads and 2 tails is 0.7*0.7*0.7*0.3*0.3=0.03087. Then figure out the corresponding probability in the other 4 rounds in turn. Make the table as follows (bold indicates higher probability) :

Number of rounds	If coin A	If coin B
1	0.00512, namely 0.2 0.2 0.2 0.8 0.8, 3 positive and 2 negative	0.03087,3-2
2	0.02048, i.e. 0.2 0.2 0.8 0.8 0.8, 2 positive -3 negative	0.01323, 2 positive -3 negative
3	0.08192,That’s 0.2 0.8 0.8 0.8 0.8, 1 positive and 4 negative	0.00567, 1 positive to 4 negative
4	0.00512, namely 0.2 0.2 0.2 0.8 0.8, 3 positive and 2 negative	0.03087,3-2
5	0.02048,That’s 0.2, 0.2, 0.8, 0.8, 0.8, 2 positive minus 3 negative	0.01323, 2 positive -3 negative

So the rule of maximum likelihood says that the most likely choice for round 1 is coin B and round 2 is most likely to be coin A and round 3 is most likely to be coin A and round 4 is most likely to be coin B and round 5 is most likely to be coin A

So let’s take the more likely, the more likely A, the number of positive rounds 2, 3, and 5, and add them up, and divide by the total number of flips A has, 15 (A flips three times, five times each), as an estimate of z. B is computed similarly. Then we can estimate the new PA and PB according to the maximum likelihood probability rule.

PA = (2+1+2) /15 = 0.33 PB = (3+3) /10 = 0.6

Assuming that we are omniscient gods and know that the coin at each flip is as indicated at the beginning of this section, the maximum likelihood estimates for PA and PB are 0.4 and 0.5 (hereinafter referred to as the true values for PA and PB). So compare our initial PA and PB with our newly estimated PA and PB:

Initialized PA	Estimated PA	The real PA	Initialized PB	Estimated PB	The real PB
0.2	0.33	0.4	0.7	0.6	0.5

See yet? Our estimates of PA and PB are much closer to their true values than their initial values! That’s the magic of the EM algorithm.

It can be expected that we continue to use the estimated PA and PB to estimate Z according to the above ideas, and then use Z to estimate the new PA and PB. After repeated iterations, PA = 0.4 and PB=0.5 can be finally obtained. No matter how we iterate, the values of PA and PB will remain 0.4 and 0.5, so that, We find maximum likelihood estimates for PA and PB.

Third, formula derivation of EM algorithm

3.1 Objective function of EM algorithm

Remember the beginning of Section 1.2?

From distribution is p (x | theta) overall samples extracted to the probability of the 100 samples, each sample of the sample set x of joint probability, expressed in the type:

Suppose we have a sample set {x(1)… X (m)} contains m independent samples, and now we want to find the category z implied by each sample, which maximizes p(x,z). The maximum likelihood estimation of P (x,z) is as follows:

The first step is to take the logarithm of maximum likelihood, and the second step is to find the sum of joint distribution probabilities for each possible class Z of each sample. But it’s usually hard to solve it directly, because there’s a hidden variable z, but it’s usually easy to solve it once you’ve determined z.

For parameter estimation, we still essentially want to obtain the parameter θ that maximizes the likelihood function. Now the only difference from maximum likelihood function is that there is an unknown variable z in the likelihood function formula, as shown in formula (1) below. In other words, our goal is to find the right θ and z so that L(θ) is maximum. So we might be thinking, well, you just have another unknown variable, and I could take the partial derivative of the unknown θ and the unknown z and set them equal to 0, and that would be the same thing.

Essentially we need to maximize the likelihood function of equation (1), but we can see that the logarithm of sum is involved, and the derivative is very complicated (imagine log(f1(x)+ f2(x)+ f3(x)+…)). The derivative of the composite function), so it is difficult to solve the unknown parameters z and θ.

We multiply the numerator and denominator by an equal function (i.e. the probability distribution of the implicit variable Z, Qi(Z (I)), whose sum of probabilities is equal to 1, i.e), that is, equation (2) in the figure above can be obtained, but equation (2) still has “logarithm of sum” and cannot be solved. What should I do? Then, witness the magic moment when, using Jensen’s inequality, we get eq. (3), where eq. (3) becomes the sum of logarithms, making it easy to differentiate.

From Equation (2) to equation (3), there are two magic points:

When we calculate the maximum value of Equation (2), equation (2) is not easy to calculate, so we think that equation (2) is larger than a convenient lower bound of Equation (3), and we maximize the lower bound of equation (3) as much as possible until the calculation result of equation (3) is equal to equation (2), and then we indirectly obtain the maximum value of equation (2).
Jensen’s Inequality, where does Jensen’s inequality come from?

3.2 Jensen’s Inequality

Let f be a function whose domain is real

If you take the second derivative of f of x for all real numbersF is convex.
whenxIs a vector, if its Hessian matrix H is semidefinite (), then f is convex.
iforF is then called strictly convex.

Jensen’s inequality is stated as follows:

If f is a convex function and X is a random variable, then: E[f(X)]>=f(E[X]), colloquially speaking, is that the expectation of the function is greater than or equal to the desired function.

In particular, if f is strictly convex, the above equation is equal if and only if P(X = EX) = 1, i.e. X is constant, i.e. E[f(X)] = f(EX).

As shown in the figure below

In the graph, the solid line F is a convex function (because the area above the function is a convex set), and X is a random variable, with a probability of 0.5 of A and a probability of 0.5 of B (like flipping a coin). The expected value of X is going to be the median of a and b, and it’s pretty clear from that,.

And, of course, if f is (strictly) concave if and only if -f is (strictly) convex, the inequality goes in the opposite direction, so that’s just minus.

Back to formula (2), since f(x)=log x is a concave function (the second derivative is -1/x2<0).

In equation (2)isExpectations. Why is that? Recall the Lazy Statistician rule in the expectation formula, as follows

Let Y be a function of random variable X(g is a continuous function), then (1) X is a discrete random variable and its distribution law is, k = 1, 2,… . ifAbsolute convergence, then there are(2) X is a continuous random variable and its probability density isIf,Absolute convergence, then there are

If E(X)=∑ X *p(X), and f(X) is a function of X, then E(f(X))=∑f(X) *p(X), and, so the inequality of formula (3) can be obtained:

OK, at this point, it’s easy to take the derivative of (3), but (2) and (3) are inequality signs, and the maximum of (2) is not the maximum of (3), and we want to get the maximum of (2), so what do we do?

The inequality of Equations (2) and (3) above can be written as: the likelihood function L(θ)>=J(z,Q). As stated at the end of Section 3.1, we can increase L(θ) by continuously maximizing the lower bound J, eventually reaching the maximum value of L(θ).

In the figure above, we fix θ, adjust Q(z) so that the lower bound J(z,Q) rises to equal L(θ) at this point θ (green curve to blue curve), then fix Q(z), adjust θ so that the lower bound J(z,Q) reaches its maximum value (θ t to θt+1), and then fix θ again, adjust Q(z)… Until it converges to theta star at the maximum of the likelihood function L of theta.

There are two questions:

When is the lower bound J(z,Q) equal to L(θ) at this point θ?
Why does it have to converge?

First of all, the first problem, Jensen’s inequality says, this is true when X is constant. In other words, in order for equation (3) to be equal, we need:

Where c is a constant and does not depend on. I take the transformation of this, and I sum all the z’s, and I get

Because I mentioned earlier(Yes, the probability distribution of the implicit variable Z, the sum of which is equal to 1), so it can be deduced:

So throughCan be obtainedThe value of (i.e/c), sofor

Suddenly it was clear! At this point, we derive the conditional probability as the calculation formula of Q(z) whose lower bound is pulled up after fixed parameter θ, and solve the problem of how to choose Q(z). This is the E step, which is the lower bound of L(θ).

The next M step is to set Q(z) and adjust θ to minimize the lower bound J(z,Q) of L(θ). After all, the lower bound can be adjusted even larger after Q(z) is fixed.

This is the EM algorithm step.

3.3 PROCESS and proof of EM algorithm

To sum up, the expected maximum EM algorithm is a maximum likelihood estimation method for solving probability model parameters in incomplete data or data sets with missing data (with implicit variables).

In other words, if we don’t know what the hidden variable z is (for example, coin A or coin B), then we can’t tell the probability of coin A or coin B coming up heads θ, and in that case, we can

1 Randomly initialize the distribution parameter θ2 and repeat the cycle until it converges {(step E, find the Q function). For each I, calculate the posterior probability of the hidden variable calculated from the model parameters of the last iteration (in fact, the expectation of the hidden variable) as the present estimate of the hidden variable:

Maximizes the likelihood function to obtain the new parameter value

In this way, the Q(z) solution is substituted into θ, and the θ solution is substituted back into Q(z). In this way, the parameter θ maximizes the likelihood function L(θ).

And then, the second question from the previous section, does it converge? Or, how do you make sure EM converges?

Here, I annotate and explain with direct reference to JerryLead’s machine learning notes.

Assume thatandIs the result of iteration T and iteration T +1 of EM (yes, superscript t indicates iteration T). If we prove that, that is to say, the maximum likelihood estimate increases monotonically, so eventually we will reach the maximum of the maximum likelihood estimate.

So let’s prove it.

The selectedAfter that, we get E



This step guarantees that in a givenWhen, Jensen’s inequality holds, namely



And then take M steps, and fixAnd willAs a variable, for the topSo when I take the derivative, I getAfter some deduction, the following formula can be established:

Explain step (4) and getPhi is just maximization, that is,It doesn’t make the equation true, it only makes the equation true if it’s fixedAnd press E to getTo be true.

And according to the formula that we got earlier, for all of themandAll set up



Step (5) makes use of the definition of M step, which is the definition of M stepTo adjust to, maximizes the lower bound. So (5) is true, and (6) is the result of the previous equation.

So that proves itIt increases monotonically. One way to do that is to convergeIt doesn’t change at all, or it changes very little.

Explain (4), (5) and (6) again.

First of all, (4) satisfies all parameters, while its equality condition is only fixed, and adjust the Q set, and step (4) just fixed Q, adjustThere is no guarantee that the equation is true.
(4) to (5) are the definitions of the M steps, and (5) to (6) are the conditions for the establishment of the equality guaranteed by the previous E steps. In other words, E will bring the lower bound to andA particular value (here), but then found that the lower bound can still rise, so after M steps, the lower bound is pulled up again, but can not reach andThen E moves the lower bound to the same height as that particular value, and repeat until you reach the maximum value.

Over? NO, M step, how do we find the extreme value of θ? Although there are many ways to find the extreme value, this article will be expanded.

Firstly, Q(z) is solved and substituted into θ, and it can be obtained:

(7)

Among them:

(8)

(9)

Directly on theIt’s still a bit of a hassle, but you can maximize it by iterating.

First, according to Equation (9), fromFinding conditional distribution
Then put thePlug into PI (7) and get

(10)

So you just maximize the joint distributionLet’s maximizeThen repeat the steps.

M step is obviously the maximization step, but what about E step? According to Equation (10),

In fact, step E is taking the conditional expectation given X, the posterior expectation, so that the Jenson’s inequality in Equation (3) is equal, and it’s amplifying the small end of Jensen’s inequality so that it equals the large end, which is a magnification; M-step maximization of joint distribution, using 0 gradient, Lagrange method and other methods to find the extreme point, another magnification. As long as the likelihood function is bounded, as long as the zero gradient point in M steps is the maximum point, you can zoom all the way to the end.

3.4 Another understanding of EM algorithm

If we define

We know that from the previous derivationEM can be regarded as the coordinate ascending method of J, E step is fixedAnd optimize the, M steps are fixedTo optimize the.

Coordinate ascent:

In the linear iterative optimization path in the figure, you can see that each step is further to the optimal value, and the forward path is parallel to the coordinate axis, because each step only optimizes one variable.

This is like finding the extreme value of a curve in the X-y coordinate system, but the curve function cannot be directly differentiated, so the gradient descent method is not applicable. But when you fix one variable, you can get the other one by taking its derivative, so you can fix one variable at a time, take the extremum of the other one, and then gradually approach the extremum.

Corresponding to EM, is: E step: fixed θ, optimize Q; M step: fix Q and optimize θ; The alternation pushes the extreme to the maximum.

4. EM application: Estimate two unknown parameters of pLSA

In fact, there are many applications of EM algorithm, the most extensive is GMM mixed Gaussian model, clustering, HMM and so on. Such as the clustering

This paper focuses on estimating two unknown parameters of topic model pLSA with EM algorithm.

4.1 Document generation under pLSA model

An article often has multiple topics, but each topic has a different probability of appearing in the document. For example, when introducing a country, it is common to introduce a number of topics, such as education, economy and transportation. So how are documents generated in pLSA?

Suppose you want to write M documents, and since a document is made up of different words, you need to identify the words in each position in each document.

Suppose you have K possible themes, V possible words, and let’s play a game of dice.

1. Suppose you each to write a document to make a K “document – theme” below the dice (throw the dice to get K a theme of a), and K V of the “theme – word” dice (each dice corresponds to a topic, K dice K before the corresponding theme, and dice corresponding to each side of the choice of words, V faces V alternative words).
- For example, if K=3, make a “document-topic” die with three topics: education, economy, transportation. Then make V = 3, 3 a 3 “theme – word” below the dice, among them, the education subject word can be on the surface of the three dice: university, teachers, curriculum, the economic subject of dice three words can be: on the surface of the market, enterprises, finance, transportation subject word can be on the surface of the three dice: high-speed, cars, planes.
2. For each word, first roll the “document-topic” dice to choose the topic. After getting the result of the topic, use the “subject-word” dice corresponding to the result of the topic to choose the word to write.
- Roll the document-topic dice and assume (with some probability) that the topic is education, so the next step is to roll the education topic dice and get (with some probability) one of the words in the education topic dice: university.
  - The process of casting dice to produce words is simplified as follows: “First choose the topic with a certain probability, and then choose the word with a certain probability”. In fact, there were three topics to choose from at the beginning: education, economy and transportation. So why choose education? It’s a random choice, but the random choice follows a probability distribution. For example, the probability of choosing an education topic is 0.5, the probability of choosing an economic topic is 0.3, and the probability of choosing a transportation topic is 0.2. Then the probability distribution of these three topics is {education: 0.5, economy: 0.3, and transportation: 0.2}, we call the probability distribution of each topic Z appearing in document D as the topic distribution, and it is a multinomial distribution.
  - Similarly, after the education theme is randomly selected from the theme distribution, there are still three words: university, teacher and course, all of which may be selected, but their probability of being selected is also different. For example, if the word “university” is selected with a probability of 0.5, the word “teacher” is selected with a probability of 0.3, and the word “course” is selected with a probability of 0.2, the probability distribution for these three words would be {university: 0.5, teacher: 0.3, course: 0.2}, we call the probability distribution of the occurrence of each word W under the topic Z word distribution, which is also a multinomial distribution.
  - Therefore, topic selection and word selection are two random processes. First, the topic “education” is extracted from the topic distribution {education: 0.5, economy: 0.3, transportation: 0.2}, and then the word “university” is extracted from the word distribution corresponding to the education topic {university: 0.5, teacher: 0.3, course: 0.2}.
3. Finally, you repeatedly roll the document-topic and subject-word dice, repeat N times (to produce N words) to complete a document. Repeat the method to produce a document M times to complete M documents.

The above process is abstracted into the document generation model of pLSA. In this process, we do not care about the order of occurrence between words, so pLSA is a bag of words approach. Specifically, the model assumes that a set of co-occurrence terms are associated with an implied topic category. Simultaneous definition:

Represents the probability of a document being selected in a mass of documents.
Say the wordIn a given documentThe probability of occurrence in.
- So how do we calculate that? For massive documents, after word segmentation of all documents, a word list is obtained, so that each document is a collection of words. For each word, the probability of its occurrence in the document is calculated by dividing the number of occurrences by the total number of words in the document.
Indicates a specific topicIn a given documentThe probability of the occurrence of.
Denotes a specific wordOn a given topicThe probability of occurrence, the more closely related to the topic of the word, its conditional probabilityThe greater the.

Using the first, third and fourth probabilities above, we can obtain the generation model of “document-term” according to the following steps:

According to the probabilitySelect a document
The selected documentAfter, from the topic distribution according to probabilityChoose an implied subject category
The selectedAfter, from the word distribution according to probabilityChoose a word

So the whole process of document generation in pLSA is to select the topic of document generation and determine the topic generation word.

4.2 Reverse its subject distribution according to the document

Conversely, now that the document has been generated, how do you deduce the topic from the already generated document? This process of infering hidden topics (distribution) from the documents you see (actually the reverse of producing the documents) is the purpose of topic modeling: to automatically discover topics (distribution) in the document set.

In other words, humans write various articles based on the document generation model and then throw them to the computer, which sees a series of articles that have already been written. Now the computer needs to summarize the topic of the article according to a series of words seen in the article, and then obtain the different probability of occurrence of each topic: topic distribution. That is, document D and word W are observable, but topic Z is hidden.

As shown in the figure below (in the figure, colored D and W represent observable variables, uncolored Z represents unknown hidden variables, N represents a total of N words in a document, and M represents M documents) :

In the figure above, document D and word W are the samples we get (the samples are random, and the parameters are unknown but fixed, so pLSA belongs to the frequency school of thought. It is different from the LDA introduced in reference 8: the sample is fixed, the parameter is unknown but not fixed, it is a random variable and obeys a certain distribution, so LDA belongs to the Bayes school of thought) and can be observed. Therefore, for any documentThat’s given.

This is based on a large number of known document-term information, training out of the document – themeAnd subject-word items, as shown in the following formula:

Therefore, the generation probability of each word in the document is:

Due to theYou can calculate in advance,whileandUnknown, soThat’s the parameter we want to estimate.In a very simple way, we want to maximize this theta.

What methods are used for estimation? Commonly used parameter estimation methods include maximum likelihood estimation (MLE), maximum post-verification estimation (MAP), Bayesian estimation and so on. Since the parameter to be estimated contains implicit variable Z, we can consider EM algorithm.

4.3 EM algorithm to estimate two unknown parameters of pLSA

First try toTwo unknown variables to be estimated are described from the perspective of matrix and.

Assume that usingSaid wordIn the themeA multinomial distribution of, thenYou can represent it as a vector, with each element **Said wordsAppear on the subject六四运动The probability of, i.e.,

withRepresents all topicsIn the documentA multinomial distribution of, thenYou can represent it as a vector, with each element **Said the themeAppear in documentIs the probability **, i.e

In this way,The cleverandIt turns into two matrices. In other words, in the end we want the solution to take these two matrices:

Since words are independent of each other, the distribution of N words in the whole document is:

Moreover, documents and documents are also independent from each other, so word distribution in the whole corpus is (M documents in the whole corpus, N words in each document) :

Among them,Said wordsIn the documentWord frequency,Represents the total number of words in document DI.

Thus, the log-likelihood function of word distribution of the whole corpus can be obtained (there is a small error in the following formula, the correct formula should be: N is M, M is N) :

Now, we need to maximize this logarithmic likelihood function to solve for the parametersand. For such maximum likelihood estimation with implicit variables, the EM algorithm can be used. EM algorithm is divided into two steps: e-step and m-step.

E-step: Assume that the parameter is known, calculate the posterior probability of the implicit variable at this time.

Using Bayes’ rule, it can be obtained:

M-step: Insert the posterior probability of the implicit variable, maximize the logarithmic likelihood function of the sample distribution, and solve the corresponding parameters.

Look at the logarithmic likelihood function that we got beforeResults due to document lengthIt can be evaluated separately, so getting rid of it doesn’t affect the maximized likelihood function. In addition, according to the calculation results of E-step, thePlug inSo we just maximize the following functionN = M, M = N, M = N, M = N

This is an extremum problem of a function of several variables, and the following constraints are known (there is a small error in the following formula, the correct formula should be: M is N) :

Friends familiar with convex optimization should know that the common method to deal with such extreme value problems with constraints is The Lagrange multiplier method, that is, by introducing the Lagrange multiplier, the constraint conditions and multivariate (objective) function are fused together and transformed into the extreme value problems without constraints.

Here we introduce two Lagrange multipliersand, to write the Lagrange function (there is a small error in the following formula, the correct should be: N is M, M is N) :

Because we want to solve for the parameterand, so they are respectivelyandTake the partial derivative, and set the partial derivative equal to 0, and obtain (there is a small error in the following formula: N = M, M = N) :

The parameters can be estimated by eliminating the Lagrange multiplierand(There is a small error in the following formula, the correct formula should be: N is M, M is N) :

To sum up, in pLSA:

Due to theandUnknown, so we use EM algorithm to estimateThe value of this parameter.
And then,Said wordsAppear on the subjectThe probability of, iswithSaid the themeAppear in documentThe probability of, isAnd thus theConvert to a “topic-term” matrix φ (topic-generating words), theConverts to a document-topic matrix θ (document generation topic).
Finally, and.

Finally, it is worth mentioning that on the basis of pLSA model, a Bayesian framework is added, which is LDA. For more details on LDA, please refer to Reference 8.

5. References and recommended reading

JerryLead (EM Algorithm) The EM Algorithm
Zouxy09: From Maximum likelihood to EM algorithm shallow solution
How to explain EM algorithm in a simple way and give an example?
Milter: How to understand EM algorithms sensibly?
EM algorithm nine levels of realm
July Open Online course: 18 minutes of EM algorithms
Dr Tang spent 5 hours dissecting and deriving EM algorithms
The LDA topic model is generally understood

Six, afterword.

We talked about cooperation during the day yesterday, and finally finished writing EM algorithm notes in an Internet cafe in Chengdu in the evening. The formula is very popular and not easy to write, so it cost us a lot of effort, and we have to change it constantly.

The number of changes is as follows

In the evening of August 25, the first round of modification was made to improve the calculation formula of probability density function Q(z) of implicit variable Z, which was the derivation of conditional probability.
In the morning of August 26, the second round of modification was made to improve the formula derivation of EM algorithm, including: Q(z) was obtained and substituted into θ, and Q(z) was obtained and substituted back into θ;
In the morning of August 26, the third round of modification was made to improve the iterative solution of θ in EM algorithm.
In the evening of August 26, the fourth round of modification was made to simplify the description of relevant examples of EM algorithm, which made the writing thinking clearer.
In the morning of August 28, the fifth round of modification added two unknown parameters of pLSA estimated by EM algorithm, making readers not only familiar with the essence, derivation, but also proficient in application.

July, August 2018.

How to understand EM algorithm in general