The original author translation proofreading
Andrew Ng CycleUser XiaoDong_Wang
A link to the
Making the address
Zhihu column
CS229, Stanford University
Netease open class video with Chinese subtitles

Chapter 7 b

Hybrid Gaussians and The EM Algorithm

In the handout of this chapter, we will discuss density estimation using EM (Expectation-Maximization) algorithm.

As always, suppose we get some training sample set ${x^{(1)},… , x ^ {(m)}} $. Since this is an unsupervised learning environment, there is no classification label for these samples.

We want to be able to obtain a joint distribution $p (x ^ {(I)}, z ^ {} (I)) = p (x ^ ^ {(I)} | z {(I)}) p (z ^ {(I)}) $to data modeling. The $z ^ {(I)} \ sim Multinomial (\ phi) $(i.e. $z ^ {(I)} $is a $\ phi $for polynomial distribution parameters, where $\ phi_j \ ge 0, \ sum_ {j = 1} ^ $k \ phi_j = 1. And parameter $\ phi_j $$p is given (z ^ {} (I) = j) $), the other $x ^ ^ {(I)} | z {} (I) = j \ sim N (mu _j, \ Sigma_j) $(translator note: $x ^ ^ {(I)} | z = j ${(I)} is a $mu _j $and $\ Sigma_j $for the parameter of normal distribution). ${z^{(I)}${z^{(I)}${z^{(I)}${z^{(I)}$ $x^{(I)}$= ${1,… , randomly selected in the k} $$z ^ {(I)} $to generate, then $x ^ {(I)} $is to obey the $k $a gaussian distribution in one, and that $k $a gaussian distribution depends on $z ^ ${(I)}. It’s called a mixture of Gaussians model. Note also that $z^{(I)}$is a latent random variable, which means its value may be hidden or unseen. This makes the estimation problem more difficult.

The parameters of our model are $\phi, $mu$and $\Sigma$. To estimate these values, we can write the likelihood function of the data:

? \begin{aligned} l(\phi,\mu,\Sigma) &= \sum_{i=1}^m \log p(x^{(i)}; \phi,\mu,\Sigma) \ &= \sum_{i=1}^m \log \sum_{z^{(i)}=1}^k p(x^{(i)}|z^{(i)}; \mu,\Sigma)p(z^{(i)}; \phi) \end{aligned} ?

However, if we try to solve each parameter by setting the derivative of the above equation to be zero, we will find that it is impossible to find the maximum likelihood estimates of these parameters in the closed form. (Try it yourself if you don’t believe me.)

The random variable $z^{(I)}$represents the $K $gaussian distribution of $x^{(I)}$. Note here that the maximum likelihood estimation problem is much simpler if we know $z^{(I)}$. Then the likelihood function can be written in the following form:

? l(\phi,\mu,\Sigma)=\sum_{i=1}^m \log p(x^{(i)}|z^{(i)}; \mu,\Sigma) + \log p(z^{(i)}; \phi) ?

$\phi, $mu$, and $\Sigma$:

? \begin{aligned} &\phi_j=\frac 1m\sum_{i=1}^m 1{z^{(i)}=j}, \ &\mu_j=\frac{\sum_{i=1}^m 1{z^{(i)}=j}x^{(i)}}{\sum_{i=1}^m 1{z^{(i)}=j}}, \ &\Sigma_j=\frac{\sum_{i=1}^m 1{z^{(i)}=j}(x^{(i)}-\mu_j)(x^{(i)}-\mu_j)^T}{\sum_{i=1}^m 1{z^{(i)}=j}}. \end{aligned} ?

In fact, we have already seen that if $z^{(I)}$is known, then this maximum likelihood estimate is almost identical to the previous estimation of parameters using the Gaussian Discriminant Analysis Model, The only difference is that $z^{(I)}$plays the role of the gauss discriminant tag $^1$.

Multinomial distribution is also important because $z^{(I)}$is generalized to another multinomial distribution rather than Bernoulli. The second is because a different $\Sigma_j$is used for each term in the Gaussian distribution.

However, in the density estimation problem, $z^{(I)}$is not known. How do you do that? The EM (Expectation-Maximization) algorithm is an iterative algorithm with two main steps. $z^{(I)}$= $E ^{(I)}$ Then in the $M$step, the model parameters are updated according to the guesses of the previous step. Since we pretend that the previous step is true in the $M$step, the maximization process is easy. Here’s the algorithm:

Repeat the following process until convergence: {($E$- step) for each $I, j$, set

? w_j^{(i)} := p(z^{(i)}=j|x^{(i)}; \phi,\mu,\Sigma) ?

($M$- step) Update parameters:

? \begin{aligned} &\phi_j=\frac 1m\sum_{i=1}^m w_j^{(i)}, \ &\mu_j=\frac{\sum_{i=1}^m w_j^{(i)}x^{(i)}}{\sum_{i=1}^m w_j^{(i)}}, \ &\Sigma_j=\frac{\sum_{i=1}^m w_j^{(i)}(x^{(i)}-\mu_j)(x^{(i)}-\mu_j)^T}{\sum_{i=1}^m w_j^{(i)}}. \end{aligned} ?

}

In the $E$step, given $x^{(I)}$ We calculate the posterior probability of the parameter $z^{(I)}$. Using Bayes rule, we get the following formula:

? p(z^{(i)}=j|x^{(i)}; \phi,\mu,\Sigma)= \frac{p(x^{(i)}|z^{(i)}=j; \mu,\Sigma)p(z^{(i)}=j; \phi)} {\sum_{l=1}^k p(x^{(i)}|z^{(i)}=l; \mu,\Sigma)p(z^{(i)}=l; \phi)} ?

The formula above, $p (z ^ {} (I) = j | x ^ {(I)}; \phi,\mu,\Sigma)$is obtained by evaluating the density of a Gaussian distribution with the mean value of $\mu_i$and the covariance of $\Sigma_j$with respect to $x^{(I)}$; $p(z^{(i)} = j; \phi)$is obtained by $\phi_j$, and so on. Calculated $in $E $steps w_j ^ {(I)} $represents our $z ^ {(I)} $this value of “weak estimate (soft guesses)” $^ 2 $.

$[0, 1]$; $[0, 1]$; The corresponding “hard” value is a single best guess, such as from the set ${0,1}$or ${1,… , k}$take a value.

In addition, the update in the $M$step is compared with the equation after $z^{(I)}$is known. They are the same except for the indicator functions used to indicate the Gaussian distribution to which each data point belongs instead of $w_j^{(I)}$.

$w_j^{(I)}$w_j^{(I)}$w_j^{(I)}$w_j^{(I)}$w_j^{(I)}$w_j^{(I)}$w_j^{(I)}$w_j^{(I)}$w_j^{(I)}$w_j^{(I)}$w_j^{(I)}$w_j^{(I)}$w_j^{(I)}$w_j^{(I)}$w_j^{(I)}$w_j^{(I)}$ Like the $K$mean algorithm, the $EM$algorithm also tends to lead to local optimality, so reinitializing with a different initial parameters may be a good approach.

$EM${z^{(I)}$’; But how does this algorithm come about, and can we make sure that there are certain properties of this algorithm, like convergence and things like that? In the handout in the next chapter, we will explain a more general interpretation of the $EM$algorithm so that we can easily use the $EM$algorithm in other estimation problems that also have latent variables and are able to guarantee convergence.