This is the 16th day of my participation in the August Text Challenge.More challenges in August

Hidden Markov Model

The hidden Markov model is also a generative model. Is a probabilistic model used to describe the transition of implicit state and the probability of manifestation of implicit state.

concept

What is the recessive state?

Hidden Markov chains randomly generate a sequence of states where the state is hidden, that is, invisible to the outside world. We call this the recessive state. Hereinafter referred to as the state.

The probability of transition between states is represented by a state transition probability matrix AAA:


A = [ a i j ] N x N a i j = P ( i t + 1 = q j i t = q i ) . i = 1 . . . . . N ; j = 1 . . . . . N A=[a_{ij}]_{N\times N} \\ a_{ij}=P(i_{t+1}=q_j|i_t=q_i),i=1,… ,N; j=1,… ,N

Aija_ {ij}aij represents the probability that the state QIq_IQi at time t+1t+1t+1 will transfer to state qjq_jqj.

What is observation?

The manifestation of the implicit state, that is, each state generates a corresponding observation, which also generates a sequence of observations.

The observation probability matrix is represented by BBB:


B = [ b j ( k ) ] N x N b j ( k ) = P ( o t = v k i t = q t ) . k = 1 . . . . . M ; j = 1 . . . . . N B=[b_{j}(k)]_{N\times N} \\ b_j(k)=P(o_t=v_k|i_t=q_t),k=1,… ,M; j=1,… ,N

Bj (k)b_j(k) BJ (k) represents the probability that state QJQ_Jqj generates observation vKv_Kvk at time t.

The initial state probability distribution, represented by π\ PI π, represents the probability of each state at the initial time.

In fact, THE HMM does not observe the transition sequence of system states, but only the presentation sequence of system states. The HMM model can be said with A triple, namely the lambda = (A, B, PI) \ lambda = (A, B, \ PI) lambda = (A, B, PI).

HMM must meet the following conditions:

  • The state transfer must meet the Markov property, that is, the state of the hypothetical Markov chain at any time t is only related to the previous moment, and has nothing to do with the state and observation at other moments.
  • Observation independence hypothesis. The observations are independent of each other.
  • States must be able to be roughly estimated.

Probability calculation

Given the HMM model lambda = (A, B, PI) \ lambda = (A, B, \ PI) lambda = (A, B, PI) and the observation sequence of TTT moment O = (o1,… oT)O=(o_1,… o_T)O=(o1,… OT), calculate the observation sequence of probability P (O ∣ lambda) P (O | \ lambda) P (O ∣ lambda).

The forward algorithm

Forward probability: given the model λ\lambdaλ, the observation sequence at TTT time is O1… ,oto_1,… ,o_to1,… ,ot, and the state is qiq_iQI. It’s just going to start at t=1t=1t=1 backwards in time


a t ( i ) = P ( o 1 . o 2 . . . . . o t . i t = q i Lambda. ) a_t(i)=P(o_1,o_2,… ,o_t,i_t=q_i|\lambda)

Probability of observation sequence:


t = 1 :    a 1 ( i ) = PI. i b i ( o 1 ) . i = 1 . 2 . . . . . N t = 1 . 2 . . . . . T 1 :    a t + 1 ( i ) = [ j = 1 N a t ( j ) a j i ] b i ( o t + 1 ) . i = 1 . 2 . . . . . N P ( O Lambda. ) = i = 1 N a T ( i ) t=1:\; A_1 (I) = \ pi_ib_i (o_1), I = 1, 2,… N \ \ t = 1, 2,… ,T-1:\; A_ (I) = {t + 1} [\ sum_ {j = 1} ^ N a_t (j) a_ {ji}] b_i (o_ (t + 1}), I = 1, 2,… ,N \\ P(O|\lambda)=\sum_{i=1}^N a_T(i)

At +1(I)a_{t+1}(I)at+1(I) is the forward probability of the current state QIq_IQi at t+1t+1t+1.

After the algorithm

Backward probability: in the same way, it is actually the timing sequence from t=Tt=Tt= t forward. Specifically, given the model λ\lambdaλ, the observation sequence from t+1t+1t+1 to TTT is OT +1,… ,oTo_{t+1},… ,o_Tot+1,… ,oT, and the state is qiq_iQI.

Probability of observation sequence:


t = T :    Beta. T ( i ) = 1 . i = 1 . 2 . . . . . N t = T 1 . T 2 . . . . . 1 :    Beta. t ( i ) = j = 1 N a i j b j ( o t + 1 ) Beta. t + 1 ( j ) . i = 1 . 2 . . . . . N P ( O Lambda. ) = i = 1 N PI. i b i ( o 1 ) Beta. 1 ( i ) t=T:\; \ beta_T (I) = 1, I = 1, 2,… ,N \\ t=T-1,T-2,… , 1: \; (I) = \ \ beta_ {t} sum_ {j = 1} ^ N a_ {ij} b_j (o_ (t + 1}) \ beta_ {t + 1} (j), I = 1, 2,… ,N \\ P(O|\lambda)=\sum_{i=1}^N \pi_i b_i(o_1) \beta_1(i)

Using forward and backward probabilities, the probability of the final observation sequence is defined as follows


P ( O Lambda. ) = i = 1 N j = 1 N a t ( i ) a i j b j ( o t + 1 ) Beta. t + 1 ( j ) . t = 1 . . . . . T 1 P(O|\lambda)=\sum_{i=1}^N\sum_{j=1}^N a_t(i)a_{ij}b_j(o_{t+1})\beta_{t+1}(j),t=1,… ,T-1

Given λ\lambdaλ and the observation sequence OOO, the probability of being in state QIq_iQi at time T is:


P ( i t = q i O . Lambda. ) = P ( i t = q i . O Lambda. ) P ( O Lambda. ) = Alpha. t ( i ) Beta. t ( i ) j = 1 N Alpha. t ( j ) Beta. t ( j ) P(i_t=q_i|O,\lambda)=\dfrac{P(i_t=q_i,O|\lambda)}{P(O|\lambda)}=\dfrac{\alpha_t(i)\beta_t(i)}{\sum_{j=1}^N\alpha_t(j)\be ta_t(j)}

Given λ\lambdaλ and the observation series OOO, the probability of being in state qq_iqi at time t and qjq_{j}qj at time t+1 is as follows:


P ( i t = q i . i t + 1 = q j O . Lambda. ) = P ( i t = q i . i t + 1 = q j . O Lambda. ) P ( O Lambda. ) = Alpha. t ( i ) a i j b j ( o t + 1 ) Beta. t + 1 ( i ) i = 1 N j = 1 N Alpha. t ( i ) a i j b j ( o t + 1 ) Beta. t + 1 ( i ) P(i_t=q_i,i_{t+1}=q_j|O,\lambda)=\dfrac{P(i_t=q_i,i_{t+1}=q_j,O|\lambda)}{P(O|\lambda)}=\dfrac{\alpha_t(i)a_{ij}b_j(o_{t +1})\beta_{t+1}(i)}{\sum_{i=1}^N\sum_{j=1}^N \alpha_t(i)a_{ij}b_j(o_{t+1})\beta_{t+1}(i)}

learning

There are two kinds of learning methods. One is supervised learning and the other is unsupervised learning.

The distinction is based on the type of training data. If the training data includes observation sequence and corresponding state sequence, it can be realized by supervised learning, that is, maximum likelihood estimation method. If the training data is only observation sequence, it is realized by unsupervised learning method, that is, EM algorithm.

Obviously, the goals of the study was model lambda = (A, B, PI) \ lambda = (A, B, \ PI) lambda = (A, B, PI) of the three parameters.

Based on maximum likelihood estimation

It is known that there are observation sequence OOO and corresponding state sequence III in the training samples, namely {(O1,I1),(O2,I2)… ,(Os,Is)}\{(O_1,I_1),(O_2,I_2),… ,(O_s,I_s)\}{(O1,I1),(O2,I2),… ,(Os,Is)} Total SSS pairs.

The maximum likelihood method can be used to estimate the parameters of the model.

State transition probability AIJA_ {ij} AIj can be estimated as follows:


a ^ i j = A i j j = 1 N A i j . i = 1 . . . . N ; j = 1 . . . . . N \hat{a}_{ij}=\dfrac{A_{ij}}{\sum_{j=1}^N A_{ij}},\qquad i=1,.. ,N; j=1,… ,N

Where AijA_{ij}Aij represents the frequency of JJJ occurring when the state III at TTT time shifts to t+1t+1t+1 in the sample. That’s how many times in the sample does state III move to state JJJ at the next point in time. This is a little bit easier to understand.

And the estimation of observation probability BJ (k) B_j (k) BJ (k) is:


b ^ j ( k ) = B j k k = 1 M B j k . i = 1 . . . . N ; j = 1 . . . . . M \hat{b}_j(k)=\dfrac{B_{jk}}{\sum_{k=1}^M B_{jk}},\qquad i=1,.. ,N; j=1,… ,M

Where, BjkB_{jk}Bjk represents the frequency at which state QjQ_Jqj generates observation vKv_kvk in the sample.

The simple estimate of the initial state probability π\ PI π is to calculate the frequency of each initial state in all samples.

EM based algorithm

Reviewing the setting, there are only SSS observation sequences with length of TTT {O1… ,Os}\{O_1,… ,O_s\}{O1,… ,Os}, our goal is also to learn parameters in the model.

Because state sequence III is unknown, HMM becomes a probabilistic model with implicit variables:


P ( O Lambda. ) = P ( O I . Lambda. ) P ( I Lambda. ) P(O|\lambda)=\sum P(O|I,\lambda)P(I|\lambda)

So the question is, how do you learn the parameters of this model?

The answer is the EM algorithm.

We know that EM algorithms are divided into E steps and M steps.

E

E is the expectation. First of all we need to initialize a set of parameters aij (0), bj (k) (0), PI (0) a ^ I {(0)} _ {ij}, b_j (k) ^ {(0)}, \ pi_i ^ {} (0) aij (0), bj (k) (0), PI (I) (0). And then I define a function Q, which is the expectation of implicit variable III.


Q ( Lambda. . Lambda. ^ ) = I log P ( O I . Lambda. ) P ( O I . Lambda. ^ ) Q(\lambda,\hat{\lambda})=\sum_I \log P(O|I,\lambda)P(O|I,\hat{\lambda})

λ^\hat{\lambda}λ^ represents an estimate. And further,


Q ( Lambda. . Lambda. ^ ) = I log P ( O I . Lambda. ) P ( O I . Lambda. ^ ) = I log PI. i 1 b i 1 ( o 1 ) a i 1 i 2 b i 2 ( o 2 ) . . . a i T 1 i T b i T ( o T )    P ( O I . Lambda. ^ ) = I log PI. i 1 P ( O I . Lambda. ^ ) + I ( t = 1 T log a i t i t + 1 ) P ( O I . Lambda. ^ ) + I ( t = 1 T log b i t ( o t ) ) P ( O I . Lambda. ^ ) \begin{aligned} Q(\lambda,\hat{\lambda})&=\sum_I \log P(O|I,\lambda)P(O|I,\hat{\lambda})\\ &=\sum_I \log \pi_{i_1}b_{i_1}(o_1)a_{i_1i_2}b_{i_2}(o_2)… a_{i_{T-1}i_{T}}b_{i_T}(o_T) \; P(O|I,\hat{\lambda}) \\ &=\sum_I \log \pi_{i_1}P(O|I,\hat{\lambda})+\sum_I(\sum_{t=1}^T\log a_{i_t i_{t+1}})P(O|I,\hat{\lambda})+\sum_I(\sum_{t=1}^T\log b_{i_t} (o_t))P(O|I,\hat{\lambda}) \end{aligned}

Step E is just going to compute this function.

M step

M is for maximization. Maximize each of the three terms above.

This is using the Lagrange multiplier method, which is a little bit complicated, so you can refer to the book, but I won’t expand it here.

Finally, the estimated value of the parameters can be solved:


PI. i = P ( O . i 1 = i Lambda. ^ ) P ( O Lambda. ^ ) \pi_i=\dfrac{P(O,i_1=i|\hat{\lambda})}{P(O|\hat{\lambda})}

a i j = t = 1 T 1 P ( O . i 1 = i . i t + 1 = j Lambda. ^ ) t = 1 T 1 P ( O . i 1 = i Lambda. ^ ) a_{ij}=\dfrac{\sum_{t=1}^{T-1} P(O,i_1=i,i_{t+1}=j|\hat{\lambda})}{\sum_{t=1}^{T-1} P(O,i_1=i|\hat{\lambda})}

b j ( k ) = t = 1 T P ( O . i 1 = j Lambda. ^ ) I ( o t = v k ) t = 1 T P ( O . i 1 = j Lambda. ^ ) b_{j}(k)=\dfrac{\sum_{t=1}^{T}P(O,i_1=j|\hat{\lambda})I(o_t=v_k)}{\sum_{t=1}^{T} P(O,i_1=j|\hat{\lambda})}

reference

  1. Teacher Li Hang — Statistical Learning Methods