A sequence of

This series of articles from 52 NLP (I love natural language processing: www.52nlp.cn/), the original link in the HMM learning best practices, has made a big secondary consolidation, also see: blog.csdn.net/itplus/arti…

This paper is mainly to sort out three problems of hidden Markov model and forward algorithm.

Two or three questions

  2.1Hidden Markov Models

A hidden Markov model is A triplet (π, A, B).

  

Each probability in the state transition matrix and confusion matrix is time independent — that is, the matrices do not change over time as the system evolves. In fact, this is the most unrealistic assumption of markov’s model about the real world.

2.2 applications

Once a system can be described as a HMM, it can be used to solve three basic problems. The first two are pattern recognition problems: the probability of finding an observation sequence given a HMM (evaluation); Search for hidden state sequences (decoded) that are most likely to generate an observation sequence. The third problem is generating a HMM (learning) for a given sequence of observations.

Question 1 calculating probability (evaluation)

To consider such A problem, we have some hidden Markov models describing different systems (that is, sets of (π,A,B) triples) and A sequence of observations. We want to know which HMM is most likely to produce this given sequence of observations. For algae, for example, we might have a “summer” model and a “winter” model, because the situation is different from season to season — we might want to determine the current season based on the observed sequence of seaweed moisture. We use a forward algorithm to calculate the probability of a sequence of observations given a hidden Markov model (HMM) and therefore select the most appropriate HMM.

Conclusion: The first problem is to find the probability of the occurrence of a certain observation sequence in the case of a given model (the parameters of a given HMM model are known)

Question 2 prediction problem (decoding)

Search for the most likely sequence of hidden states given an observation sequence.

A related problem, and one of the most interesting, is searching for hidden state sequences that generate output sequences. In many cases we are more interested in the hidden states in the model because they represent something more valuable that is often not directly observable. Consider the case of algae and the weather. A blind hermit can only sense the state of the algae, but he wants to know the state of the weather, which in this case is the hidden state. We use Viterbi algorithm to determine (search) the known observation sequence and the most likely hidden state sequence under HMM.

We know the observation sequence, and we know the parameters of the HMM, so let’s figure out the state sequence that most likely corresponds to this observation sequence.

Question 3learning

Generate hidden Markov model from observation sequence. The third problem, the most difficult of HMM related problems, is to estimate the most appropriate hidden Markov model (HMM) based on A sequence of observations (from A known set) and A set of hidden states associated with it, that is, to find the parameters of the HMM (π, A, B).

When matrices A and B cannot be measured directly (estimated), forward-backward algorithm is used for learning (parameter estimation), which is also A common situation in practical applications.

Hidden Markov models described by A vector and two matrices (PI,A,B) are of great value to real systems, although they are often approximations, but they stand up to analysis.

Three forward algorithm

3.1 Exhaustive Search (Violent Method)

Given A hidden Markov model, that is, given the model parameters (π, A, B), we want to find the probability of the observation sequence. Taking the weather as an example, we have a hidden Markov model (HMM) to describe the weather and its closely related humidity state of algae, and we also have an observation sequence of the humidity state of algae. Suppose the observation result of seaweed humidity for 3 consecutive days is (dry, wet, soggy) — and each of these three days could be sunny, cloudy, or rainy. For the observation sequence and the hidden state, it can be considered as a grid:

Each column in the grid shows possible weather states, and each state in each column is linked to each state in an adjacent column. The transition between the states is provided with a probability by the state transition matrix. Below each column is the observed state at a certain point in time, and the probability of the observed state given any hidden state is provided by the confusion matrix.

It can be seen that one way to calculate the probability of an observation sequence is to find every possible hidden state and add up the probability of the observation sequence under these hidden states. For the above (weather) example, there will be 3^3 = 27 different weather sequence possibilities, so the probability of observing the sequence is:

Pr(dry,damp,soggy | HMM) = Pr(dry,damp,soggy | sunny,sunny,sunny) + Pr(dry,damp,soggy | sunny,sunny ,cloudy) + Pr(dry,damp,soggy | sunny,sunny ,rainy) + . . . . Pr(dry,damp,soggy | rainy,rainy ,rainy)

It looks simple, but this algorithm is not feasible in reality. For the sake of the above example, the state and observation values are small, but the actual problem is that the number of hidden states is very large. N states, K observations. The total observation length is T, and the total number of our paths is N^T. (The time complexity is 𝑂(𝑇𝑁^𝑇), at the exponential level.) The data volume of light is unbearable.

So we need to find an algorithm to solve this problem with low time complexity.

3.2 Forward algorithm

The forward algorithm is based on dynamic programming. That is to say, we need to find the local state recursion formula, so step by step from the optimal solution of the sub-problem to the optimal solution of the whole problem.

For dynamic programming, see the introduction to Schematic Algorithms -9 Dynamic programming knapsack problems, Trip optimization

The forward algorithm is usually introduced to explain the definition and derivation of forward probability

At the defined time 𝑡t, the hidden state is 𝑞𝑖 Qi, and the observed state sequence is 𝑜1,𝑜2… 𝑜 𝑡 o1, o2,… The probability of OT is forward probability. Remember to:

 

𝛼 𝑡 (𝑖) = 𝑃 (𝑜 𝑜 1, 2,… 𝑜 𝑡, 𝑖 𝑡 = 𝑞 𝑖 | 𝜆)

I’m not going to stick to the derivation, because I’m really obsessed with the formula. Look at it another way.

We first define partial probability, which is the probability of reaching an intermediate state in the grid. We will then show how to calculate these local probabilities when t=1 and t=n(>1).

Suppose a t-long observation sequence is:

Again, using the weather example, we can calculate the probability of reaching some intermediate state in the grid as a sum of the probabilities of all possible paths to that state.

For example, the local probability of being in the “cloudy” state at t=2 is calculated by the following path:

We define the local probability of being in state j at time T as at(j) — this local probability is calculated as follows

For the final observed states, its local probability includes the probability of reaching those states through all possible paths — for example, for the above grid, the final local probability is calculated by the following path:

Thus, the sum of these final local probabilities is equivalent to the sum of the probabilities of all possible paths in the grid, and the probability of the observation sequence given a hidden Markov model (HMM) is obtained.

Here’s the derivation, if you’re familiar with it, you can skip it.

Calculate the local probability at t=1

In particular, when t=1, there is no path to the current state, so the probability of being in the current state when t=1 is the initial probability, and the local probability when t=1 is equal to the initial probability of the current state multiplied by the relevant observation probability:

2. Calculate the local probability when t>1

To look at the formula, according to the recursive thinking, the multiplication sign on the left side of a “Pr observe state | hidden state (j)”, and now consider the right item “Pr (t moment all point to j state path)”.

To calculate the probability of all paths to a certain state, we can calculate the probability of each path to that state and sum them, for example:

To calculateThe number of paths required increases exponentially as the sequence of observations increases, but at t-1All previous path probabilities to this state are given. Therefore, we can define the local probabilities at time t by the local probabilities at time T-1, that is:

N indicates the number of hidden states.

So the probability we have calculated is equal to the product of the corresponding observed probability (that is, the probability of the observed sign at state j at t+1) and the sum of the probabilities of arriving at that state at that time, which comes from multiplying each local probability calculated in the previous step with the corresponding state transition probability.

We can now recursively calculate the probability of a sequence of observations given a hidden Markov model (HMM) — that is, passing(1) calculation(2), through(2) is calculatedPI (3) and so on until t is equal to t(T). The probability of an observation sequence given a hidden Markov model (HMM) is equal to the sum of local probabilities at time t= t

Time complexity comparison

Since each recursion builds on the previous one, the complexity is reduced (the calculation only exists at two adjacent points in time). There are N states at each time point, so recursion between two adjacent times takes N^2 computations.

Each recursion builds on the previous one, so it only adds up O(T) times, so the total complexity is O(T) N^2, which is 0 (TN^2), which is much better than the brute force method.

Note: The time complexity of exhaustive search is 2TN^T, and that of the forward algorithm is N^2T, where T refers to the length of the observed sequence and N refers to the number of hidden states.

conclusion

Our goal is to calculate the probability of a sequence of observations under a given hidden Markov model HMM. It uses recursion to avoid exhaustive calculation of all paths in the grid.

Given this algorithm, it can be used directly to determine which HMM best describes a known sequence of observations among some hidden Markov models (HMMs) — each is evaluated using a forward algorithm, and the one with the highest probability is selected.

If you think you know it, check this out:

This is a box and ball model, an example from Statistical Learning Methods by Li Hang.

Suppose we have three boxes, each containing two kinds of red and white balls. The number of balls in these three boxes is:

box 1 2 3
Red balls for 5 4 7
Number of white balls 5 6 3

Draw the ball from the box as follows. At the beginning, the probability of drawing the ball from the first box is 0.2, the probability of drawing the ball from the second box is 0.4, and the probability of drawing the ball from the third box is 0.4. Draw the ball once with this probability and put it back. Then move from the current box to the next box for the draw. The rules are: if the box currently drawing is the first box, then with a probability of 0.5 stay in the first box and continue drawing, with a probability of 0.2 go to the second box and with a probability of 0.3 go to the third box. If the current drawing box is the second box, then with a probability of 0.5 stay in the second box and continue drawing, with a probability of 0.3 go to the first box and with a probability of 0.2 go to the third box. If the current drawing box is a third box, then with a probability of 0.5 stay in the third box and continue drawing, with a probability of 0.2 go to the first box and with a probability of 0.3 go to the second box. This is done until repeated three times to get a sequence of observations of the ball’s color:

𝑂={red, white, red}, find: what is the probability of obtaining the observation series “red, white and red”?

Note that in this process, the observer can only see the color sequence of the ball, but not the box from which the ball was taken.

Then according to the definition of our HMM model above, our observation set is:

V={red, white}, M=2

Our set of states is:

Q={box 1, box 2, box 3}, N=3

The length of the observation sequence and the state sequence is 3.

The initial state distribution is π, the state transition probability distribution matrix is A, and the observation state probability matrix is B