This is the 17th day of my participation in the August Challenge
Markov
Markov Model (Markov Model) is different from regression and classification models that deal with mutually independent sample data. It is used to deal with time series data, that is, data with time series relationship between samples. The core idea of Markov is: “The current state is only related to the state of the last moment, and has nothing to do with the state of any other moment before”. Personally, I think this is the biggest problem with Markov. As for why, put it later in the article. Here is an example to explain the Markov model in detail
Now there is a man who has three states every day: eating, sleeping and playing. This is the legendary distribution of states
Now you want to know how he is on his NTH day. Is he eating, playing, or sleeping? What are the probabilities of each of these states?
I’m going to assume that he’s going to change his state every day with a fixed probability. For example, if we play today, the probability of eating tomorrow is 0.6, the probability of sleeping tomorrow is 0.3, and the probability of playing again tomorrow is 0.1, then we can construct the following matrix
This matrix is the state transition probability matrix PPP, and it remains constant, that is, the state transition probability matrix on the first day is the same as the state transition probability matrix on the second day and every day after that
So with this matrix, plus if you know the state distribution for day 1, you can figure out the probability distribution for any given day
For a very simple example, suppose that the probability of the person eating, playing and sleeping on October 1 is S1=(0.6,0.2,0.2)S_1=(0.6, 0.2,0.2) S1=(0.6,0.2,0.2)
October 2 state distribution vector S2 = S1 ∗ P = (0.22, 0.4, 0.38) S_2 = S_1 * P = (0.22, 0.4, 0.38) S2 = S1 ∗ P = (0.22, 0.4, 0.38)
S3=S2∗P=(0.236,0.496,0.268)S_3=S_2 * P=(0.236,0.496,0.268)S3=S2∗P=(0.236,0.496,0.268) Only related to S2S_2S2.)
State distribution vector S4=S3∗PS_4=S_3 * PS4=S3∗P
The vector Sn=Sn−1∗PS_n=S_{n-1}*PSn=Sn−1∗P (see Sn−1S_{n-1}Sn−1)
So much for naive Markov. To conclude, if you want to calculate the state SnS_nSn at time NTH, you only need to calculate S1∗PnS_1*P^nS1∗Pn
As for the problem of Markov that I mentioned at the beginning, I don’t know if you have noticed it or not, but I think its biggest problem is that it pays too little attention to previous data. So for a very simple example, let’s say you’re coming home from the mall, and we’re going to look at this path
Suppose we divide the path into five equal moments (blue dots). We can think about the decision-making process of this person. She must have wanted to go straight home from the shopping mall, with a long-term plan in mind. What if Markov makes predictions? The Markov model thinks that this is a short-term decision. It thinks that the person suddenly wants to go home just at the beginning of the green dot. What do you think
I personally feel that Markov can be very problematic when it is used to process time series data anyway. This is also what I found when I was writing my paper. I originally planned to use Markov as the main part of the paper, but later changed it to a strategy like model +Markov correction that can predict long-term trends. If you’re interested, you can read my paper
HMM (Hidden Markov Model)
Hidden means invisible. To borrow a phrase: “Otaku knows winter is coming when he sees the oil packets in instant noodles turn solid.” The season (winter) is the hidden variable, the otaku (observer) doesn’t go out, so he can’t see the weather, he can only see the instant noodle seasoning (observable). HMM requires not only fixed transition probability between the hidden states themselves, but also fixed probability relationship between the observation sequence and the hidden state sequence
Let QQQ be the set of all possible (hidden) states and VVV be the set of all possible observations:
Where, NNN is the number of possible states and MMM is the number of possible observations.
III is the state sequence with length TTT, and OOO is the corresponding observation sequence:
AAA is the (hidden) state transition probability matrix:
Among them,
Represents the probability of transferring to state qjq_jqj at time t+1t+1t+1 under the condition that TTT at time is in state QIq_iqi. For example, it is winter now, and the probability of switching to spring, fall, and summer at the next moment
BBB is the observation matrix:
Among them,
Represents the probability of generating observation Vkv_kvk under the condition that TTT is in state qjQ_jqj at time. For example, the probability that the otaku thinks it’s winter when he sees the oil bag turning solid
π\ PI π is an initial state probability vector:
Among them,
Represents the probability of being in state qiq_iQI at time t=1t=1t=1. For example, the probability that we start with spring, summer, fall and winter
The hidden Markov model is determined by the initial state probability vector π\ PI π, state transition probability matrix AAA and observation probability matrix BBB. PI \ PI π and AAA determine the sequence of states, and BBB determines the sequence of observations. Therefore, the hidden Markov model λ\lambdaλ can be expressed in ternary notation, i.e
A,B,πA,B,\piA,B,π are called the three elements of the hidden Markov model
The state transition probability matrix AAA and the initial state probability vector π\ PI π determine the hidden Markov chain and generate unobtainable state sequences. The observation probability matrix B determines how to generate observations from states, and synthesizes with state sequences to determine how to generate observation sequences
By definition, the hidden Markov model makes two basic assumptions:
- Homogeneous Markov hypothesis. That is, it is assumed that the state of the hidden Markov chain TTT at any moment only depends on its state at the previous moment, and has nothing to do with the state and observation at other moments, and has nothing to do with TTT at the moment:
-
Observation independence hypothesis. That is, it is assumed that the observation at any moment only depends on the state of the Markov chain at that moment and has nothing to do with other observations and states:
Hidden Markov models can be used for annotations, where states correspond to tags. The labeling problem is to predict the corresponding labeling sequence of a given observation sequence. The data that can resolve the annotation problem is generated by the hidden Markov model. So we can use the learning and prediction algorithm of hidden Markov model to annotate
Let’s look at an example of a hidden Markov model
Suppose there are four boxes, each containing red and white balls. The number of red and white balls in the box is given in the table below
The box number | 1 | 2 | 3 | 4 |
---|---|---|---|---|
Red balls for | 5 | 3 | 6 | 8 |
Number of white balls | 5 | 7 | 4 | 2 |
Generate a sequence of observations of the ball’s color as follows:
-
At the beginning, one box is randomly selected from the four boxes with equal probability, and a ball is randomly drawn from the box. After recording its color, it is put back
-
Then, from the current box to the next box at random, the rule is: if the current box is box 1, the next box must be box 2; If it is currently 2 or 3, then it moves to the left and right boxes with probabilities 0.4 and 0.6, respectively; If the current box is 4, then either stay in box 4 or move to box 3 with a probability of 0.5
-
After determining the box to transfer, then take out a ball randomly from the box, record its color, and put it back
-
This is repeated five times to get a sequence of observations of the ball’s color
In this process, the observer can only observe the sequence of the color of the ball, but not the box from which the ball is taken out, that is, the sequence of the box cannot be observed
In this example there are two random sequences, one for the box (state sequence) and one for the color of the ball (observation sequence). The former is hidden; only the latter is observable. This is an example of a hidden Markov model. According to the given conditions, three elements of state set, observation set, sequence length and model can be defined
Boxes correspond to states, and the set of states is:
The color of the ball corresponds to the observation. And this collection of observations is:
The length of state sequence and observation sequence T=5T=5T=5
The initial probability distribution is zero
The state transition probability distribution is
The observed probability distribution is zero
Generation of observation series
According to the definition of hidden Markov model, an observation sequence with length TTT O= O1, O2… ,oTO={o_1,o_2,… ,o_T}O=o1,o2,… , the oT generation process is described as follows:
Lambda = input: hidden markov model (A, B, PI) \ lambda = (A, B, \ PI) lambda = (A, B, PI), the observation sequence length of TTT
Output: Observation sequence O=(O1, O2… ,oT)O=(o_1,o_2,… ,o_T)O=(o1,o2,… ,oT)
- The distribution of π\ PI π according to the initial state produces state I1I_1i1
- For t = 1 t = 1 t = 1
- Oto_tot is generated according to the observed probability distribution of state I1I_1I1 bit(k)b_{i_T}(k)bit(k)
- According to the state transition probability distribution of the iti_tit {aitit + 1} \ {a_ {i_ti_ {t + 1}} \} {aitit + 1} state it + 1, it + 1 = 1, 2,… , Ni_} {t + 1, i_ (t + 1} = 1, 2,… , Nit + 1, it + 1 = 1, 2,… ,N
- For t = t + 1 t = t + 1 t = t + 1; If t
Three Basic Problems of Hidden Markov Model (Key points)
Hidden Markov model has three basic problems:
- Probability calculation problem. The given model lambda = (A, B, PI) \ lambda = (A, B, \ PI) lambda = (A, B, PI) and the observation sequence O = (o1, o2,… ,oT)O=(o_1,o_2,… ,o_T)O=(o1,o2,… P(O∣λ)P(O\mid \lambda)P(O∣λ)
- Learning problems. The observation sequence O=(O1, O2… ,oT)O=(o_1,o_2,… ,o_T)O=(o1,o2,… , oT), the estimation model of lambda = (A, B, PI) \ lambda = (A, B, \ PI) lambda = (A, B, PI) parameters, makes the observation sequence under the model of probability P (O ∣ lambda) P (O/mid/lambda) P (O ∣ lambda) is the largest. That is, the maximum likelihood estimation method to estimate the parameters
- The problem of prediction is also called the problem of decoding. The known model lambda = (A, B, PI) \ lambda = (A, B, \ PI) lambda = (A, B, PI) and the observation sequence O = (o1, o2,… ,oT)O=(o_1,o_2,… ,o_T)O=(o1,o2,… , oT), P (I) ∣ O given observation sequence condition (I \ mid O P P (I) ∣ O I = one of the biggest state sequence (i1, i2,… ,iT)I=(i_1,i_2,… ,i_T)I=(i1,i2,… , iT). That is, given the observation sequence, find the most likely corresponding state sequence
HMM probability calculation method
The algorithms for calculating the probabilities of observation sequences P(O∣λ)P(O\mid \lambda)P(O∣λ) forward and backward are introduced. We will first introduce direct computations that are conceptually feasible but computationally infeasible
1. Direct calculation method
The given model lambda = (A, B, PI) \ lambda = (A, B, \ PI) lambda = (A, B, PI) and the observation sequence O = (o1, o2,… ,oT)O=(o_1,o_2,… ,o_T)O=(o1,o2,… P(O∣λ)P(O\mid \lambda)P(O∣λ). The most direct way is to calculate directly according to the probability formula. By enumerating all possible state sequences of length TTT I=(i1,i2… ,iT)I=(i_1,i_2,… ,i_T)I=(i1,i2,… ,iT), calculate each state sequence III and observation sequence O=(O1, O2… ,oT)O=(o_1,o_2,… ,o_T)O=(o1,o2,… , oT) joint probability P (O, I ∣ lambda) P (O, I \ mid \ lambda) P (O, I ∣ lambda), then the sum of all possible state sequences, get P (O ∣ lambda) P (O/mid/lambda) P (O ∣ lambda.)
State sequence I=(i1,i2… ,iT)I=(i_1,i_2,… ,i_T)I=(i1,i2,… ,iT) with a probability of
For the fixed state sequence I=(i1,i2… ,iT)I=(i_1,i_2,… ,i_T)I=(i1,i2,… ,iT), observation sequence O=(O1, O2… ,oT)O=(o_1,o_2,… ,o_T)O=(o1,o2,… The probability of t of t is 0
Then, summation of all possible sequences of states III gives the probability P(O∣λ)P(O\mid \lambda)P(O∣λ) of the observation sequence OOO, i.e
However, the calculation of the above formula is very large, and the time complexity is order O(TNT)O(TN^T)O(TNT), so this algorithm is not feasible
Here are efficient algorithms for calculating the probabilities of observation sequences P(O∣λ)P(O\mid \lambda)P(O∣λ) : forward-backward algorithm
2. Forward algorithm
Let’s first define forward probability. Given the hidden Markov model λ\lambdaλ, the partial observation sequence of TTT is defined as O1, O2… ,oto_1,o_2,… ,o_to1,o2,… ,ot and the probability that the state is qiq_iQI is the forward probability, denoted as
Can be recursive to get forward probability of alpha t (I) \ alpha_t alpha t (I) and (I) observation sequence probability P (O ∣ lambda) P (O/mid/lambda) P (O ∣ lambda.)
Input: hidden Markov model λ\lambdaλ, observation series OOO
Output: Observation sequence probabilities P(O∣λ)P(O\mid \lambda)P(O∣λ)
(1) Initial value
(2) recursion, for t=1,2… T – 1 T = 1, 2,… , T – 1 T = 1, 2,… T – 1
(3) Termination
In the forward algorithm, step (1) initializes the forward probability, which is the joint probability of state I1 = QII_1 = q_II1 = Qi and observation O1O_1O1 at the initial moment. Step (2) is the recursive formula of forward probability, and the observation sequence at time t+1t+1t+1 is calculated as O1, O2… ,ot,ot+1o_1,o_2,… ,o_t,o_{t+1}o1,o2,… ,ot,ot+1 and the forward probability of being in state qiq_iqi at time t+1t+1t+1 is shown in the figure below
In the square bracket of the formula in step (2), since αt(j)\alpha_t(j)αt(j) is observed at time TTT o1, O2… ,oto_1,o_2,… ,o_to1,o2,… ,ot and the forward probability of TTT being in state qjq_jqj at time, then the product αt(j)aji\alpha_t(j)a_{ji}αt(j)aji is o1, O2,… ,oto_1,o_2,… ,o_to1,o2,… ,ot and the joint probability of TTT being in state qjq_jqj and reaching state qiq_iqi at time t+1t+1t+1. The sum of all possible NNN states qjq_jqj of this product at time TTT is o1, O2… ,oto_1,o_2,… ,o_to1,o2,… ,ot and the joint probability of being in state qiq_iqi at time t+1t+1t+1. The product of the value in square brackets with the observed probability bi(OT +1)b_i(O_ {t+1}) Bi (OT +1) is exactly when t+1t+1t+1 observed o1, O2… ,ot,ot+1o_1,o_2,… ,o_t,o_{t+1}o1,o2,… , ot, ot and at time t + 1 + 1 + 1 t + 1 t in alpha state qiq_iqi forward probability t + 1 (I) \ alpha_ {t + 1} (I) alpha t + 1 (I). Step (3) gives the formula for P(O∣λ)P(O\mid \lambda)P(O∣λ). because
so
For example, consider the box and the ball model lambda = (A, B, PI) \ lambda = (A, B, \ PI) lambda = (A, B, PI), state set Q = {1, 2, 3} Q = \ {1, 2, 3 \} Q = {1, 2, 3}, observation set V = {red, white} V = \ {red, white \} V = {red, white}
T = 3, O = (red, white and red) T = 3, O = (red, white and red) T = 3, O = (red, white and red), try the forward algorithm calculate P (O ∣ lambda) P (O/mid/lambda) P (O ∣ lambda.)
** Solution: ** according to the forward algorithm
(1) Calculate the initial value
(2) Recursive calculation
(3) Termination
3. Backward algorithm
Given the hidden Markov model λ\lambdaλ, the partial observation sequence from t+1t+1t+1 to TTT is ot+1, OT +2, when the TTT state is qiq_iQI. ,oTo_{t+1},o_{t+2},… ,o_Tot+1,ot+2,… , the probability of oT is the backward probability, denoted as
To probability can be obtained by using the method of recursive after beta t (I) \ beta_t beta t (I) (I) the observation sequence probability P (O ∣ lambda) P (O/mid/lambda) P (O ∣ lambda.)
Input: hidden Markov model λ\lambdaλ, observation series OOO
Output: Observation sequence probabilities P(O∣λ)P(O\mid \lambda)P(O∣λ)
(1)
(2) for t= t −1, t −2… ,1t=T-1,T-2,… , 1 T = T – 1, T – 2,… 1,
(3)
Step (1) initializes the backward probability by specifying βt(I)=1\beta_t(I)=1βt(I)=1 for all states qiq_iQi at the final moment. Step (2) is the recursive formula of the backward probability. As shown in the figure below
In order to calculate the observation sequence after t+1t+1t+1 under the condition that TTT at time is qiq_iQI,ot+ 1, OT +2… ,oTo_{t+1},o_{t+2},… ,o_Tot+1,ot+2,… , the backward probability of oT βt(I)\beta_t(I)βt(I), we only need to consider the transition probability of all possible NNN states qjq_jqj at time t+1t+1t+1 (i.e. Aija_ {ij}aij term), And the observation probability of OT + 1O_ {t+1} OT +1 in this state (namely bj(OT +1) B_j (O_ {t+1}) BJ (OT +1) term), Then we consider the backward probability of the observation sequence after state Qjq_jqj (βt+1(j)\beta_{t+1}(j)βt+1(j) term). Step (3)P(O∣λ) P(O\mid \lambda)P(O∣λ) gives the same idea as step (2) except that the initial probability π I \pi_iπ I replaces the transfer probability
Observation sequence probabilities P(O∣λ)P(O\mid \lambda)P(O∣λ) can be written by using the definitions of forward and backward probabilities