“This is the first day of my participation in the Gwen Challenge in November. Check out the details: The last Gwen Challenge in 2021”

[This article adds some thoughts of the translator to the 8.4 translation of “Speech and Language Processing”]

First, Markov chain

Markov assumption: When predicting the future, the past is not important, only the present is.


P ( q i = a q 1 . . . q i 1 ) = P ( q i = a q i 1 ) ( 8.3 ) P(q_i = a|q_1… Q_ {1} I -) = P (q_i = a | q_ {1} – I) \ qquad (8.3)

A Markov chain is mainly composed of the following parts:

The formula note
Q =
q 1 q 2 . . . q N q_1q_2 … q_N
Set of N states
A =
a 11 a 12 . . . a N 1 . . . a N N a_{11}a_{12} … a_{N1} … a_{NN}
A represents the transition probability matrix,
a i j a_{ij}
It’s the probability of going from state I to state j

PI. = PI. 1 . PI. 2 . . . . . PI. N PI = PI _1, _2 of PI,… , PI _N
The initial probability distribution that the states obey,
PI. i PI _i
Represents the probability that the Markov chain starts at state I. If I have some states of J
PI. j = 0 PI _j = 0
, indicating that these states cannot be used as the initial state O of markov chain,
i = 1 n PI. i = 1 ^ \ sum_ {I = 1} {n} PI _i = 1

Example: weather event sequences

State =[hot, cold, warm] π =[0.1, 0.7, 0.2] // indicates that the possibility that the initial state of markov chain is state 1(hot) is 0.1Copy the code

Transfer matrix A:

State (I \ j) hot cold warm
hot 0.6 0.1 0.3
cold 0.1 0.8 0.1
warm 0.3 0.1 0.6

Q: Please use the above probabilities to calculate the probabilities for each of the following sequences

hot hot hot hot

cold hot cold hot

Copy the code

Ii. Hidden Markov Model (The Hidden Markov Model)

We can easily use markov chain to calculate the probability of an observation sequence of events, but in some cases, the things we are interested in is hidden, we can not directly observed, such as in the text, we can directly observe the words, but we cannot know directly words part-of-speech tags, must infer from the sequence of words words, We call these tags hidden because they can’t be directly observed,

Hidden Markov models support both observable events (such as words we see in the input) and hidden events (such as part-of-speech tags) that we think have causality in probabilistic models

2.1 HMM Components

The HMM consists of the following parts:

The formula note
Q =
q 1 q 2 . . . q N q_1q_2 … q_N
Set of N states
A =
a 11 a 12 . . . a N 1 . . . a N N a_{11}a_{12} … a_{N1} … a_{NN}
A(Transition probability matrix) represents the transition probability matrix,
a i j a_{ij}
Represents the possibility of going from state I to state j, such that for any I,
j = 1 n a i j = 1 \sum_{j=1}^{n}a_{ij}=1

O = o 1 o 2 . . . o T O = o_1o_2 … o_T
A sequence of T observations, each derived from a number of brain signals

B = b i ( o t ) B = b_i(o_t)
The observed likelihood sequence, also known as emission probabilities, represents the slave state
q i q_i
Can produce observations
o t o_t
The probability of

PI. = PI. 1 . PI. 2 . . . . . PI. N PI = PI _1, _2 of PI,… , PI _N
The initial probability distribution that the states obey,
PI. i PI _i
Represents the probability that the Markov chain starts at state I. If I have some states of J
PI. j = 0 PI _j = 0
, indicating that these states cannot be used as the initial state O of markov chain,
i = 1 n PI. i = 1 ^ \ sum_ {I = 1} {n} PI _i = 1

The first-order hidden Markov model contains two simplified hypothetical vocabularies V=v1,v2… ,vVV = v_1, v_2,… , v_VV=v1,v2,… ,vV

  1. As with first-order Markov chains, the probability of a particular state depends only on the previous state:


P ( q i q 1 q i 1 ) = P ( q i q i 1 ) ( 8.6 ) P (q_i | q_1 \ ldots q_ {1} I -) = P (q_i | q_ {1} I -) \ qquad (8.6)

  1. The probability of an output observation oio_ioi depends only on the state qiq_iQI that produced the observation and not on any other states and other observations:


P ( o i q 1 . . q i . . q T . o 1 . . o i . . o T ) = P ( o i q i ) ( 8.7 ) P (o_i | q_1 \ ldots, q_i, \ ldots, q_T, o_1, \ ldots, o_i, \ ldots, o_T) = P (o_i | q_i) \ qquad (8.7)

2.2 HMM Annotator components

HMM has two components: matrix A and matrix B

Matrix A: label transition probability P (ti ∣ ti – 1) P (t_i | t_ {1} I -) P (ti ∣ ti – 1), said before A given under the condition of A label, the possibility that the current TAB. For example, a modal verb is likely to be followed by a basic verb, VB, such as race, so we would expect this probability to be high. We calculate the maximum likelihood estimate of this transition probability by counting the number of occurrences of the first and second in a corpus.


P ( t i t i 1 ) = C ( t i 1 . t i ) C ( t i 1 ) ( 8.8 ) P (t_i | t_ {1} I -) = \ dfrac {C (t_ {1} I -, t_i)} {C (t_ {1} I -)} \ qquad (8.8)

Part of speech Tagging set: www.ibm.com/docs/zh/wca…

For example, in the WSJ corpus, MD appeared 13,124 times, among which, VB followed by MD appeared 10,471 times, so its MLE was estimated as


P ( V B M D ) = C ( M D . V B ) C ( M D ) = 10471 13124 = 0.80 ( 8.9 ) P (VB) | MD = \ dfrac {C (MD, VB)} {C} (MD) = \ dfrac {10471} {13124} = 0.80 \ qquad (8.9)

Examples of part-of-speech tagging tasks

In HMM annotations, probabilities are estimated through a statistically annotated training corpus. In this example, we will use the annotated WSJ corpus.

BBB, emission probability P (wi ∣ ti) P (w_i | t_i) P (wi ∣ ti), said a label to a given tit_iti (MD) for example, its corresponding word is a given word wiw_iwi probability (such as a will). The MLE of the emission probability is


P ( w i t i ) = C ( t i . w i ) C ( t i ) ( 8.10 ) P (w_i | t_i) = \ dfrac {C (t_i, w_i)} {C (t_i)} \ qquad (8.10)

Among the 13,124 occurrences of MD in the WSJ corpus, the corresponding word is will 4,046 times:


P ( w i l l M D ) = C ( M D . w i l l ) C ( M D ) = 4046 13124 = 0.31 ( 8.11 ) P (will) | MD = \ dfrac {C (MD, will)} {C} (MD) = \ dfrac {4046} {13124} = 0.31 \ qquad (8.11)

Remember, this likelihood term is not asking “What is the most likely label for the word will?” It would be a posteriori probability P (MD ∣ will) P (MD | will) P (MD ∣ will), instead of P (will ∣ MD) P (will | MD) P (will ∣ MD) answer is a slightly abnormal problem: “If we are generating an MD, how likely is it that the modal is will?”

As shown in figure

By analogy with markov chain, there are three states VB(verb), MD(modal verb) and NN(noun). The observed value oIO_ioi represents words, such as “will” and “back”, etc. It can be seen from the figure that the transition probability matrix AAA and the observed value likelihood BBB

For models that contain hidden variables, such as HMM, decoding refers to the task of determining the sequence of hidden values corresponding to the sequence of observed values, i.e

Decoding: given A HMM lambda = (A, B) \ lambda = (A, B) lambda = (A, B), and an observation sequence O = o1, o2,… OTO = o_1, o_2, \ldots, o_TO= O1, O2… ,oT as input, find the most likely sequence of states Q= q1q2Q3… QTQ = q_1 q_2 q_3 \ldots q_TQ= q1q2Q3… QT.

For pos tagging, the target of HMM decoding is given a sequence of NNN words W1… Wnw_1 \ ldots w_nw1… Wn, select the most likely tag sequence T1… Tnt_1 \ ldots t_nt1… Tn:


t ^ 1 : n = Arg Max t 1 t n P ( t 1 t n w 1 w n ) ( 8.12 ) \ hat {t} _ {1: n} = \ argmax_ {t_1 \ ldots t_n} P (t_1 \ ldots t_n | w_1 \ ldots w_n) \ qquad (8.12)

We will actually use Bayes’ rule in HMM to calculate:


t ^ 1 : n = Arg Max t 1 t n P ( w 1 w n t 1 t n ) P ( t 1 t n ) P ( w 1 w n ) ( 8.13 ) \hat{t} _{1:n} = \argmax_{t_1 \ldots t_n} \dfrac{P(w_1 \ldots w_n | t_1 \ldots t_n) P(t_1 \ldots t_n)}{P(w_1 \ldots W_n)} \ qquad (8.13)

P(w1n)P(w_1^n)P(w1n)


t ^ 1 : n = Arg Max t 1 t n P ( w 1 w n t 1 t n ) P ( t 1 t n ) ( 8.14 ) \ hat {t} _ {1: n} = \ argmax_ {t_1 \ ldots t_n} {P (w_1 \ ldots w_n | t_1 \ ldots t_n) P (t_1 \ ldots t_n)} \ qquad (8.14)

The HMM annotator makes two more simplifications:

  1. The probability of a word’s occurrence depends only on the probability of its own label, and has nothing to do with its neighbors


P ( w 1 w n t 1 t n ) material i = 1 n P ( w i t i ) ( 8.15 ) P (w_1 \ ldots w_n | t_1 \ ldots t_n) \ approx \ prod_ {I = 1} ^ n P (w_i | t_i) \ qquad (8.15)

  1. The probability of a tag appearing depends only on the tag that precedes it, not on the entire tag sequence


P ( t 1 t n ) material i = 1 n P ( t i t i 1 ) ( 8.16 ) P (t_1 \ ldots t_n) \ approx \ prod_ {I = 1} ^ n P (t_i | t_ {1} I -) \ qquad (8.16)

From these two simplifications we can get the most likely formula for the tag sequence from the binary tagger:


t ^ 1 : n = Arg Max t 1 t n P ( t 1 t n w 1 w n ) material Arg Max t 1 t n i = 1 n P ( w i t i ) Emission probability P ( t i t i 1 ) Transition probability ( 8.17 ) \hat{t} _{1:n} = \argmax_{t_1 \ldots t_n} P(t_1 \ldots t_n | w_1 \ldots w_n) \approx \argmax_{t_1 \ldots t_n} \ \ prod_ {I = 1} ^ n overbrace {P (w_i | t_i)} ^ {} emission probability \ overbrace {P (t_i | t_ {1} I -)} ^ transition probability} {\ qquad (8.17)

2.3 Viterbi algorithm (The Viterbi Algorithm)

The decoding algorithm of HMM is the Viterbi algorithm as shown in the following figure. As an example of dynamic programming, the Viterbi algorithm is similar to the dynamic programming minimum edit distance algorithm in Chapter 2 of SLP3.

Viterbi algorithm first builds a probability matrix, or lattice, in which each column, each observed value oTO_TOT, and each state in the behavior state graph. Thus, in a single combinatorial automata, each column (observation oTO_tot) has a cell for each state qiq_iQI. The following figure shows the lattice corresponding to the sentence Janet will back the bill.

Each cell in the grid, vt(j)v_t(j)vt(j), represents TTT observations of a given HMM λ\lambdaλ before “seeing” and passes through the most likely state sequence Q1,… , qt – 1 q_1 \ ldots, q_ {1} t – q1,… , the probability of being in state JJJ after qt−1. The value of each cell vt(j)v_t(j)vt(j) is computed recursively, taking the most likely path to that cell each time. Formally, the probability of each cell is zero


v t ( j ) = max q 1 . . q t 1 P ( q 1 q t 1 . o 1 . o 2 o t . q t = j Lambda. ) ( 8.18 ) V_t (j) = \ max_ {q_1 \ ldots, q_ {t – 1}} P (q_1 \ ldots q_ {1} t -, o_1, o_2 \ ldots o_t, q_t = j | \ lambda \ qquad (8.18)

By maximizing the previous sequence of all possible states Max ⁡q1… Qt – 1 \ max_ {q_1 \ ldots, q_ {t – 1}} maxq1,… ,qt−1 to indicate the most likely path. Like other dynamic programming algorithms, Viterbi recursively fills each cell. Since we have calculated the probability of being in each state at time t− 1t-1T −1, we can take the maximum probability of all extended paths to the current cell as the probability value of current time TTT. For a given state qjq_jqj at time TTT, its value vt(j)v_t(j)vt(j) is computed by


v t ( j ) = max i = 1 N v t 1 ( i ) a i j b j ( o t ) ( 8.19 ) V_t (j) = \ max_ {I = 1} ^ N v_ {1} t – (I) a_ {ij} b_j (o_t) \ qquad (8.19)

The formula note

v t 1 ( i ) v_{t-1}(i)
The Viterbi path probability at the previous time

a i j a_{ij}
Former state
q i q_i
To current state
q j q_j
Transfer probability of

b j ( o t ) b_j(o_t)
Given the current state
j j
And observations
o t o_t
The observed likelihood of the state

Example: Janet will back the bill, the goal is to get the correct tag sequence:

Janet/NNP will/MD back/VB the/DT bill/NN
Copy the code

The figure above (8.11) shows the possible labels for each word and highlights the correct part of speech sequence corresponding to the sentence from these hidden states. And the state of speech (state) being realized as word (observed value) with probability of 0 is shown in gray (DT impossible ->Janet)

Then let’s make specific analysis:

Figure 8.12 shows the transition probability between states AIJA_ {ij} AIJ matrix:

The probability that one moment is MD and the next moment is VB is 0.7968


P ( V B M D ) = 0.7968 P (VB) | MD = 0.7968

Figure 8.13 shows the probability of BI (OT) B_I (O_T) BI (OT), that is, the likelihood of the observed value with a given label

Figure 8.14 is a detailed version of Figure 8.11, which is the Viterbi grid used to calculate the best hidden state sequence for the observation sequence “Janet will back the Bill”.

I have N=5N=5N=5 state columns. Starting with column 1 (corresponding to the word Janet), we set the Viterbi value for each cell to the transition probability π\ PI π (the initial probability of state III, We can get from figure 8.12 < s > items) or P (NNP ∣ start) = 0.2767, P (NNP | start) = 0.2767, P (NNP ∣ start) = 0.2767 (label NNP as the start of the probability is 0.2767). And we can see the observed likelihood value of the word Janet given the cell label. Most of the cell in this column is zero v1 (7) = P (Janet ∣ DT) = 0 v_1 (7) = P (Janet | DT) = 0 v1 (7) = P (Janet ∣ DT) = 0, because Janet the word can’t be any one of these tags. The reader should see this in Figure 8.14.

Next, update each cell in the WILL column. For each state, we calculate viterbi[s,t]viterbi[s,t]viterbi[s,t] according to formula 8.19, taking the maximum extended value of all the paths to the current cell in the previous column. We have given values for the cells MD, VB, and NN. For each cell, take the maximum value of the 7 cells in the previous column and multiply it by the corresponding transition probability; In this case, most of them are zero. And then multiply that by the probability of the corresponding observation and take the maximum of that. In this case, the final value, 2.772E-8, comes from the NNP state in the previous column. The reader can continue through the remaining cells in Figure 8.14 and backtrack to see if the Viterbi algorithm returns the correct sequence of states, NNP MD VB DT NN.