1.2 Probability Theory

A key concept in the field of pattern recognition is uncertainty. It is generated by the noise in the measurement and the limited size of the data set. Probability theory provides a consistent framework for quantification and processing of uncertainty and forms one of the core foundations of pattern recognition. When combined with the decisions discussed in Partisanship section 1.5, it allows us to make the best predictions with all available information, even if that information may be incomplete or unclear.

We will introduce the basic concepts of probability theory through a simple example. Suppose we have two boxes, one red and one blue, in the red box we have two apples and six oranges, and in the blue box we have three apples and one orange. This is shown in Figure 1.9. Now suppose we pick one of the boxes at random, and then we pick a fruit at random from the box, and after seeing what kind of fruit it is, we replace it with the box it came from. We can imagine repeating this process many times. Let’s say we choose the red box 40 percent of the time and the blue box 60 percent of the time, and when we take a piece of fruit out of the box, we are equally likely to choose any piece of fruit in the box.

Figure 1.9 We use a simple example to introduce the basic concept of probability, i.e. two colored boxes, each containing fruit (apples shown as green, oranges shown as orange)

In this case, the identity of the box to be selected is a random variable, which we will represent as BBB. The random variable can take one of two possible values, RRR (for the red box) or BBB (for the blue box). Similarly, fruit identity is a random variable, denoted by FFF. He can take either AAA (apple) or OOO (orange).

First, we define the probability of event occurrence as the fraction of the number of events in the total number of trials, under the infinite limit of the total number of trials. Therefore, the red box has a 4/104/104/10 chance and the blue box has a 6/106/106/10 chance. We will be written in the probability p (B = r) = 4/10 p = r (B) = p (B = r) = 4/10 and 4/10 p 6/10 = B (B) = p (B = B) = p = B (B) 6/10 = 6/10. Note that by definition the probability must be within the interval [0,1][0,1][0,1]. Furthermore, if events are mutually exclusive, and if they include all possible outcomes (for example, in this case, the box must be red or blue), then the probability that we see these events must always be 1.

We can now ask the question, “What is the total probability that the selection program chooses an apple? “Or” If we choose an orange, what is the probability that the box we choose is blue? “. Once we have mastered the two basic rules of probability, the sum rule and the product rule, we can answer questions like this, and even more complex questions related to pattern recognition problems. With these rules in hand, we’ll return to our fruit box example.

To derive the rules for probability, consider the more general example in Figure 1.10 involving two random variables, XXX and YYY (for example, it could be the squares and fruit variables mentioned above). We assume that XXX can take any value xix_ixi, where I =1… ,Mi=1,… ,Mi=1,… ,M, YYY can be yjy_jyj, where j=1… ,Lj=1,… ,Lj=1,… , L. Considering the total number of NNN experiments, we sampled variables XXX and YYY, and let X=xiX=x_iX=xi and Y=yjY=y_jY=yj be the number of experiments like NiJN_ {ij}nij. Similarly, the number of experiments where XXX is set to xix_ixi (independent of YYY) is represented by CIC_ICI, and the number of experiments where YYY is set to yjy_JYj is represented by RJR_jrj.

Figure 1.10 We can deduce the sum product rule of probability by considering two random variables XXX with the Forbidden City. XXX values {xi}\{x_i\}{xi}, where I =1… ,Mi=1,… ,Mi=1,… ,M, YYY values {yj}\{y_j\}{yj}, where j=1… ,Lj=1,… ,Lj=1,… In this diagram, we have M=5M=5M=5 and L=3L=3L=3. If we consider the total number of instances of these variables, NNN, then we represent the number of instances of X=xiX=x_iX=xi and Y=yjY=y_jY=yj through Nijn_ {ij}nij, which is the number of points in the corresponding cell in the array. The number of points in column III corresponds to X=xiX=x_iX=xi, represented by CIC_ICI, and the number of points in row JJJ corresponds to Y=yjY=y_jY=yj, represented by RjR_jRj.

XXX values xix_ixi and YYY values yjy_jyj p(X=xi,Y=yj)p(X=x_i,Y=y_j)p(X=xi,Y=yj) becomes the joint probability of X=xiX=x_iX=xi and Y=yiY=y_iY=yi. It’s made up of the number of points at I,ji,ji,j as part of the total number of points, so


p ( X = x i . Y = y j ) = n i j N (1.5) P (X = x_i, Y = y_j) = \ frac {n_ {ij}} {N} \ tag} {1.5

Here we implicitly consider the limit as N goes to infinity N\right row \inftyN goes to infinity. Similarly, regardless of the value of YYY, the probability of XXX taking the value xix_ixi is written as p(X=xi)p(X=x_i)p(X=xi) and given by the score of the total number of points in column III, so


p ( X = x i ) = c i N (1.6) P (X = x_i) = \ frac {c_i} {N} \ tag} {1.6

Since the number of instances in column III in Figure 1.10 is just the sum of the number of real cases in each cell of the column, we have CI =∑ jnijC_i =\sum _jn_{ij} CI =∑jnij, so from (1.5) and (1.6), we have


p ( X = x i ) = j = 1 L p ( X = x i . Y = y j ) (1.7) P (X = x_i) = \ sum ^ L_} {j = 1 p (X = x_i, Y = y_j) \ tag} {1.7

This is the sum of probabilities. Note that p(X=xi)p(X=x_i)p(X=xi) is sometimes called the marginal probability because it is obtained by marginalizing or summing up other variables (YYY in this case).

If we only consider the example of X = xiX = x_iX = xi, such instances scores as Y = yjY = y_jY = yj written as p (Y = yj ∣ X = xi) p (Y = y_j | X = x_i) p (Y = yj ∣ X = xi), And it’s called the conditional probability that Y=yjY=y_jY=yj given X=xiX=x_iX=xi. He gets it by looking up the scores of the points in column III that belong to cells I,ji,ji,j, and therefore by


p ( X = x i Y = y j ) = n i j c i (1.8) P (X = x_i | Y = y_j) = \ frac {n_ {ij}} {c_i} \ tag} {1.8

Given (1.5), (1.6), (1.8), then we can get the following relationship


p ( X = x i . Y = y j ) = n i j N = n i j c i c i N = p ( Y = y j X = x i ) p ( X = x i ) (1.9) P (X = x_i, Y = y_j) = \ frac {n_ {ij}} {N} = \ frac {n_ {ij}} {c_i} \ cdot \ frac {c_i} {N} = p (Y = y_j | X = x_i) p (X = x_i) \ tag} {1.9

This is the probability product rule.

So far, we have been very careful to distinguish between random variables, such as the box BBB in the fruit example, and the values that the random variable can take, such as RRR if the box is red. Therefore, the probability of BBB taking RRR is expressed as p(B=r)p(B=r)p(B=r). While this helps avoid ambiguity, it can lead to a rather cumbersome notation, and in many cases, this pedantry is not needed. Instead, we can simply write p(B)p(B)p(B) p(B) to represent the distribution over the random variable BBB, or p(r)p(r)p(r) p(r)p(r)p(r) p(r)p(r) to represent the distribution computed for a particular value RRR, provided the explanation in the context is clear.

With this more compact notation, we can write down the two basic rules of probability theory in the following form.

The Rules of Probability

Sum rule p (X) = ∑ Yp (X, Y), p (X) = \ sum_Yp (X, Y), p (X) = ∑ Yp (X, Y) (1.10)

Product rules of p (X, Y) = p (Y ∣ X) p (X) p (X, Y) = p (Y | X) p (X) p (X, Y) = p (Y ∣ X) p (X) (1.11)

Here p(X,Y)p(X,Y)p(X,Y) is a joint probability, expressed by “probability of XXX and YYY”. Similarly, the number of p (Y ∣ X) p (Y | X) p (Y ∣ X) is a conditional probability, represented as “YYY probability of a given XXX”, and the number of p (X) p (X) p (X) is a marginal probability, the probability of “XXX”. These two simple rules form the basis of all the probabilistic mechanisms we use throughout this book.

According to the rules of product, combined with the symmetry p (X, Y) = p (Y, X) p (X, Y) = p (Y, X) p (X, Y) = p (Y, X), we immediately get a conditional probability


p ( Y X ) = p ( X Y ) p ( Y ) p ( X ) (1.12) P (Y | X) = \ frac {p (X | Y), p (Y)} {p (X)} \ tag} {1.12

This relationship, known as Bayes’ theorem, plays a central role in pattern recognition and machine learning. Using the summation formula, the denominator in Bayes’ theorem can be the numerator


p ( X ) = Y p ( X Y ) p ( Y ) (1.13) P (X) = \ sum_Yp p (X | Y) (Y) \ tag} {1.13

Is represented by the quantity that appears in. We can treat the denominator in Bayes’ theorem as a normalized constant to ensure that the sum of the conditional probability to the left of (1.12) and all the values of YYY is equal to 1.

In Figure 1.11, we show a simple example involving the joint distribution of two variables to illustrate the concepts of marginal and conditional distributions. In addition, a limited sample of N=60N=60N=60 data points was extracted from the joint distribution, as shown in the lower left corner. In the upper right corner is the histogram of data point scores of two YYY values with one each. According to the definition of probability, these fractions are equal to the corresponding probability p(Y)→∞p(Y)\rightarrow \inftyp(Y)→∞ in limit NNN. We can think of a bar chart as a simple way to model a probability distribution by extracting a finite number of points from that distribution. From data modeling distribution is the core of statistical pattern recognition, this book will explore in detail. The rest of the two figures in figure 1.11 shows the p (X) p (X) p (X) and p (X ∣ Y = 1) p (X | Y = 1) p (X ∣ Y = 1) estimate the corresponding histogram.

Figure 1.11 Distribution diagram of two variables, XXX takes 9 possible values, YYY takes 2 possible values. The figure above on the left shows a sample of 60 points derived from the joint probability distribution of these variables. The rest of the figure shows the marginal distribution p (X) p (X) p (X) and p (Y), p (Y), p (Y) histogram estimation, and the upper left corner in the first line of the corresponding conditional distribution p (X ∣ Y = 1) p (X | Y = 1) p (X ∣ Y = 1).

Now let’s go back to our example, which involves fruit boxes. For now, we will again make a clear distinction between random variables and their instantiations. We have seen that the probability of choosing the red or blue box is determined by, respectively


p ( B = r ) = 4 / 10 (1.14) P = r (B) = 4/10 \ tag} {1.14

p ( B = b ) = 6 / 10 (1.15) P = B (B) = 6/10 \ tag} {1.15

Is given. Note that these meet p = r (B) + p (B = B) = 1 p = r (B) + p (B = B) = 1 p = r (B) + p (B = B) = 1.

Now suppose we pick a box at random, and it turns out to be a blue box. So the probability of picking an apple is the fraction of apples in the blue box, which is 3/43/43/4. Therefore, p (F = a = B ∣ B) = 3/4 p (F = = B a | B) = 3/4 p (F = a = B ∣ B) = 3/4. In fact, we can write down all four conditional probabilities for fruit types, given the boxes again


p ( F = a B = r ) = 1 / 4 (1.16) P (F = = r a | B) = 1/4 \ tag} {1.16

p ( F = o B = r ) = 3 / 4 (1.17) P = o | B = r (F) = 3/4 \ tag} {1.17

p ( F = a B = b ) = 3 / 4 (1.18) P (F = = B a | B) = 3/4 \ tag} {1.18

p ( F = o B = b ) = 1 / 4 (1.19) P = o | B = B (F) = 1/4 \ tag} {1.19

Notice that these probabilities are normalized, so


p ( F = a B = r ) + p ( F = o B = r ) = 1 (1.20) P = a | B = r (F) + p (F = o | B = r) = 1 \ tag} {1.20

And similar


p ( F = a B = b ) + p ( F = o B = b ) = 1 (1.21) P (a | B F = = B) + p (F = o | B = B) = 1 \ tag} {1.21

We can now use the sum product rule of probability to evaluate the overall probability of choosing an apple


p ( F = a ) = p ( F = a B = r ) p ( B = r ) + p ( F = a B = b ) p ( B = b ) = 1 4 x 4 10 + 3 4 x 6 10 = 11 20 (1.22) p(F=a)=p(F=a|B=r)p(B=r)+p(F=a|B=b)p(B=b)=\frac{1}{4}\times\frac{4}{10}+\frac{3}{4}\times\frac{6}{10}=\frac{11}{20}\tag{1 22}.

According to the sum rule, p (F = o) = 1 – p (F = o) = 11/20 = 9/20 9/20-11/20 = 1 p (F = o) = 1-11/20 = 9/20.

Instead, suppose we defendants are known to have selected a piece of fruit, which is an orange, and we want to know which box it came from. This requires us to evaluate the probability distribution on the box conditional on the identity of the fruit, and the probability in (1.16) – (1.19) gives the probability distribution on the fruit conditional on the identity of the box. We can solve the conditional probability inversion problem by using Bayes’ theorem, given


p ( B = r F = o ) = p ( F = o B = r ) p ( B = r ) p ( F = o ) = 3 4 x 4 10 x 20 9 = 2 3 (1.23) P (B = r | = o F = \ frac {p (F = o | B = r) p (B = r)} {p (F = o)} = \ frac {3} {4} \ times \ frac {4} {10} \ times \ frac {20} {9} = \ frac {2} {3} \ tag} {1.23

According to the sum rule, p (B ∣ F = B = o) = 1 – two-thirds of 1/3 = p (a | B = B = o F = 1 – two-thirds of 1/3 = p (B ∣ F = B = o) = 1-2/3 = 1/3.

We can make the following important interpretations of Bayes’ theorem. If we are asked which box we chose before being told the identity of the chosen fruit, the most complete information we can get is the probability p(B)p(B)p(B). We call it a prior probability, because it’s the probability available before we look at the properties of the fruit. Once we were told that the fruit is orange, we can use bayes’ theorem to calculate the probability p (B ∣ F) p (B | F) p (B ∣ F), we call it a posteriori probability. Note that in this case, the prior probability of selecting the red box is 4/10, so we are more likely to choose the blue box over the red box. However, once we observe that the chosen fruit is orange, we find that the posterior probability of the red box is now 2/3, so that the box we are now more likely to choose is actually red. This is intuitive because the proportion of oranges in the red box is much higher than that in the blue box, so the observed fruit being oranges provides important evidence in support of the box, and yes, the red box is more likely to be chosen than the blue box.

Finally, we note that if the joint distribution of the two variables is decomposed into a product of edges such that P (X,Y)= P (X)p(Y)p(X,Y)=p(X)p(Y)p(X,Y)= P (X)p(Y)p(X,Y)= P (X)p(Y) then XXX and YYY are said to be independent. From product rules, we see p (Y ∣ X) = p (Y), p (Y | X) = p (Y), p (Y ∣ X) = p (Y), so the YYY of the conditional distribution of a given XXX really has nothing to do with the value of the XXX. In our fruit and instances, for example, if each box containing the same proportion of apples and oranges, then p (F ∣ B) = p (F) p (F | B) = p (F) (F) ∣ B p = p (F), so the selection (for example) the probability of apple has nothing to do with the choice which box.