1.2.3 Bayesian probability
So far in this chapter, we have looked at probability in terms of the frequency of random, repeatable events. We’re going to call it the classical or frequent interpretation of probability. We now turn to a more general Bayesian view, in which probability provides a quantification of uncertainty.
Consider an uncertain event, such as whether the moon ever orbited the sun in its own orbit or whether the Arctic ice cap would disappear by the end of the century. These things cannot be repeated many times to define probabilities of probabilities as we did earlier in the context of the fruit box. However, we usually have some idea of, for example, how fast we think polar ice is melting. If we now get new evidence, such as new forms of diagnostic information gathered from a new Earth observation satellite, we might revise our view of the rate of ice loss. Our assessment of these issues will influence the actions we take, such as the extent to which we work to reduce greenhouse gas emissions. In such cases, we want to be able to quantify our expression of uncertainty, accurately correct the uncertainty in light of new evidence, and then be able to take the best action or decision. All of this can be achieved through elegant and very general probabilistic Bayesian interpretation.
However, expressing uncertainty in terms of probability is not a particular choice, but inevitable if we respect common sense in making rationally coherent inferences. For example, Cox (1946) shows that if confidence is expressed numerically, then a simple set of axioms encoding the common sense properties of these beliefs will uniquely result in a set of rules for operating confidence that are equivalent to rules for the sum and product of probabilities. This provides the first serious evidence that probability theory can be regarded as an extension of Boolean logic in cases involving uncertainty. Many other authors have proposed different sets of attributes or kilometers that these uncertainty measures should satisfy. In each case, the resulting quantitative values conform precisely to the rules of probability. It is therefore natural to refer to these quantities as (Bayesian) probabilities.
In the field of pattern recognition, it would also be helpful to have a more general definition of probability. Consider the example of polynomial curve fitting discussed in Section 1.1. It seems reasonable to apply the probability concept of frequency theory to the random value of the observed variable TNT_nTN. However, we wish to resolve and quantify the uncertainty surrounding the appropriate choice of the model parameter WWW. We will see that, from a Bayesian perspective, we can use probabilistic mechanisms to describe uncertainties in model parameters, such as WWW or the choice of the model itself.
Bayes’ theorem now has a new meaning. Recall that in the example of the fruit box, the observation of the fruit’s identity provided relevant information that changed the likelihood that the selected box was a red box. In this example, Bayes’ theorem is used to convert prior probabilities into posterior probabilities by combining evidence provided by observational data. As we will see in more detail later, we can use approximations when inferring numbers such as parameter WWW in the polynomial curve fitting example. Before looking at the data, we can capture our hypothesis about WWW in the form of a prior probability distribution p(w)p(w)p(w). Effect of observed data D={T1,… ,tN}D=\{t_1,… ,t_N\}D={t1,… , tN} by conditional probability p (D) ∣ w p (D | w) p (D) ∣ w said, we will see in the first section 1.2.5 behind if explicitly expressed. Bayes’ theorem, in the form of
Then we allow us in the future inspection probability p (w) ∣ D p (w) | D p (w) ∣ D in the form of observation after DDD, evaluate the uncertainty in the WWW.
Bayes’ theorem on the right side of the number of p (D) ∣ w p (D | w) p (D ∣ w) is calculated according to observation data set DDD, can be regarded as parameter vector function of the WWW, in this case is called the likelihood function. It represents the possibility of observing a dataset with different Settings of the parameter vector WWW. Note that likelihood is not a probability distribution on the WWW and its integral with respect to the WWW is (not necessarily) equal to 1.
Given the definition of this possibility, we can put Bayes’ theorem in words
Where, all of these quantities are treated as functions of WWW. The denominator in (1.43) is the normalization constant, which ensures that the posterior distribution on the left is a valid probability density and the integral is one. In fact, the integral (1.43) with respect to both sides of the WWW, we can use the prior distribution and likelihood function to represent the denominator in Bayes’ theorem
In the bayesian model and frequency model, likelihood function p (D) ∣ w p (D | w) p (D ∣ w) plays a central role. However, the way it is used is fundamentally different between the two approaches. In frequency Settings, WWW is considered to be a fixed parameter whose value is determined by some form of “estimate” whose error bar is derived by considering the distribution of DDD in the possible data set. In contrast, from a Bayesian point of view, there is only one dataset DDD (that is, the actual observed dataset), and the uncertainty in the parameters is represented by a probability distribution on the WWW.