Abstract: This paper first introduces the origin of Bayes, and uses simple examples to vividly explain how Bayes’ theorem works, explains its basic principle and the physical meaning of the formula.

How does Bayesian reasoning work

Bayesian reasoning is a way of getting clearer predictions from data, which is particularly useful when there are not enough desirable data and you want to get the full strength of the predictions from that data.

Although it is sometimes described as awe-inspiring, Bayesian reasoning is neither magic nor mystery. Although detailed explanations can be obtained in the mathematical formula, the concept behind it is completely accessible. In short, Bayesian reasoning allows you to make stronger conclusions from the data by folding the answers you already know.

Bayesian inference is based on the ideas of Thomas Bayes, a Nonconformist Presbyterian minister who wrote two books, one on theology and the other on probability. His work includes the original form of what is now known as Bayes’ theorem, which has been applied to problems such as reasoning and education-based guessing of technical terms. The idea of Bayes was popularized by another priest, Richard Price. He saw what Bayes’ theorem meant and refined it and published it. Bayesian reasoning is more accurately and historically known as Bayes’ theorem, the Bayes-Price rule.

Applying Bayesian reasoning in movie theaters




Imagine a movie fan in a movie theater has lost their ticket, pictured above is what they look like from behind, you want to get their attention, only know they have long hair, but can’t tell their gender, do you shout “excuse me, ma ‘am” or “excuse me, Sir”? Given that you know the hairstyles of men and women in your area, you might assume this is a woman. (In this simplification, only hair length and gender are two characteristics).

Now consider what happens to this person when he stands in line in the men’s bathroom, and with this extra piece of information, you might think it’s a man. This use of common sense and background knowledge is a no-brainer. Bayesian reasoning captures this common sense and background knowledge mathematically so that we can make more accurate predictions.




The digital cinema dilemma assumes that there are about half men and half women in the theatre. There are 100 people, 50 men, 50 women. Among women, half had long hair (25) and half had short hair (25). Among the men, 48 had short hair and 2 had long hair. With 25 long-haired women and two long-haired men, it is safe to assume that the ticket holder is a woman.




Let’s say 100 people line up in the men’s bathroom, 98 men and 2 women accompanying their partners. Those two women have long hair and short hair. The proportion of men with short and long hair was the same as before, but since there were 98 of them, 94 had short hair and 4 had long hair. The safe bet is that the ticket holder is a man. This is a concrete example of the basic Bayesian reasoning principle. Knowing the crucial piece of information in advance — movie ticket owners lining up outside the men’s room — allows us to better predict them.

To clarify bayesian reasoning, four concepts are proposed: probability, conditional probability, joint probability, and edge probability.

The probability of




The probability of an event occurring can be determined by dividing the number of possible occurrences by the total number of possible occurrences. The probability of one fan being female is 50 women per 100 viewers which is 0.5 or 50%. The same applies to men.




Waiting in line for men’s toilets breaks down to 0.02 for women and 0.98 for men.

Conditional probability




Conditional probability answers the question, “If I know that a person is female, what is the probability that she will have long hair?” Conditional probabilities are calculated in the same way as probabilities, but they are only a subset of all the examples that are looked at — those that meet certain conditions. In this case, the P (long hair | women), assuming that she was a woman, how much is the probability of its to have long hair, equal to the number of women have long hair, divided by the total number of women. This would be 0.5, whether we take into account the men’s bathroom line or the entire theater.




According to the same mathematical formula, assuming that he is a male, it has a long hair conditional probability P (long hair | male) is 0.96, regardless of whether they are in the queue.




About the conditional probability, one of the most important thing to remember is that, P (A | B) and P (B | A) are not the same. For example, P (| lovable puppy) and P (| the dog cute). If I’m holding a puppy, the chances of it being cute are pretty high. If what I’m holding is cute, it’s a puppy, it could be a kitten or a rabbit or something.

Joint probability




Joint probability is the answer to the question “What is the probability that someone is female and has short hair?” It turns out it’s a two-step process. First, focus on the probability that someone is a woman, P (female). And then assume that she is a woman, and its probability P is short hair (hair | women). By multiplication in combination with these given joint probability P (women’s hair) = P (female) * P (short hair | female). Using this method, we can calculate that we already know that the audience (women with long hair) is 0.25, but in the men’s bathroom line P (women with long hair) is 0.01.




P (men with long hair) was 0.02 among all viewers, but 0.04 in the men’s bathroom queue.




Unlike conditional probability, joint probability does not care about order. P (A and B) is the same thing as P (B and A), so the probability of milk and jelly donuts is the same as the probability of jelly donuts and milk.

The edge probability




Edge probability To answer the question “What is the probability that someone has long hair?” In order to solve this problem, we have to add up the probabilities of all the different ways, men with long hair plus women with long hair. The combined probability P (long hair) is 0.27, but 0.05 in the men’s bathroom queue.

Bayes’ theorem

The part that really cares is to answer questions like, “If we know a person has long hair, what are the odds that that person is female (or male)?” This is a conditional probability P (man | long hair), the probability of its contrary we already know that P (long hair) | men, but as a result of conditional probability are irreversible, now can’t know anything about the new conditional probability.

Fortunately, Thomas Bayes noticed




After remembering how to calculate the joint probability, you can write out the equations P (men with long hair) and P (men with long hair). Because the joint probability is invertible, these two things are equal.




Use a bit of algebra knowledge can solve for P | long hair (men) the problem.




Replace “male” and “long hair” with “A” and “B,” and you get Bayes’ theorem.




Finally, to solve the movie ticket dilemma, bayes’ theorem must be applied to our problem.




First, we need to expand the edge probability P (long hair).




Then compute the probability of a person is male, assuming they have long hair, for the audience in a men’s toilet line, P | long hair (men) was 0.8. This confirms that the person who lost the movie ticket may be a man. Bayes’ theorem has taken over our intuition of the situation. Most importantly, it has incorporated our preexisting knowledge that there are more men in the men’s bathroom line. Using this prior knowledge, it updates its beliefs about the situation.

A probability distribution

Bayesian inference can be used to explain the movie theater dilemma and show the mechanism of its activity. In data science applications, however, it is most commonly used to interpret data. By extracting prior knowledge from measurements, stronger conclusions can be drawn from small data sets. The details of how this works are shown below, but need to be explicit about what is called a “probability distribution”.

Consider the probability that the next pot of coffee will have just enough space to fill a cup. If there is only one cup, then there is no problem filling, but if there is more than one, you have to decide how to divide up that many cups of coffee. But you can break it down if you like, as long as you pour all the coffee into one cup or something. In a movie theater, a cup may represent a woman and another may represent a man.




Alternatively, we could use four cups to represent the distribution of all combinations of gender and hair length. In both cases, the total amount of coffee is added to a cup.




Typically, we set up the cups side by side and view the amount of coffee as a bar chart. Its distribution shows the strength of our conviction of the case.




If you flip a coin and hide the result, your beliefs are evenly divided between head and tail.




If you roll the dice and hide the result, your belief in the number at the top is evenly distributed among the six sides.




If you buy a Powerball ticket, you think the odds of winning are close to zero. Coin flips, dice rolls, Powerball lottery results — these are all examples of measuring and collecting data.




Not surprisingly, you can still have faith in the data you collect. Consider the height of adults in the United States. Your belief about their height looks like the picture above. This illustrates a belief that this person is about 150 and 200 cm, 180 and 190 cm most likely.




The distribution can be broken down into finer levels, so you can think of it as pouring a small amount of coffee into more cups to get a finer set of beliefs.




Eventually, the number of imaginary cups you need becomes so large that the analogy breaks down. The distribution at that point is continuous. The math has been revised, but the basic idea is still useful. It shows how your beliefs are distributed.

Now, in terms of probability distributions, you can use Bayes’ theorem to interpret the data.




Bayesian reasoning in pet hospitals

It is difficult to get an accurate weight reading because of the wide range of dog wiggles, and getting an accurate reading is important because if weight is gained, the amount of food eaten must be reduced, and vice versa.

At the last weigh-in, three measurements were obtained, 13.9 LBS., 17.5 LBS., and 14.1 LBS. The mean, standard deviation, and standard error were calculated and the actual weight distribution of the dog was obtained.




This distribution indicates the belief in the dog’s weight using this method. It is a normal distribution with an average of 15.2 pounds and a standard error of 1.2 pounds. Actual measurements show white lines. Unfortunately, the width of this curve is inappropriate. While at its peak at 15.2 pounds, the probability distribution suggests that it can easily be as low as 13 pounds or as high as 17 pounds. Too wide a range to make any kind of decision confidently. When confronted with such results, it is common to go back and collect more data, but in some cases this is not feasible or too expensive.

By using Bayes’ theorem, this is to make small data sets as useful as possible. Before we apply it, it’s worth going over the formula and reviewing the terminology.




Replace “A” and “B” with “W” (weight) and “m” (measurement). Each of the four terms represents a different part of the process.

There is P (w), indicating prior beliefs. In this case, it represents our belief about the weight of the dog before weighing it.

Probability P (m | w), said the measure will lead to the risk of specific weight, this is also called the possibility of data.

The back of the P | m (w), said the weight of a given probability, considering the measurement, we do this is also what we are most interested in.

The probability of the data, P (m), represents the probability of any given measured data point. Now let’s assume that this is a constant.

In this case, assume that the dog might weigh 13 pounds, 15 pounds, 1 pound, or a million pounds, and let the data speak, assuming that the prior priors are uniform, that is, all values of its probability distribution are constant. This makes the bayes’ theorem to simplify the P | m (w) = P (m | w).




At this point, you can take every possible value of the dog’s weight and calculate the probability of getting three measurements. For example, if the dog weighed a thousand pounds, our measurement would be extremely unlikely. However, if the person actually weighs 14 or 16 pounds, the measurement is likely. We can through the use of each weight the possibility of a hypothetical value measured value is calculated, namely, P (m | w). Because of a priori is unified, which is equal to the a posteriori probability P | m (w).

Bayes’ theorem is used, but it’s still not close to a useful estimate. To solve this problem, the prior probabilities are assumed to be non-uniform. The prior distribution represents what we believe about something before we take any measurements. A unified prior says that we believe every possible outcome is equally likely, which is a rare case.




I do have more information about the case of the dog. The last weight of the dog was 14.2 pounds. Although the arm is not a very sensitive scale, it does not feel significantly heavier or lighter to me, so I believe the weight of the dog is about 14.2 pounds. Expressed as a normal distribution with a standard deviation of 0.5 LBS.




Now knowing a priori, we can repeat the posteriori process, and to do that, we consider the likelihood that the dog’s weight has an exact value, let’s say 17 pounds. Then, the dog is indeed the conditional probability of 17 pounds and is multiplied by the prior probability, repeating the process for every other possible weight. In this example, there are more measured weights in the 1315 lb range, which is the opposite of the uniform prior.




By calculating each possible weight probability, a new posterior probability is generated. The peak of the posterior distribution is also known as the maximum posterior estimate or MAP, and in this case the MAP is 14.1 LBS. This is significantly different from previous calculations using unified prior knowledge. It’s also a narrower spike, which allows us to make more confident estimates. Now we can see that the weight of the dog has not changed much, the amount of food it eats has not changed.

By integrating what we already know and measure, we can be more confident and make more accurate estimates. Bayesian reasoning allows us to make good use of a very small data set. There is an extremely low probability that we pre-assign 17.5 LBS of measurements. This is almost the same as rejection values, but instead of doing this anomaly detection based on intuition and common sense, Bayes’ theorem allows us to do this anomaly detection in a mathematical way.

As a side note, suppose P (m) is uniform, but if we happen to know that the balance is biased in some way, we can reflect that in P (m). If the scale returns only even numbers or random measurements that will be generated on the third attempt, we can artificially make P (m) to reflect this, which will improve the accuracy of our posterior probability.

Avoid the Bayesian trap

The dog weighing example demonstrates the advantages of Bayesian reasoning, but it also has drawbacks. It improves our estimation by making some assumptions about the answer, but the whole point of measuring something is to understand that thing. If the answer to our hypothesis is already known, the data might be reviewed.

If we had started with a strong presupposition that the dog weighed between 13 and 15 pounds, that value would never have been detected if the weight had actually dropped to 12.5 pounds. We assign zero probability to this outcome in advance, and no matter how many measurements are made, each one for less than £13 will be ignored.

Fortunately, there are ways to hedge our bets and avoid blind judgment. The method is to assign at least one small probability to each outcome. If the dog actually weighs 1,000 pounds, the measurements we collect will be reflected in the posterior probability. This is one reason why normal distributions are often used as prior distributions. Normal distributions are mostly concentrated in small areas, no matter how far they spread, it has a long tail and never goes completely to zero.

The original title of this paper is “How Bayesian Inference Work”, author: Brandon

For more details, please refer to Data Science and Robots Blog