So far, WE have completed support vector machine SVM, decision tree, KNN, Bayesian, linear regression and Logistic regression. For other algorithms, please allow Taoye to give credit here for the first time. Later, we will have the opportunity and time to make up for you.

Update so far, also received part of the reader’s praise. It’s not much, but thank you very much for your support, and I hope everyone who reads it will find it rewarding.

The entire content of this series of articles is Taoye pure hand written, and also references a number of books and open sources. The total number of words in this series is around 15W (including source code), which will be gradually filled in later. More technical articles can be found on Taoye’s official account: Cynical Coder. The document can be circulated freely, but be careful not to modify its contents.

If you have any questions you don’t understand in the article, you can directly ask them, and Taoye will reply as soon as you see them. Meanwhile, you are welcome to come here to privately urge Taoye: Cynical Coder. Taoye’s personal contact information is also available on the public account. There are some things Taoye can only secretly say to you there (# ‘O’)

To improve your reading experience, Taoye, a series of hand-ripping machine learning articles, has been compiled into PDF and HTML, and is available for free by replying to [666] on the public id [Cynical Coder].

Probability theory, so to speak, plays a very important role in machine learning. Taoye’s knowledge of probability theory is still only in the undergraduate period, and has forgotten a lot. So just a quick review and a quick review, it’s enough to understand bayesian algorithms in machine learning.

This series of articles has been updated to support vector machines (SVM), decision trees, and K-Nearest Neighbors (KNN). Now let’s play with Bayesian algorithms. Other articles in the series can be used to eat your own needs.

  • The Machine Learning in Action “, analyze the support vector Machine, single hand crazy tore linear SVM: www.zybuluo.com/tianxingjia…
  • The Machine Learning in Action “, analyze the support vector Machine (SVM), optimization of the SMO: www.zybuluo.com/tianxingjia…
  • Machine Learning in Action — Understand what you know and understand what you don’t know. Nonlinear support vector machine (SVM) : www.zybuluo.com/tianxingjia…
  • The Machine Learning in Action – what Taoye tell you about the decision tree is a “ghost” : www.zybuluo.com/tianxingjia…
  • The Machine Learning in Action “- children, come to play, you of the decision tree: www.zybuluo.com/tianxingjia…
  • The Machine Learning in Action – a female classmate ask Taoye, KNN should how to play to customs clearance: www.zybuluo.com/tianxingjia…
  • The Machine Learning in Action – a colloquialism bayesian, “melon” masses should be exactly the melon or just bad melon: www.zybuluo.com/tianxingjia…

This article mainly includes the following parts:

  • What exactly is Bayes, very impressive?? It mainly introduces some basic knowledge of Bayes, and then writes the author’s personal opinion on Bayes. This part is not difficult, and it should be understandable if you read it carefully
  • This paper introduces some theories involved in Bayesian decision making, including conditional probability, total probability, Bayesian inference and so on, and helps people to understand the theory through some vivid cases or topics. It is particularly important to note here that the last case (the pot and the stone) is important to understand what Bayes really means in practice.
  • This third part of the content is mainly through a watermelon case to give everyone a Bayesian practice, Taoye named: “melon eating masses” should be just melon or bad melon. This “melon eating masses” is literally meant, not the “meme” of Internet slang. In addition, in this part of the content, it will also introduce in detail the specific processing methods of nominal data and numerical data, as well as the commonly used “smoothing” processing — Laplace correction. And, of course, I’m going to show you, as usual, how to do this in code.

First, what is Bayes, very powerful??

Bayesian constant reason British mathematician bayesian (Thomas Bayes theorem. 1702-1761), used to describe the relationship between the two conditional probability, such as P (A | B) and P (B | A). According to the law of multiplication, can immediately export: P (A studying B) = P (A) * P (B | A) = P (B) * P (A | B). The above formula can be transformed into: P (A | B) = P (B | A) * P/P (A) (B).

Thomas Bayes (1702-1761) was an 18th-century British theologian, mathematician, mathematical statistician and philosopher. He was the founder of probability theory and bayesian statistics.

Put a picture of the big guy to keep the text:

For the bayesian algorithm, its advantages are simple and easy to understand, high learning efficiency, less data is still effective, and can deal with multiple classification problems.

As for the disadvantages, in the Bayesian algorithm, it is assumed that there is a certain correlation between attributes, so the calculation will be relatively complicated, and it is not so easy to deal with. So naive Bayes is derived to reduce the relationship between attributes (in a word, no relationship at all), that is, attributes are completely independent of each other, but we know in practice that it’s very difficult to be completely independent of each other. Even so, naive Bayes will have a wide range of applications.

The attributes mentioned above are independent of each other.

Independence is a staple of probability theory. That is to say, there is no relationship between the two, I don’t care about you, please don’t care about me, you go your own way, I go my own way of sunshine, what one person does will not cause any impact on others. (Key understanding)

If you have some doubts about the above understanding, Taoye will introduce it in detail in the following cases to help you understand independence.

In addition, in the book machine Learning in Action, it is also mentioned that the applicable data type of this algorithm is nominal data. But in practice, it is possible to apply numerical data as well as nominal data.

Here’s a little bit about nominal and numerical data: nominal data is generally used to represent discrete attributes, such as height, which we don’t specify in centimeters, but height and short. Numerical data are generally for continuous attributes, such as height: 170cm, 175cm, 180cm and so on.

In Bayesian algorithms, different types of data, we deal with differently. For nominal data, can be directly processed through the frequency, how many high people?? How many short people are there? For numerical data or continuous data, we generally consider the probability density function to deal with, assuming that the continuous data meets the Gaussian distribution. (It doesn’t matter if you don’t understand it here, we’ll dig into the application of gaussian distribution in detail later.)

These are some of the basic concepts involved in Bayesian algorithms, Taoye is as colloquial as possible and should not be too difficult for the knowledgeable reader to understand. It’s okay to be a little bit skeptical, but let’s see who Bayes really is.

2. Bayesian decision making theory

Conditional probability:

About conditional probability, in fact, as early as in middle school, I had contact with it, I still remember that Taoye learned this part of the content when the rogue, always crazy interaction with the teacher in class.

First one way, then another, and then another way

Taoye also specially found this book for you, which is Math Elective 2-3. If there is a chance to pick up this book, it will be very interesting, a little exciting to think about. There are two books, A and B, which are the “god book” of purple pipi and blue Pipi:

Okay, okay, let’s do a quick review of conditional probability!

The terminology of the figure above, we can call it Venn diagram, Venn diagram, Venn diagram, Venn diagram, Van diagram, whatever. It is mainly used to help us to understand the relationship between the set and set, according to the above, we can see clearly in the case of B events, events happened A probability P (A ∣ B) P (A | B) P (A ∣ B) (the unknown words can understand it’s area, second understand???) :


P ( A B ) = P ( A studying B ) P ( B ) 1) (2 – P(A|B) = \frac{P(A \cap B)}{P(B)} \tag{2-1}

In honor of the Book of Gods, let’s take a look at one of the titles

High school Mathematics B version elective 2-3

Q: Roll the red and blue dice, and let event B= “The roll of the blue die is 3 or 6”, and event A= “the sum of the roll of the two dice is greater than 8”. Then the question arises, given that the roll of the blue die is 3 or 6, what is the probability of event A occurring?

We know that there are 6 possibilities per sieve, and 36 possibilities when we roll two sieves, right? There are 5 possible events A and B occur simultaneously, namely P(A∩B)=536P(A \cap B)= frac{5}{36}P(A∩B)=365, and there are 12 possible events A, so P(A)=1236P(A)= frac{12}{36}P(A)=3612, So we can get the P (A ∣ B) P (A | B) P (A ∣ B) value is as follows:


P ( A B ) = P ( A studying B ) P ( B ) = 5 36 12 36 = 5 12 \begin{aligned} P(A|B) & = \frac{P(A \cap B)}{P(B)} \\ & = \frac{\frac{5}{36}}{\frac{12}{36}}=\frac{5}{12} \end{aligned}

How about that? Pretty simple, huh? The above conditional probability expression is what we learned in middle school, and the conditional probability in Bayesian algorithm is slightly different, with a slight variation.

Is it the

We are not received in front of the P (∣ B) = P (A studying B) P (B) P (A | B) = \ frac {P (A \ cap B)} {P} (B) P (∣ B) = P (B) P (A studying B), after the changes:


P ( A studying B ) = P ( A B ) P ( B ) (2 – (2) P(A \cap B) = P(A|B) * P(B) \tag{2-2}

In the same way:


P ( A studying B ) = P ( B A ) P ( A ) (2-3) P(A \cap B) = P(B|A) * P(A) \tag{2-3}

So:


P ( A B ) P ( B ) = P ( B A ) P ( A ) (2-4) P(A|B) * P(B) = P(B|A) * P(A) \tag{2-4}

That is:


P ( A B ) = P ( B A ) P ( A ) P ( B ) (2-5) P(A|B) = \frac{P(B|A) * P(A)}{P(B)} \tag{2-5}

Equation 2-5 is the conditional probability formula used in Bayes.

Total probability formula:

Suppose that the sample space S is the sum of two events A and A’, as shown below:

In the figure above, event A and event A’ together constitute the sample space S.

In this case, event B can be divided into two parts. The diagram below:

That is:


P ( B ) = P ( B studying A ) + P ( B studying A ) (2-6) P(B) = P(B \cap A) + P(B \cap A^{‘}) \tag{2-6}

From the above derivation, it can be known that:


P ( B studying A ) = P ( B A ) P ( A ) (2-7) P(B \cap A) = P(B|A)P(A) \tag{2-7}

so


P ( B ) = P ( B A ) P ( A ) + P ( B A ) P ( A ) (2-8) P(B) = P(B|A)P(A) + P(B|A^{‘})P(A^{‘}) \tag{2-8}

This is the total probability formula. What it means is that if A and A prime form A partition of the sample space, then the probability of event B, is equal to the sum of the probabilities of A and A prime times the conditional probabilities of B for each of those two events.

Substituting this total probability formula into the conditional probability formula above gives another way of writing conditional probability:


P ( A B ) = P ( B A ) P ( A ) P ( B A ) P ( A ) + P ( B A ) P ( A ) (9) 2 – P(A|B) = \frac{P(B|A)P(A)}{P(B|A)P(A) + P(B|A^{‘})P(A^{‘})} \tag{2-9}

Bayesian inference

By deforming the conditional probability formula, the following form can be obtained:


P ( A B ) = P ( A ) P ( B A ) P ( B ) (2-10) P(A|B) = P(A)\frac{P(B|A)}{P(B)} \tag{2-10}

We refer to P(A) as **” Prior probability “, i.e., our judgment of the probability of A before the occurrence of B event. In other words, at this time, event A is only considered and has nothing to do with event B. P (A | B) is called the “Posterior probability” (Posterior aim-listed probability), namely the B after the incident, we have A probability of occurrence of reassessment, consider the event B this time. P/P (B | A) (B) known as the “likelihood function (Likelyhood),” this is an adjustment factor * *, makes forecast probability closer to the true probability.

Therefore, conditional probability can be understood as the following formula:


Posterior probability = prior probability x adjustment factor Posterior probability = prior probability x adjustment factor

That’s what bayesian inference means. We estimate a prior probability, and then add the results of the experiment to see whether the experiment enhances or weakens the prior probability to get a more realistic posterior probability.

Here, if “the likelihood function” P/P (B | A) (B) > 1, means “prior probability” is enhanced, the possibility of the occurrence of event A larger; If the “probability function” is equal to 1, that means that event B is not helpful in determining the probability of event A; If the “probability function” is less than 1, it means that the “prior probability” is weakened and event A is less likely.

The above content, from nguyen other teacher’s blog: www.ruanyifeng.com/blog/2011/0…

To further understand Bayesian inference, let’s take a look at the following example:


Examples: Machine Learning in action

Now there are two identical POTS 1 and 2. Pot 1 has four stones, which are 2 white and 2 black. Pot 2 has three stones, one white and two black. Now choose a pot at random, pick up a stone from it, find it is white, which pot is the white stone more likely to come from??

Taoye modified this title based on the examples in machine Learning in Action, mainly to make it easier for you to understand the application of Bayes to classification problems. Which jar is more likely to come from?

There are only two POTS, either pot 1 or pot 2, but it’s more likely to describe a possibility, which is essentially a classification problem. Which pot is more likely to come from, and we end up putting the white stone into which pot.

In other words, we can represent the color of the stone as the attribute of the sample, and the category of the jar as the corresponding label of the sample. At this point, we can calculate the probability of the stone coming from pot 1 and pot 2, respectively, which is greater, and then classify the stone into that pot.

We can analyze it by the above conditional probability formula, which is reproduced as follows:


P ( A B ) = P ( A ) P ( B A ) P ( B ) P(A|B) = P(A)\frac{P(B|A)}{P(B)}

For this, let’s make event A= “from Pot 1” and event B= “white rock selected”.

Then we can know that since the two cans are exactly the same, so:


P ( A ) = 1 2 P(A) = \frac{1}{2}

And P (B ∣ A) P (B | A) P (B ∣ A) represents the probability of selected white stone in the pot 1, we know that there are four stone in the pot, 1 is there are two pieces of white, so:


P ( B A ) = 2 4 = 1 2 P(B|A) = \frac{2}{4}=\frac{1}{2}

P(B)P(B) is simply the probability of selecting white stones in the whole world. There are 7 white stones in the whole world, among which there are 3 white stones, so:


P ( B ) = 3 7 P(B) = \frac{3}{7}

In summary, we can obtain our conditional probability result, that is, the probability that the rock came from pot 1, provided it is found to be white, is:


P ( A B ) = P ( A ) P ( B A ) P ( B ) = 1 2 1 2 3 7 = 7 12 \begin{aligned} P(A|B) & = P(A)\frac{P(B|A)}{P(B)} \\ & = \frac{1}{2}\frac{\frac{1}{2}}{\frac{3}{7}}=\frac{7}{12} \end{aligned}

Similarly, assuming that event C= “from jar no. 2”, we can calculate the conditional probability at this time:


P ( C B ) = P ( A ) P ( C A ) P ( C ) = 1 2 1 3 3 7 = 7 18 \begin{aligned} P(C|B) & = P(A)\frac{P(C|A)}{P(C)} \\ & = \frac{1}{2}\frac{\frac{1}{3}}{\frac{3}{7}}=\frac{7}{18} \end{aligned}

For this problem, A priori probability P (A) = P (C) = P (A) = P (C) 12 = \ frac {1} {2} P = P (A) (C) = 21, After adjustment factor (likelihood function) get A posteriori probability P (A ∣ B) = P (A | B) = 712 \ frac {7} {12} P (A ∣ B) = 127, and P (C ∣ B) = P (C | B) = 718 \ frac {7} {18} P C ∣ (B) = 187. In other words, before removing A white stone, the probability of event A and event C are the same, while after removing A white stone, the probability of event A is enhanced **, while the probability of event C is weakened. And P (∣ B) > P (C ∣ B) P (A | B) > P (C | B) P (∣ B) > P C ∣ (B), to that end, we will be more willing to take to the classified as no. 1 of the white ball pot. 六四运动

The above paragraph is very important for you to understand, because it’s very important for you to understand what naive Bayes actually means. We can say this: on the premise that all attributes of the tested samples are known, we need to calculate the probability that the samples are classified into various classes through bayesian algorithm, so as to judge the final classification of the tested samples.

Don’t know if readers have attention oh, for the above problems, white balls of classified tank is not 1 2 cans, by definition, P (A ∣ B) + P (C) ∣ B = 1 P (A | B) + P (C | B) = 1 P (∣ B) + P (C) ∣ B = 1, we find, however, will be calculated after the two values is not equal to 1, It’s totally unreasonable. It’s annoying…

First of all, it should be acknowledged that readers with such questions are excellent, indicating that they are thinking carefully while reading. In fact, have this question readers ignored the problem of “domain”, we understand the P (∣ B) + P (C) ∣ B = 1 P (A | B) + P (C | B) = 1 P (∣ B) + P (C ∣ B) = 1 is in the “domain”, is the seven stones, and think about conditional probability, “domain” has been changed, At this point, the “domain” is no longer a whole, but divided into two sub-” domains “, so the sum of the two probabilities calculated at this time will not necessarily be 1. (Key understanding)

As for the above “domain” issues, Taoye thought independently in the learning process and did not refer to any authoritative materials for the time being, so the correctness of the above statement cannot be fully guaranteed. If you have any questions, please feel free to comment below or chat Taoye privately.

One last word of caution, understanding what this problem really means is really important for understanding Bayesian algorithms.


The masses should be good or bad

In this section, we mainly use an example from Zhou Zhihua’s watermelon book to understand the application of Bayes, and then solve the problem in the form of codes. The overall process is roughly the same as the above examples, mainly in that readers should learn to adapt to different problems and learn to transform the problems. Just as the book of Changes says, poverty leads to change, change leads to success, and general rule lasts for a long time. ** This point is quite important, especially for us students.

Now let’s get down to business.

Examples are in: Zhou zhihua – chapter 4 of machine learning

In order to make the reading less boring, or to get into the case better, Taoye makes up a short story as an introduction.

Note: the meaning of “melon eating masses” is purely literal, not the “meme” of Internet slang.

Once upon a time, there was a mountain on which there was a temple, and in the temple there were people eating melons. (It’s kind of nice to sing, HHHHHH)

At the same time, there are hundreds of watermelons in the temple for people to eat. At the beginning, the crowd was still very happy to eat the melon, swallowing a watermelon. But the watermelon just to a certain number of time, he found some melon is good melon, and some melon is bad melon, at that time confused: I drop darling, this bad melon should not be expired??

So how? I want to be able to pick the bad ones every time, or at least pick the good ones more often than the bad ones. To do so, he collected attributes and labels from previous watermelon trials to determine whether the fruit was good or bad.

Watermelon data collected by melon eaters are as follows:

There are 17 watermelons in total in this data set, including 8 good ones and 9 bad ones.

The melon eater case is a little more complicated than the stone jar case, but only a little. There are multiple attributes involved, and in addition to nominal data, there is also numerical data, these different types of data we need to deal with?? In addition, what needs to be said is that if the characteristic value of watermelon detected by us does not exist in the 17 samples, what should we do at this time??

All of the above are the problems we need to solve in this section.

  • Attribute problem resolution

In the case of the stone jar in the front, our attribute features are only color, while in the case of the melon eater, the attribute features are color, root, tapping, texture, umbilical, touch, density and sugar content.

From the previous several articles on hand tearing machine learning, we can know that when there are multiple attribute features in a sample data, we can regard these multiple attribute features as a whole. What is the whole?? Yeah, it’s an eigenvector.

We might as well put the characteristic vectors for x = (texture, colour and lustre, roots, tapping sound, navel, tactility, density, sugar content) x = (texture, colour and lustre, roots, tapping sound, navel, tactility, density, sugar content) x = (texture, colour and lustre, roots, tapping sound, navel, tactility, density, sugar content), and the good melon, melon bad for CCC label said, On the premise of knowing the attribute characteristics of a tested watermelon sample, we need to judge whether the melon is a good melon or a bad melon. Then, according to Bayes’ theorem, we have


P ( c x ) = P ( c ) P ( x c ) P ( x ) (3, 1) P(c|x)=\frac{P(c)P(x|c)}{P(x)} \tag{3-1}

By type, we are not easy to find, based on the bayesian formula to estimate the a posteriori probability P 3-1 c ∣ (x) P (c | x) P (c ∣ x) is the main difficulties: class conditional probability P (x ∣ c) P (x | c) P (x ∣ c) are all attributes on the joint probability, it is difficult to directly estimated from limited training samples. The other thing is, this is the equivalent of the Cartesian product, which is not very computationally friendly. (The Cartesian product is for readers to know for themselves, and Taoye will introduce it later when the opportunity arises.)

In order to avoid the above problem, “Naive Bayes classifier” adopts “attribute conditional independence hypothesis” : for known categories, it is assumed that all attributes are independent of each other and will not have any influence on each other. I don’t care about you, please don’t care about me, you go your own way, I go my own way of sunshine, what you do will not have any impact on others.

Is not independent understanding a little feeling??

Based on the assumption of attribute condition independence, formula 3-1 can be rewritten as:


P ( c x ) = P ( c ) P ( x c ) P ( x ) = P ( c ) P ( x ) Π i = 1 8 P ( x i c ) 2) (3 – P(c|x)=\frac{P(c)P(x|c)}{P(x)}=\frac{P(c)}{P(x)}\Pi_{i=1}^8P(x_i|c) \tag{3-2}

As we know, for a watermelon sample to be tested, the calculated results of each attribute characteristic value in each sample category are the same. For example, for the discrimination of good melon and bad melon, the calculated P(x)P(x)P(x) are the same. In other words, the calculation result of P(x)P(x)P(x) does not have any influence on the posterior probability result of the calculation of different labels. In other words, there is no need to calculate the value of P(x)P(x)P(x) in order to determine whether this is a good or bad melon. ** For this, we get the following:


h n b ( x ) = a r g   m a x P ( c ) Π i = 1 8 P ( x i c ) (3-3) h_{nb}(x) = arg \ maxP(c)\Pi_{i=1}^8P(x_i|c) \tag{3-3}

This is our simple bayesian expression, the expression of meaning is a time when comparing different categories of P (c) Π I = 18 P (xi) ∣ c P (c) \ Pi_ {I = 1} ^ 8 P (x_i | c) P (c) Π I = 18 P (xi ∣ c), value of the largest label is the classification results we want.

Not hard?? Should be able to understand?? Should also understand?? Blue English  ̄,

  • How different data types are handled

By observing the data, we can find that the attribute characteristics of the samples fall into two categories: one is nominal attribute data: color, root, tapping, texture, umbilical and tactile sensation; the other is numerical data: density and sugar content. We can think of nominal data as discrete and numerical data as continuous, whereas in Bayesian algorithms, different types of data are treated differently.

For discrete attribute to Dc, xiD_ {c, x_i} Dc, xi said on the ith attribute values are in the DcD_cDc xix_ixi collection of samples, the conditional probability P (xi) ∣ c P (x_i | c) P (xi ∣ c) can be estimated as:


P ( x i c ) = D c . x i D c (3-4) P(x_i|c)=\frac{|D_{c,x_i}|}{|D_c|} \tag{3-4}

In other words, it’s a calculation of frequency.

And for continuous attributes, we can consider the probability density function, assuming that p (xi) ∣ c p (x_i | c) p (xi) ∣ c to N (uc, I, sigma c, i2) N (u_ {c, I}, \ sigma_ {c} I ^ 2) N (uc, I, sigma c, i2), Where uc,iu_{c, I}uc, I and σc,i2\sigma_{c, I}^2σc,i2 are the mean and variance of the value of the ith attribute of the CTH sample, then:


p ( x i c ) = 1 2 PI. sigma c . i e x p ( ( x i u c . i ) 2 2 sigma c . i 2 ) (3-5) p(x_i|c)=\frac{1}{\sqrt{2\pi}\sigma_{c,i}}exp(-\frac{(x_i-u_{c,i})^2}{2\sigma_{c,i}^2}) \tag{3-5}

That is to say at this time of p (xi ∣ c) p (x_i | c) p (xi) ∣ c is equivalent to the data sample concentration corresponds to the characteristics of the all numeric data obey gaussian distribution, in order to calculate p (xi) ∣ c p (x_i | c) p (xi ∣ c) results

Ok, after understanding these two questions, we can calculate the quality of watermelon by the masses

We might as well assume that the corresponding attributes of a watermelon obtained by the chagua people at this time are as follows, and we judge the quality of the watermelon through Bayes:

We first calculate the prior probability P(c)P(c)P(c). Since there are 17 melons in total, including 8 good melons and 9 bad melons, we can get:


P ( Good melon = is ) = 8 17 = 0.471 P ( Good melon = no ) = 9 17 = 0.529 \ begin} {aligned & P (good melon =) = \ frac {8} {17} = 0.471 \ \ & P (good melon = no) = \ frac {9} {17} = 0.529 \ end} {aligned

For each attribute, estimate the conditional probability P (xi) ∣ c P (x_i | c) P (xi ∣ c) :


P green is = P ( Colour and lustre = green Good melon = is ) = 3 8 = 0.375 P green no = P ( Colour and lustre = green Good melon = no ) = 3 9 = 0.333 P Curled up is = P ( roots = Curled up Good melon = is ) = 5 8 = 0.625 P Curled up no = P ( roots = Curled up Good melon = no ) = 3 9 = 0.333 P Turbidity ring is = P ( Knock sound = Turbidity ring Good melon = is ) = 6 8 = 0.750 P Turbidity ring no = P ( Knock sound = Turbidity ring Good melon = no ) = 4 9 = 0.444 P clear is = P ( texture = clear Good melon = is ) = 7 8 = 0.875 P clear no = P ( texture = clear Good melon = no ) = 2 9 = 0.222 P sag is = P ( Belly button = sag Good melon = is ) = 6 8 = 0.625 P sag no = P ( Belly button = sag Good melon = no ) = 2 9 = 0.222 P Hard and smooth is = P ( touch = Hard and smooth Good melon = is ) = 6 8 = 0.750 P Hard and smooth no = P ( touch = Hard and smooth Good melon = no ) = 6 9 = 0.667 \ begin} {aligned & P_ {}’s green | = P (color = green | good melon =) = \ frac {3} {8} = 0.375 \ \ & P_ {green | n} = P (color = green | good melon = no) = \ frac {3} {9} = 0.333 \ \ & P_ crouching | {is} = P (crouching | good melon roots = =) = \ frac {5} {8} = 0.625 \ \ & P_ crouching | {n} = P (crouching | good melon roots = = no) = \ frac {3} {9} = 0.333 \ \ & P_ {}’s muddy ring | = P (knock louder = turbidity | melon is =) = \ frac {6} {8} = 0.750 \ \ & P_ {turbidity ring | n} = P (knock louder = turbidity | melon = no) = \ frac {4} {9} = 0.444 \ \ & P_ clear | {is} = P (good texture clear = | melon =) = \ frac {7} {8} = 0.875 \ \ & P_ clear | {n} = P (good texture clear = | melon = no) = \ frac {2} {9} = 0.222 \ \ & | P_ {sag is} = P (navel | good melon = = sag is) = \ frac {6} {8} = 0.625 \ \ & P_ | n} {sag = P (navel = sag | good melon = no) = \ frac {2} {9} = 0.222 \ \ & P_ hard slide | {is} = P (tactility is = good hard slide | melon =) = \ frac {6} {8} = 0.750 \ \ & P_ {hard slide | n} = P (touch = good hard slide | melon = no) = \ frac {6} {9} = 0.667 \ \ \ end} {aligned

P Density: 0.697 is = P ( The density of = 0.697 Good melon = is ) = 1 2 PI. 0.129 e x p ( ( 0.697 0.574 ) 2 2 0.12 9 2 ) = 1.962 P Density: 0.697 no = P ( The density of = 0.697 Good melon = no ) = 1 2 PI. 0.195 e x p ( ( 0.697 0.496 ) 2 2 0.19 5 2 ) = 1.194 P Sugar: 0.460 is = P ( sugar = 0.460 Good melon = is ) = 1 2 PI. 0.101 e x p ( ( 0.460 0.279 ) 2 2 0.10 1 2 ) = 0.669 P Sugar: 0.460 no = P ( sugar = 0.460 Good melon = no ) = 1 2 PI. 0.108 e x p ( ( 0.460 0.154 ) 2 2 0.10 8 2 ) = 0.42 \begin{aligned} P_{density: Is 0.697 |} & = P (density = 0.697 | good melon is =) \ \ & = \ frac {1} {\ SQRT {2 \ PI}} 0.129 exp (- \ frac {(0.697 0.574) ^ 2} {2 * 0.129 ^ 2}) \ \ \ \ & = 1.962 P_ {density: 0.697 | n} & = P (density = 0.697 | good melon = no) \ \ & = \ frac {1} {\ SQRT {2 \ PI}} 0.195 exp (- \ frac {(0.697 0.496) ^ 2} {2 * 0.195 ^ 2}) \ \ \ \ & = 1.194 P_ {sugar: 0.460 | is} & = P (sugary = 0.460 | good melon is =) \ \ & = \ frac {1} {\ SQRT {2 \ PI}} 0.101 exp (- \ frac {(0.460 0.279) ^ 2} {2 * 0.101 ^ 2}) \ \ \ \ & = 0.669 P_ {sugar: 0.460 | n} & = P (sugary = 0.460 | good melon = no) \ \ & = \ frac {1} {\ SQRT {2 \ PI}} 0.108 exp (- \ frac {(0.460 0.154) ^ 2} {2 * 0.108 ^ 2}) \ \ \ \ & = 0.42 \end{aligned}

It is necessary to say something here: in zhou Zhihua’s watermelon book
P sag is | P_ {sag is}
There is an error in the calculation, the actual result should be 0.625, not 0.750, the reader can calculate to verify

Thus, we can calculate the probability that the melon is a good melon and a bad melon as follows


P ( Good melon = is ) P green is P Curled up is P Turbidity ring is P clear is P sag is P Hard and smooth is P Density: 0.697 is P Sugar: 0.460 is = 0.046 P ( Good melon = no ) P green no P Curled up no P Turbidity ring no P clear no P sag no P Hard and smooth no P Density: 0.697 no P Sugar: 0.460 no = 4.36 1 0 5 \ begin} {aligned & P (good melon =) * P_ {}’s green | * P_ crouching | {is} * P_ {| turbidity ring is} * P_ clear | {is} \ \ &, quad, quad, quad, quad, quad, quad | * P_ {sag is} * P_ hard slide | {is} * P_ {}’s density: 0.697 | * P_ {sugar: 0.460 | is} = 0.046 \ \ & P (good melon = no) * P_ {green | n} * P_ crouching | {n} * P_ {turbidity ring | n} * P_ clear | {n} \ \ &, quad, quad, quad, quad, quad | no} \ quad * P_ {sag * P_ {hard slide | n} * P_ {density: 0.697 | n} * P_ {sugary: 0.460 | n} = 4.36 * 10 ^ {5}} {aligned \ end

According to the calculation, 0.046>4.36∗10−50.046>4.36*10^{-5}0.046>4.36∗10−5, so we should classify this discriminant sample as “good melon”.

Below, we can use code to describe the above Bayes test process.

First, establish an establish_data method to prepare data:

Define calc_label_count and calc_p_C methods to count the number of labels of different categories and calculate the frequency of each category in the data sample set respectively, that is, P(c)P(c)P(c) of each category:

The running result of the program is as follows, which is consistent with our previous manual calculation

According to the bayesian classification process, still need to define a and calc_continuity_p_xi_c calc_dispersed_p_xi_c method to calculate P (xi) ∣ c P (x_i | c) P (xi ∣ c) the value of the method corresponding to the discrete data and continuous data

However, it should be noted that when calculating P(xi, C)P(x_i, C)P(xi, C) of continuous data, we should also know the mean value and variance of the data in advance. For this purpose, a calc_mean_standard method should be defined to achieve this function. The code for the three core methods is as follows (all fairly simple) :

The calculation results are shown in the figure below:

It can be seen that the Bayes algorithm at this time judged that the melon was a good melon, which was consistent with our actual label, indicating that the prediction was correct. Of course, this code only predicts one watermelon sample, and readers can predict multiple watermelon samples according to the program code to determine the accuracy rate of this Bayes.

Complete code:

Import numpy as NP """ Author: Taoye 官 信 号: Cynic Coder Explain: Use to generate sample attributes and corresponding tags Return: x_data: Attributes of the data sample, including 8 attributes y_label: """ def establish_data(): X_data = [[1, 1, 1, 1, 1, 1, 0.697, 0.460], [2, 1, 2, 1, 1, 1, 0.774, 0.376], [2, 1, 1, 1, 1, 1, 0.634, 0.264], [1, 1, 2, 1, 1, 1, 0.608, 0.318], [3, 1, 1, 1, 1, 1, 0.556, 0.215], [1, 2, 1, 1, 2, 2, 0.403, 0.237], [2, 2, 1, 2, 2, 2, 0.481, 0.149], [2, 2, 1, 1, 2, 1, 0.437, 0.211], [2, 2, 2, 2, 2, 1, 0.666, 0.091], [1, 3, 3, 1, 3, 2, 0.243, 0.267]. [3, 3, 3, 3, 3, 1, 0.245, 0.057], [3, 1, 1, 3, 3, 2, 0.343, 0.099], [1, 2, 1, 2, 1, 1, 0.639, 0.161], [3, 2, 2, 2, 1, 1, 0.657, 0.198], [2, 2, 1, 1, 2, 2, 0.360, 0.370], [3, 1, 1, 3, 3, 1, 0.593, 0.042], [1, 1, 2, 2, 2, 1, 0.719, 0.103] y_label = [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1] return np. Array (x_data), Np. array(y_label) """ Author: Taoye wechat public id: Coder Explain: Used to count the number of samples for different tags Def calc_count (y_label): count, count = 0, count = 0, count = 0; Data_number = y_label. Shape [0] for label in y_label: # if int(label) == 0 Label_count_0 += 1 if int(label) == 1: label_count_1 += 1 return label_count_0, label_count_1 "" Coder Explain: Used to calculate the frequency of each class in the data sample set, i.e. $P(c)$value: Parameters: y_label: the label corresponding to the sample attribute Return: PC: Def calc_p_c(y_label): data_number = y_label.shape[0] label_count_0, label_count_1 = calc_label_count(y_label) return label_count_0 / data_number, label_count_1 / data_number """ Author: Taoye wechat public id: Coder Explain: used to calculate the frequency of each class in the data sample set, i.e. $P(c)$value of each class, mainly used for nominal data "" def calc_dispersed_P_xi_c (test_data, X_data, y_label, attribute_index): label_count_0, label_count_1 = calc_label_count(y_label) attribute_count_0, attribute_count_1 = 0, 0 for item in x_data[:label_count_0]: if test_data[attribute_index] == item[attribute_index]: attribute_count_0 += 1 for item in x_data[label_count_0:]: if test_data[attribute_index] == item[attribute_index]: attribute_count_1 += 1 return attribute_count_0 / label_count_0, attribute_count_1 / label_count_1 """ Author: Def calc_mean_standard(x_data): mean_value_0, mean_value_1 = np.mean(x_data[:8, 6:8], axis = 0), np.mean(x_data[8:, 6:8], axis = 0) std_value_0, std_value_1 = np.std(x_data[:8, 6:8], axis = 0), np.std(x_data[8:, 6:8], axis = 0) return mean_value_0, mean_value_1, Std_value_0, std_value_1 """ Author: Taoye wechat public iD: Coder Explain: Def calc_gaussian(data, mean_value, std_value): return (1 / (np.sqrt(2*np.pi) * std_value)) * (np.e ** ((- (data - mean_value) ** 2) / (2 * (std_value) ** 2))) """ Def calc_continuity_p_xi_c(test_data, x_data): mean_value_0, mean_value_1, std_value_0, std_value_1 = calc_mean_standard(x_data) pxi_density_0 = calc_gaussian(test_data[6], mean_value_0[0], std_value_0[0]) pxi_density_1 = calc_gaussian(test_data[6], mean_value_1[0], std_value_1[0]) pxi_sugar_0 = calc_gaussian(test_data[7], mean_value_0[1], std_value_0[1]) pxi_sugar_1 = calc_gaussian(test_data[7], mean_value_1[1], std_value_1[1]) return pxi_density_0, pxi_density_1, pxi_sugar_0, pxi_sugar_1 if __name__ == "__main__": Test_data = [1, 1, 1, 1, 1, 0.697, 0.460] x_data; y_label = establish_data() attr0 = calc_dispersed_p_xi_c(test_data, x_data, y_label, 0) attr1 = calc_dispersed_p_xi_c(test_data, x_data, y_label, 1) attr2 = calc_dispersed_p_xi_c(test_data, x_data, y_label, 2) attr3 = calc_dispersed_p_xi_c(test_data, x_data, y_label, 3) attr4 = calc_dispersed_p_xi_c(test_data, X_data, y_label, 4) attr5 = calc_dispersed_p_xi_C (test_data, x_data, y_label, 5) print(" P_{(x_I,c)} for nominal data: ", attr0, attr1, attr2, attr3, attr4, attr5) pxi_density_0, pxi_density_1, pxi_sugar_0, Pxi_sugar_1 = calc_continuity_p_xi_c(test_data, x_data) print(" P_{(x_i,c)} : ", pxi_density_0, pxi_SUGAR_0, pxi_SUGAR_1) p1, p2 = calc_p_c(y_label) print(" ", p1, p2) p_good_melon = p1 * attr0[0] * attr1[0] * attr2[0] * attr3[0] * attr4[0] * attr5[0] * pxi_density_0 * pxi_sugar_0 p_bad_melon = p2 * attr0[1] * attr1[1] * attr2[1] * attr3[1] * attr4[1] * attr5[1] * pxi_density_1 * pxi_sugar_1 Print (" The probability of classifying as good melon and bad melon is: ", p_good_melon, p_bad_melon) print(' good melon ') if P_good_melon >= p_bad_else print(' good melon ')Copy the code

At the beginning of this section, we raised three questions, two of which we have solved, and now we will tackle the last one.

  • If the eigenvalue in our detected watermelon does not exist in the 17 samples, what should we do at this time??

That is to say, if an attribute value does not appear at the same time with a class in the training set, we will have problems in discriminating according to the naive Bayes mentioned above. For example, when training naive Bayes using watermelon datasets, for a sample “knock = crisp” test, there are:


P crisp is = P ( Knock sound = crisp Good melon = is ) = 0 8 = 0 P_ {}’s ringing | = P (tapping sound is = | good melon crisp =) = \ frac {0} {8} = 0

Therefore, no matter what other attributes of the sample are, even if it is obviously a good melon in other attributes, the result of classification is “good melon = no”, which is obviously not very reasonable.

In order to avoid the information carried by other attributes being “erased” by attribute values not present in the training set, “smoothing” operation is usually carried out when estimating the probability value, commonly known as “Laplacian correction”. Specifically, let N represent the number of possible categories in training set D, and NiN_iNi represent the number of possible values of the ith attribute, then the equations 3-4 and 3-5 are modified as follows:


P ^ ( c ) = D c + 1 D + N P ^ ( x i c ) = D c . x i + 1 D c + N i \hat{P}(c) = \frac{|D_c| + 1}{|D| + N} \\ \hat{P}(x_i|c)=\frac{|D_{c,x_i}|+1}{|D_c|+N_i}

For example, in the example in this section, the quasi-prior probability can be estimated as:


P ^ ( Good melon = is ) = 8 + 1 17 + 2 . P ^ ( Good melon = no ) = 9 + 1 17 + 2 = 0.526 \ hat (good melon =) = {P} \ frac {8} + 1 17 + {2}, \ hat (good melon = no) = {P} \ frac {9 + 1} {17 + 2} = 0.526

There is no difficulty in this Laplacian correction, which mainly deals with the problem that a single attribute does not exist in the data sample. Readers can follow the Laplacian correction to improve the above complete code.

There’s actually a little bit more on Bayesian algorithms, but for space reasons, we’ll save that for liver later.

If this article has helped you, please like it and share it

This article will not be a nagging.

I am Taoye, love study, love to share, is keen on all kinds of technology, the study of anime like playing chess, listening to music, chat, hoping to worlds to record your growth process as well as the life intravenous drip, also hope to be able to strong more within the circle of like-minded friends, more welcome visiting WeChat princess: cynicism Coder.

I’ll see you next time. Bye

References:

[1] Machine Learning Practice: Peter Harrington, Posts and Telecommunications Press [2] Statistical Learning Methods: Li Hang, 2nd Edition, Tsinghua University Press [3] Machine Learning: zhou Tsinghua university press [4] mathematics B version for high schools are taking 2-3 [5] bayesian inference and Internet applications: http://www.ruanyifeng.com/blog/2011/08/bayesian_inference_part_one.html

Recommended reading

“Machine Learning in Action” — Female students asked Taoye, KNN should play how to complete “Machine Learning in Action” — understand all, don’t understand also can understand. Nonlinear support vector Machine Machine Learning in Action Machine Learning in Action, Taoye, takes a look at support vector machines Print (“Hello, NumPy! “) ) do what what not, eat the first Taoye penetration into a black platform headquarters, the truth behind the very fear of “Tai Hua database” -SQL statement execution, what did the bottom?