3.1 The basic concept of Naive Bayes

3.1.1 What is Bayes

I’m sure that if you read this article, you know or have learned about Bayes, an application of Bayesian ‘theorem. English mathematician ** Thomas Bayes *** first proposed the theorem in a paper published in 1763. Pushu Bayes is an application of this theorem.

Naive Bayes is quite different from other statistical inference methods. It is based on subjective judgment, meaning that you can estimate a value without objective evidence and then revise it based on actual results. Just because of its strong subjectivity, it has been criticized by many statisticians.

Naive Bayes requires a lot of computation, so for a long time in history, it was not widely used. It was only with the birth of the computer that it gained real importance. It has been found that many statistics cannot be judged objectively in advance, and the emergence of large data sets in the Internet age, coupled with high-speed computing power, has made it easy to verify these statistics and created conditions for the application of Bayesian inference, which is becoming increasingly powerful. Well, the author is limited in his ability to give an accurate concept of naive Bayes despite having consulted a large amount of literature. I can only quote the concept of naive Bayes given by Daniel here.

Concept: Naive Bayes is a classification method based on Bayes’ theorem and the assumption that feature conditions are independent. For a given training data set, the joint probability distribution of input/output is first learned based on the assumption of feature condition independence. Then based on this model, for a given input x x x, bayes’ theorem is used to find the output y y y with the highest posteriori probability.

[Note] The above is the concept of Bayesian in Statistical Learning Methods by Li Hang.

After reading the concepts above, I’m sure many readers may be a little confused about how naive Bayes should be used and how to classify them. In order to visually describe how naive Bayes works in depth, as in the previous chapter, we will introduce naive Bayes through an example.

The probability of male and female students wearing military uniform in a university is 2 to 1. Question: If you meet a person wearing military uniform at school, what is the probability of male or female? Ok, so let’s talk about prior probability, conditional probability, total probability formula and posterior probability.

3.1.2 Prior probability

Prior probability refers to the probability obtained based on previous experience and analysis, such as the total probability formula (to be discussed later).

Let’s use the example above to explain what a prior probability is. According to the above examples, let’s assume that a student on campus is a male student Y= Y bo Y=y_boy Y=yboy, and a female student Y= Y g I r L Y=y_girl Y=ygirl; A person wearing military uniform is the event X= X 1 X=x_1 X=x1, and not wearing military uniform is the event X= X 0 X=x_0 X=x0. The gender of a student and whether he or she wears a uniform are independent of each other.

So we can see that there are four prior probabilities in this example: P(X= X 1) P(X=x_1) P(X=x1) and P(X= X 0) P(X=x_0) P(X=x0) P(Y= Y bo Y) P(Y=y_{boy}) P(Y=y_{boy}) P(Y=y_{boy}) and P(Y= Y g I R l P(Y=y_{girl} P(Y=y_{girl} P(Y=y_{girl} P) P(Y=y_{boy}) P(Y=y_{girl} P(Y=ygirl) rl P(Y=y_{girl} P(Y=ygirl) rl P(Y=y_{girl} P(Y=ygirl It can be inferred from previous experience in the example that the ratio of boys to girls in this school is usually 2:1: P (Y = Y b o Y) = 2/3 P (Y = y_ {boy}) = P (Y = yboy) two-thirds = 2/3 and P (Y = Y g I r l) = 1/3 P (Y = y_ {girl}) P (Y = ygirl) = = 1/3 one-third . However, the prior probabilities P(Y= Y bo Y) P(Y=y_{boy}) P(Y=yboy) and P(Y= Y g I r L P(Y=y_{girl} P(Y=ygirl) cannot be obtained directly, and need to be solved according to the total probability formula. Before we look at the total probability formula, let’s look at conditional probability.

3.1.3 Conditional probability

Conditional probability is the core of the bayes’ theorem, the so-called “Conditional probability” (Conditional aim-listed probability), is refers to under the condition of the event B B B occurs, the probability of events A aa, with P (A ∣ B) P (A | B) P (A ∣ B).

FIG. 1 Venn diagram

According to Venn’s diagram, it can be clearly seen that in the case of event B B B, the probability of occurrence of event A A A is P(A∩B) P(A∩B) P(B).



As a result,



And by the same token,



So,







That’s the formula for conditional probability.

By deforming the conditional probability formula, the following form can be obtained:



We refer to P(A) P(A) P(A) P(A) as Prior probability, that is, our judgment of the probability of A, A and A before the occurrence of B events. P (A ∣ B) P (A | B) P (A ∣ B) referred to as the “Posterior probability” (Posterior aim-listed probability), namely in B B B after the incident, we review of aa A probability of occurrence. P/P (B ∣ A) (B) P (B | A)/P (B) P (B ∣)/P (B) known as the “likelihood function (Likelyhood),” this is an adjustment factor, makes the forecast probability closer to the true probability. Therefore, conditional probability can be understood as the following formula:

Posterior probability = prior probability x adjustment factor

We estimate a prior probability, and then add the results of the experiment to see whether the experiment enhances or weakens the prior probability to get a more realistic posterior probability.

Over here, if we have P of B given A divided by P of B gt; 1 P(B|A)/P(B)> One P of B given A divided by P of B is greater than one, which means the prior probabilities are increased, the probability of events A, A, A is greater; If the probability function is equal to 1, that means that events B, B, B are not helpful in determining the probability of events A, A, A; If the probability function is less than 1, that means that the prior probability is weakened, that event A, A, A is less likely.

Well, a more formal expression is as follows.

Conditional probabilityThis is the probability of the event X= X, X= X, X= X under the condition that the event Y= Y, Y= Y, Y= Y has already occurred. Conditional probability can be expressed as P (X = X ∣ Y = Y), P (X = X | Y = Y), P (X = X ∣ Y = Y) :. And the conditional probability calculation formula is:



The P (X = X ∣ Y = Y), P (X = X | Y = Y), P (X = X ∣ Y = Y) is the joint probability, the probability of two events occur together. And P(Y= Y) P(Y= Y) P(Y= Y) and P(X= X) P(X= X) P(X= X) are prior probabilities.

Let’s use examples to illustrate this. “The probability of male students wearing military uniforms in a certain school is 1/2, 1/2, 1/2”, that is to say, “under the premise of male students, the probability of wearing military uniforms is 1/2, 1/2, 1/2”, which is a conditional probability. P (X = 1 X ∣ Y = Y b o Y) = 1/2 P (X = x_1 | Y = y_ {boy}) = 1/2 P (X = x1 ∣ Y = yboy) = 1/2

Similarly “girls in uniform probability for the two thirds two-thirds two-thirds” as the conditional probability, namely, P (X = 1 X ∣ Y = Y g I r l) = 1/2 P (X = x_1 | Y = y_ {girl}) = 1/2 P (X = x1 ∣ Y = ygirl) = 1/2

3.1.4 Total probability formula

Suppose the sample space, S, is the sum of two events, A and A prime.

Figure 2

In the figure above, the red part is event A and the green part is event A’, which together form the sample space S. In this case, event B can be divided into two parts.

Figure 3

namely



In the last derivation, we knew that



So,

This is the total probability formula. It means that if A, A, A and A prime A' A prime constitutes A partition of the sample space, so the probability of event B is equal to A, A, A and A prime A' The probability of A prime times the sum of the conditional probabilities of B, B, B for each of these two events. Substituting this into the conditional probability formula in the previous section yields another way of writing conditional probability:



The formal expression is as follows.

Total probability formulaMeans: if the event Y = Y (1, 2, Y = Y…, Y = Y n Y = y_1, Y = y_2,… ,Y=y_n Y=y1,Y=y2,… Y=yn can form a complete set of events, that is, they are incompatible with each other and their sum is the complete set. Then for the event X= X X= X X= X:



Therefore, for the above example, we can obtain according to the total probability formula:



So regardless of gender, the probability of wearing a military uniform on campus is 5/9, 5/9, 5/9, and the probability of not wearing a military uniform is 4/9, 4/9, 4/9.

3.1.5 Posterior probability

A posterior probability is the probability that an event X= X, X= X, X= X has already occurred, and that the event occurred because of the event Y= Y, Y= Y, Y= Y. That is, in the above example, “what is the probability that a person is male or female, given that the person wears slippers?” Formal is: posterior probability P (Y = Y o Y b ∣ X = 1 P (Y = X y_ {boy} | X = x_1 P (Y = yboy ∣ X = x1

The calculation of posterior probability should be based on prior probability. A posteriori probability can be calculated by using a prior probability and likelihood function through The Bayesian formula.

Bayes formulaAs follows:



The P (Y = Y b o Y ∣ X = 1 P (Y = X y_ {boy} | X = x_1 P (Y = yboy ∣ X = x1 is demanding a posteriori probability, P (X = 1 X ∣ Y = Y b o Y P (X = x_1 | Y = y_ {boy} P (X = x1 ∣ Y = yboy conditional probability, P (Y = Y b o Y P (Y = y_ {boy} P (Y = yboy prior probability.

The bayesian algorithm uses the above information to solve the posterior probability and classifies according to the posterior probability value.

Using the above example to understand, the posterior probability is:



In other words, given the fact that a person is wearing a military uniform, the probability that the student is a boy is 3/5, 3/5, 3/5, and the probability that the student is a girl is 2/5, 2/5, 2/5. If the question is “decide whether the student is a boy or a girl”, the question is a classification problem. And since the posteriori probability calculated by Bayes’ formula is that there’s a higher probability of boys and then there’s a higher probability of girls

P (Y = Y b o Y = X one > P (Y = Y g I r l ∣ X = X 1 P (Y = y_ {boy} | X = x_1 & gt; P (Y = y_ {girl} | X = P (Y = yboy x_1 ∣ X = x1 > P (Y = ygirl ∣ X = x1

Then we can classify it as male (without actually solving the denominator when using naive Bayes for classification).

So far, we have used examples to illustrate the basic steps and simple principles of using naive Bayes for classification. We’re going to talk a little bit more about naive Bayes’ principle, and this is just a summary of what we’ve seen before, so don’t be intimidated by a lot of formulas. Okay, so let’s move on.

Reference: www.ruanyifeng.com/blog/2011/0…

3.2 Naive Bayes inference process

For the sample set:



Where m m M means there are m m m samples, and N n n means there are N n n features. y i , i = 1 , 2… , m y_i, I = 1, 2… , m yi, I = 1, 2… ,m represents the type of sample, and the values are {C 1,C 2,..,C k C_1,C_2… ,C_k C1,C2,… Ck}.

Prior probabilityTo:



Conditional probabilityIs (based on conditional independence assumption) :



A posteriori probabilityTo:



Conditional probability formulaTo:



The above formula is the basic formula of naive Bayes classification. Thus, the naive Bayes classifier can be expressed as:



Note that since the denominator is the same for all C, K, C_k, Ck:

3.3 Parameter estimation of Naive Bayes

3.3.1 Maximum likelihood estimation

For the sample set, we can use maximum likelihood estimation to calculate the prior probability:



∑ I =1mI(y I =Ck) {\sum_{I =1}^{m}I(y_i=C_k)} ∑ I =1 mi (yi=Ck) The prior probability calculates the frequency of categories C, K, C, K, Ck in the sample set.

Let the set of possible values of the JTH feature xj x_j xj be {a j1,a j2,.. A j l a_{j1},a_{j2}… ,a_{jl} aj1,aj2,… And ajl}.Conditional probabilityThe maximum likelihood estimate of is:



Where the value of the j j j feature may be {a j1,a j2,.. A j l a_{j1},a_{j2},… ,a_{jl} aj1,aj2,… Ajl}, h, h, h. In the subsample set of sample class Ck4, the subsample set of C_k4, the subsample set of Ck4, the frequency of the sample whose characteristic value j$is ajl a_{jl} ajL.

3.3.2 Bayesian estimation

When maximum likelihood estimation is used, the probability value to be estimated may be 0, which will affect the calculation result of posterior probability and result in classification deviation. In order to solve this problem, Bayesian estimation is adopted. The specific Bayes of conditional probability are shown as follows.



Where S, j, S_j, Sj are characteristics, and the number of values of X, j X_j, and Xj is h, H, h. In the formula λ > = 0 \lambda > =0 λ>=0, which is equivalent to giving a positive number λ &gt on the frequency of each value of the random variable; 0 \lambda > 0 λ>0, when λ=0 \lambda =0, is the maximum likelihood estimator. When lambda =1, its answer is Laplace smoothing. For any l=1,2,.. S j l=1,2… , S_j l = 1, 2,… ,Sj, k=1,2.. k, k=1,2… , K K = 1, 2,… , K

P (X j = a J L ∣ Y = C k) > 0 P(X_j=a_{jl}|Y=C_k)> 0 P (Xj = ajl ∣ Y = Ck) > 0

∑ L = 1 S j P (X j = a J L Y = C k) > 0 \sum_{l=1}^{S_j}P(X_j=a_{jl}|Y=C_k)> ∑ 0 l = 1 SJP (Xj = ajl ∣ Y = Ck) > 0

Again, the Bayesian estimate of prior probability is zero

3.4 Algorithm process of Naive Bayes

The following shows the procedure of a Bayesian algorithm using maximum likelihood estimation as parameter estimation.

Algorithm (Naive Bayes algorithm)

The input: the training set



Where y I y_i yi, I =1,2.. m I =1,2… , m I = 1, 2,… ,m represents the type of sample, and the values are {C 1,C 2,..,C k C_1,C_2… ,C_k C1,C2,… Ck}.

The output: x x x classification.

(1) CalculationPrior probability, and find the number of sample categories K, K, K. For each sample Y=C, k, Y=C_k, Y=Ck, compute P(Y=C, k), P(Y=C_k), P(Y=Ck). This is the frequency of categories C, K, C_k, Ck in the total sample set.



(2) CalculationConditional probability, divide the sample set into K K K sub-sample set, respectively calculate the sub-sample set belonging to C K C_k Ck, and calculate the probability of feature Xj=a J L X_j= A_ {jL} Xj= ajL: J P (X = a j l ∣ Y = C k) P (X_j = a_ {jl} | Y = C_k) P (Xj = ajl ∣ Y = (Ck). Is the ratio of the sample number of this subset with the characteristic value of A j L a_{jL} ajL to the sample number of this subset.



(3) For the sample to be predicted xtest x^{test} xtest, calculate its posterior probability for each category C K C_k Ck:



The category with the largest probability value is the prediction category of the sample to be predicted.

3.5 Algorithm summary of Naive Bayes

The main principle of naive Bayes algorithm has been summarized, and the advantages and disadvantages of naive Bayes algorithm are summarized here. The main advantages of naive Bayes are:

1) Naive Bayes model originated from classical mathematical theory and has stable classification efficiency. 2) Good performance on small-scale data, able to handle multi-classification tasks, suitable for incremental training, especially when the amount of data exceeds the memory, we can conduct incremental training in batches. 3) It is not sensitive to missing data and the algorithm is relatively simple, which is often used for text classification.

Main disadvantages of naive Bayes:

1) Theoretically, naive Bayes model has the minimum error rate compared with other classification methods. However, this is not always the case in fact, because when the naive Bayes model gives the output category, it assumes that the attributes are independent from each other, which is often not valid in practical application. When the number of attributes is large or the correlation between attributes is large, the classification effect is not good. Naive Bayes has the best performance when the attribute correlation is small. For this, there are semi-naive Bayes and other algorithms that improve modestly by taking into account partial correlations. 2) It is necessary to know the prior probability, and the prior probability depends on the hypothesis in most cases. There can be many models of the hypothesis, so the prediction effect will be poor in some cases due to the prior model of the hypothesis. 3) Because we determine the posteriori probability through prior and data, there is a certain error rate in classification decisions. 4) Very sensitive to the expression of input data.

Reference: [1] Statistical Learning Methods, li Hang.

Video: www.youtube.com/watch?v=DNv…