Today, this article will talk about naive Bayes model, which is one of the classic models in the field of machine learning, and is very simple, suitable for beginners to start.

The naive Bayes model, as the name suggests, is definitely highly related to Bayes’ theorem. We introduced Bayes’ theorem in our three Doors article, so let’s briefly review Bayes’ formula:


We put theandAs prior probability, then Bayes formula is the formula to calculate posterior probability through prior and conditional probability. That isFruit found backWe explore the causes of events based on what has already happened. The naive Bayes model is based on this principle, and its principle is very simple, so simple that a sentence can be probability: when a sample may belong to multiple categories, we simply choose the one with the highest probability.

Therefore, since it is to select the category to which the sample belongs, it is obvious that the naive Bayes model is a classification algorithm.

Before we get into the details of the algorithm, let’s familiarize ourselves with a few concepts. Some of these concepts have been introduced in previous articles as a review.


Prior probability


The prior probabilities are actually pretty easy to understand, so let’s ignore the first and the last. To put it bluntly, prior probabilities are the probabilities that we can calculate beforehand by doing experiments. Like flipping a coin heads, or running into a red light, or it’s going to rain tomorrow.

Some of these things we can get experimentally, some of these things we can estimate from previous experience. In our case, the probabilities of these events are relatively clear. It can be considered as the probability that can be determined before model exploration, so it is called prior probability.


A posteriori probability


Posteriori probability is intuitively opposite to prior probability, and is something that we can’t get directly through experiments or previous experience. It refers more to the probability that one event is caused by one cause or another.

For example, the probability of a student passing an exam can be measured. It is possible to test a student multiple times, or to count students in batches. But if students can choose to review or play games before the exam, obviously, the review will improve the probability of students passing the exam, and playing games may decrease or may not change much, we don’t know. Suppose we know that Xiao Ming has passed the exam, and we want to know whether he has studied before the exam, which is a posterior probability.

Logically, it’s the opposite of conditional probability. Conditional probability is the probability of event B occurring if event A occurs, and A posterior probability is the probability of event A occurring if event B is known to occur.


Likelihood estimation


It’s also a bad word, and there’s not a single article about Bayes that doesn’t mention it. But few articles have made this concept clear.

Likelihood islikelihoodSemantically it is the same asThe probability ofProbability is very close, probably only a distinction was made in translation. The mathematical expressions of the two are also very close, and both can be written.

The probability is that you know the parameters, the probability of event x. Likelihood focuses on the parameters when event A occurs. So naturally, the likelihood estimation function is a function that estimates parameters from probability distributions. The maximum likelihood estimator is the most likely parameter for the occurrence of event AThe value of the.

To take a very simple example, suppose we have an opaque black box with several black balls and several white balls. But we don’t know how many black balls there are and how many white balls there are. To explore this ratio, we put back 10 balls from the box, assuming that the final result is 7 black and 3 white, what is the proportion of black balls in the box?

This question simply couldn’t be easier, isn’t it a pupil’s problem? So if I have 7 black balls out of 10, then the probability of black balls is 70%. What’s wrong with that?

There’s nothing wrong with it on the surface, but it’s not. Because the experimental result obtained by our experiment does not represent the probability itself, to put it simply, the black ball in the box is 70%, 7 black and 3 white, and the black ball in the box is 50%, the same result can occur, how can we determine that the black ball in the box must be 70%?

This is where the likelihood function comes in.


Likelihood function


We substitute the black and white sphere experiment into the likelihood estimation formula above, and the final result of the experiment is determined to be event X. And what we’re trying to figure out, which is the percentage of black balls, is the parameter. And since we have the experiment of putting it back, the probability of pulling out the black ball is constant, according toThe binomial distribution, we can write the probability of event x:


This is our likelihood function, or probability function. It reflects the probability of event x occurring under different parameters. So what we’re going to do is we’re going to calculate from this functionThe biggestThe value of.

So this is a very simple calculation, and we have to doderivativeAnd then you set the derivative equal to 0, and then you figure out what that meansThe value of. The end result, of courseThe time equation has a maximum.

We could also takeGraph the function of phi, and get a sense of the probability distribution.

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 1, 100)
y = np.power(x, 7) * np.power(1 - x, 3)

plt.plot(x, y)
plt.xlabel('value of theta')
plt.ylabel('value of f(theta)')
plt.show()
Copy the code

So that proves that our intuition is right. It’s not that there’s a 70% chance that we’re going to pick up black marbles and the percentage of black marbles in the box is 70%, but there’s a 70% chance that we’re going to pick up black marbles.


The model,


Now comes the big part, which is Bayes’ formula:


We then apply a variation of the formula and assume that the set of all events associated with B is C. clearly, suppose a set of C has m events, which can be written as:.

then

When we pursue the cause of event B, we will pursue all possible parameter set C leading to this result, and then select the one with the highest probability as the result.

And the same is how we use it to classify, so for a sample x, we calculate the probability that it falls into each category, and then we pick the one with the highest probability as the final prediction category. This naive idea is the basis of the naive Bayes model.

We assume that the, where each a represents the feature of one dimension of sample X. Again, we’re going to have a collection of categories, where each y represents a specific category. All we have to do is figure out the probability that x is in each category y, and choose the one with the highest probability as the final classification result.

We write the probability formula according to Bayes’ formula:


Among themPhi is a constant for all phiIt stays the same, so we can ignore it. We just have to focus on the numerator.

Here, we make an important assumption: we assume that the eigenvalues of the dimensions in sample X are independent of each other.

It’s a very naive assumption, but it’s also a very important assumption, and without it, the probability would be so complicated that it would be almost impossible to calculate. Because of this naive assumption, it’s called the naive Bayes model, and that’s why it’s called the naive Bayes model. Of course, English is naive Bayes, so it is theoretically possible to call it “bayes”.

Now that we have this assumption, we can just expand the formula:

Among themIt’s a prior probability, and we can get it experimentally or something like thatWe can’t get it directly, so we need to calculate it statistically.

ifIt’s discrete. It’s easy. We just have to countAt the time of the incident, eachCan achieve the proportion of. Let’s say we do this a couple of times,It happened M times,N times, then obviously:


To prevent M=0, we can add a smoothing parameter to both the numerator and denominator, so the final result is written as:


But if theWhat if it’s a continuous value? If it is continuous, it can take on an infinite number of values. So obviously, we can’t calculate the probability for every one of these values. It’s impossible to collect such a diverse collection. What should we do in this case?

Continuous values don’t matter, we can assume that the distribution of the variables is normally distributed. Its normal distribution curve is actually the probability distribution of this variable.

Using the figure above as an example, we look at the cumulative percentage value at the bottom. It’s really the area of the region between x’s position and minus infinity. This area is 0 minus 1, and we can use this area to represent the probability of f of x. In fact, it is assumed that variables are normally distributed in different dimensions, which is actually the idea of the Gaussian mixture model (GMM).

In other words, if it’s a discrete value, then we represent the probability by calculating the ratio, and if it’s a continuous value, then we calculate the probability by calculating the probability distribution by calculating the normal distribution. And by doing that, we can get through nLianCheng getFinally, we compare all the probabilities of y and choose the highest one as the result of classification.

That’s all right, but there’s one small problem.

It’s a floating point number, and it’s probably very small, and we need to compute the product of n floating points. Due to the existence of precision error, when the result of continuous multiplication is less than the precision, it is impossible to compare the size of the two probabilities.

To solve this problem, we need to do a variation of the floating-point multiplication: we take the log of the left and right sides of the equation. To multiply a number of floating point numbers, convert to addition:


Since the logarithm function is a monotone function, we can directly use the result after taking the logarithm to compare the magnitude, which can avoid the influence of precision.

The above is the principle of Bayesian model, and the application of Bayesian model in text classification will be shared with you in the following articles.

More text is not easy, if there is a harvest, for a concern