preface
When I gave small-scale lectures to new members of my data science and artificial intelligence group, the lecture contents were sorted out, modified and deleted. This is probably the most comprehensive and easy-to-understand article on naive Bayes on the web today
Some small talk about Bayes
Both in life and in our scientific theories, probabilities are often estimated. For example, let’s calculate the probability of rain tomorrow, or the probability of winning the lottery, or whatever the probability is. Probability is probability. But there are two distinct schools of probabilism in artificial intelligence, bayesianism and frequency. The frequency school believes that all things develop in a particularly internal way, and what we need to do is find that natural frequency. The Bayesians say that probabilities follow a certain distribution that we can estimate from the data. But these differences are beside the point. We’re going to learn Bayesianism today, simply because it works. What makes it work?
The useful is that we can work out the probability of adverse, that is, we know there are three white balls in a box, two red balls, another box has three red ball, two white balls, we literally from a box, took out a red ball, can reverse out the red ball is most likely in that box. In other words, if you know it rained today, you can guess what the weather was like yesterday. Of course, unless you have amnesia, you basically know what the weather was like yesterday — anyway, in real life we often don’t know the cause of an event, and that’s when Bayes’ theorem comes in handy. And it’s very useful.
Forward probability and backward probability
For all of you freshmen who haven’t studied probability theory yet, there’s something you probably know, and that’s that if you have three white balls and two red balls in a box, what’s the probability of picking up a white ball and what’s the probability of picking up a red ball?
The obvious answer is three out of five for the white ball and two out of five for the red ball. Three fifths plus two fifths is one, which is one hundred percent white balls or red balls
So I have another box, and I have three red balls and two white balls, and I want to know what’s the probability of taking a ball out of two boxes, what’s the probability of a white ball and what’s the probability of a red ball.
The answer is also obvious: a white ball is (five plus five) one (three plus two) equals one half, and a red ball is also one half.
So now we can solve for the ball the other way around, let’s say we took a white ball, what’s the probability of taking a white ball out of the first bin? What’s the probability of getting it out of the second box?
And this is also easy to calculate, if you flip it around and you have three white balls in the first box, and two white balls in the second box, and you have five white balls in total, then the probability of coming from the first box is (3 + 2) points (3), and the probability of coming from the second box is 2 out of 5.
So which box is the white ball most likely to come out of? – the first
At this point, bayes’ theorem actually comes out, and now we can abstract the language that we just inverted into a mathematical language, a mathematical formula.
The probability of a white ball coming from the first box is equal to the number of white balls in the first box divided by the number of all white balls
Now, the important thing to note here is that even though we’re dealing with quantities, in order to make this formula more general, let’s convert it to probability
P (white ball from the first box) = P (white ball from the first box)/P (white ball)
And then again, the color of the ball is B, the box is A, so the white ball is B1, and box one is A1. But it is not enough just to change the sign, we have also expressed the relationship between the box and the white ball in mathematical form, for example, the white ball comes from the first box, that is, the white ball Case, the probability of the ball from the first box, we remember the box a namely a1 | | white ball b1, in turn, the proportion of white balls in the first box is b1 | a1, if to specific solution, by using multiplication formula: p (b1 | a1) = p (a1) * p (b1)
And then the number of white balls in the first one? (that is, the total number of balls * box box a proportion of the proportion of one white balls)/(the total number of the ball by the proportion of all ball of white balls) mathematical formula is n * p (a1) * p (b1 | a1)/(n * p (b1) = p (a1) * p (b1 | a1)/p (b1)
Then available: p (a1 | b1) = p (a1) * p (b1 | a1)/p (b1) = ((1/2) * (3/5))/(1/2) = 3/5
At this point we’re halfway through naive Bayes’ theorem, but what’s the other half? That’s a little bit of simplicity, of course, but until then it’s better to abstract this formula a little bit more
For example, if WE abstracted A1 to xi, which is any feature, and b1 to yi, which is any result, then we could calculate how many balls there are no matter how many boxes there are, and since we’re looking for the box with the highest probability, we could rewrite this formula as follows
Y = arg Max (y), P (yi) * P (x | yi)/P (x) # for is one of the largest elements of probability to be selected, so this time the grave P (x) starting can be omitted.Copy the code
And what if there are multiple properties? At that time the naive bayesian plain place is shown, assuming that each x is independent identically distributed directly, which is simple, the cast steel COINS, throw the first steel probability is one half of the front side of COINS, then throw out the first steel COINS, COINS out of the second steel, the probability of positive and negative or half, and is the first steel COINS is no relationship. So the probability of two coins flipping tails at the same time is 1/2 times 1/2 is 1/4.
For our xi, it doesn’t matter what your x1 is, the probability of x2 doesn’t matter to you, and we’re just going to calculate the joint probability of them by multiplying them. So at this point we can get
Laplace once said, “Probability theory is nothing more than a mathematical abstraction of common sense.” Laplace had a point
A probability distribution
So the probability distribution is how y varies with x, so for example the binomial distribution is either 0 or 1 for the coin toss experiment, and it changes as you go. The way it changes is something like this.
And then there’s the gaussian distribution, which doesn’t just depend on x but depends on the mean of the variance
And the multinomial distribution, and the ball example is an example that fits the multinomial distribution, where the probability is a direct ratio of quantities,
The binomial distribution is the Bernoulli test repeated n times independently. In each test there are only two possible results, and two results happen or not opposite to each other, and are independent of each other, have nothing to do with all other test results, the probability of events or not remains constant in each independent trials, is this a series of tests always called n Bernoulli experiment, when the number of test to 1, the binomial distribution to 0-1 distribution.
Maximum likelihood Estimation and Probability Distribution of Naive Bayes
Now you can look at bayes’ formula and see if there’s anything wrong with it.
The obvious problem is that in the real world, we don’t know the distribution of the conditional probabilities from x minus > y.
Although p (y) we can get directly, but p (x | y) is also under the condition of y, x prove distribution we don’t know, may be the relationship between x and y is the binomial distribution, also can be multinomial distribution or gaussian distribution. At this point, we need to choose a distribution to calculate.
What are the probability distributions? Binomial distribution, multinomial distribution, Gaussian distribution. We assume that the prior probability is polynomial distribution, then the parameter \theta_y is estimated by the smoothed maximum likelihood estimation method, that is, the relative frequency count:
What’s special about this picture is that we have n numerators and we add a to each numerator, and we add na to the denominator, in order to avoid the fact that when you multiply n x’s you happen to have x’s equal to zero. Because at this time, the final result will be zero, resulting in error.
But exactly what we want to end up with is the maximum probability, and the way we solve this formula, we add an A to each x, we add an NA to the denominator, and we end up with the same magnitude relationship, uh-huh, so that’s how we solve the error problem. And when a is equal to 1, this is called Laplacian smoothing.
“This is what I said” – Laplace “this is not what I said” – Lu Xun
For other conditional probability solutions, other probability distribution formulas are used, such as gaussian distribution, and the parameters \sigma_y and \mu_y are estimated by maximum likelihood method. :
Or binomial distribution:
So what is maximum likelihood estimation??
In fact, maximum likelihood estimation is based on the largest number of available observed data as the result. For example, in the Gaussian distribution, even though we don’t know what the variance and the mean are, we can just use the data that we see as the actual variance and the mean.
Let’s say we flip a coin, a very special coin. The probability of heads is 0.6. Suppose we throw ten times in a row as one experiment, four experiments in total, and the result is {3,5,7,8}, even though the actual probability is 0.6 If God exists, then maybe God can do it, but humans have human ways. We can use our experimental results to estimate the probability,(3+5+7+8)/(10*4) =5.75. Based on the maximum likelihood estimate, or maximum likelihood estimate, we assume that 5.75 is the actual probability. So we can get a good approximation
And as the experiment increases, our approximate results naturally get closer and closer to the real probability.
The error analysis
Although the naive Bayes algorithm we talked about before has been excellent and has excellent performance in many fields, in addition to the basic common error of machine learning, the error of naive Bayes itself is actually quite large. So first of all, let’s say the first one, the multiplication error.
The numerator of our conditional probability is the multiplication of many decimals, which may cause the multiplication error and lead to overflow. Then when the characteristic value too much, we can use logarithm likelihood, namely the log (y) = log (p (xi | y) and then each multiplying decimal can become – > log (ABC) = log (a) * log (b) * log (c) there is no decimal LianCheng error problem.
In addition, the error of the algorithm itself is also very large. For example, the most basic assumption, “simplicity” — >, is the independence and distribution of features. If there is a dependency between features, then we may need other algorithms. Such as TAN,SPODE,AODE and so on.
Naive Bayes classifier in SKLearn
The sklearn document reads as follows:
Naive Bayes learners and classifiers are very fast compared to other more complex methods. The decoupling of the classification conditional distribution means that each feature can be estimated independently and individually as a one-dimensional distribution. This, in turn, helps mitigate the problems caused by dimensional disasters. On the other hand, naive Bayes, although considered a fairly good classifier, is not a good estimator, so you should not attach too much importance to the probability output from predict_proba.
Naive Bayes classifier in SKLearn
The three naive Bayes classifiers in SkLearn are Gaussian naive Bayes classifier, multinomial distributed Naive Bayes classifier, and Bernoulli Naive Bayes classifier (corresponding to two-line distribution).
Its use method is as follows:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
y_pred = gnb.fit(iris.data, iris.target).predict(iris.data)
Copy the code
It should be noted that gaussian naive Bayes classifier has no parameters to adjust, but supports importing sample_weight function of sample weight. Multinomial distributed Naive Bayes classifier has a parameter prior smoothing factor A, and Bernoulli Naive Bayes classifier has a prior smoothing factor A and a binarize parameter.
They also support the incremental training method: the Partial_FIT function
reference
[1] Zhou Zhihua, Professor [2] SkLearn Official documentation