The introduction

I read an article about a foreigner programmer who wrote some great Shell scripts, including sending text messages to his wife after work at night, automatically making Coffee, automatically scanning emails sent by a DBA, etc. So I wanted to do something interesting with what I learned. My thoughts are as follows:

  1. First I write a scrapy script to grab jokes from a website
  2. Then write a Shell script to automatically grab the latest jokes at 6 a.m. every day
  3. Then naive Bayes model was used to determine whether the current joke was an adult joke
  4. If it’s an adult joke, script it to your buddy’s email automatically
  5. If it’s not an adult joke, script it to your girlfriend’s email automatically

In this series of articles, you will learn:

  1. The mathematics behind naive Bayes
  2. Specific working process of the algorithm (combined with examples)
  3. Scikit-learn fast implementation algorithm

This series of articles will also cover scrapy, pandas, and numpy. In this series of articles, I will focus on the implementation of step 3 above in detail, and put the rest of the source code on Github for your reference. Now, let’s enjoy this amazing journey.

Overview of naive Bayes

As we all know, naive Bayes is a simple but very powerful linear classifier. It has been very successful in sorting spam and diagnosing diseases. It is only called naive because it assumes that features are independent of each other, but in real life, this assumption is basically not true. So even if the hypothesis is not true, it still does very well, especially in small samples. However, if there is strong correlation between each feature and nonlinear classification problem, the naive Bayes model will have poor classification effect.

The mathematics behind naive Bayes

Posterior probability (Posterior Probabilities)

In order to better understand how naive Bayes classifier works, it is necessary to understand Bayes’ rule. It can be simply described as the following formula:

We can explain the above formula with an example of whether it rains or not.

We can see that if we want to predict the probability of an outcome given a given phenomenon, we must know: 1. The probability of the occurrence of the phenomenon given a given outcome. 2. The probability of this outcome. 3. The probability of this phenomenon.

In practical applications, we cannot have only one phenomenon, for example, in the classification of spam, there may even be eigenvectors of thousands of words. Here, I’ll introduce some mathematical notation to unify the following statement:

The goal of naive Bayes is to find the maximum probability separately.

Now I’m going to explain each of the three probabilities.

Conditional probability (Conditional Probabilities)

The independence of random variables means that if I tell you that one variable is present, it doesn’t affect your belief that another variable is likely to be present. The simplest example is the flip of a coin, where the first flip does not affect the probability of getting an inverted head on the second flip (i.e. 0.5).

In naive Bayes model, features are not only independent, but also conditionally independent. For example, my eigenvectors, then I can write the probability as follows:

The probability of is can be understood as: the probability of observed phenomena under given conditions belonging to a certain category. The probability of each feature in the feature vector can be obtained by maximum-likelihood estimation, which simply calculates the frequency of a feature in a certain category. The formula is as follows:

Now, I’ll illustrate the above concept with a simple example of sorting spam. Let’s say I have four emails, two of which are spam, and there are four features in the feature vector. The details are as follows:

The sample number love buy deal cat Whether it is spam
1 1 0 0 1 not
2 0 1 1 0 is
3 0 1 0 1 is
4 1 0 1 1 not

Now I want to find the probability, and by the independence of the conditions and the maximum likelihood estimate, we can find the probability:

Prior probability (Prior Probabilities)

In fact, the prior probability principle is very simple. In the example of spam above, we can use maximum likelihood estimation to obtain:, the general formula is as follows:

Through the above formula of posterior probability, we can know that: if the prior probability obeys uniform distribution, the posterior probability will completely depend on the conditional probability and the phenomenal probability. However, the phenomenal probability is constant, so the posterior probability completely depends on the conditional probability.

Note: I think in some categorical applications, prior probabilities should be consulted by experts in the application field, you can’t get prior probabilities just from the probabilities that appear in the sample. For example, if I have more samples on rainy days than sunny days in my training set, you can’t say that the probability of a rainy day is greater in real life than a sunny day. Therefore, for some applications, prior probabilities should be consulted by experts in the application field.

Phenomenon probability (Evidence Probabilities)

Phenomenological probability is category independent. For example, in the spam example above, I want to know the probability of this phenomenon. I just need to find out the probability of “Deal” in all my samples, which has nothing to do with which category they belong to.

We actually don’t have to figure out this probability at all, because we’re trying to figure out which category has the highest probability, and in the process, it’s constant for every category, so it doesn’t affect the final decision.

Plus an Additive Smoothing

So again, in the case of the email above, let’s say we want to find out, well, the word love doesn’t appear in the spam, so the probability is 0, and if this conditional probability is 0, then the whole posteriori probability is 0. To avoid zero probabilities, we can add smoothing terms. Change the formula for conditional probability above to the following form:

  • The additional smoothing item parameter. It’s called Lidstone Smoothing; Called Laplace smoothing
  • Characteristics of several

conclusion

Now, we have the mathematical background of naive Bayes and the process of its algorithm. In the next article, I will mainly introduce naive Bayes’ model and skills in text classification, laying a good foundation for our classification of adult jokes.