Naive Bayesian is a classification method based on Bayes’ theorem and the assumption of feature condition independence. As a supervised learning method based on probability theory, it is widely used in natural language processing and plays a very important role in machine learning. In a project done before, naive Bayes classifier was used, and it was applied to the analysis and processing of emotional words, and achieved good results. In this paper, we will introduce the theoretical basis of naive Bayes classification and its practical use.

Before learning naive Bayes classification and formally starting emotional word analysis, we first need to understand the mathematical basis of Bayes’ theorem.

Bayes’ theorem

Bayes’ theorem is about the conditional probability of random events A and B, and the formula is as follows:


P ( A B ) = P ( A ) P ( B A ) P ( B ) P(A|B)=\frac{P(A)P(B|A)}{P(B)}

In the above formula, each term represents the following meaning:

  • P(A): Prior probability (prior probability), is the probability of the occurrence of event A without any conditions, also called the basic probability, which is A subjective judgment of the probability of event A
  • P(A|B) : The probability of A occurring in the case of B, also known as A posterior probability (posterior probability)
  • P(B|A): likelihood, also known as conditional likelihood (conditional likelihood)
  • P(B): The probability of B occurring under all circumstances, whether or not A occurs, is called the global likelihood or normalized constant (normalizing constant)

According to the above explanation, Bayes’ theorem can be expressed as:

Posterior probability = prior probability * likelihood/normalized constant

Generally speaking, it can be understood that when we cannot determine the probability of the occurrence of an event, we can rely on the probability of the occurrence of events related to the essential attributes of the event to predict the probability of the occurrence of the event. In mathematical language, the more events supporting a certain attribute occur, the more likely the event will occur. This reasoning process is also called Bayesian reasoning.

In access to some of the document, P/P (B | A) (B) can be called A likelihood function, as an adjustment factor, according to the adjustment of the new information on the event A B, holds the prior probability (subjective judgment) to closer to the true probability. Then, Bayes’ theorem can also be understood as:

Probability of A after new information = prior probability of A * adjustment caused by new information

Let me give you an example to help you understand this process more intuitively. It is assumed that the influences of weather and temperature on the movement in a period of time are statistically analyzed, as shown below:

Weather Temperature sports sunny day very high swimming sunny day high football cloudy day fishing cloudy day swimming sunny day low swimming cloudy day low fishingCopy the code

Now calculate the probability of going swimming on a sunny day with moderate temperature. According to Bayes’ theorem, the calculation process is as follows:

P (swimming | clear, medium temperature) = P (fine, medium temperature | swimming) * P (swim)/P (fine, medium temperature) = P (sunny | swimming) * P (temperature | in swimming) * P (swim) / [P (sunny) * P (temperature)] = 1/2 / two-thirds of a third party (1/2 * 1/3) = 2/3Copy the code

The probability that you end up going out for a swim is 2/3, and that’s how you calculate the probability of an event happening, based on Bayes’ theorem, based on a given feature.

The idea of Bayesian analysis plays an important role in predicting the probability of occurrence of a thing based on the accumulation of evidence. When we want to predict a thing, we will first infer a prior probability based on the existing experience and knowledge, and then adjust the probability in the case of continuous accumulation of new evidence. The whole process of accumulating evidence to get the probability of an event is called Bayesian analysis. Thus, the underlying idea of Bayes can be summarized as: if you have all the information about a thing, you can calculate an objective probability of it.

In addition, the following formula can be obtained by deformation based on Bayes’ formula:


P ( B i A ) = P ( B i ) P ( A B i ) j = 1 n P ( B j ) P ( A B j ) P(B_i|A)= \frac {P(B_i)P(A|B_i)} {\sum_{j=1}^n P(B_j)P(A|B_j)}

Including B1, B2,… ,Bj is A complete event group, and the above formula can represent the probability of searching for various “causes” leading to the occurrence of A under the condition that event A has occurred.

Naive Bayes

Before learning Naive Bayes, we first need to understand Bayesian classification. Bayesian classification predicts the probability of an object belonging to a certain category and predicts the most likely category of its subordination by comparing the probability of different categories. It is constructed based on Bayes’ theorem. Bayes classifier shows high classification accuracy when dealing with large data sets.

Bayesian classification when dealing with a sample of unknown type X, to calculate the probability of X belongs to each category Ci P (Ci X |), and then choose one of the biggest probability. Suppose there are two characteristic variables X and y, and there are two classification categories C1 and C2, combined with Bayes’ theorem:

  • ifP(C1|x,y) > P(C2|x,y), indicating that C1 is more likely to occur than C2 under the conditions of occurrence of x and y, so it should belong to category C1
  • Conversely, ifP(C1|x,y) < P(C2|x,y), then it should belong to category C2

As a powerful predictive modeling algorithm, Naive Bayesian Model simplifies on the basis of Bayes’ theorem and assumes that the characteristic attributes of the target are independent from each other, which is also the reason why it is described as “Naive”. In practice, if there is correlation between attributes, the classification accuracy will be reduced, but it is very effective for solving most complex problems.

Set on sample data set D, and the characteristic attribute set of sample data is X={x1,x2… , xd} X = \ {x_1, x_2,… , x_d \} X = {x1, x2,… , xD}, class variables can be divided into Y={y1,y2… , ym} Y = \ {y_1, y_2,… , y_m \} Y = {y1, y2,… Ym}, that is, data set D can be divided into ymy_mym categories. Let’s assume x1,x2… , xdx_1, x_2,… , x_dx1, x2,… , XD are mutually independent, then it can be obtained from Bayes’ theorem:


P ( y i x 1 . x 2 . . x d ) = P ( x 1 . x 2 . . x d y i ) P ( y i ) P ( x 1 . x 2 . . x d ) = P ( x 1 y i ) P ( x 2 y i ) P ( x d y i ) P ( y i ) P ( x 1 . x 2 . . x d ) = i = 1 d P ( x j y i ) P ( y i ) j = 1 d P ( x j ) \begin{array}{l} P(y_i|x_1,x_2,\cdots,x_d)= \frac{P(x_1,x_2,\cdots,x_d|y_i) \cdot P(y_i)}{P(x_1,x_2,\cdots,x_d)} \\ = \frac{P(x_1|y_i)\cdot P(x_2|y_i)\cdots P(x_d|y_i) \cdot P(y_i)}{P(x_1,x_2,\cdots,x_d)} \\ = \frac{\prod_{i=1}^{d}{P(x_j|y_i) \cdot P(y_i)}}{\prod_{j=1}^{d}{P(x_j)}} \\ \end{array}

For the same test sample, the denominator P(X) is fixed, so when comparing the posterior probability, we can only compare the size of the numerator.

Here to explain the bayes’ theorem, the bayesian classification and the difference between naive bayes, bayes’ theorem as theoretical basis, to solve the problem of inverse probability in probability theory, based on the designed the bayes classifier, and the naive bayes is one of the bayes classifier, which is the most simple and common classifier, The following diagram can be used to show the relationship between them:

In practice, naive Bayes is widely used in text classification, spam filtering, sentiment prediction and phishing website detection. In order to train the naive Bayes model, we need to train the classified data on the basis of the training set and calculate the prior probability and the conditional probability of each attribute. After the calculation, the probability model can use the Bayesian principle to predict the new data.

Bayesian inference is very similar to the working mechanisms of the human brain, which is why it can become the foundation of machine learning, the brain’s decision-making process is first to subjective judgment of things, and then to collect new information, optimize the subjective judgment, if new information in accordance with the subjective judgment, it enhances the credibility of the subjective judgment, if does not conform to, reduces the credibility of subjective judgment.

Code implementation

With a basic understanding of the theory, we begin to analyze how naive Bayes can be applied to the analysis of emotional words in our text processing. The main steps are as follows:

  • The text segmentation of training set and test set is completed, and the classification of the training set and test set is marked by subjective judgment
  • Train the training set, count the frequency of each word appearing in the classification, calculate the frequency of each category in the training sample, and the conditional probability (i.e. Likelihood probability) of each characteristic attribute for each category.
  • The trained model is applied to the samples of the test set, and the probability of the samples under each classification is calculated according to The Bayesian classification
  • The probability of each classification is compared, and the emotion classification that the text most likely belongs to is predicted

Use flow chart to represent:

1. Preparation

Firstly, prepare the data set, which uses the review data of a certain hotel and divides it into two categories to be classified according to subjective attitude, namely “favorable” or “bad”. Emotional labels are made for the sentences after each line of word segmentation, and the word segmentation has been completed in advance for the whole sentence. The data format is as follows:

At the head of each row of data, are added “favorable” or “negative” labels. Labels and words are separated by tabs, and words are separated by Spaces. According to the proportion, 80% of the data set is used as the training set, and the remaining 20% is used as the test set. The random principle is ensured in the allocation process as far as possible.

2. Training phase

In the training stage, the statistics of word frequency are mainly completed. Read the training set and count the frequency of occurrence of each word in this category, which is used to solve the probability of occurrence of each word in each category, namely, the relationship between vocabulary and subjective categorization emotion:

private static void train(a){
    Map<String,Integer> parameters = new HashMap<>();
    try(BufferedReader br = new BufferedReader(new FileReader(trainingData))){  // Training set data
        String sentence;
        while(null! =(sentence=br.readLine())){ String[] content = sentence.split("\t| "); // use TAB or space
            parameters.put(content[0],parameters.getOrDefault(content[0].0) +1);
            for (int i = 1; i < content.length; i++) {
                parameters.put(content[0] +"-"+content[i], parameters.getOrDefault(content[0] +"-"+content[i], 0) +1); }}}catch (IOException e){
        e.printStackTrace();
    }
    saveModel(parameters);
}
Copy the code

Save the trained model in the file, which can facilitate the next use without repeated training of the model:

private static void saveModel(Map<String,Integer> parameters){
    try(BufferedWriter bw =new BufferedWriter(new FileWriter(modelFilePath))){
        parameters.keySet().stream().forEach(key->{
            try {
                bw.append(key+"\t"+parameters.get(key)+"\r\n");
            } catch(IOException e) { e.printStackTrace(); }}); bw.flush(); }catch(IOException e){ e.printStackTrace(); }}Copy the code

To view the saved model, the data is in the following format:

Good reviews - free to send 3 bad reviews - really annoying 1 good reviews - gifts 3 bad reviews - dirty mess 6 good reviews - solve 15 bad reviews - get ripped off 1......Copy the code

The training model is saved here, so if there is the same classification task in the future, the calculation can be directly carried out on the basis of the training set. For tasks requiring higher classification speed, the calculation speed can be effectively improved.

3. Load the model

Loading the trained model:

private static HashMap<String, Integer> parameters = null;  // Store the model
private static Map<String, Double> catagory=null;
private static String[] labels = {"Praise"."Bad review"."Total"."priorGood"."priorBad"};

private static void loadModel(a) throws IOException {
    parameters = new HashMap<>();
    List<String> parameterData = Files.readAllLines(Paths.get(modelFilePath));
    parameterData.stream().forEach(parameter -> {
        String[] split = parameter.split("\t");
        String key = split[0];
        int value = Integer.parseInt(split[1]);
        parameters.put(key, value);
    });

    calculateCatagory(); / / classification
}
Copy the code

Words are classified, the total number of word frequency of favorable and unfavorable comments is counted, and the prior probability is calculated based on them:

// Calculate the total number of categories in the model
public static void calculateCatagory(a) {
    catagory = new HashMap<>();
    double good = 0.0; // Total frequency of praise words
    double bad = 0.0;   // The total frequency of bad comments
    double total;   / / total word frequency

    for (String key : parameters.keySet()) {
        Integer value = parameters.get(key);
        if (key.contains("Well -")) {
            good += value;
        } else if (key.contains("Bad review -")) {
            bad += value;
        }
    }
    total = good + bad;
    catagory.put(labels[0], good);
    catagory.put(labels[1], bad);
    catagory.put(labels[2], total);
    catagory.put(labels[3],good/total); // The prior probability is favorable
    catagory.put(labels[4],bad/total);	// Negative prior probability
}
Copy the code

View the statistics after execution:

The total frequency of the words corresponding to “favorable comment” is 46,316, the total frequency of the words corresponding to “bad comment” is 77,292, and the total frequency of the words in the training set is 123,608, and their prior probability can be calculated based on them:

The conditional probability that the document belongs to a category = total word frequency of all entries in that category/total word frequency of all entriesCopy the code

4. Test phase

In the test stage, we loaded the test set prepared in advance and predicted the subjective emotion of the comment sentences after each line of word segmentation:

private static void predictAll(a) {
    double accuracyCount = 0.;// The exact number
    int amount = 0;    // Total amount of test set data

    try (BufferedWriter bw = new BufferedWriter(new FileWriter(outputFilePath))) {
        List<String> testData = Files.readAllLines(Paths.get(testFilePath));    // Test set data
        for (String instance : testData) {
            String conclusion = instance.substring(0, instance.indexOf("\t"));  // Already labeled
            String sentence = instance.substring(instance.indexOf("\t") + 1);
            String prediction = predict(sentence);  // Forecast result

            bw.append(conclusion + ":" + prediction + "\r\n");
            if (conclusion.equals(prediction)) {
                accuracyCount += 1.;
            }
            amount += 1;
        }
        // Calculate accuracy
        System.out.println("accuracyCount: " + accuracyCount / amount);
    } catch(Exception e) { e.printStackTrace(); }}Copy the code

In the test, the following Predict method was called for classification. Before calculating, let’s review the above formula and simplify it in the program:


P ( y i x 1 . x 2 . . x d ) = i = 1 d P ( x j y i ) P ( y i ) j = 1 d P ( x j ) P(y_i|x_1,x_2,\cdots,x_d)= \frac{\prod_{i=1}^{d}{P(x_j|y_i) \cdot P(y_i)}}{\prod_{j=1}^{d}{P(x_j)}}

For the same prediction samples, the denominator is the same, so we can only compare the molecular ∏ I = 1 dp (xj ∣ yi) ⋅ P (yi) \ prod_ {I = 1} ^ {d} {P (x_j | y_i) \ cdot P (y_i)} ∏ I = 1 dp (xj ∣ yi) ⋅ P (yi) size. To further simplify the numerator part, for the serial multiplication budget, we can perform logarithmic operation on it and become the sum of the parts:


log 2 [ i = 1 d P ( x j y i ) P ( y i ) ] = i = 1 d log 2 P ( x j y i ) + log 2 P ( y i ) \log_2[\prod_{i=1}^{d}{P(x_j|y_i) \cdot P(y_i)}] = \sum_{i=1}^{d}{\log_2P(x_j|y_i)} + \log_2P(y_i)

In this way, the comparison of probabilities can be simplified to the logarithmic sum of prior probabilities and likelihood probabilities. The prior probability has been calculated and saved in previous steps, so only the likelihood probability of each word under classification conditions can be calculated here. Predict is implemented as follows:

private static String predict(String sentence) {
    String[] features = sentence.split("");
    String prediction;

    // Predict good and bad reviews separately
    double good = likelihoodSum(labels[0], features) + Math.log(catagory.get(labels[3]));
    double bad = likelihoodSum(labels[1], features) + Math.log(catagory.get(labels[4]));
    returngood >= bad? labels[0]:labels[1];
}
Copy the code

Where the likelihood method is called to calculate the logarithmic sum of likelihood probability:

// Calculate the likelihood probability
public static double likelihoodSum(String label, String[] features) {
    double p = 0.0;
    Double total = catagory.get(label) + 1;// The denominator is smoothed
    for (String word : features) {
        Integer count = parameters.getOrDefault(label + "-" + word, 0) + 1;// Molecular smoothing
        // Calculate the probability of the word in the case of the class by dividing the word frequency by the total word frequency of the class
        p += Math.log(count / total);
    }
    return p;
}
Copy the code

In the method of calculating the likelihood probability, if a word is not included in the training set, its likelihood probability will be 0. In order to prevent this situation, the numerator and denominator are smoothed by adding one respectively.

Finally, the above steps are called in the main function. If the probability of favorable comments based on samples is calculated to be greater than or equal to the probability of bad comments, it is classified into “favorable comments”, and vice versa, it is classified into the category of “bad comments”. Thus, the whole process of training and testing is completed:

public static void main(String[] args) throws IOException {
    train();
    loadModel();
    predictAll();
}
Copy the code

After executing all the code, you can see that the accuracy is 93.35%.

By comparing the labels in the final output document with the predicted results, we can see that the accuracy of the predicted results is very high.

5, summary

In the above example, there are still some areas that can be improved. For example, we can build an emotion lexicon in the early stage, only extract emotion words in the process of feature value extraction, remove the influence of other useless words (such as prepositions) on classification, and only select key representative words as feature extraction to achieve higher classification efficiency. In addition, the emotion words of the test set can also be added to the thesaurus during the establishment of thesaurus, so as to avoid the situation that the likelihood probability is zero under a certain classification condition and simplify the smoothing step.

In addition, naive Bayes classification has a good application in the case of additional training sets. If the training sets continue to increase, new sample values can be added on the basis of the existing training model, or the attributes of the existing sample values can be modified, on this basis, incremental model modification can be achieved.

If the article is helpful to you, welcome to pay attention to the public code of agricultural ginseng