1. Write at the front

If you want to engage in data mining or machine learning, it is necessary to master common machine learning algorithms. Common machine learning algorithms are as follows:

  • Supervised learning algorithms: logistic regression, linear regression, decision tree, Naive Bayes, K-nearest Neighbor, support vector machine, integration algorithm Adaboost, etc
  • Unsupervised algorithms: clustering, dimensionality reduction, association rules, PageRank, etc

For detailed understanding of the principles, have seen watermelon book, statistical learning methods, such as machine learning field, also heard some machine learning courses, but always feel more abstruse words, don’t have the patience to read, and theory are everywhere, but practice is the most important, so here want to use the most simple language to write a vernacular theory + practice series of machine learning algorithm.

In my opinion, it is more important to understand the idea behind the algorithm and its use than to understand its mathematical derivation. Idea will let you have an intuitive feeling, so as to understand the rationality of the algorithm, the mathematical deduction is to express this kind of rationality in a more rigorous language, for example, a pear is sweet and can be expressed in mathematical language to sugar content is 90%, but only the bite personally, you can truly feel the pear how sweet, And really understand the math of what 90 percent sugar looks like. If these machine learning algorithms are pears, the primary purpose of this article is to take a bite out of them. There are also several other purposes:

  • Test your understanding of the algorithm and make a summary of the algorithm theory
  • Can be happy to learn the core ideas of these algorithms, find the interest in learning these algorithms, for in-depth learning these algorithms lay a foundation.
  • The theory of each class will be put to a practical case, can really learn to apply, not only can exercise programming ability, but also can deepen the grasp of algorithm theory.
  • Also want to put all the previous notes and reference in a piece, convenient for the convenience of checking later.

The process of learning algorithms should not only gain the theory of algorithms, but also have fun and the ability to solve practical problems!

Today is a colloquialism + of article 5 of machine learning algorithm theory, naive bayesian algorithm, this algorithm is the most appropriate scenario text classification task, routinely use natural language processing tasks, but not only is used in text classification, bayesian method is proved to be very general and powerful reasoning framework, through today’s study, Quickly master the calculation principle and workflow of naive Bayes, and use the learned principle and workflow to do a text classification task.

The outline is as follows:

  • Bayes’ principle (Don’t be afraid of the unknowable, deduce the unknown from the known)
  • How Naive Bayes Classification Works (Discrete and continuous data cases)
  • Naive Bayes classification practice (text classification, here will master TF-IDF technology, will know the word segmentation technology)

OK, let’s go !

2. Naive Bayes? Let’s start with Bayes’ principle.

Many of you have heard of Bayes’ principle, right? Where is it? Of course, it’s time to learn probability statistics, some people may say, well, I’ve forgotten all the knowledge of probability statistics, ha ha, that doesn’t matter, who let this is vernacular machine learning algorithm, it must be the essence of vernacular learning algorithm ah. But before you do that, you need to understand Bayes’ principle, and don’t worry, there’s no complicated formula here, just a little example, and you’ll find that you’ve unknowingly learned the core idea of Bayes’ principle, and yes, that’s the magic. Don’t believe it? Then keep reading.

Bayes’ principle was developed by The English mathematician Thomas Bayes. Bayes is an amazing man, and he’s like Van Gogh. After his death, a paper he wrote on inductive reasoning was turned over by a friend and published. The result was one of the most famous papers in the history of science, one whose ideas directly influenced statistics for more than two centuries. (Haha, impressive, but unfortunately, Bayes can’t see.)

Where does Bayes’ principle come from? Bayes wrote a paper to solve a problem called “inverse probability,” trying to figure out how to make more mathematically logical guesses without much reliable evidence. There’s a word here called inverse probability. What is “inverse probability “?

The so-called “inverse probability” is relative to “positive probability”.

u

You always know the positive probability, for example, if I have 5 balls in a bag, 3 black balls, 2 white balls, and I randomly pick one of them out and say, what’s the probability of a black ball? . At this point, immediately answer: 3/5.

Ha ha, this is the positive probability, easy to understand, but this is usually god’s perspective, knowing the full picture (knowing there are 5 balls in the bag beforehand).

But if we only know in advance that there are either black or white balls in the bag, and we don’t know how many there are in each bag, But by the color of the balls we touch, can we tell how many black and white balls there are in the bag? So this is the inverse probability.

It was such a common question that influenced statistical theory for nearly 200 years. This is because, unlike other methods of statistical inference, Bayes’ principle is based on subjective judgment: without knowing all the objective facts, we can also estimate a value and then constantly revise it based on the actual results.

Well, I guess you’re confused right now! Here’s a simple example that will make you fall in love with Bayes. Oh, how it works:

u

A school is 60% boys and 40% girls. Boys always wear trousers and girls half wear trousers and half wear skirts. With this information, we can easily calculate “what is the probability that a random student will wear pants and what is the probability that he or she will wear skirts”, which is the calculation of the “forward probability” mentioned above. However, if you were walking on campus and came across a student wearing trousers (unfortunately you are so short-sighted that you could only see whether he or she was wearing trousers, but could not determine his or her gender), would you be able to deduce the probability that he or she was male?

Look at the example above. Can you work it out? It seems to involve reverse reasoning.

We may have to formalize the bayesian problem is not good at it, but we have to count in the form of equivalent problem should be very good at it, here, might as well put the question in a while: random walk on the campus, you met N trousers (sex) still can’t see, ask, how many boys are there in the N men inside, how many girls?

Isn’t that easy, you say? Just figure out how many people in the school wear pants, and within those people, how many girls, how many boys?

Ha ha, cool, so let’s work it out: let’s say there are M people in this school.

  1. First of all, let’s figure out how many boys and how many girls there are in this school

u

60% boys, 40% girls, so the number of boys is M times P, and the number of girls is M times P. Is that okay?

  1. So let’s figure out, among the boys, how many of them wear pants?

u

According to the above we know that boys wear pants, that is, as long as a man, he will wear pants. There’s a word, if it’s male, what is it? For, the conditional probability), P (trousers | male) = 100%. The number of boys wear pants, then, would mean: the boy in front of the number times of probability P (M) = M * * P (trousers | male)

  1. In the same way, how many women wear pants?

u

According to the above we know that the girl inside, half of the people wear trousers, generally people wear skirts, also is P (trousers | female) = 50%, this also is a conditional probability, because the premise is female. The number of women to wear pants, then, would mean: M * P (female) * P (female trousers |)

  1. So this becomes, so in this school, the person who wears pants is the guy who wears pants plus the girl who wears pants

u

People wear pants * P = M (male) * P (trousers | male) + M * P (female) * P (female trousers |)

  1. So what is the ratio of boys to girls in pants?

u

The proportion of men in pants should be calculated like this. The premise is to wear trousers), namely P (m | trousers) and P (female | trousers). So how do we do that? Simple, the total pants wearing people know, and know, pants of men and women in their respective numbers, then: \

  • P (M | trousers) = [M * P (male) * P (trousers | male)] / [M * P (male) * P (trousers | male) + M * P (female) * P (trousers | female)]
  • P (female | trousers) = [M * P (female) * P (trousers | female)] / [M * P (male) * P (trousers | male) + M * P (female) * P (trousers | female)]

  1. The top one, you see that the M’s in the numerator and the denominator cancel out, and you get PI

u

  • P (m | trousers) = [P (m) * P (trousers | male)] / [P (m) * P (trousers | male) + P (female) * P (trousers | female)]
  • P (female | trousers) = [P (female) * P (trousers | female)] / [P (m) * P (trousers | male) + P (female) * P (trousers | female)]

Actually, that’s the end of the example, that’s the answer to the top question. I’m not saying, what is this formula up here? I want to make sure you understand the above example. If not, let me explain a few concepts:

  • Prior probability: To judge the probability of something happening through experience is prior probability. For example, the boys are 60%, the girls are 40%. It’s just a fact, no strings attached. Another example is that the rainy season in South China is from June to July, which is the experience summarized from previous years’ climate. The probability of rain at this time is much higher than other times, and these are all prior probabilities.
  • : conditional probability events in another event B has occurred under the condition of probability, expressed as P (A | B), pronounced “in B occurs under the condition of A happening probability”. Such as the boy inside, wear pants (trousers | male), P girl inside, wearing pants of P (trousers | women).
  • Posterior probability: Posterior probability is the probability of inferring the cause after the occurrence of an effect. Such as the above I see people wearing pants, I presume that this is a man P (m | trousers) or P (female | pants) the probability of a woman. It’s one of the conditional probabilities.

Okay with the three probabilities up here? You can test it:

u

If your girlfriend finds a flirty text message with another woman on your phone, and she starts thinking about the three probabilities, you decide which of the following probabilities are:

  • The probability of you having an affair in the absence of any circumstance;
  • If you’re cheating, the probability of having a flirty text on your phone;
  • Found a flirty text on your phone, thought you were cheating.

Can these three probabilities be matched? If so, you understand the concepts and examples, and get down to business.”

Let’s write the last probability of the above example as follows:

u

  • P (m | trousers) = [P (m) * P (trousers | male)] / [P (m) * P (trousers | male) + P (female) * P (trousers | female)]
  • P (female | trousers) = [P (female) * P (trousers | female)] / [P (m) * P (trousers | male) + P (female) * P (trousers | female)]

P(male), P(female) is the prior probability; P (trousers | male), P (female trousers |) is the conditional probability; P (m | trousers), P (female | trousers) is the a posteriori probability.

Trousers = A, men =B1, women =B2This is greatBayes formula. Here’s a more general form:No wonder Laplace saidProbability theory is simply a mathematical expression of common sense.

In fact, Bayes’ principle is to solve for a posterior probability. By what? Bayes’ formula.

Now, doesn’t it feel like you understand Bayes’ principle without a lot of brain burn? So what this means is that if we have a conditional probability calculation that we don’t know, we’re going to do the transformation through Bayes’ formula, don’t be afraid of the unknowable, extrapolate from the known unknown.

However, the seemingly mundane Bayes formula has a very profound principle behind it.

I won’t say much here in case you get sleepy, but if you’re interested, check out my classic blog post on Bayes in plain English. Because my goal is not just to understand the principles, but to actually use them. I don’t fully understand it and I’m not afraid of it, but naive Bayes, and I’m going to run a wave by example.

Naive Bayes

So with Bayes’ principle behind us, let’s look at the algorithm that we’re going to focus on, naive Bayes.

It is a simple but extremely powerful predictive modeling algorithm. Naive Bayes is called naive Bayes because it assumes that each input variable is independent. That’s a strong assumption, and it’s not always the case, but the technique still works pretty well for the vast majority of complex problems.

u

What is the input variable here? It’s similar to what we did with gender, because in the actual problem, there might not only be gender, but there might be height, there might be weight, there might be a lot of input features that are involved in the classification problem based on these characteristics using Bayes’ formula. And what Naive Bayes does is, he assumes that these traits, height, weight, sex, they’re not correlated, they’re not correlated. So when we calculate the probability of meeting all three of these characteristics, we can calculate separately that P(ABC) = P(A)* P(B)* P(C) and that’s the reason.

The naive Bayes model consists of two types of probabilities:

  • Probability P(Cj) for each category;
  • Each attribute of the conditional probability P (Ai | Cj).

Here’s another example of categorical and conditional probability:

u

Let’s say I have seven pieces, three of which are white and four of which are black. So the probability that the piece is white is 3/7, the probability that the piece is black is 4/7, and that’s the category probability. Suppose I put these seven pieces into two boxes, and box A has two white pieces and two black pieces; Box B has 1 white and 2 black. So the probability of picking white in box A is 1/2, and the probability of picking black is 1/2, and this is the conditional probability, which is the probability under certain conditions, like in box A. So let’s say I pick up A white piece, and I say, what’s the probability of being in box A? Can you do the math? Won’t? The bayes formula above is for nothing!

Post the calculation process:In order to train the naive Bayes model, we need to first give the training data and the corresponding classification of these data. So these two probabilities, the category probabilities and the conditional probabilities. They can all be calculated from the training data given. Once calculated, the probabilistic model can make predictions about the new data using Bayesian principles. (There will be cases later)Also, it should be noted before that Bayes’ principle, Bayesian classification and naive Bayes are not the same thing:

u

Bayes’ principleIt’s the biggest concept, and it solves the problem of “inverse probability” in probability theory, and it’s based on this theory that people have devisedBayesian classifier.Naive Bayes classificationIt is a kind of Bayesian classifier, and also the simplest and most commonly used classifier. Naive Bayes is simple because it assumes that attributes are independent of each other, so it has constraints on the actual situation. If there is association between attributes, the classification accuracy will be reduced. But the good news is that for the most part, bayesian classification works well.

Ok, after understanding naive Bayes, let’s take two cases and experience the calculation process of naive Bayes (for the detailed derivation formula of Naive Bayes, please see my notes on statistical learning methods below).

4. Working principle of Naive Bayes classification

Naive Bayesian classification is a common bayesian classification method. When we see a stranger in our daily life, the first thing we need to do is to judge his or her gender, and the process of judging gender is a process of classification. Based on past experience, we usually make judgments based on height, weight, shoe size, hair length, clothing, voice and so on. The “experience” here is a well-trained model of gender judgment, whose training data is the various people we meet in daily life and the actual gender data of these people.

4.1 Discrete Data Case

The data we encounter can be divided into two types, one is discrete data, the other is continuous data. So what is discrete data? Discrete means discontinuous, with clear boundaries. For example, integers 1, 2, and 3 are discrete data, and any number between 1 and 3 is continuous data, and it can take any value in this interval.

Let me take the following numbers, which are based on your previous experience. Then give you a new data: height “high”, weight “medium”, shoe size “medium”, please ask whether this person is male or female?If you look at this problem, this is where you can see the simplicity of Bayes. Here’s how it works:

4.2 Continuous Data cases

In real life, we get continuous values, such as the following:So if I give you a new data, height 180, weight 120, shoe size 41, would you please tell me whether the person is male or female?

The formula is the same as the above formula. The difficulty here is that since height, weight and shoe size are all continuous variables, the probability cannot be calculated by using discrete variables. And because the sample size is too small, it cannot be divided into intervals. What to do?

At this point, it can be assumed that the height, weight and shoe size of men and women are normally distributed, and the mean and variance can be calculated through the samples, which is the density function of the normal distribution. So if you have the density function, you can plug in the values and figure out what the density function is at a certain point. (When calculating the probability of a continuous random variable at a certain value point, we can look at the function value of the current probability density function at that point. The higher the value is, the higher the probability is. But the current value of the probability density function is not equal to the probability, can only be compared to the magnitude)

u

For example, the height of men is normally distributed with a mean of 179.5 and standard deviation of 3.697. So the probability of a man being 180 is 0.1069.

How does that work? This is where the tool comes in, a function in Excel:

u

NORMDIST(x, mean, standard_dev, cumulative)

  • X: the value to be calculated in the normal distribution;
  • Mean: Mean of normal distribution;
  • Standard_dev: standard deviation of normal distribution;
  • Cumulative: the Cumulative value is a logical value, that is, False or True. It determines the form of the function. When TRUE, the function results in a cumulative distribution (standard normal); When False, the function results in the probability density.

Here we use NORMDIST(180,179.5,3.697,0)=0.1069. Similarly, we can calculate that the probability of male weight 120 is 0.000382324, and the probability of male shoe size 41 is 0.120304111.

So we can be calculated: P (A1A2A3 | C1) = P (A1 | C1), P (A2 | C1) P (A3 | C1) = 0.1069 * 0.000382324 * 0.120304111 = 4.9169 e-6

In the same way we can also calculate the human female possibility: P (A1A2A3 | C2) = P (A1 | C2) P (A2 | C2) P (A3 | C2) = 0.00000147489 * 0.015354144 * 0.120306074 = 2.7244 e-9

It’s clear that this set of data is more likely to be classified as male than female.

Ha ha, isn’t the calculation principle very simple ah. The following is to check whether it is really mastered. Naive Bayes should be used for an actual combat. Before the actual combat, the working process of naive Bayes classifier should be posted first:

5. Naive Bayes text classification

Naive Bayes classification is often used for text classification, especially for English and other languages. It is often used in spam text filtering, sentiment prediction, recommendation systems and so on.

But before categorizing it, it’s worth introducing some text processing and modeling.

5.1 Naive Bayes in SKLearn

Also known as Scikit-learn, SKLearn provides three naive Bayes classification algorithms, namely GaussianNB, MultinomialNB, and BernoulliNB.

u

These three algorithms are suitable for application in different scenarios, and we should choose different algorithms according to different feature variables:

  • Gaussian naive Bayes: Characteristic variables are continuous variables that conform to gaussian distribution, such as height of person and length of object.
  • Polynomial naive Bayes: Feature variables are discrete variables that conform to multinomial distribution. In document classification, feature variables are reflected in the number of occurrences of a word or the TF-IDF value of a word. Note that polynomial naive bayes actually follow a polynomial distribution, there is no negative number, so when you pass in input, don’t use StandardScaler to normalize the data, use MinMaxScaler to normalize the data
  • Bernoulli naive Bayes: Feature variables are Boolean variables that conform to a 0/1 distribution, and feature is the occurrence of words in document classification. Bernoulli naive Bayes is file-grained, 1 if the word appears in a file, 0 otherwise. Polynomial naive Bayes, on the other hand, are word-grained and count specific times in a file. However, Gaussian naive Bayes is suitable for the case where the characteristic variables are continuous and conform to normal distribution (Gaussian distribution). For example, natural phenomena such as height and weight are more suitable for processing with Gaussian naive Bayes. Text classification uses polynomial naive Bayes or Bernoulli naive Bayes.

5.2 What is tF-IDF value?

This one explanation, a lot of space, not separate explanation here, please refer to my other blog: TF-IDF? That’s all I need to do now, but I’m going to show you how to do this with tools.

5.3 How to Find tF-IDF?

In SkLearn we directly use the TfidfVectorizer class, which helps us calculate the value of the word TF-IDf vector. So in this class, when you take the log of sklearn, it’s base E, not base 10.

How do I create the TfidfVectorizer class?

TfidfVectorizer(stop_words=stop_words, token_pattern=token_pattern)
Copy the code

When we create, we have two construction parameters, we can customize the stop word stop_words and the rule token_pattern. Note the data structure passed. The stop_words is a List type, and the filter rule token_pattern is a regular expression.What is stop word? Stop words are words that are not used in classification. Generally, these words have a high frequency TF, but a low IDF, which cannot play a role in classification. To save space and computing time, we use these words as stop words, telling the machine that these words do not need to help me calculate.

When we have created the TF-IDF vector type, we can calculate it for us using fit_transform and return us a text matrix that represents the TF-IDF value of each word in each document.After we carry out fit_transform fitting model, we can get more TF-IDF vector attributes, for example, we can get the corresponding relation of words (dictionary type) and IDF value of vector, and of course, we can also get the set stop word stop_words.Here’s a quick example:

Suppose we have four documents:

  • Document1: This is the Bayes document;
  • Document2: This is the second second document;
  • Document3: and the Third One;
  • Document 4: Is this the document?

Now if you want to calculate what words are in the document, what is the TF-IDF value of these words in different documents?

  1. First we create the TfidfVectorizer class:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vec = TfidfVectorizer()
Copy the code
  1. Then we create a list of four documents, documents, and let the created TFIDF_VEc fit documents to get tF-IDF matrix:
documents = [
    'this is the bayes document'.'this is the second second document'.'and the third one'.'is this the document'
]
tfidf_matrix = tfidf_vec.fit_transform(documents)
Copy the code

Output all non-repeating words in the document:

print('Words not repeated :', tfidf_vec.get_feature_names()) #'and'.'bayes'.'document'.'is'.'one'.'second'.'the'.'third'.'this']
Copy the code

Print the id value for each word:

print('ID for each word :', tfidf_vec.vocabulary_) # Results are as follows: the ID of each word: {'this': 8.'is': 3.'the': 6.'bayes': 1.'document': 2.'second': 5.'and': 0.'third': 7.'one': 4}
Copy the code

Output the tF-IDF value of each word in each document. The order in the vector is in the order of the word ID:

print('TFIDF value per word :', tfidf_matrix.toarray()) #0.         0.63314609 0.40412895 0.40412895 0.         0.
  0.33040189 0.         0.40412895]
 [0.         0.         0.27230147 0.27230147 0.         0.85322574
  0.22262429 0.         0.27230147]
 [0.55280532 0.         0.         0.         0.55280532 0.
  0.28847675 0.55280532 0.        ]
 [0.         0.         0.52210862 0.52210862 0.         0.
  0.42685801 0.         0.52210862]]
Copy the code

5.4 How to Classify Documents

If we want to classify documents, there are two important stages:

  1. Data preparation based on word segmentation, including word segmentation, word weight calculation, removing stop words;
  2. The naive Bayes classification is applied to obtain the naive Bayes classifier through the training set, and then the classifier is applied to the test set, and compared with the actual results, finally the classification accuracy of the test set is obtained.

These modules are introduced as follows:

  • In the preparation phase, the most important thing is word segmentation. What about word segmentation for documents? English documents and Chinese documents use different word segmentation tools.

    In English documents, the most commonly used is the NTLK package. NTLK package contains English stop words, word segmentation and annotation methods.

importNLTK word_list = nltk.word_tokenize(text) # nltk.pos_tag(word_list) #Copy the code

In Chinese documents, the most common one is jieba packet. Jieba packet contains stop words and word segmentation in Chinese.

importJieba word_list = jieba.cut (text) #Copy the code
  • Module 2: Loading the stop words table we need to read the stop words table file by ourselves. We can find common Chinese stop words from the Internet and save them in stop_words. TXT, and then use Python’s file reading function to read the file and save it in stop_words array.
stop_words = [line.strip().decode('utf-8') for line in io.open('stop_words.txt').readlines()]
Copy the code
  • Module 3: Calculate the weights of words

Create the TfidfVectorizer class directly, then use fit_transform method to fit the tF-IDF feature space features, you can understand that the selected segmentation is the feature. We calculate the feature vectors of these features in the document and get feature space features.

tf = TfidfVectorizer(stop_words=stop_words, max_df=0.5)
features = tf.fit_transform(train_contents)
Copy the code

Here the max_df parameter is used to describe the highest occurrence rate of words in the document. Assuming max_df=0.5, representing a word that appears in 50% of documents, it carries so little information that it is not counted as a participle. Min_df is rarely set because min_df is usually small.

  • Module 4: Generation of Naive Bayesian classifier We pass the feature space train_features of feature training set and the corresponding classification train_labels of training set to The Bayesian classifier CLF, which will automatically generate a classifier conforming to the feature space and corresponding classification.

Here we use a polynomial Bayesian classifier, where alpha is the smoothing parameter. Why use smoothing? Because if a word doesn’t appear in the training sample, the probability of that word is calculated as zero. However, the sample of the training set is only the sampling situation of the whole, and we cannot assume that the probability of the whole event is 0 just because an event is not observed. To solve this problem, we need to do smoothing.

When alpha=1, Laplace smoothing is used. Laplace smoothing is the method of adding 1 to calculate the probability of words that have not occurred. In this way, when the training sample is large, the change of probability obtained by adding 1 can be ignored, and the problem of zero probability can be avoided at the same time.

When 0<alpha<1, Lidstone smoothing is used. For Lidstone smoothing, the smaller the alpha, the more iterations and the higher the accuracy. We can set alpha to 0.001.

# Polynomial Bayes classifier from sklearn.naive_bayesimport MultinomialNB
clf = MultinomialNB(alpha=0.001).fit(train_features, train_labels)
Copy the code
  • Module 5: Using the generated classifier to make prediction First we need to get the feature matrix of the test set. The method is to create a TfidfVectorizer class with the segmentation of the training set, using the same stop_words and max_df, and then use the TfidfVectorizer class to fit_transform the contents of the test set. The characteristic matrix test_features of the test set is obtained.
test_tf = TfidfVectorizer(stop_words=stop_words, max_df=0.5, vocabulary=train_vocabulary)
test_features=test_tf.fit_transform(test_contents)
Copy the code

Then we use trained classifiers to make predictions about the new data. The method is to use predicted_labels by passing in the test_features matrix of the test set using predict function.

So what the predict function does is evaluate all the posterior probabilities and find the one that’s the largest.

  • Module 6: Computational accuracy Computational accuracy is actually the evaluation of the classification model. We can call the metrics package in Sklearn. Accuracy_score function is provided in metrics, which is convenient for us to compare the actual results with the predicted results and give the accuracy of the model.
from sklearn import metrics
print metrics.accuracy_score(test_labels, predicted_labels)
Copy the code

5.5 Actual combat text classification

Chinese document dataset click here to download.

Data description:

u

There are four types of documents: women, sports, literature, campus;Put the training set in the train folder, the test set in the test file, and the stop words in the Stop folder.

Naive Bayes classification is used to train the training set and verify the test set, and the accuracy of the test set is given.

Ok, step by step, follow the previous ideas:

  1. Import packages
import os

import jieba
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.metrics import accuracy_score
Copy the code
  1. Load stop word
LABAL_MAP = {'sports':0.'women':1.'literature':2.'campus':3}

"""Stop loading words"""
with open('./text classification/stop/stopword.txt'.'rb') as f:
    STOP_WORDS = [line.strip() for line in f.readlines()]
Copy the code
  1. Load data set
"""Define a function to load data"""
def load_data(path):
    """Base_path: base path return: segmentation list, tag list"""
    
    documents = []
    labels = []

    forLabel_dir in os.listdir(path): # File_path = os.path.join(path, label_dir)forFile in os.listdir(file_path): # This is to iterate over the text labels under each label directory.append(LABAL_MAP[label_dir])
            filename = os.path.join(file_path, file)
            with open(filename, 'rb'Content = fr.read() word_list = list(jieba.cut(content)) words = [wlfor wl in word_list if wl not in STOP_WORDS]
                documents.append(' '.join(words))
    
    return documents, labels
Copy the code
"""Load data"""
train_x, train_y = load_data('./text classification/train')
test_x, text_y = load_data('./text classification/test')
Copy the code
  1. Calculate the weights of words
"""Compute the TF-IDF matrix"""
tfidf_vec = TfidfVectorizer(stop_words=STOP_WORDS, max_df=0.5) new_train_x = tFIDF_vecc. fit_transform(train_x) # test_tFIDF_vec = TfidfVectorizer(stop_words= stop_words, max_df=0.5, vocabulary=tfidf_vec.vocabulary_)
new_test_x = test_tfidf_vec.fit_transform(test_x)
Copy the code
  1. Model and predict (I’m comparing three Bayesian approaches here)
"""Build a model"""
bayes_model = {}

bayes_model['MultinomialNB'] = MultinomialNB(alpha=0.001)
bayes_model['BernoulliNB'] = BernoulliNB(alpha=0.001)
bayes_model['GaussianNB'] = GaussianNB()

for item in bayes_model.keys():
    clf = bayes_model[item]
    clf.fit(new_train_x.toarray(), train_y)
    pred = clf.predict(new_test_x.toarray())
    print(item, "accuracy_score: ", accuracy_score(text_y, pred))
Copy the code

The final result is as follows:

MultinomialNB accuracy_score:  0.91
BernoulliNB accuracy_score:  0.9
GaussianNB accuracy_score:  0.89
Copy the code

6. Summary

I’m finally done here, oh my god, I didn’t expect this to be so much, so let’s just wrap it up, and today we started with Bayes’ principle, and we came up with the great Bayes’ formula through real life examples, and that’s what Bayes’ principle is based on to find a posterior probability.

Then it introduces naive Bayes and its advantages, and explains the calculation process of naive Bayes with two cases

Then, the actual combat of text classification, before the actual combat, introduced some knowledge of text processing, such as word segmentation, such as TF-IDF statistical method principle and implementation, and then complete the actual task.

I hope I can master the usage and principle of naive Bayes through today’s study.

Reference:

  • Note.youdao.com/noteshare?i…
  • Note.youdao.com/noteshare?i…
  • Blog.csdn.net/sdutacm/art…
  • Note.youdao.com/noteshare?i…
  • Cloud.tencent.com/developer/a…
  • Blog.csdn.net/qq_27009517…
Machine Learning Online Manual Deep Learning online Manual AI Basics Download (PDF updated to25Set) site QQ group1003271085To join the wechat group, please reply to "add group" to get a discount station knowledge planet coupon, please reply to "knowledge planet" like the article, click on itCopy the code