preface

In this paper, naive Bayes algorithm is used to achieve sentiment analysis and prediction of Douban Top250 movie evaluation.

Recently, I am learning how to deal with positive and negative emotions in natural language, but most of the practices I can search are emotion analysis of IMDB movie reviews on Kggle.

So HERE I use the most basic naive Bayes algorithm to emotional analysis and prediction of douban film reviews.

Here I refer to github.com/aeternae/IM… Thanks a million.

Naive Bayes classifier

Bayesian classification is the general name of a class of classification algorithms, which are based on Bayes theorem, so they are collectively called Bayesian classification.

This algorithm is often used for article classification, junk mail, junk comment classification, naive Bayes effect is good and the cost is very low.

A conditional probability is known, how to get the probability of two events after the exchange, which is known in P (A | B) how to calculate P (B | A).

P (B | A) under the premise of said event A has occurred, the probability of event B occurs, is called the event A events under the conditional probability of B.

Naive Bayes’ formula


An easy-to-understand video tutorial

Youtube, www.youtube.com/watch?v=Aqo…

Here’s an unfortunate example

If we want to know the relationship between being a programmer and baldness, we can use the naive Bayes formula to calculate it.

We now want the probability of P (bald | do programmer), also is to be the probability that the programmers will be bald

I will never go bald ((o(゚ ゚)o))!!

You plug in naive Bayes’ formula


The known data are shown in the following table

The name professional Whether the bald
kratos God of war is
Killer 47 killer is
The saitama superman is
Destroy the bully Director of family planning Office is
Jason Statham Tough guy is
Xx 996 programmer The programmer is
I The programmer no

Based on the naive Bayes formula, we can calculate from the above table:


This example simply describes the basic use of naive Bayes’ formula.

Next, I will use the film reviews of Douban Top250 to use Naive Bayes for training and prediction of good and bad reviews.

Douban Top250 movie reviews emotion analysis

First of all, I need the corpus of Top250 film reviews. I use Scrapy to grab 5W pieces of corpus for training and verification.

Douban film review crawler github.com/3inchtime/d…

Once we have the corpus, we can start the actual development.

Jupyter is recommended here to develop the operation.

The following code can be seen on my Github and your suggestions are welcome.

Github.com/3inchtime/d…

First load the corpus

# -*- coding: utf-8 -*-
import random
import numpy as np
import csv
import jieba


file_path = './data/review.csv'
jieba.load_userdict('./data/userdict.txt')

Read the corpus saved in CSV format
def load_corpus(corpus_path):
    with open(corpus_path, 'r') as f:
        reader = csv.reader(f)
        rows = [row for row in reader]

        
    review_data = np.array(rows).tolist()
    random.shuffle(review_data)

    review_list = []
    sentiment_list = []
    for words in review_data:
        review_list.append(words[1])
        sentiment_list.append(words[0])

    return review_list, sentiment_list
Copy the code

Before training, data sets are generally shuffled to disrupt the order of data and randomize data to avoid over-fitting. So the random.shuffle() method is used to shuffle the data.

Jieba.load_userdict (‘./data/userdict.txt’) here I made a dictionary to prevent some stutters from being inaccurate, can improve the accuracy of about 1%.

For example, jieba will divide into “no” and “like”, which leads to a high probability of being predicted as a favorable comment.

So here I’ve sorted out a lot of words like this in my own custom dictionary to get a little bit more accurate.

Then all the corpus was divided into test set and training set by 1:4

n = len(review_list) // 5

train_review_list, train_sentiment_list = review_list[n:], sentiment_list[n:]
test_review_list, test_sentiment_list = review_list[:n], sentiment_list[:n]

Copy the code

participles

1. Use jieba participle to divide the corpus and remove stopwords.

import re
import jieba


stopword_path = './data/stopwords.txt'


def load_stopwords(file_path):
    stop_words = []
    with open(file_path, encoding='UTF-8') as words:
       stop_words.extend([i.strip() for i in words.readlines()])
    return stop_words


def review_to_text(review):
    stop_words = load_stopwords(stopword_path)
    # Remove English
    review = re.sub("[^\u4e00-\u9fa5^a-z^A-Z]".' ', review)
    review = jieba.cut(review)
    # Drop the stop word
    if stop_words:
        all_stop_words = set(stop_words)
        words = [w for w in review if w not in all_stop_words]

    return words

# Comments for training
review_train = [' '.join(review_to_text(review)) for review in train_review_list]
# Good/bad comments for training comments
sentiment_train = train_sentiment_list

# Comments for testing
review_test = [' '.join(review_to_text(review)) for review in test_review_list]
# Positive/negative comments corresponding to test comments
sentiment_test = test_sentiment_list
Copy the code

TF*IDF and word frequency vectorization

Tf-idf is a weighting technique commonly used in information processing and data mining. The importance of a word in the whole corpus is calculated according to the number of times the word appears in the text and the frequency of documents appearing in the whole corpus.

Its advantage is that it can filter out some common but irrelevant words, while retaining important words that affect the whole text.

Use Countvectorizer() to turn a document into a vector that counts the frequency of terms appearing in the text.

The CountVectorizer class converts words in the text to a word frequency matrix, for example, containing an element A [I] [j] that represents the word frequency of the j word in class I text. It counts the number of occurrences of each word using the FIT_transform function.

TfidfTransformer is used to count tF-IDF values for each word in the Vectorizer.

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB

count_vec = CountVectorizer(max_df=0.8, min_df=3)

tfidf_vec = TfidfVectorizer()

# define Pipeline's streaming encapsulation and management of all steps to make it easy to reuse parameter sets on new data sets (such as test sets).
def MNB_Classifier(a):
    return Pipeline([
        ('count_vec', CountVectorizer()),
        ('mnb', MultinomialNB())
    ])

Copy the code

The parameter max_df serves as a threshold. When constructing the keyword set of corpus, if the word frequency is greater than max_df, the word will not be regarded as a keyword.

If this parameter is float, it represents the number of occurrences of the word as a percentage of the number of documents in the corpus, or int, the number of occurrences of the word.

Min_df is similar to max_df, except that if a word has a word frequency less than min_df, it is not considered a keyword

In this way, we successfully constructed the Pipeline for training and testing

The training set is then trained with pipeline.fit ()

Then directly use pipeline.score () to predict and score the test set

mnbc_clf = MNB_Classifier()

# Get training
mnbc_clf.fit(review_train, sentiment_train)

# Test set accuracy
print('Test set Accuracy: {}'.format(mnbc_clf.score(review_test, sentiment_test)))

Copy the code

This completes the entire process from training to testing.

Basically, the accuracy rate of test sets is about 79%-80%.

Because there are a large number of positive film reviews will contain negative emotional words, such as in the documentary “The Cove”

I don’t think most people who watch this film have any idea that China’s Baiji has been extinct for eight years, or that there are only about 1,000 finless porpoises left in the Yangtze River. Instead of lamenting and cursing how The Japanese hunt dolphins, it is better to do something practical to protect the finless porpoise in the Yangtze River, which will also be extinct in a few years. The Chinese can’t do anything any better than the little Japanese.

So if you remove this kind of similar praise, you can improve accuracy.

Save the trained model

First convert to word frequency matrix, then calculate TFIDF value
tfidf = tfidftransformer.fit_transform(vectorizer.fit_transform(review_train))
# Polynomial classifier in Naive Bayes
clf = MultinomialNB().fit(tfidf, sentiment_train)

with open(model_export_path, 'wb') as file:
    d = {
        "clf": clf,
        "vectorizer": vectorizer,
        "tfidftransformer": tfidftransformer,
    }
    pickle.dump(d, file)
Copy the code

The trained model was used to predict film review emotion

Here I directly pasted the entire source code, the code is very simple, I packaged the entire processing logic into a class, so it is very easy to use.

Clone on my Github if necessary

# -*- coding: utf-8 -*-
import re
import pickle

import numpy as np
import jieba


class SentimentAnalyzer(object):
    def __init__(self, model_path, userdict_path, stopword_path):
        self.clf = None
        self.vectorizer = None
        self.tfidftransformer = None
        self.model_path = model_path
        self.stopword_path = stopword_path
        self.userdict_path = userdict_path
        self.stop_words = []
        self.tokenizer = jieba.Tokenizer()
        self.initialize()

    # Load model
    def initialize(self):
        with open(self.stopword_path, encoding='UTF-8') as words:
            self.stop_words = [i.strip() for i in words.readlines()]

        with open(self.model_path, 'rb') as file:
            model = pickle.load(file)
            self.clf = model['clf']
            self.vectorizer = model['vectorizer']
            self.tfidftransformer = model['tfidftransformer']
        if self.userdict_path:
            self.tokenizer.load_userdict(self.userdict_path)

    # Filter English and irrelevant text in text
    def replace_text(self, text):
        text = re.sub('((https? |ftp|file)://)? [-A-Za-z0-9+&@#/%?=~_|!:,.;] +[-A-Za-z0-9+&@#/%=~_|].(com|cn)'.' ', text)
        text = text.replace('\u3000'.' ').replace('\xa0'.' ').replace(' ' '.' ').replace('"'.' ')
        text = text.replace(' '.' ').replace("↵".' ').replace('\n'.' ').replace('\r'.' ').replace('\t'.' ').replace(') '.' ')
        text_corpus = re.split('[!...?...] ', text)
        return text_corpus

    # Affective analysis calculation
    def predict_score(self, text_corpus):
        # participle
        docs = [self.__cut_word(sentence) for sentence in text_corpus]
        new_tfidf = self.tfidftransformer.transform(self.vectorizer.transform(docs))
        predicted = self.clf.predict_proba(new_tfidf)
        Round to three places
        result = np.around(predicted, decimals=3)
        return result

    # jieba participle
    def __cut_word(self, sentence):
        words = [i for i in self.tokenizer.cut(sentence) if i not in self.stop_words]
        result = ' '.join(words)
        return result

    def analyze(self, text):
        text_corpus = self.replace_text(text)
        result = self.predict_score(text_corpus)

        neg = result[0][0]
        pos = result[0][1]

        print('{}'.format(neg, pos))

Copy the code

Simply instantiate the analyzer and use the Analyze () method.

# -*- coding: utf-8 -*-
from native_bayes_sentiment_analyzer import SentimentAnalyzer


model_path = './data/bayes.pkl'
userdict_path = './data/userdict.txt'
stopword_path = './data/stopwords.txt'
corpus_path = './data/review.csv'


analyzer = SentimentAnalyzer(model_path=model_path, stopword_path=stopword_path, userdict_path=userdict_path)
text = 'A disappointing Nolan movie that feels more like a mishmash of Inception. I knew it was going to be a movie that couldn't surpass Prequel 2, but I didn't expect it to be this bad. The loss of rhythm control and the ambiguous positioning of the characters are definitely the wounds of the whole film. '
analyzer.analyze(text=text)

Copy the code

Github.com/3inchtime/d…

All the above codes have been pushed to my Github, and your suggestions are welcome.