preface
In this paper, naive Bayes algorithm is used to achieve sentiment analysis and prediction of Douban Top250 movie evaluation.
Recently, I am learning how to deal with positive and negative emotions in natural language, but most of the practices I can search are emotion analysis of IMDB movie reviews on Kggle.
So HERE I use the most basic naive Bayes algorithm to emotional analysis and prediction of douban film reviews.
Here I refer to github.com/aeternae/IM… Thanks a million.
Naive Bayes classifier
Bayesian classification is the general name of a class of classification algorithms, which are based on Bayes theorem, so they are collectively called Bayesian classification.
This algorithm is often used for article classification, junk mail, junk comment classification, naive Bayes effect is good and the cost is very low.
A conditional probability is known, how to get the probability of two events after the exchange, which is known in P (A | B) how to calculate P (B | A).
P (B | A) under the premise of said event A has occurred, the probability of event B occurs, is called the event A events under the conditional probability of B.
Naive Bayes’ formula
An easy-to-understand video tutorial
Youtube, www.youtube.com/watch?v=Aqo…
Here’s an unfortunate example
If we want to know the relationship between being a programmer and baldness, we can use the naive Bayes formula to calculate it.
We now want the probability of P (bald | do programmer), also is to be the probability that the programmers will be bald
I will never go bald ((o(゚ ゚)o))!!
You plug in naive Bayes’ formula
The known data are shown in the following table
The name | professional | Whether the bald |
---|---|---|
kratos | God of war | is |
Killer 47 | killer | is |
The saitama | superman | is |
Destroy the bully | Director of family planning Office | is |
Jason Statham | Tough guy | is |
Xx 996 programmer | The programmer | is |
I | The programmer | no |
Based on the naive Bayes formula, we can calculate from the above table:
This example simply describes the basic use of naive Bayes’ formula.
Next, I will use the film reviews of Douban Top250 to use Naive Bayes for training and prediction of good and bad reviews.
Douban Top250 movie reviews emotion analysis
First of all, I need the corpus of Top250 film reviews. I use Scrapy to grab 5W pieces of corpus for training and verification.
Douban film review crawler github.com/3inchtime/d…
Once we have the corpus, we can start the actual development.
Jupyter is recommended here to develop the operation.
The following code can be seen on my Github and your suggestions are welcome.
Github.com/3inchtime/d…
First load the corpus
# -*- coding: utf-8 -*-
import random
import numpy as np
import csv
import jieba
file_path = './data/review.csv'
jieba.load_userdict('./data/userdict.txt')
Read the corpus saved in CSV format
def load_corpus(corpus_path):
with open(corpus_path, 'r') as f:
reader = csv.reader(f)
rows = [row for row in reader]
review_data = np.array(rows).tolist()
random.shuffle(review_data)
review_list = []
sentiment_list = []
for words in review_data:
review_list.append(words[1])
sentiment_list.append(words[0])
return review_list, sentiment_list
Copy the code
Before training, data sets are generally shuffled to disrupt the order of data and randomize data to avoid over-fitting. So the random.shuffle() method is used to shuffle the data.
Jieba.load_userdict (‘./data/userdict.txt’) here I made a dictionary to prevent some stutters from being inaccurate, can improve the accuracy of about 1%.
For example, jieba will divide into “no” and “like”, which leads to a high probability of being predicted as a favorable comment.
So here I’ve sorted out a lot of words like this in my own custom dictionary to get a little bit more accurate.
Then all the corpus was divided into test set and training set by 1:4
n = len(review_list) // 5
train_review_list, train_sentiment_list = review_list[n:], sentiment_list[n:]
test_review_list, test_sentiment_list = review_list[:n], sentiment_list[:n]
Copy the code
participles
1. Use jieba participle to divide the corpus and remove stopwords.
import re
import jieba
stopword_path = './data/stopwords.txt'
def load_stopwords(file_path):
stop_words = []
with open(file_path, encoding='UTF-8') as words:
stop_words.extend([i.strip() for i in words.readlines()])
return stop_words
def review_to_text(review):
stop_words = load_stopwords(stopword_path)
# Remove English
review = re.sub("[^\u4e00-\u9fa5^a-z^A-Z]".' ', review)
review = jieba.cut(review)
# Drop the stop word
if stop_words:
all_stop_words = set(stop_words)
words = [w for w in review if w not in all_stop_words]
return words
# Comments for training
review_train = [' '.join(review_to_text(review)) for review in train_review_list]
# Good/bad comments for training comments
sentiment_train = train_sentiment_list
# Comments for testing
review_test = [' '.join(review_to_text(review)) for review in test_review_list]
# Positive/negative comments corresponding to test comments
sentiment_test = test_sentiment_list
Copy the code
TF*IDF and word frequency vectorization
Tf-idf is a weighting technique commonly used in information processing and data mining. The importance of a word in the whole corpus is calculated according to the number of times the word appears in the text and the frequency of documents appearing in the whole corpus.
Its advantage is that it can filter out some common but irrelevant words, while retaining important words that affect the whole text.
Use Countvectorizer() to turn a document into a vector that counts the frequency of terms appearing in the text.
The CountVectorizer class converts words in the text to a word frequency matrix, for example, containing an element A [I] [j] that represents the word frequency of the j word in class I text. It counts the number of occurrences of each word using the FIT_transform function.
TfidfTransformer is used to count tF-IDF values for each word in the Vectorizer.
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
count_vec = CountVectorizer(max_df=0.8, min_df=3)
tfidf_vec = TfidfVectorizer()
# define Pipeline's streaming encapsulation and management of all steps to make it easy to reuse parameter sets on new data sets (such as test sets).
def MNB_Classifier(a):
return Pipeline([
('count_vec', CountVectorizer()),
('mnb', MultinomialNB())
])
Copy the code
The parameter max_df serves as a threshold. When constructing the keyword set of corpus, if the word frequency is greater than max_df, the word will not be regarded as a keyword.
If this parameter is float, it represents the number of occurrences of the word as a percentage of the number of documents in the corpus, or int, the number of occurrences of the word.
Min_df is similar to max_df, except that if a word has a word frequency less than min_df, it is not considered a keyword
In this way, we successfully constructed the Pipeline for training and testing
The training set is then trained with pipeline.fit ()
Then directly use pipeline.score () to predict and score the test set
mnbc_clf = MNB_Classifier()
# Get training
mnbc_clf.fit(review_train, sentiment_train)
# Test set accuracy
print('Test set Accuracy: {}'.format(mnbc_clf.score(review_test, sentiment_test)))
Copy the code
This completes the entire process from training to testing.
Basically, the accuracy rate of test sets is about 79%-80%.
Because there are a large number of positive film reviews will contain negative emotional words, such as in the documentary “The Cove”
I don’t think most people who watch this film have any idea that China’s Baiji has been extinct for eight years, or that there are only about 1,000 finless porpoises left in the Yangtze River. Instead of lamenting and cursing how The Japanese hunt dolphins, it is better to do something practical to protect the finless porpoise in the Yangtze River, which will also be extinct in a few years. The Chinese can’t do anything any better than the little Japanese.
So if you remove this kind of similar praise, you can improve accuracy.
Save the trained model
First convert to word frequency matrix, then calculate TFIDF value
tfidf = tfidftransformer.fit_transform(vectorizer.fit_transform(review_train))
# Polynomial classifier in Naive Bayes
clf = MultinomialNB().fit(tfidf, sentiment_train)
with open(model_export_path, 'wb') as file:
d = {
"clf": clf,
"vectorizer": vectorizer,
"tfidftransformer": tfidftransformer,
}
pickle.dump(d, file)
Copy the code
The trained model was used to predict film review emotion
Here I directly pasted the entire source code, the code is very simple, I packaged the entire processing logic into a class, so it is very easy to use.
Clone on my Github if necessary
# -*- coding: utf-8 -*-
import re
import pickle
import numpy as np
import jieba
class SentimentAnalyzer(object):
def __init__(self, model_path, userdict_path, stopword_path):
self.clf = None
self.vectorizer = None
self.tfidftransformer = None
self.model_path = model_path
self.stopword_path = stopword_path
self.userdict_path = userdict_path
self.stop_words = []
self.tokenizer = jieba.Tokenizer()
self.initialize()
# Load model
def initialize(self):
with open(self.stopword_path, encoding='UTF-8') as words:
self.stop_words = [i.strip() for i in words.readlines()]
with open(self.model_path, 'rb') as file:
model = pickle.load(file)
self.clf = model['clf']
self.vectorizer = model['vectorizer']
self.tfidftransformer = model['tfidftransformer']
if self.userdict_path:
self.tokenizer.load_userdict(self.userdict_path)
# Filter English and irrelevant text in text
def replace_text(self, text):
text = re.sub('((https? |ftp|file)://)? [-A-Za-z0-9+&@#/%?=~_|!:,.;] +[-A-Za-z0-9+&@#/%=~_|].(com|cn)'.' ', text)
text = text.replace('\u3000'.' ').replace('\xa0'.' ').replace(' ' '.' ').replace('"'.' ')
text = text.replace(' '.' ').replace("↵".' ').replace('\n'.' ').replace('\r'.' ').replace('\t'.' ').replace(') '.' ')
text_corpus = re.split('[!...?...] ', text)
return text_corpus
# Affective analysis calculation
def predict_score(self, text_corpus):
# participle
docs = [self.__cut_word(sentence) for sentence in text_corpus]
new_tfidf = self.tfidftransformer.transform(self.vectorizer.transform(docs))
predicted = self.clf.predict_proba(new_tfidf)
Round to three places
result = np.around(predicted, decimals=3)
return result
# jieba participle
def __cut_word(self, sentence):
words = [i for i in self.tokenizer.cut(sentence) if i not in self.stop_words]
result = ' '.join(words)
return result
def analyze(self, text):
text_corpus = self.replace_text(text)
result = self.predict_score(text_corpus)
neg = result[0][0]
pos = result[0][1]
print('{}'.format(neg, pos))
Copy the code
Simply instantiate the analyzer and use the Analyze () method.
# -*- coding: utf-8 -*-
from native_bayes_sentiment_analyzer import SentimentAnalyzer
model_path = './data/bayes.pkl'
userdict_path = './data/userdict.txt'
stopword_path = './data/stopwords.txt'
corpus_path = './data/review.csv'
analyzer = SentimentAnalyzer(model_path=model_path, stopword_path=stopword_path, userdict_path=userdict_path)
text = 'A disappointing Nolan movie that feels more like a mishmash of Inception. I knew it was going to be a movie that couldn't surpass Prequel 2, but I didn't expect it to be this bad. The loss of rhythm control and the ambiguous positioning of the characters are definitely the wounds of the whole film. '
analyzer.analyze(text=text)
Copy the code
Github.com/3inchtime/d…
All the above codes have been pushed to my Github, and your suggestions are welcome.