♚ \
Author: Yishui Hancheng, CSDN blog expert, personal research interests: machine learning, deep learning, NLP, CV
Blog: yishuihancheng.blog.csdn.net
The methods of sentiment analysis include sentiment dictionary based method supervised machine learning method and unsupervised machine learning method. Sentiment analysis method based on emotional dictionary is by building a emotional dictionary contains various emotions, make evaluation rules, to break sentences, text analysis and matching dictionary, by analyzing the positive and negative emotional words in the text, to the number to calculate the emotional value, emotional words to get each film the proportion of positive, negative and neutral emotions, Finally, the emotional value is used as the basis to judge the emotional tendency of text review data. The machine learning method requires a large number of manually annotated corpus as a training set, and uses machine learning or algorithms to train the model, which can then be used to classify and judge the emotional tendency of the new paper.
The route planning of this research work: ****
1. Build Python crawler to crawl the review data of Douban, and pass the data word segmentation and part-of-speech tagging. The part-of-speech tagging is weak tagging.
2. Based on HowNet and NTDSP, PMI is used to produce the emotion dictionary in the field of film. According to the emotion dictionary, emotion words are found and their positions are marked, and negative words and adverbs of degree are searched forward to calculate emotion values.
3. Classification was carried out by support vector machine (SVM), the final classification effect was evaluated by Accuracy, and the calculated results were presented by using cloud map and analysis table.
This system is mainly based on the realization and transformation of the following papers, if you want to further understand the following paper can refer to the idea.
Introduction of system implementation: ****
Main content in this project is to build the crawler based on douban “wandering the earth” and “fast life” movie reviews data of original data collection and analysis, mining, construction film reviews and combined with field data sets an emotional dictionary user ratings data as weak annotation information to implement automatic tagging of reviews data, save the human cost, Finally, sentiment analysis of film review data is completed based on support vector machine SVM model.
The schematic diagram of system design architecture is shown below: ****
The functions of each module are described as follows: ****
1. The first part is data acquisition, which is mainly to write a crawler to complete the crawling and storage of specified movie data.
2. Clean the obtained data and parse it into JSON data objects that are more convenient for us to use.
3. Perform jieba word segmentation on the comment information in the data after parsing, and perform part-of-speech tagging at the same time.
4. Obtain NTUSD and HowNet emotion dictionary and merge and re-process to get basic emotion dictionary data.
5. Generate positive and negative list dictionaries of Topk film review data based on high-frequency vocabulary mining.
6. Complete the calculation of new words’ emotional tendency.
7. Sentiment based on film review data tends to integrate user’s rating data to achieve annotation of sample data.
8. Complete the extraction of characteristic data of the specified five dimensions.
9. Build SVM model for training and classification test, calculate evaluation indicators and perform visual display.
After the introduction of the whole and each module, we will start to explain the implementation of the module. First, we will look at the data acquisition module, and the start code of the crawl is shown as follows:
The above code has achieved our goal of “the Wandering Earth” and “Pegasus” two movies review data crawling, are this year’s New Year film popular movies ah!
Screenshots of some of the review data are shown below:
The Wandering Earth
Pegasus
The next step is data parsing and jieba word segmentation. Screenshots of some results are as follows:
The Wandering Earth
Pegasus
At the same time of word segmentation processing, what needs to be done is the construction of the emotion dictionary. Here, the corresponding data combination processing can be directly downloaded from the Internet to build the dictionary, so as not to do too much introduction here, as follows:
The following is the word frequency statistics of the word segmentation results, part of the result data is as follows:
The Wandering Earth
Pegasus
Next, it realizes high-frequency positive and negative emotion word mining processing based on PMI method, so as to construct the emotion dictionary in the field of film review data. If the word segmentation of text data in other fields can also be processed by referring to this method.
The specific implementation code of domain emotion dictionary construction is shown as follows:
After the construction of the domain emotion dictionary is completed, the original sample data can be annotated based on the emotion dictionary and weak labeling information of users’ ratings. Due to the large number of codes, part of the code implementation is shown here as follows:
After the sample data annotation processing combined with the above two methods is completed, feature extraction of text data can be carried out. The code for feature extraction is shown as follows:
The above code to achieve the extraction and vectorization of feature data, finally can be model construction, testing and results visualization analysis.
The results of Ciyun’s visual analysis of film review data are as follows:
The Wandering Earth
Pegasus
Then the system realized theme mining of film review data based on the LDA theme mining model, and the visual analysis results are as follows:
The Wandering Earth
Pegasus
Afterwards, the support curves of the two films and television works are visualized as follows:
The system uses the SVM model, and the code of the model is as follows:
The calculation results of the four evaluation indexes of model classification are as follows:
{"recall": 1.0."F_value": 0.8979477087433231."precision": 0.814795918367347."accuracy": 0.814795918367347}
Copy the code
The visualization is as follows:
From the comprehensive view of each index, the classification accuracy of the model is good. Here is the end of the work of this paper, I am very glad to review my knowledge and write something to share, if you think my content can or is enlightening and helpful to you, also hope to get your encouragement and support, thank you!
Appreciate the author
Python Chinese community as a decentralized global technology community, to become the world’s 200000 Python tribe as the vision, the spirit of Chinese developers currently covered each big mainstream media and collaboration platform, and ali, tencent, baidu, Microsoft, amazon and open China, CSDN industry well-known companies and established wide-ranging connection of the technical community, Have come from more than 10 countries and regions tens of thousands of registered members, members from the ministry, tsinghua university, Peking University, Beijing university of posts and telecommunications, the People’s Bank of China, the Chinese Academy of Sciences, cicc, huawei, BAT, such as Google, Microsoft, government departments, scientific research institutions, financial institutions, and well-known companies at home and abroad, nearly 200000 developers to focus on the platform.
Click to become a registered member of the community