How to batch comment sentiment analysis and visualize it on a timeline? Public opinion analysis is not difficult, let’s implement it in Python.
Pain points
You are the regional manager of a chain of hot pot restaurants and pay close attention to customer reviews. Once upon a time, you fretted that customers didn’t like to write reviews. Recently, as restaurants have become more popular, there are more outlets and more customers writing reviews, so you have a new pain – there are too many reviews to read.
From me, you’ve learned about emotion analysis, a useful automated tool, and suddenly you’ve seen the light of day.
You find your branch’s page on a popular review site and ask an assistant to pull down the reviews and Posting dates. Because the assistant doesn’t use crawlers, it has to copy and paste comments from the web into Excel. By the end of the day, I only got 27. (Note that we used real review data here. In order to avoid causing confusion to the businesses under review, the name of the restaurant was replaced with “RESTAURANT A”. Hereby stated.)
Good thing you just wanted to do an experiment, so go with it. You used the Chinese information sentiment analysis tool I introduced earlier to get the sentiment value of each comment in turn. When you first start producing results, you’re excited that you’ve found the ultimate tool for public opinion analysis.
But the good times are always short. You soon realize that if you run a separate program for each comment, using a machine to analyze it, it’s much easier to read it yourself.
What to do?
The sequence
Of course there is. To borrow or not to Borrow: How to Use Python and Machine Learning to Help You make Decisions? The data enclosure introduced in this article processes multiple data at once to improve efficiency.
But that’s not enough. We can also visualize the results of emotion analysis on a time series. So you can see trends at a glance — are people more or less satisfied with their restaurants these days?
What we humans are good at processing is images. Because of our long evolutionary history, we have been forced to improve our ability to process images quickly and accurately, or else we would be eliminated by the environment. Hence the saying “a picture is worth a thousand words”.
To prepare
First, you need to install the Anaconda suite. See “How to Make a Word Cloud in Python” for detailed steps.
For the Excel file restaurant-comments.xlsx, please download it here.
Use Excel to open, if everything is ok, please move the file to our working directory demo.
Because we need to analyze Chinese comments in this example, the software package used is SnowNLP. For the basic application of sentiment analysis, see how to Use Python to Do Sentiment Analysis. .
Go to your system “terminal” (macOS, Linux) or “command prompt” (Windows), go to our working directory, Demo, and execute the following command.
pip install snownlp
pip install ggplot
Copy the code
The operating environment is configured.
At a terminal or command prompt type:
jupyter notebook
Copy the code
If the Jupyter Notebook works correctly, we’re ready to start coding.
code
Let’s create a New Python 2 Notebook in Jupyter Notebook and call it time-Series.
First, we’ll introduce the data box analysis tool Pandas (pd for ease of calling).
import pandas as pd
Copy the code
Next, read the Excel data file:
df = pd.read_excel("restaurant-comments.xlsx")
Copy the code
Let’s see if the contents are complete:
df.head()
Copy the code
The results are as follows:
Notice the time column here. If the time format in your Excel file is the same as here, it will be smart enough to recognize the time format for Pandas.
If, on the other hand, you retrieve a time exactly to the date, such as “2017-04-20”, it will be treated as a string and will not be used for time series analysis. The solution is to add the following two lines here:
from dateutil import parser
df["date"] = df.date.apply(parser.parse)
Copy the code
In this way, you have the correct time data.
Once the data is complete, we’ll run the sentiment analysis. Try a little experiment with the comments in the first line.
text = df.comments.iloc[0]
Copy the code
Then we call the SnowNLP sentiment analysis tool.
from snownlp import SnowNLP
s = SnowNLP(text)
Copy the code
Here’s SnowNLP’s analysis:
s.sentiments
Copy the code
The result is:
0.6331975099099649
Copy the code
The emotion analysis value can be calculated correctly. From there, we need to define functions that process all the comment information in bulk.
def get_sentiment_cn(text):
s = SnowNLP(text)
return s.sentiments
Copy the code
Then, we used the powerful Apply statement in Python to process all the comments at once and stored the generated sentiment values in a separate column in the data box called sentiment.
df["sentiment"] = df.comments.apply(get_sentiment_cn)
Copy the code
Let’s look at the emotion analysis results:
df.head()
Copy the code
New column sentiment has been generated. As we explained earlier, SnowNLP results range from 0 to 1, representing the likelihood of positive sentiment analysis results. Through the observation of the first few data, we find that on review websites, customers’ comments on this branch are generally positive, and some of them are very positive.
But observations of a small amount of data can lead to biased conclusions. Let’s average all the emotion analysis results. Use the mean() function.
df.sentiment.mean()
Copy the code
The result is:
0.7114015318571119
Copy the code
The result is over 0.7, indicating that customers have a positive attitude towards the store as a whole.
Let’s take a look at bitwise values and use a function called median().
df.sentiment.median()
Copy the code
The result is:
0.9563139038622388
Copy the code
We found something interesting — not only was the median value higher than the average, it was almost 1 (completely positive).
This means that most of the reviews were overwhelmingly positive. But there are a few outliers that significantly lower the average.
Let’s use emotional time series visualizations to see when these outliers occur and how low they really are.
We need to use the GGPlot plotting toolkit. The toolkit was originally available only in R, making users of other data analysis tools drool with envy. Fortunately, it was quickly ported to the Python platform.
We imported the drawing function from ggPlot and let The Jupyter Notebook display the image directly.
%pylab inline
from ggplot import *
Copy the code
There may be some warning messages here. It doesn’t matter, just ignore it.
Now let’s draw a graph. Here you can type the following line.
ggplot(aes(x="date", y="sentiment"), data=df) + geom_point() + geom_line(color = 'blue') + scale_x_date(labels = date_format("%Y-%m-%d"))
Copy the code
You can see how concise and humane ggPlot’s plotting syntax is. You just tell Python which data box you want to use, which column to choose as the horizontal axis and which column as the vertical axis, dot and line first, and specify the color of the line. Then, you need to make the date on the X axis appear in what format. All the parameters are very similar to natural language, intuitive and easy to understand.
After execution, you can see the resulting graph.
In the graph, we find that many positive sentiment analyses are extremely high. At the same time, we also clearly found those very low points. The sentiment analysis value for comments is close to zero. These comments are judged by Python to be largely devoid of positive sentiment.
From the perspective of time, the recent period of time, almost every few days will appear a relatively serious negative evaluation.
As a manager, you can be on pins and needles. Hope to know what happened soon. You don’t have to go through a data box or Excel file looking for the lowest sentiment comments. The Python data box Pandas provides you with great sorting capabilities. Suppose you want to find the one with the lowest sentiment analysis of all the comments, do it like this:
df.sort(['sentiment'[]) :1]
Copy the code
The result is:
The sentiment analysis is almost zero! However, the data box shows that the comment information is incomplete. We need to print the comments in their entirety.
print(df.sort(['sentiment']).iloc[0].comments)
Copy the code
The full comment is as follows:
This time in the past, valentine’s day never ZhengRiZi out on valentine’s day before, not because don’t have A boyfriend, but the feeling which people are much, so I stagger, this time is really slow A restaurant, so in ZhengRiZi also out, from the numeral at four o ‘clock in the afternoon when I look up to more than one hundred, I drive from home to block an hour, I have a look at two hours in advance, row number of the first on the Internet, we are almost half of that when I see there are only more than 30 before the number, I think sure no problem, wait a minute can eat, did not think of the tragedy, begin by us to sit to the equipotential area, is twenty minutes about a station to station, midway through many times I want to go, ha ha, oh, Wait until the last 9:00 in the morning, the waiter feel less thoughtful when at leisure, but it’s certainly, one person is responsible for several tables, holiday today so many people, it must be very tired, so, is largely myself running errands, didn’t ask the waiter to get too much, shrimp slip to the waiter, and then environment feel health is good, is a little too noisy, Taste, continues the taste, but the most humane A restaurant is see we waited more than two hours, come up to give us A discount card, and can be used when the time, this feeling is very good, not the kui is A restaurant, is more humanized than the general, but this time is to choose the wrong time, later still have to make an appointment in advance, or we don’t catch A holiday, It’s hot!
By reading this, you can see that the customer did have a bad experience — the wait was so long that the word “tragedy” was used; Other factors cited were inadequate service and noisy environments. It was the presence of these words that made the analysis so low.
Fortunately, customers were very understanding and gave positive comments about the store’s humane practices.
As you can see from this example, while sentiment analytics can help you automate a lot of things, you can’t rely on it entirely.
Natural language analysis should not only look at the key words that express strong emotions, but also consider many factors such as the way of expression and context. These are the research frontiers of natural language processing. We look forward to applying the scientists’ findings to improve the accuracy of emotion analysis.
However, even if the current automated processing of sentiment analysis is not very accurate, it can still help you quickly locate anomalies that may be problematic. In terms of efficiency, it is much higher than manual processing.
When you read this comment, you breathe a sigh of relief. After learning from your experience, you decide to follow through on your personal service. You also think that you can collect data on how long people wait for a meal and use that data analysis to provide a more reasonable expectation of how long people wait for a meal. This will save customers from having to wait very late.
Congratulations, manager! In the age of data intelligence, you’re already moving in the right direction.
Now it’s time for you to read the next negative comment carefully…
discuss
In addition to sentiment analysis and time series visualization, how else do you think Chinese comment information can be mined? Besides review websites, what other data sources do you know for public opinion analysis? Welcome to leave a message to share with you, we exchange and discuss.
If you like, please give it a thumbs up. You can also follow and top my official account “Nkwangshuyi” on wechat.
If you’re interested in data science, check out my series of tutorial index posts entitled how to Get started in Data Science Effectively. There are more interesting problems and solutions.