preface
Cast a green leaf, sigh years like a dream still in, note a wang Qingquan, looking for floating in various positions.
Tell me why I came up with the idea of writing Python as a tutorial. Because today the leader called me to the office, we drank two cups of tea, but I am not the kind of tea, tea has not studied! Therefore, I have specially designed a tutorial to teach you Python and prevent you from losing this knowledge. At least some of the most common etiquette of drinking tea should be clear to yourself, so as not to make a fool of yourself in the future.
Start
The python python python python python python python python python python python python Python Python Python Python Python TextRank, TF-IDF, LDA topic model.
Source code at the end of the article
Xiaobian found a website related to tea:
chaping.chayu.com/?bid=1
Data acquisition
Enter the tea review from the home page, you can see the basic information of all the tea, the result is multiple pages, get all the basic information including title, rating, brand, origin, tea type, detailed links, ID:
Then according to the obtained links, drill down and climb to get the recommended index, total rating and all rankings of each tea:
And crawl the corresponding comment, as many pages as possible, including the field reviewer, reviewer level, rating, comment, comment time:
CSV and comment. CSV:
The whole crawler process is like this, using xpath extraction, multi-process crawling, the logic is not complex, the detailed implementation of logic can be viewed source code.
The data analysis
A total of 3W pieces of data were obtained, and then we could start exploring.
First, check the title, which is composed of brand and name. Only retain the name part and draw word cloud.
Black tea, white dudan, Tieguanyin, green tea, maojian and other heard tea names or more:
The value of tea score ranges from 0 to 10, and the histogram is drawn after every two points of score are segmented.
From the point of view of the results, the scores are quite high, and only a few scores are lower than 4 points. After a look at the data selected by the editor, the overall evaluation is not particularly friendly to these low scores of tea:
Now basically each kind of tea has a special brand in the sale, the brand statistics, draw words.
It is found that Dou Ji tea industry, Tea, Dayi, Tianfu tea and other prominent brands, even if these brands do not know about tea, but they have more or less heard of it in the street:
Each tea has its unique place of origin, to draw thermal maps of the place of origin.
It is found that the origin of Yunnan is the largest, as many as thousands of varieties, xiaobian check, Yunnan tea is the most important origin, Yunnan tea is the oldest hometown.
Fujian, with a history of more than one thousand years of tea culture, is the most important producing area of Chinese tea.
At present, tea can be divided into ten categories: Pu ‘er, green tea, black tea, oolong, dark tea, white tea, scented tea, yellow tea, bag tea and instant tea. Each major category is subdivided into many smaller categories, and a statistical histogram is drawn for each category.
It is found that Pu ‘er tea has the most categories, followed by green tea and black tea.
Hot search can reflect the popularity of a tea from the side, xiaobian selected the top 10 hot search tea, pull out the details.
It is found that the top tea is the classic Pu ‘er tea, and Pu ‘er tea is also the most diverse tea. You can buy some specially and try it in the future.
The review time is measured in terms of time and month, compared to the review trend chart for each year and month.
It was found that the activity level of review users kept rising from 2014 to 2017, and then declined:
Pandas, Stylecloud, Jieba, and Pyecharts are used to perform the exploratory analysis.
Keyword extraction
In the obtained data, there are general comment fields, that is, comments on each kind of tea, and each user comment field. These two fields are used to realize text keyword extraction.
For the total evaluation, we want to classify the tea with similar total evaluation together. We can use the KMeans clustering algorithm, but the total evaluation is textual data.
The key words in each general comment need to be extracted first, and the TextRank algorithm is used to extract the key words. The principle is to divide words based on sentences, give weights to each word, and get the words with high scores as keywords.
Vectorization of keywords, cosine similarity calculation, finally using clustering algorithm, divided into two types.
Category one is evaluated mainly from the taste direction: aroma, taste, mouth, smoothness, etc.
Category 2 is mainly evaluated from the aspect of appearance, appearance, cable, color, raw materials, etc. :
For comments, TF-IDF algorithm is used to extract keywords first, which is composed of TF and IDF algorithms.
TF, calculate the frequency of each word in all texts.
IDF, counting the number of times each word appears in all comments, how many comments, mapping a score.
Finally, TF*IDF selects the top 10 keywords with scores:
The second method is to use the topic model LDA to extract keywords. It is necessary to determine the number of topics and then extract keywords. Here, 1 topic and the top 10 keywords are selected:
The source code for
Source code is available here