Besides holiday reunion, moon cakes are also the focus of the topic of this year’s Mid-Autumn Festival, whose moon cakes are the most popular? What flavor of moon cake is popular?
I will climb with you taobao.com mooncake sales data, and then through data analysis, tell you who is the “king of mooncakes” this year. But before we crawl and analyze the data, let’s take a look at the history of mooncakes.
The history of mooncakes
It is said that as early as in the Yin and Zhou Dynasties, there was a kind of “Taishi cake” with thin edges and thick hearts in memory of The Great Master Wen Zhong in Jiangsu and Zhejiang provinces, which can be said to be the ancestor of moon cakes. As for why wen Zhong was commemorated, I do not know.
Later, in the Northern Song Dynasty, the royal family took a fancy to it and ate it on the Mid-Autumn Festival. It has to be said that the propaganda efforts of the royal family and aristocrats were more than ten times as much as ordinary people, which played a trans-century role in the spread of moon cakes.
\
Aristocrats saw that this thing was popular and needed to rub a hot spot, so they got a name, so they thought of a big name “moon group” and a small name “small cake”. I said why can’t you combine these two names?
As for the name moon cake, it was not until the Southern Song Dynasty that it first appeared in books. Behind is probably the moon cake step by step with their own efforts gradually standing around the Mid-Autumn Festival inspirational story.
\
It is well known that the four traditional Chinese moon cakes include Guangzhou moon cakes, Beijing moon cakes, Su moon cakes and Chao moon cakes. Now with the development of The Times, there are a lot of new types, such as ice moon cake, seafood moon cake, ice cream moon cake and so on.
\
Second, data acquisition
\
I took the mooncakes on Taobao.com as a target to get the recent mooncake sales across the country. Target link: s.taobao.com/search?q= moon cake…
\
\
Tools & Modules:
Tools: Python3.7+Sublime Text
Modules: Requests, jieba, matplotlib, WordCloud, imread, pandas, numpy, etc.
\
The purpose is to see the sales statistics, moon cake price and sales distribution of different keywords word, and moon cake sales in different provinces through data analysis.
\
The detailed code is as follows:
Copy the code
Copy the code
import requests
import re
# Download web page
def get_html_text(url):
try:
res = requests.get(url,timeout=30)
res.raise_for_status()
res.encoding = res.apparent_encoding
return res.text
except:
return “”
Parse the page and save the data
def parse_page(html):
try:
plt = re.findall(r’\”view_price\”\:\”[\d\.]*\”‘, html)
tlt = re.findall(r’\”raw_title\”\:\”.*? \”‘, html)
loc = re.findall(r’\”item_loc\”\:\”.*? \”‘, html)
sale = re.findall(r’\”view_sales\”\:\”.*? \”‘, html)
#print(plt)
for i in range(len(plt)):
price = eval(plt[i].split(‘:’)[1])
title = eval(tlt[i].split(‘:’)[1])
location = eval(loc[i].split(‘:’)[1])
location = location.split(‘ ‘)[0]
sales = eval(sale[i].split(‘:’)[1])
sales = re.match(r’\d+’,sales).group(0)
print(price)
With open(” TXT “,’a’,encoding=’ utF-8 ‘) as f:
print(f)
f.write(title+’,’+price+’,’+sales+’,’+location+’\n’)
except:
print(“”)
\
\
def main():
Goods = “moon cakes”
depth=100
start_url = ‘s.taobao.com/search?q=’ + goods
for i in range(depth):
try:
url = start_url + ‘&s=’ + str(44 * i)
print(‘url=’,url)
html = get_html_text(url)
parse_page(html)
except:
continue
\
main()
\
Properties of the Response object
\
- R.tatus_code Return status of the HTTP request. 200 indicates that the connection is successful; 404 indicates that the connection fails.
- R.ext A string of HTTP response content, that is, the page content corresponding to the URL.
- R.encoding The encoding of the response content guessed from the HTTP header;
- R.apparent_encoding Encoding of the response content analyzed from the content (alternative encoding);
\
Data cleaning preview
\
\
As can be seen from the above figure, the average price of mooncakes on the whole network is about 90 yuan, the most expensive mooncake is as high as 9,999 yuan, and the highest sales volume is 355444 yuan (the data is subject to the current data).
\
4. Data analysis visualization
\
Cantonese style moon cake style is still, egg yolk, lotus seed paste taste deeply loved
\
\
Conclusion:
\
Cantonese moon cakes, gift boxes accounted for a high proportion; In terms of taste, the proportion of egg yolk taste is very high, than lotus seed paste, five kernel are high, other flavors such as bean paste, fruit, ham and so on; From the point of view of brand merchants, Beijing Daoxiang Village, Guangdong Huamei ranked first; From the gift box, enterprises, employees, group buying, wholesale, Taobao is also one of the channels for enterprises to purchase moon cakes to send employees.
\
The detailed code is as follows:
Copy the code
Copy the code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
from pyecharts import Geo,Style,Line,Bar,Overlap
from wordcloud import WordCloud, ImageColorGenerator
from os import path
from pylab import mpl
import jieba
\
F = open(r”C:\Users\Administrator\Desktop\ data.txt “,encoding=’ utF-8 ‘)
\
df = pd.read_csv(f,sep=’,’,names=[‘title’,’price’,’sales’,’location’])
\
title = df.title.values.tolist()
\
# participle each heading
title_s = []
\
for line in title:
title_cut = jieba.lcut(line)
title_s.append(title_cut)
\
title_clean = []
\
# Stop word list
Stopwords = [” moon cakes “, “gift”, “taste”, “box”, “bag mail”, “”,” “” “,” gifts “, “big”,
“Mid-Autumn festival”, “moon cakes”, “2”, “bread”, “rong”, “more”, “a”, “taste”, “jin”, “send”, “”,” old “,
“Beijing “,” Yunnan “,” Internet celebrity old “]
\
\
# Remove the stop word list
for line in title_s:
line_clean = []
for word in line:
if word not in stopwords:
line_clean.append(word)
title_clean.append(line_clean)
\
title_clean_dist = []
\
# Perform de-weighting
for line in title_clean:
line_dist = []
for word in line:
if word not in line_dist:
line_dist.append(word)
title_clean_dist.append(line_dist)
\
allwords_clean_dist = []
for line in title_clean_dist:
for word in line:
allwords_clean_dist.append(word)
\
df_allwords_clean_dist = pd.DataFrame({‘allwords’:allwords_clean_dist})
\
# To filter _ de-weight words for summary statistics
word_count = df_allwords_clean_dist.allwords.value_counts().reset_index()
word_count.columns = [‘word’,’count’]
\
backgroud_Image = plt.imread(‘1.jpg’)
\
wc = WordCloud(width=1024,height=768,background_color=’white’,
\
mask=backgroud_Image,font_path=”C:\simhei.ttf”,max_font_size=400,
random_state=50)
\
wc = wc.fit_words({x[0]:x[1] for x in word_count.head(100).values})
\
plt.imshow(wc,interpolation=’bilinear’)
plt.axis(“off”)
plt.show()
\
d = path.dirname(__file__)
\
wc.to_file(path.join(d,”yuebing.png”))
\
Knowledge:
\
Font_path: string // font path. If you want to display any font, write the font path + suffix, for example: font_path = ‘boldface. TTF’;
Mask: nd-array or None (default=None) // If the parameter is empty, use a two-dimensional mask to draw the word cloud. If mask is not empty, the set width and height value is ignored and the mask shape is replaced by mask. The part that is all white (#FFFFFF) will not be drawn, and the rest will be used to draw the word cloud. For example: bg_pic = imread(‘ read an image.png ‘), the canvas of the background image must be set to white (#FFFFFF), and then display a shape other than white. You can use the PS tool to copy their shape to a pure white canvas and then save, ok;
Stopwords: set of strings or None // Set the words to be masked. If empty, use the built-in stopWords;
Background_color: color value (default= “black”) // Background color, for example, background_color=’white’, the background color is white;
Max_font_size: int or None (default=None) // Display the maximum font size;
Fit_words (frequencies) // Generate the word cloud based on the word frequency (frequencies, dictionary type)
\
Statistical analysis of the sum of sales corresponding to different keywords word
\
(Explanation: For example, if the word “Canton style” is used, the sum of sales volume of goods containing the word “Canton style” is counted, that is, the sum of sales volume of goods with the style of “Canton style” is calculated)
\
As can be seen from the above figure, gift box, Cantonese style, egg yolk, lotus seed paste, five kernel, rice fragrant village, gorgeous and other keywords are high, which also proves that Cantonese style moon cake is the king of moon cake. The actual payers are up to nearly 7 million, and Cantonese style moon cake is still elegant. Although originated in Guangzhou, the Guangzhou-style mooncakes have actually become popular all over the country with their soft crust and varied fillings, making them the king of mooncakes.
\
The detailed code is as follows:
Copy the code
Copy the code
w_s_sum = []
for w in word_count.word:
i = 0
s_list = []
for t in title_clean_dist:
if w in t:
s_list.append(df.sales[i])
i+= 1
w_s_sum.append(sum(s_list))
\
df_w_s_sum = pd.DataFrame({‘w_s_sum’:w_s_sum})
df_word_sum = pd.concat([word_count,df_w_s_sum],axis=1,ignore_index=True)
df_word_sum.columns = [‘word’,’count’,’w_s_sum’]
df_word_sum.sort_values(‘w_s_sum’,inplace=True,ascending=True)
df_w_s = df_word_sum.tail(30)
\
attr = df_w_s[‘word’]
v1 = df_w_s[‘w_s_sum’]
\
Bar = bar (” Moon cake keywords sales distribution map “)
\
Bar. add(” keyword “,attr,v1,is_stack=True,xaxis_rotate=30,yaxix_min=4.2,
\
xaxis_interval=0,is_splitline_show=False)
\
overlap = Overlap()
\
overlap.add(bar)
\
Overlap. Render (‘ Moon cake keywords _ sales distribution map. HTML ‘)
\
* * * *
Most of the items sold are under 3,000, accounting for as much as 90%
\
As can be seen from the figure above, there are only a few of them with sales of more than 100,000. There are altogether 8 kinds of them, and 6 of them have sales of more than 300,000. Under the current Internet red economy, explosive products are king, a single big; Is the so-called network red is marketing, explosive is the product, with a good product and then through the operation of marketing can produce ten times the amplification of the benefit, if there is no good product, light marketing enterprises are difficult to last. To make use of the Internet celebrity economy to create popular style, choose popular style must have their own characteristics, in the process of sales, customer evaluation of the product search ordering and customer order transformation plays a crucial role;
\
Downgrading consumption? The average price in 10-100 yuan accounted for 50%
\
The quantity of goods decreases with the price. The higher the price is, the less goods are sold. Low-priced goods are in the majority, most of the goods between 10-100, 100-200 followed by the price of more than 8000 goods.
The detailed code is as follows:
Copy the code
Copy the code
F = open(r”C:\Users\Administrator\Desktop\ data.txt “,encoding=’ utF-8 ‘)
\
df = pd.read_csv(f,sep=’,’,names=[‘title’,’price’,’sales’,’location’])
\
print(df.sort_values(by=’price’))
\
price_info = df[[‘price’,’location’]]
\
Bins =,10,50,100,150,200,300,500,1000,5000,8000 [0]
Level = [‘ 0 to 10 ‘, ’10-50′, ’50-100’, ‘100-150’, ‘150-200’, ‘200-500’, ‘500-1000’, ‘1000-5000’, ‘5000-8000’, ‘8000’)
\
price_stage = pd.cut(price_info[‘price’], bins = bins,labels = level).value_counts().sort_index()
print(price_stage)
\
attr = price_stage.index
v1 = price_stage.values
\
Bar = bar (” Price range & Distribution of mooncake types and quantities “)
Bar. Add (” “, attr, v1, is_stack = True, xaxis_rotate = 30, yaxix_min = 4.2,
xaxis_interval=0,is_splitline_show=False)
\
overlap = Overlap()
overlap.add(bar)
Overlap.render (‘ Price range & Type and quantity distribution of mooncakes. HTML ‘)
\
\
Postscript:
* * * *
\
As can be seen from the picture above, among the Top15 on the Internet, cantonese flavor accounts for 80%. Cantonese mooncakes are sold all over the country. There are so many kinds of mooncakes, but why are cantonese mooncakes so popular in China? The outer layer of cantonese moon cake is syrup skin, made of wheat flour, syrup, vegetable oil, alkali water and other raw materials and baked, which is not the traditional Chinese pastry skills, which is related to the origin of Cantonese moon cake. Mooncakes became popular in Guangdong later than in other parts of China, late in the Qing Dynasty. Before this, the Shamian area of Guangzhou had become the British and French concession due to the Opium War, and all kinds of western cake shops set foot on the boundary of Guangzhou. The cantonese mooncake, wrapped in syrup and baked, is actually the product of learning from western pastry practices.
\
One of the most important ingredients of cantonese mooncakes is lotus seed paste. As early as 1889, a cake and pastry shop called “Lianxiang Lou” in the west of Guangzhou boiled lotus seeds into lotus paste as stuffing, and the crisp cakes made of fragrant and delicious flavor were very popular. Later, the producers of Lianxiang Lou shaped the filling into mooncakes, which gradually became cantonese mooncakes.
\
Have you eaten any delicious moon cakes this year?
\
Submission Email:[email protected]
Welcome to apply for the Python Chinese Community’s new Columnist program
Python Chinese community as a decentralized global technology community, to become the world’s 200000 Python tribe as the vision, the spirit of Chinese developers currently covered each big mainstream media and collaboration platform, and ali, tencent, baidu, Microsoft, amazon and open China, CSDN industry well-known companies and established wide-ranging connection of the technical community, Have come from more than 10 countries and regions tens of thousands of registered members, members from the Ministry of Public Security, ministry of industry, tsinghua university, Beijing university, Beijing university of posts and telecommunications, the People’s Bank of China, the Chinese Academy of Sciences, cicc, huawei, BAT, represented by Google, Microsoft and other government departments, scientific research institutions, financial institutions, and well-known companies at home and abroad, nearly 200000 developers to focus on the platform.
Click **** to read the original article and become a free member of **** community