In the first sentence of the text, add “I am participating in the Mid-Autumn Festival Creative Submission Contest. For details, see: Mid-Autumn Festival Creative Submission Contest
Analyzing a wave of mooncakes using Python, what do I conclude?
The public account “Jie Ge’s IT Journey” replies to “moon Cake” backstage, you can get the complete data of this article.
Immediately August 15, ushered in the annual Mid-Autumn Festival. The Mid-Autumn Festival originated from the ancient worship of the moon, so far has a long history.
Mid-Autumn Festival is coming, every place has its own customs. But the customs of the Mid-Autumn Festival are nothing more than: offering sacrifices to the moon, appreciating the moon, watching lanterns and eating moon cakes. Moon cake in my memory, the deepest memory is the “Five benevolence” flavor, is the elder family favorite. I still remember when I was a child, I hated to eat the red silk green silk in the “Five benevolence”.
Later, I also found some delicious moon cakes. Today, I searched the Internet specially, and saw a lot of flavors that I had never tasted before. I was so dazzled by them, so I tried to use Python to analyze the most delicious flavors for you, so as to make a reference for those who have not bought moon cakes yet.
Python + Python + Boken
I. Purpose of analysis
- 1) Which kind of moon cake sells the most?
- 2) What is the price range of mooncakes?
- 3) TOP 15 brands with good reputation?
- 4) TOP 10 delicious mooncake flavors?
- 5) TOP 10 shops selling most mooncakes?
- 6) Price comparison of popular mooncakes
- 7) Brand recommendations of different flavors of moon cakes (automatic)
Second, data acquisition
Data source: Jingdong search keyword [moon cake], and use automatic collection software to collect more than 2000 data, including moon cake title, store name, brand, price, sales volume, category and origin
# import related library, Import pandas as pd import matplotlib.pyplot as PLT import matplotlib as MPL import seaborn as SNS import numpy as Np SNS. Set (the font = 'SimHei, style =' darkgrid ') data = pd. Read_excel (' C: / Users/Cherich/Desktop/moon cake data. XLSX ') data. The info ()Copy the code
Shop name, brand, mooncake category and origin all have missing values. The first two missing values are less, so null values can be deleted directly. ** The type of mooncake is important, so consider filling; ** The final origin can be analyzed directly based on the existing origin without much impact on the final result.
Data cleaning
3.1 Filling Categories
There are usually two ways to fill: one is to make predictions based on machine learning correlation algorithms; The second rule, such as in most titles contain the category of moon cake, so take a string judgment, can be filled.
data.head()
Copy the code
Categorys = data. Groupby ('category') category_list = [I [0][:2] for I in categorys] category_list[-5] = 'clocking' Print (category_list) # [' Beijing type ', 'other', 'ice cream', 'ice cream', 'black ice', 'card vouchers',' desktop ', 'card vouchers',' Hong Kong ', 'dian type', 'boom type, Datas = datas [datas ['category'].isnull()==True] def add_category(df): name = 'category_list' for j in category_list: if j in str(df): name = str(j) return name datas['category'] = datas['title'].apply(add_category) datas1 = data[data['category'].isnull()==False] datas2 = pd.concat([datas1,datas]) datas2Copy the code
3.2 Filling flavors
The flavor also appears in the title and is filled in the same way as above. Because the taste does not have a separate field, so to fill the taste keyword, have to say, the moon cake taste is really many!
Skin tastes = [" ice ", "ice cream", "the egg yolk with lotus-paste", "red bean", "black sesame seed", "ham", "salt and pepper", "durian", "rose", "flow heart", "cheese", "beef", "fruit", "crisp skin", "five ren", "coconut", "date jam", "peach kernel"] def add_taste(df): name ='' for j in tastes: if j in str(df): name = str(j) break else: Name = datas2['taste'] = datas2['taste']. Apply (add_taste) datas2.head()Copy the code
3.3 Deleting data whose store name is empty
datas2.dropna(subset=['shop'],inplace=True)
datas2.info()
Copy the code
3.4 Mark the price range
Def price (df) : lable = 'if 0 < df < = 50: lable =' 0 ~ 50 yuan 'elif < df < = 100:50 lable =' $50 ~ 100 'elif 100 < df < = 150: Elif 150 <df <= 200: lable = '150~200 'else: Lable ['price_lable'] = datas2['price']. Apply (price) datas2.head()Copy the code
4. Data visualization
4.1 Price range of mooncakes
las = datas2.groupby(datas2['price_lable']).size() las.sort_values(ascending=True,inplace=True) Plt. figure(figsize=(8,6),dpi=80) plt.title(label=' patches ',fontsize=20) size= 0.3 patches, l_text, p_text = plt.pie(las.values,labels = las.index, shadow=True, Colors = PLT. Cm. Coolwarm_r (np) linspace (0, 1, len (las) index))), wedgeprops = dict (width = the size, edgecolor='w'),autopct='%.2f%%',startangle=300) plt.show()Copy the code
31 percent of the mooncakes cost less than 50 yuan, so it seems that most of them are affordable. 24% of mooncakes cost more than 200 yuan. Are mooncakes so expensive?
4.2 Comparison of sales volume of mooncakes
big_category = datas2[datas2['category']! Groupby (datas2['category']) category = [I for I,j in big_category] numbers = [j['sales'].sum() for I,j in Figure (figsize = (5,4), fontsize=18) plt.bar(category,numbers, Color = PLT. Cm. Coolwarm_r (np) linspace (0, 1, len (Numbers)))) PLT. Xticks (rotation = 45) PLT. The grid (PLT). The show ()Copy the code
It seems that the top three categories of mooncakes are Cantonese, Soviet and Hong Kong mooncakes. Driven by curiosity, I checked these mooncakes specifically to find out where they are delicious!
Cantonese style moon cake: Cantonese style thin skin, filling ratio is usually 1:4, filling mainly coconut, lotus seed paste, egg yolk, bean paste, oil, taste soft.
Hong Kong-style moon cakes are similar to Cantonese style because they are geographically similar, but hong Kong-style moon cakes have been improved on the basis of Cantonese style. Low fat and low oil are the characteristics of Hong Kong-style. Get the goddess on it now!
Su style moon cakes in Jiangsu, Zhejiang and Shanghai, the biggest feature is the crispy skin, crisp outside and soft inside, very layered, the more chewing, the more fragrant.
4.3 TOP 15 Brands with Good Reputation
shop = datas2.groupby(datas2['brand']) shop_dic = {i:j['sales'].sum() for i,j in shop} shop_dic = sorted(shop_dic.items(), key = lambda kv:(kv[1], kv[0]),reverse=True) ins = [] val = [] for i, j in shop_dic[:15]: ins.append(i.split()[0]) val.append(j) # print(ins) vals = [round(datas2[datas2['brand']== z]['price'].mean()) for z in Ins] # print(vals) plt.figure(figsize = (8,4),dpi=80) plt.title(label='TOP 15 ',fontsize=20) s = PLT. Barh (ins / : : 1, val [: : - 1), height = 0.9, color = PLT) cm. Coolwarm_r (np) linspace (0, 1, len (ins)))) I = 0 PLT. The grid (PLT). The show ()Copy the code
Daoxiangcun in Beijing ranked first in sales among all brands, followed by Huamei, Wufangzhai, Yuanlong…
4.4 TOP 10 Delicious mooncake flavors
shop = datas2[datas2['taste']! Groupby (datas2['taste']) shop_dic = {I :j['sales']. Sum () for I,j in shop} shop_dic = sorted(shop_di.items (), key = lambda kv:(kv[1], kv[0]),reverse=True) ins = [] val = [] for i, j in shop_dic[:15]: Figure (figsize = (8,4),dpi=80) plt.title(label='TOP 15 hot mooncakes ',fontsize=18) Plt.bar (ins,val, color=plt.cm.coolwarm_r(np.linspace(0,1,len(ins)))) plt.xticks(rotation=45) plt.grid() plt.show()Copy the code
According to the sales comparison of flavors, the popular flavors are lotus seed yolk, flow heart, five kernel, bean paste, ham…
4.5 TOP 10 Shops with the highest mooncake Sales
shop = datas2.groupby(datas2['shop']) shop_dic = {i:j['sales'].sum() for i,j in shop} shop_dic = sorted(shop_dic.items(), key = lambda kv:(kv[1], kv[0]),reverse=True) ins = [] val = [] for i, j in shop_dic[:10]: Figure (figsize = (8,4),dpi=80) plt.title(label='TOP 10 most sold stores ',fontsize=18) PLT. Barh (ins / : : 1, val [: : - 1), height = 0.9, color = PLT) cm. Coolwarm_r (np) linspace (0, 1, len (ins)))) PLT. The grid ()Copy the code
It seems that although rice fragrant village brand sales first, but in the store sales, Huamei flagship store ranked first
4.6 Cloud map of brand sales words
from wordcloud import WordCloud from PIL import Image li = [each for each in datas2['brand'].values] def func_pd(words): count_result = pd.Series(words).value_counts() return count_result.to_dict() frequencies = func_pd(li) Plt. figure(figsize = (10,8),dpi=80) wordcloud = wordcloud (font_path=" stssong.TTF",background_color='#E6E6FA', width=700,height=350).fit_words(frequencies) plt.imshow(wordcloud) plt.axis("off") plt.show()Copy the code
4.7 Price comparison of popular brand mooncakes
Brand = [' beauty ', 'rice fragrant village', 'five fang zhai', 'the heart', 'hang fa lau', 'guangzhou restaurant', 'glory, yuen long,' gold ', 'yuen', 'haagen-dazs',' Pan Xiang remember ', 'YOTIME', 'golden nine', 'cuhk for farmers', Datas2 ['brand'].isin(brand) groups = datas2['price'].groupby('brand']) plt.figure(figsize) = (8,4),dpi=80) plt.title(' hot brand price comparison ',fontsize=18) box_1, box_2, box_4,box_5, box_6 = Get_group ('YOTIME'), group.get_group ('YOTIME'), group.get_group ('YOTIME'), group.get_group ('YOTIME') ), groups. Get_group (' five fang zhai) labels = 'beauty', 'rice fragrant village', 'guangzhou restaurant', 'YOTIME', 'golden nine', 'five fang zhai bplot = PLT. Boxplot ([box_1 box_2, box_3, box_4,box_5,box_6],patch_artist = True,showmeans=True,labels=labels) colors= Plt.cm. coolwarm_r(np.linspace(0,1,len(labels))) for patch, color in zip(bplot['boxes'], colors): plt.cm.coolwarm_r(np.linspace(0,1,len(labels))) for patch, color in zip(bplot['boxes'], colors): patch.set_facecolor(color) plt.grid(False) plt.show()Copy the code
Select a few popular brands, you can see in the price:
Each brand has certain high outliers, which belong to customized gift boxes for users with higher purchasing ability.
The lowest average price is Huamei and Daoxiangcun, Wufangzhai, it seems that part of the high sales is due to the price.
4.8 Brand Recommendation of Different Flavors of Moon Cakes (automatic)
They want to automatically recommend brands with good sales by choosing their favorite flavors. Choose Boken for linkage, Bokeh is an interactive visual library for browsers. It can make interactive graphs, dashboards, and data applications quickly.
import pandas as pd from bokeh.models.widgets import Panel from bokeh.models.widgets import Tabs import warnings warnings.filterwarnings('ignore') from bokeh.io import curdoc from bokeh.plotting import figure from bokeh.models import ColumnDataSource, Select from bokeh.layouts import row import matplotlib as mpl mpl.rcParams['font.family'] = 'SimHei' data = Pd. Read_excel (' C: / Users/cherich/Desktop/moon cake data. XLSX ') brand = [' beauty ', 'rice fragrant village', 'five fang zhai', 'the heart', 'hang fa lau', 'guangzhou restaurant, the food,' glory, yuen long, 'gold statue, 'Yuen Long ',' Haagen-Dazs ', 'Pan Xiangji ', 'YOTIME',' Golden Nine ', 'Zhongda Huinong ', '" gongdelin] data = data [data [' brand'] the isin (brand)] # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - # Create drop-down widget: select types = list(data['taste'].unique()) select1 = Select(options=types, Data_qudao = data_qudao = data[data.taste == 'qudao '] data_qudao = data[data.taste ==' qudao ' data_qudao.groupby('brand').size().sort_values(ascending=False).head(15) print(data_qudao_a) data_qudao_b = Pd. DataFrame(data= data_qudao_A, columns=['num']) data_qudao_b['ind'] = data_qudao_b. source source1 = ColumnDataSource(data={ 'x': data_qudao_b['ind'], 'y': Data_qudao_b [' num]}) TOOLTIPS = [(" taste ", "@ x"), (" sales ", "@ y")] p1 = figure (title = 'moon cakes taste shop recommend, X_range =data_qudao_a.index. To_list (), plot_width=620, plot_height=500, x_axIS_label =' brand ', y_axis_label=' sales ', If ('x', width=0, bottom=0, top='y', source=source1, color='#BCD2EE') def update_plot1('x', width=0, bottom=0, top='y', source=source1, color='#BCD2EE') old, new): yr = select1.value data_qudao = data[data.taste == yr] data_qudao_a = data_qudao.groupby('brand').size().sort_values(ascending=False).head(15) data_qudao_b = pd.DataFrame(data=data_qudao_a, columns=['num']) data_qudao_b['ind'] = data_qudao_b.index source1.data = { 'x': data_qudao_b['ind'], 'y': Data_qudao_b ['num']} p1.title.text = '% yr select1.on_change('value', update_plot1) layout2 = row(select1, P1) tab1 = Panel(Child =layout2, title=' flavorings ') Layout = Tabs(Tabs =[tab1]) curDoc ().add_root(layout)Copy the code
Start the Bokeh service:
bokeh serve --show aa.py
Copy the code
Choose your favorite flavor and the chart automatically shows the top-selling brands.
Five, the conclusion
1. Most of the mooncakes are priced under 50 yuan, which is very affordable.
2, Cantonese style moon cakes are the most popular, followed by Hong Kong style moon cakes, Soviet style moon cakes;
3, good reputation of the brand is: Huamei, Daoxiangcun, Wufangzhai, Meixin;
4, popular flavors are: egg yolk lotus seed paste, flow heart, five kernel, bean paste, ham, ice skin;
In this paper, to the end.
Original is not easy, if you think this article is useful to you, please kindly like, comment or forward this article, because this will be my power to output more high-quality articles, thank you!
By the way, please give me some free attention! In case you get lost and don’t find me next time.
See you next time!