This year, I don’t know how many friends stay in situ for the New Year. Although we can’t go back to our hometown for the New Year, we have to buy New Year goods and give gifts to our family members and elders. Therefore, out of curiosity, I made use of crawlers to acquire data from a certain treasure, combined with Python data analysis and third-party visual platform to analyze what people bought during the Spring Festival. The analysis results are shown in the following big screen:
The above is done using the third party visualization tool FineBI after cleaning the data. The following is the implementation process in Python. For this article, it is mainly divided into the following five steps:
-
Analysis methods
-
The crawler parts
-
Data cleaning
-
Data visualization and analysis
-
Conclusions and Recommendations
First, analysis of ideas
In fact, in terms of today’s data, we are mainly doing exploratory analysis; First, comb the existing fields, including title (extracted category), price, sales volume, store name and place of shipment. Let’s do a detailed dimension split and visual selection:
Category:
-
What are the TOP 10 category sales? (Table or horizontal bar chart)
-
Popular (most frequently appeared) category display; (the word cloud)
** Price: The price range distribution of ** New Year goods; (Circle graph, observed proportion)
Sales volume and Store name:
-
What are the TOP 10 stores in terms of sales? (Bar chart)
-
Combined with categories to do linkage, such as nuts, corresponding to display sales ranking stores; (Linkage, using three-party tools)
** Shipping Place: What are the top cities in ** sales? (map)
Second, crawl data
Selenium is used to simulate clicking on a browser. The prerequisite is that Selenium and the browser driver have been installed. In this case, I am using Google Browser.
pip install selenium
Copy the code
After the installation is successful, run the following code, enter the keyword “New Year goods”, scan code can be, waiting for the program slowly collection.
# coding=utf8 import re from selenium.webdriver.chrome.options import Options from selenium import webdriver import time Def search_product(key_word) def search_product(key_word): Find_element_by_id ("q").send_keys(key_word) # Define click button, And click browser.find_element_by_class_name('btn-search').click() # maximize window: Sleep (15) # to locate the "page number", Page_info = browser.find_element_by_xpath('//div[@class="total"]').text # Findall () returns a list, even though there is only one element. Page = re.findall("(\d+)", page_info)[0] return page # def get_data(): # Found through page analysis: Items = browser.find_elementS_by_xpath ('//div[@class="items"]/div[@class="item J_MouserOnverReq "]') for item in items: Pro_desc = item.find_element_by_xpath('.//div[@class="row row-2 title"]/a').text # price pro_price = Item.find_element_by_xpath ('.//strong').text # buy_num = Item.find_element_by_xpath ('.//div[@class="deal-cnt"]').text # shop = Item.find_element_by_xpath ('.//div[@class="shop"]/a').text # address = item.find_element_by_xpath('.//div[@class="location"]').text # print(pro_desc, pro_price, buy_num, shop, address) with open('{}.csv'.format(key_word), mode='a', newline='', encoding='utf-8-sig') as f: csv_writer = csv.writer(f, delimiter=',') csv_writer.writerow([pro_desc, pro_price, buy_num, shop, address]) def main(): browser.get('https://www.taobao.com/') page = search_product(key_word) print(page) get_data() page_num = 1 while int(page) ! = page_num: Print (" * * 100) print (" climbing in the first page {} ". The format (page_num + 1)) browser.get('https://s.taobao.com/search?q={}&s={}'.format(key_word, Page_num * 44)) browser. Implicitly_wait (25) get_data() page_num += 1 print( ) if __name__ == '__main__': Key_word = input(" Please enter the product you want to search: ") option = Options() browser = webdriver.Chrome(chrome_options=option, executable_path=r"C:\Users\cherich\AppData\Local\Google\Chrome\Application\chromedriver.exe") main()Copy the code
The collection results are as follows:
The data is ready. It is time-consuming to extract categories from the title. It is recommended that you directly use the sorted data.
The idea is to segment the title, name entity recognition, mark the noun, find out the category name, such as nuts, tea, etc.
Data cleaning
The file cleaning here is almost done with Excel, the data set is small, it is very efficient to use Excel, for example, here to make a price range. Now that the data cleaning is complete (you can visualize it using the tripartite tool), you can move on to analyzing it in Python if you want.
4. Data visualization and analysis
1. Read the file
import pandas as pd import matplotlib as mpl mpl.rcParams['font.family'] = 'SimHei' from wordcloud import WordCloud from Ast import literal_eval Import Matplotlib.pyplot as PLT datas = pd.read_csv('./ Years.csv ', Encoding =' GBK ') datasCopy the code
2. Visualization: word cloud
Split (',') li.extend(new_list) def func_pd(words): count_result = pd.Series(words).value_counts() return count_result.to_dict() frequencies = func_pd(li) Vcombs. Pop (' other ') plt.figure(figsize = (10,4),dpi=80) wordcloud = WordCloud(font_path="STSONG.TTF",background_color='white', width=700,height=350).fit_words(frequencies) plt.imshow(wordcloud) plt.axis("off") plt.show()Copy the code
Chart description: We can see the word cloud, popular (most frequently) category fonts are the largest, in order: nuts, tea, pastry, etc.
3. Visualization: Draw ring graph
# PLT. Pie (x, lables, autopct, shadow, startangle, colors, explodes) food_type = datas. The groupby (' price range). The size () Plt. figure(figsize=(8,4),dpi=80) explodes= [0,0,0,0,0.2,0.1] size= 0.3 plt.pie(food_type, radius=1,labels=food_type.index, autopct='%.2f%%', colors=['#F4A460','#D2691E','#CDCD00','#FFD700','#EEE5DE'], wedgeprops=dict(width=size, Legend (food_type. Index,bbox_to_anchor=(1.5, 1.0)) plt.show()Copy the code
The circle chart is similar to the pie chart, representing the proportion of the part in the whole. It can be seen that about 33% of the 0-200 yuan lunar New Year goods, and 33% of the 100-200 yuan goods. The price of most New Year goods tends to be less than 200.
4. Visualization: Draw bar charts
Figure (figsize = (10,4),dpi=80).sum().sort_values(ascending=False).head(10) plt.figure(figsize = (10,4),dpi=80) Color = ['#F4A460','#D2691E','#CDCD00','#EEE5DE', '#EEB4B4', '#FFA07A', '#FFD700'] plt.bar(data.index,data.values, color=colors) plt.xticks(rotation=45) plt.show()Copy the code
Chart description: The above is the ranking of stores by sales volume, we can see that the first place is the flagship store of Three Squirrels, it seems that people like to eat dry goods during the Spring Festival.
5. Visualization: Draw horizontal bars
Foods = datas.groupby(by=' category ')[' sales '].sum().sort_values(ascending=False).head(10) Food. sort_values(ascending=True,inplace=True) plt.figure(figsize = (10,4),dpi=80) plt.xlabel(' sales ') plt.xlabel(' sales ') PLT. Title (' necessities recommend purchasing list, fontsize = 18) colors = [' # F4A460 ', '# D2691E', '# CDCD00', '# CD96CD', '# EEE5DE', '# EEB4B4', '# FFA07A', '#FFD700'] plt.barh(foods.index,foods.values, color=colors,height=1) plt.show()Copy the code
According to the sales ranking of categories, nuts rank first, proving the hypothesis above, people like to eat nuts.
Conclusions and Recommendations
Taobao hot New Year products: nuts, tea, cakes, cookies, candy, white wine, walnuts, mutton, sea cucumber, wolfberry;
Recommended List (by sales volume) : nuts, snacks, cakes, biscuits, tea, sweets, pine nuts, dates, cakes, halo-flavor, melon seeds, milk, walnuts;
** Spring Festival goods price reference: ** more than 66% of Spring Festival goods prices between 0 to 200 yuan;
Popular stores: Three Mice, Tmall Supermarket, Baicaowei, Liangpin Shop;