In this article

Dragon Boat Festival is coming. Travel? Go home? Visiting friends and family? It is necessary to bring zongzi. So:

  • What brand of zongzi do you choose?
  • What flavor of zongzi do you choose?
  • What price range to choose?

This year, Huang used Python to climb the “zongzi data” on JINGdong and analyze it. Let’s see what you find! This article from data crawling, data cleaning, data visualization, three convenient, but you simply complete a small data analysis project, so that you can have a comprehensive use of knowledge.

Here’s the whole idea:

  • To access the web page: www.jd.com/

  • Crawl instructions: Based on jingdong website, we searched the website “zongzi” data, about 100 pages. We climb the field, both the relevant information of the first page, and part of the information of the second page;

  • Crawling ideas: first for a page of data level page to do a resolution, and then for a secondary page to do a resolution, and finally to turn the page operation;

  • Crawl fields: name (title), price, brand (store), category (taste) of zongzi;

  • Use the tool: Requests + LXML +pandas+time+re+ Pyecharts

  • Site parsing: xpath

The final effect is as follows:

Data crawl

Jingdong website is generally dynamically loaded, that is to say, it can only crawl to the first 30 data of a page in a general way (there are 60 data in a page).

For this article, I used a very basic method to crawl the first 30 pieces of data on each page (feel free to go down and crawl all the data if you’re interested).

So what fields did this article crawl? I’ll give you a demonstration, if you’re interested, to crawl more fields and do a more detailed analysis.



Here is the crawler code:

import pandas as pd import requests from lxml import etree import chardet import time import re def get_CI(url): Headers = {' user-agent ':'Mozilla/5.0 (Windows NT 6.1; Win64; X64) AppleWebKit / 537.36 (KHTML, RQG = requests. Get (URL,headers=headers) rqg.encoding = Chardet.detect (rqg.content)['encoding'] HTML = etree.html (rqg.text) # price p_price = HTML. Xpath (' / / div/div [@ class = "p - price"] / strong/I/text () ') # name p_name = HTML. The xpath (/ / div/div [@ 'class = "p - name P-name-type-2 "]/a/em') p_name = [STR (p_name[I].xpath('string(.)') for I in range(len(p_name)) Html. xpath('//div/div[@class="p-name p-name-type-2"]/a/@href') deep_URL = [" HTTP :" + I for I in deep_ur1] # Brands_list = [] kinds_list = [] for I in deep_URL: rqg = requests.get(i,headers=headers) rqg.encoding = chardet.detect(rqg.content)['encoding'] html = etree.HTML(rqg.text) # brand = html.xpath('//div/div[@id="parameter-brand"]/ /ul[@id="parameter-brand"]/li/@title') brands_list.append() # Class kinds = re.findall('> (.*?)</li>',rqg.text) kinds_list.append(kinds) data = Pd. DataFrame({' name ':p_name,' price ':p_price,' brand ':brands_list,' category ':kinds_list}) return(data) x = "https://search.jd.com/Search?keyword=%E7%B2%BD%E5%AD%90&qrst=1&wq=%E7%B2%BD%E5%AD%90&stock=1&page=" url_list = [x + STR (I) for I in range(1,200,2)] res = pd.dataframe (columns=[' name ',' price ',' brand ',' category ']) # Res0 = get_CI(url) res = pd.concat([res,res0]) time.sleep(3) # save data res.to_csv('aliang.csv',encoding='utf_8_sig')Copy the code

And the data that I finally crawled looks like this.

Data cleaning

As can be seen from the figure above, the whole data is very tidy, not very messy, and we just need to do some simple operations.

To read the data, use the library pandas.

Import pandas as pd df = pd.read_excel(" end.xlsx ",index_col=False) df.head()Copy the code

The results are as follows:



We remove the brackets for the “brand” and “category” fields respectively.

Df (" brand ") = df [r]. "brand" apply (lambda x: x [] 1: - 1) df (" categories ") = df (" categories ") apply (lambda x: x [] 1: - 1) df. Head ()Copy the code

The results are as follows:

① Top 10 stores of zongzi brand
Df [r]. "brand" value_counts () [10]Copy the code

The top 5 flavors of zongzi

Def func1(x): if x (x) > 0: return x (x) else: Return the x df (" categories ") = df [r]. "categories" apply func1 (df) [r]. "categories" value_counts () [for]Copy the code

The results are as follows:

③ Division of sales price range of zongzi
Def price_range(x): def price_range(x): def price_range(x): def price_range(x): def price_range(x): def price_range(x): def price_range(x): def price_range(x): Return '100-300 yuan 'elif x <= 500: return '300-500 yuan' elif x <= 1000: return '500-1000 yuan 'else: Return '>1000 yuan 'df[" price_range "] = df[" price_range "].value_counts()Copy the code

The results are as follows:



Because the data is not a lot, there are not many fields, there is not a lot of messy data. Therefore, there is no data deduplication, missing value padding, etc. So, you can go down and get more fields, more data, for data analysis.

Data visualization

As the saying goes: word table, table diagram. Through visualization analysis, we can reveal the “hidden” information behind the data.

Extension: Of course, this is just a “primer”, I did not get too much data, nor did I get too many fields. Here is an exercise for learning friends, go down to use more data, more fields, do a more thorough analysis.

Here, we make a visual display based on the following questions:

  • ① Zongzi sales shop Top10 bar chart;
  • (2) zongzi flavor ranking Top5 bar chart;
  • (3) Zongzi sales price range division pie chart;
  • (4) Cloud map of commodity names of zongzi;

Due to the layout of the entire article, the code for the visualization section of this article is available at the end of this article.

① Top10 bar chart of zongzi sales shops

Conclusion analysis: last year, we analyzed some moon cake data, “Wufangzhai”, “Beijing daoxiangcun” these brands are still fresh, can be said to do moon cake, zongzi old shop. Like “Sanquan” and “Yearning”, I have always thought that they only make dumplings and tangyuan. Is zongzi worth a try? Of course, there are some new brands, like “zhu Old boss”, “rice fragrant private house” and other brands, we can go down to search. When you buy something, you have to choose carefully, and brand is important.

② zongzi flavor ranking Top5 bar chart



Conclusion analysis: In my impression, when I was a child has been eating the most is the “sweet zongzi”, until I went to junior middle school did not know, zongzi can also have meat? Of course, as can be seen from the figure, there are still many shops selling “fresh meat zongzi”. After all, this gift is still high-end. There are also some flavors, like “jujube zongzi” and “bean paste zongzi”, which I have hardly tasted. If you gave it to someone, what flavor would it be?

Pie chart of zongzi sales price range division



Conclusion: Here, I deliberately subdivide the price range. This pie chart is also very realistic, after all, the Dragon Boat Festival is celebrated once a year, or to small profits and quick sales, nearly 80% of zongzi, the price is less than 100 yuan. Of course, there are some mid-range zongzi, which cost between 100 and 300 yuan. More than 300 yuan, I think there is no need to eat, anyway, I will not spend so much money to buy zongzi.

The cloud picture of the commodity name of zongzi

Conclusion: From the picture, we can roughly see the selling points of the merchants. After all, it is a holiday, and “gift” and “gift” embody the festive atmosphere. “Pork” and “bean paste” embody the flavor of zongzi. Of course, is it a good “breakfast” choice? Purchase, but also support “group purchase” oh.

⑤ Graphics combination for large screen

The visualization of this paper uses pyecharts library to draw. We first do each picture individually, then graphics integration, can make a beautiful visualization of the large screen. About how to make, you can private message to get the code!

How to obtain the source code:

① More than 3000 Python ebooks ②Python development environment installation tutorial ③Python400 set self-learning video ④ software development common vocabulary ⑤Python learning roadmap ⑤ project source code case sharing if you use it can be directly taken away in my QQ technical exchange group group number: 739021630 (technical exchange and resource sharing only, no advertising allowed) to take it for yourself click here to collect it