preface

Chinese Valentine’s Day – Tanabata Festival (alias: Qiqiao Festival, Seven Qiao Festival, seven elder sister birthday, moral wax, English name: The Double Seventh Festival is celebrated on the Seventh day of the Seventh lunar month. There are many customs such as praying to the moon, praying to the Weaver girl, eating the delicious fruit, and begging for marriage.

With the approach of the Chinese valentine’s day, many friends all began to prepare to send a girlfriend/boyfriend gift, a gift as a medium to convey emotion, expressed the wish for a girlfriend/boyfriend and intention, but at the same time for the gifts, for a lot of friend is the difficult choice, this paper use Python crawl a treasure to commodity page, Analyze the list of high selling gifts for your friends for your reference.

Programming instructions

According to different keywords, climb “some treasure” to obtain commodity information (take “Qixi Festival gift”, “Qixi Festival gift for boyfriend”, “Qixi Festival gift for girlfriend”, etc.), according to the obtained data analysis, get the Qixi Festival gift list, and show the comparison of the frequency and proportion of different gifts through word cloud visualization.

Data crawl

πŸ•ΈοΈ Website composition analysis

Crawlers can’t live without the website, so first observe the composition of the website. When entering the keyword “Qixi Festival gift” to search, it is found that the parameter value of Q in the website is the keyword “Qixi Festival gift” that you typed, as shown in the figure below:

​Therefore, you can construct the url as follows:

Q_value = "Qixi Gift" URL = "https://s.taobao.com/search?q={}imgfile=&js=1&stats_click=search_radio_all%3A1&initiative_id=staobaoz_20210802&ie=utf8& bcoffset=5&p4ppushleft=2%2C48&ntoffset=5&s=44".format(q_value)Copy the code

It is also possible to copy the url directly, but the downside of this approach is that you have to re-open the page and copy the url every time you want to crawl to another category of goods. And the use of q_value variable structure url, when you need to get other categories of goods only need to modify the Q_value variable, such as to climb the keyword “Qixi Festival gift to my boyfriend”, just need to make the following changes:

Q_value = "Chinese Valentine's Day gift for boyfriend"Copy the code

πŸ•ΈοΈ Web structure analysis

Using the browser “developer tool” and looking at the structure of the web page, you can see that the product information is in

​​​

For this reason, we first use the Requests library to request web content. It is important to note that when requesting the page information, we need to construct the cookie and user-agent information in the request header, otherwise we will not get a valid response. To obtain the cookie and user-agent information, click the current requested page under the network TAB of developer Tools (if there is no current requested page under the network TAB, the page can be displayed only after refreshing). Find the cookie and user-agent values of the request header in the header TAB and copy them. Construct the request header as follows:

Headers = {# replace user-agent value with user-agent value "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36 Edg/92.0.902.62" Replace with the just copied cookie value "cookie":"... JSESSIONID=4990DB1E8783DF90266F5159209F8E3A" }Copy the code

The following figure shows an example of getting a cookie value (the user-agent value is obtained in a similar way, just find it in the header TAB

User-agent value of the request header) :

​​​

After getting the response page, it is necessary to use BeautifulSoup4 and regular expression library RE to parse the page and obtain the product information:

Import re import requests from BS4 import BeautifulSoup Response.raise_for_status () Response.encoding = 'utF-8' # response.encoding = BeautifulSoup(Response.text, 'html.parser') results = soup. Find_all ('script') information = STR (results[7] re.findall(r'"raw_title":"(.*?) View_price = re.findall(r'"view_price":"(.*?) View_sales = re.findall(r'"view_sales":"(.*?) "', information) # Obtain shipping address item_loc = re.findall(r'"item_loc":"(.*?)) "', information) # for shop name Nick = re. The.findall (r '" Nick" : "(. *?) Detail_url = re.findAll (r'" detail_URL ":"(.*?)) "', information) # get cover image address pic_url = re.findall(r'"pic_url":"(.*? "', information)Copy the code

It is important to note that since the page is decoded in UTF-8 mode, Unicode cannot be decoded initially, so printing the details page address and the cover image address will see the decoded error:

print(detail_url[4])
# //detail.tmall.com/item.htm?id\u003d628777347316\u0026ns\u003d1\u0026abbucket\u003d17 
Copy the code

Therefore, you need to use the following methods to decode correctly:

decodeunichars = detail_url[4].encode('utf-8').decode('unicode-escape')
print(decodeunichars)
# //detail.tmall.com/item.htm?id=628777347316&ns=1&abbucket=17 
Copy the code

πŸ•ΈοΈ Write data to a CSV file

To write data to a CSV file, first create the CSV file and write the header:

File = open('gift.csv', 'w ', Encoding =" utF-8-sig ",newline= ") csv_head = csv.writer(file) # Header = ['raw_title','view_price','view_sales','salary','item_loc','nick','detail_url','pic_url'] csv_head.writerow(header) file.close()Copy the code

Then, since commas (“,”) indicate cell switching, we need to preprocess the obtained data:

def precess(item):
    return item.replace(',', ' ') 
Copy the code

Finally, after preprocessing the data, write the data into a CSV file:

for i in range(len(raw_title)): with open('gift.csv', 'a+', encoding='utf-8-sig') as f: f.write(precess(raw_title[i]) + ',' + precess(view_price[i]) + ',' + precess(view_sales[i]) + ',' + precess(item_loc[i])  +',' + precess(nick[i]) + ',' + precess(detail_url[i]) + ',' + precess(pic_url[i]) + '\n')Copy the code

πŸ•ΈοΈ Crawl to more pages by observing the page links

By viewing pages 2 and 3 at:

Chinese valentine's day gift imgfile = https://s.taobao.com/search?q= & js = 1 & stats_click = search_radio_all % 3 a1 & initiative_id = staobaoz_20210805 & ie = utf8 &bcoffset=1&ntoffset=1&p4ppushleft=2%2C48&s=44 Chinese valentine's day gifts & imgfile = https://s.taobao.com/search?q= & js = 1 & stats_click = search_radio_all % 3 a1 & initiative_id = staobaoz_20210805 & ie = utf 8&bcoffset=1&ntoffset=1&p4ppushleft=2%2C48&s=88Copy the code

As you can see, the only difference between the urls is that the parameter value of s on the second page is 44, while the parameter value of S on the third page is 88. In combination, there are 44 products on each page, so we can use the following construction to crawl 20 pages of products:

url_pattern = "https://s.taobao.com/search?q={}&imgfile=&js=1&stats_click=search_radio_all%3A1&initiative_id=staobaoz_20210805&ie=utf8 &bcoffset=1&ntoffset=1&p4ppushleft=2%2C48&s={}" for i in range(20): url = url_pattern.format(q, i*44)Copy the code

πŸ•ΈοΈ crawler complete code

Import re import requests import time from BS4 import BeautifulSoup import CSV import OS Q = "Tanabata gift" url_pattern = "https://s.taobao.com/search?q={}&imgfile=&js=1&stats_click=search_radio_all%3A1&initiative_id=staobaoz_20210805&ie=utf8 &bcoffset=2&ntoffset= 2&p4ppushLeft =2%2C48&s={}" headers ={"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/ 537.36edg /92.0.902.62", # Cookie value replace the cookie value "cookie" obtained using the above method :"..." } def analysis(item,results): pattern = re.compile(item, re.I|re.M) result_list = pattern.findall(results) return result_list def analysis_url(item, results): pattern = re.compile(item, re.I|re.M) result_list = pattern.findall(results) for i in range(len(result_list)): result_list[i] = result_list[i].encode('utf-8').decode('unicode-escape') return result_list def precess(item): If not os.path.exists("gift.csv"): return item.replace(',', ') file = open('gift.csv', "w", Encoding =" utF-8-sig ",newline= ") csv_head = csv.writer(file) # Header = ['raw_title','view_price','view_sales','salary','item_loc','nick','detail_url','pic_url'] csv_head.writerow(header) file.close() for i in range(100): Time. Sleep (25) URL = url_pattern. Format (q, I *44) Response = requests. Get (URL = URL, headers=headers) Response.encoding = 'utF-8' Response.raise_for_status () soup = BeautifulSoup(Response.text, 'html.parser') results = soup. Find_all ('script') information = STR (results[7]) All_goods = analysis(r'"raw_title":"(.*?)"shopLink"', information) for good in all_goods: Raw_title = analysis(r'(.*?)","pic_url"', good) if not raw_title: View_price = analysis(r'"view_price":"(.*?)"', good) if not view_price: View_price.append ('0.00') # View_sales = analysis(r'"view_sales":"(.*?)"', good) #print(view_sales) if not view_sales: View_sales.append ('0 person paid ') # Item_loc = analysis(r'"item_loc":"(.*?)"', good) if not item_loc: # Item_loc.append (' unknown address ') # Obtain store name Nick = analysis(r'" Nick ":"(.*?)"', good) if not Nick: Detail_url = analysis_URL (r'" detail_URL ":"(.*?)"', good) if not detail_URL: Pic_url = analysis_URL (r'"pic_url":"(.*?)"', good) if not pic_url: Picurl.append (' no cover ') with open('gift.csv', 'a+', encoding=' utF-8-sig ') as f: f.write(precess(raw_title[0]) + ',' + precess(view_price[0]) + ',' + precess(view_sales[0]) + ',' + precess(item_loc[0])  +',' + precess(nick[0]) + ',' + precess(detail_url[0]) + ',' + precess(pic_url[0]) + '\n')Copy the code

πŸ•ΈοΈ Data crawl result

​​​

Data analysis and visualization

Maybe a lot of friends (not) just click in to learn technology, Shun (shun) jun (bian) is eager to know what gift to give his/her boyfriend/girlfriend, don’t worry, the most concerned part is coming. Next, word cloud visualization analysis is used to consider the two cases including and excluding sales.

🎁 Chinese Valentine’s Day gift list

Without considering sales:

from os import path from PIL import Image import matplotlib.pyplot as plt import jieba from wordcloud import WordCloud, STOPWORDS import pandas as pd import matplotlib.ticker as ticker import numpy as np import math import re df = pd.read_csv('gift.csv', encoding='utf-8-sig',usecols=['raw_title','view_price','view_sales','salary','item_loc','nick','detail_url','pic_url']) Raw_title_list = df['raw_title']. Values raw_title = ','. Join (raw_title_list) with open('text.txt','a+') as f: D = path.dirname(__file__) # Read the whole text.file = open(path.join(d, 'text.txt').read() ## Remove the modifier stopwords = [" Tanabata ", "Chinese valentine's day", "valentine's day", "boyfriend" and "girlfriend", "boys", "girl," "girlfriend", "boyfriend", "gifts", "birthday", "creative", "practical" and "friend", "man", "wife", "husband", "direct selling", "girlfriends" and "marriage", "to"] Text_split_no = [] for word in text_split: if word not in stopwords: Append (word) #print(text_split_no) text = ". Join (text_split_no) # picture_mask =" np.array(Image.open(path.join(d, "Beijing. JPG ")) stopWords = set(stopwords) stopwords.add("said") wc = WordCloud(# set font, Font_path =r'C:\Windows\Fonts\simsun. TTC ', background_color="white", max_words=4000, mask=picture_mask, Wc.to_file (path.join(d, "result.jpg")) # generate wC.generate (text) # generate wC.to_file (path.join(d, "result.jpg")Copy the code

​​​

Next, consider sales:

Def get_sales (item) : TMP = item [: - 3] if 'm' in TMP: TMP TMP. = replace (' m ', '0000'). The replace ('. ', ') if the '+' in the TMP: Replace ('+','') TMP = int(TMP) TMP = TMP / 100.0 if TMP <= 0: tmp = 0 tmp = round(tmp) return tmp raw_title_list = df['raw_title'].values view_sales_list = df['view_sales'].values for i in range(len(raw_title_list)): for j in range(get_sales(view_sales_list[i])): with open('text.txt','a+') as f: (f.w ritelines raw_title_list [I] + ', ') # the rest of the code are outside "" "... "" "Copy the code

​​​

It can be seen that the difference is relatively obvious. Finally, the segmentation results are sorted. After manually removing invalid words, the top 15 gifts are classified and sorted out:

gift = {}
for i in text_split_no:
	if gift.get(i,0):
		gift[i] += 1
	else:
		gift[i] = 1
sorted_gift = sorted(gift.items(), key=lambda x:x[1],reverse=True) 
Copy the code

❀️ Gift list ❀️

Doll/doll/plush toy/Huggable Bear candy necklace chocolate album/souvenir book Snack/snack bracelet/bracelet/Rose pendant night light music box/music box ring/pair of ring lipstick hand string/Red string earringsCopy the code

πŸ‘§ gift list for your girlfriend

In the same way, change the keyword to “girlfriend + gift” to get ❀️ a list of gifts for girlfriends ❀️ :

Night light album/commemorative album/picture frame lettering gift Music box necklace/bracelet/bracelet/bracelet lipstick watch bag glass shoe dollCopy the code

πŸ‘¦ gift list for boyfriends

Finally, change the keyword to “boyfriend + gifts” to get ❀️ gift list for boyfriends ❀️ :

Coke engraving album/commemorative album/photo frame bouquet hand ornaments Key chain woodcut Love letter embroidery basketballCopy the code

As you can see, there are some differences between gifts for girlfriends and boyfriends. Of course, the reason for the discrepancy may be that many titles are too creative to include goods at all.

At the end

Of course, this article is for analysis purposes only, and the results are for reference only. If you don’t have something on the list that you want, you can also opt for red envelopes or empty your shopping cart. No matter what gift you give your girlfriend/boyfriend, it’s important to convey your own ❀️ intention ❀️.

How to get the article source code and novice or other information for free:

Project source code case sharing if you can use it directly, in my QQ xiaobai answer communication group group number: 806965976 (pure technical communication, xiaobai answer and resource sharing, advertising is not entered) free self-help take away click here to get