I wrote an example of a reptile in my spare time today. By crawling the user comments of JINGdong through the crawler, many results can be obtained by analyzing the data, such as which color of bra is the most popular among women and the average size of Chinese women (for reference only).
Open the developer tool – Network, in the user review page we found the browser has such a request
Through analysis, we found that the main parameters used are productId, Page and pageSize. The last two are paging parameters, productId is the ID of each commodity, through this ID to obtain the evaluation record of the commodity, so we only need to know the productId of each commodity to easily obtain the evaluation. To analyze the search page page source code
Through analysis, we find that each commodity is in the LI tag, and the LI tag has a data-PID attribute, and the corresponding value is the productId of the commodity.
With a general understanding of the process, we can start our crawler work.
First we need to get the id of the product in the search page to provide productId for the following to crawl user reviews. Key_word is the keyword of the search, here is [bra]
import requests
import re
""" Query commodity ID ""
def find_product_id(key_word):
jd_url = 'https://search.jd.com/Search'
product_ids = []
# Climb the first 3 pages of merchandise
for i in range(1.4):
param = {'keyword': key_word, 'enc': 'utf-8'.'page': i}
response = requests.get(jd_url, params=param)
Id # goods
ids = re.findall('data-pid="(.*?) "', response.text, re.S)
product_ids += ids
return product_ids
Copy the code
Put the product ids from the first three pages into the list and then we can crawl the reviews
Through the analysis of Preview, we found that the format of the request response to obtain user evaluation is a string with a JSON spliced at the end (as shown in the figure below), so as long as we delete the useless characters, we can obtain the JSON object we want.
The content of comments in the JSON object is what we ultimately want the comment record to be
""" Get comment content """
def get_comment_message(product_id):
urls = ['https://sclub.jd.com/comment/productPageComments.action?' \
'callback=fetchJSON_comment98vv53282&' \
'productId={}' \
'&score=0&sortType=5&' \
'page={}' \
'&pageSize=10&isShadowSku=0&rid=0&fold=1'.format(product_id, page) for page in range(1.11)]
for url in urls:
response = requests.get(url)
html = response.text
# delete useless characters
html = html.replace('fetchJSON_comment98vv53282('.' ').replace('); '.' ')
data = json.loads(html)
comments = data['comments']
t = threading.Thread(target=save_mongo, args=(comments,))
t.start()
Copy the code
In this method, only the urls for the first 10 pages of reviews are taken and placed in the list of urls. A thread is started to store the comment data to MongoDB through a loop to retrieve the comment records for different pages.
We continue to analyze the evaluation record interface and find the two pieces of data we want
-
ProductColor: productColor
-
ProductSize: productSize
# mongo service
client = pymongo.MongoClient('mongo: / / 127.0.0.1:27017 /')
# jd database
db = client.jd
The # product table is not created automatically
product_db = db.product
# save mongo
def save_mongo(comments):
for comment in comments:
product_data = {}
# color
# flush_data Method to flush data
product_data['product_color'] = flush_data(comment['productColor'])
# size
product_data['product_size'] = flush_data(comment['productSize'])
# Comment content
product_data['comment_content'] = comment['content']
# create_time
product_data['create_time'] = comment['creationTime']
# insert mongo
product_db.insert(product_data)
Copy the code
Because of the difference in color and size description of each commodity, we conducted simple data cleaning for aspect statistics. This code is very non-pythonic. But it’s just a little demo, so you can ignore it.
def flush_data(data):
if 'skin' in data:
return 'color'
if 'black' in data:
return 'black'
if 'purple' in data:
return 'purple'
if 'powder' in data:
return 'pink'
if 'blue' in data:
return 'blue'
if 'white' in data:
return 'white'
if 'grey' in data:
return 'grey'
if 'somehow' in data:
return 'Champagne'
if 'Hu' in data:
return 'Amber'
if 'red' in data:
return 'red'
if 'purple' in data:
return 'purple'
if 'A' in data:
return 'A'
if 'B' in data:
return 'B'
if 'C' in data:
return 'C'
if 'D' in data:
return 'D'
Copy the code
Now that the functionality of these modules has been written, we just need to connect them together
Create a thread lock
lock = threading.Lock()
Get the comment thread
def spider_jd(ids):
while ids:
# lock
lock.acquire()
Extract the first element
id = ids[0]
# Remove fetched elements from the list to avoid reloading
del ids[0]
# releases the lock
lock.release()
# Get comments
get_comment_message(id)
product_ids = find_product_id('bra')
for i in (1.5) :Add a thread to get comments
t = threading.Thread(target=spider_jd, args=(product_ids,))
# start thread
t.start()
Copy the code
The reason for locking the code above is to prevent repeated consumption of shared variables
Check MongoDB after run:
After obtaining the results, we can use the Matplotlib library to graphically display the data in order to make it more intuitive
import pymongo
from pylab import *
client = pymongo.MongoClient('mongo: / / 127.0.0.1:27017 /')
# jd database
db = client.jd
The # product table is not created automatically
product_db = db.product
# Count the following colors
color_arr = ['color'.'black'.'purple'.'pink'.'blue'.'white'.'grey'.'Champagne'.'red']
color_num_arr = []
for i in color_arr:
num = product_db.count({'product_color': i})
color_num_arr.append(num)
# Display color
color_arr = ['bisque'.'black'.'purple'.'pink'.'blue'.'white'.'gray'.'peru'.'red']
# labelDistance, how far is the position of the text from the far point, 1.1 refers to the position 1.1 times the radius
#autopct, the text format inside a circle, %3.1f%% indicates that the decimal has three digits and the integer has one digit
#shadow, whether the pie has a shadow
# startAngle, 0, indicates that the first block is turned counterclockwise from 0. Generally choose to start from 90 degrees is better
# pctDistance, the percentage of text distance from the center of the circle
In order to obtain the returned value of pie chart, p_texts include the text inside the pie chart and the text outside the label of l_texts
patches,l_text,p_text = plt.pie(sizes, labels=labels, colors=colors,
labeldistance=1.1, autopct='% 3.1 f % %', shadow=False,
startangle=90, pctdistance=0.6)
Change the size of the text
The method is to iterate over each text. Call the set_size method to set its properties
for t in l_text:
t.set_size=(30)
for t in p_text:
t.set_size=(20)
# Set the x and y scales to be the same so that the pie chart is round
plt.axis('equal')
plt.title("Underwear color scale map", fontproperties="SimHei") #
plt.legend()
plt.show()
Copy the code
Running the code, we found that black was the second most popular skin color.
Next, let’s take a look at the distribution of size, which is displayed in a bar chart
index=["A"."B"."C"."D"]
client = pymongo.MongoClient('mongo: / / 127.0.0.1:27017 /')
db = client.jd
product_db = db.product
value = []
for i in index:
num = product_db.count({'product_size': i})
value.append(num)
plt.bar(left=index, height=value, color="green", width=0.5)
plt.show()
Copy the code
After running, we found that there were more women in B size
Finally, welcome to my public account (PYTHon3xxx). There will be different Python dry stuff every day.