Crawl vipSHOP moon cake details data

I am participating in the Mid-Autumn Festival Creative Submission contest, please see: Mid-Autumn Festival Creative Submission Contest for details. This is my first crawler, welcome to discuss exchange.

preface

The price and details of vipshop mooncake are in Python. If you have not used Python, please install Python. The requests third-party library is used for the beginning of crawler, but the major e-commerce platforms have anti-crawler mechanism, and the page data cannot be obtained when using Requests. Therefore, selenium is used in this article.

Third-party libraries

Describes the third-party library installation and its use in the program

selenium

Selenium tests run directly in the browser as if it were a real user, and support most major browsers, including Internet explorer (7,8,9,10,11), Firefox, Safari, Chrome, Opera, and more. We can use it to simulate user clicks to visit the website, bypassing some complex authentication scenarios through Selenium + driving the browser this combination can directly render parsing JS, bypassing most of the parameter construction and anti-crawling.

1. Installation: PIP can be used after python is installed. To use the PIP installation, run the PIP install Selenium command

Download chromedriver: Download Chromedriver. exe. Note: The version needs to be the same as your browser version, and you need to place it in the Python directory.

BeautifulSoup

BeautifulSoup is designed to help us structure the data on our web pages so that we can retrieve the data we need, such as product name, price, item number, etc.

To install PIP, run PIP install beautifulsoup4

Actual crawling mooncake related data

Access to the page

Open the Vipshop website and enter the moon cakes, copy the url is as follows: category.vip.com/suggest.php…

After installing Python, open CMD and type Python, then type the following code to see if it executes and opens the browser correctly

from selenium import webdriver
from bs4 import BeautifulSoup
url = "https://category.vip.com/suggest.php?keyword=%E6%9C%88%E9%A5%BC&ff=235|12|1|1"# VIP address driver = webdriver.chrome () r = driver.get(url)Copy the code

Parses the page to get the commodity data

Keyboard F12 to open the debugging tool, mouse into the complete commodity block to observe the web page data, use the select method in BeautifulSoup to obtain all the mooncake commodity data on the current page, and print the view result.

html = driver.page_source
bs = BeautifulSoup(html, "lxml")
course_data = bs.select('div[data-product-id]')
print(course_data)
Copy the code

Parses detailed product data

To get the item code data, you need to click to open the detail page, so the for loop opens each item and uses BeautifulSoup to get the data, then builds the object to stuff it with useful information.

The sizing logic can be ignored, so you can write it down so you can read it later in Python. I’m a front-end developer, so I rarely write Python.

for each_item in course_data:
 detailUrl = each_item.find("a")
 id = each_item.attrs['data-product-id']
 goodsName = each_item.find("div", class_="c-goods-item__name"R1 = driver.get()'https:'+detailUrl.attrs["href"Html1 = driver.page_source bs1 = BeautifulSoup(html1,"lxml"Size_arr = [] # Size_arr = [] # Sizes = bs1.find_all("li", class_="size-list-item J-sizeID")
 if sizes: 
  for each_item in sizes:
   size = each_item.find('span', class_="size-list-item-name")
   size_arr.append(size.getText())
 elif bs1.find_all("li", class_="selector_opt"):
  sizes = bs1.find_all("li", class_="selector_opt")
  for each_item in sizes:
   size = each_item.find('a').getText()
   size_arr.append(size)
 elif bs1.find_all("li", class_="size-list-item J-sizeID sli-selected size-list-item-small"):
  sizes = bs1.find_all("li", class_="size-list-item J-sizeID sli-selected size-list-item-small")
  for each_item in sizes:
   size = each_item.find('span', class_="size-list-item-name")
   size_arr.append(size.getText())
 elif bs1.find_all("li", class_="size-list-item J-sizeID sli-selected"):
  sizes = bs1.find_all("li", class_="size-list-item J-sizeID sli-selected")
  for each_item in sizes:
   size = each_item.find('span', class_="size-list-item-name")
   size_arr.append(size.getText())
 else:
  size_arr = ['not available']
 infoCode = bs1.find("p", class_="other-infoCoding").getText()
 str=', '.join(size_arr) # array to string goods_dict = {"goodsName": goodsName.getText(), "detailUrl": detailUrl.attrs["href"]."id":  id, "infoCode": infoCode, "size": str}
 list_data.append(goods_dict)
Copy the code

The output data

Normally, we will save the data we crawl into the database. Since this is only a demo, we will write the moon cake data into a text file for the time being.


print(list_data)
with open('mooncake_classes.txt'."a+") asF: # Write course information to a text filefor text in list_data:
  print(text)
  f.write('Trade Name:'+text['goodsName'] +'Product ID:'+text['id'] +Article No. :+text['infoCode'] +'Specification:'+text['size'] +'\n')

Copy the code

conclusion

If you’ve followed my article through, try getting the price data, just like getting the name data. At present, demo is only the data of the first page, you can click the bottom button to get all the page data, the same as clicking details. Welcome to study and exchange.