Hello, I’m Zeroing

Today we will introduce how to use Python to crawl jingdong commodity category. The data includes commodity title, price, publisher, author and other information.

The core library used in this crawler is Selenium + PyQuery. Selenium is used to drive the browser to simulate the access to the web page, and PyQuery is used to parse the page information for data extraction. Let’s take a look at the final result

After the script is started, Selenium automatically opens the jingdong web page, turns the product page information, and controls the background to return the extracted data while the browser turns the page.

Before introducing the main program, here is the Selenium package

The installation of Selenium

Selenium is primarily a testing tool for Web applications that can manipulate the browser to complete a series of steps that simulate human action. For example, it can automatically brush classes, automatically fill in texts, and automatically query the express number on the web page. Currently, it supports Java, Python, C#, Ruby and other languages.

When doing web page crawling, some data of web pages are rendered in Ajax way, such as weibo, the headline does not have the next page entrance, and the page turning effect is realized by refreshing the page; This kind of web data is not directly placed in HTML, but through user operation to trigger js commands embedded in HTML, so as to call the data stored in JSON files, and finally render.

For this kind of web page collection, there are generally two ideas:

  • 1. Use developer tools to find hidden links to store JSON data, and then extract the data using conventional Request methods;
  • 2. Selenium tool is used to simulate human operation and realize data capture;

Therefore, Selenium tool can achieve some effective suppression of some anti-crawling measures on the web page;

Python can use Selenium with the help of the encapsulated Selenium library, which can be installed using the PIP command

pip install selenium 
Copy the code

Selenium currently supports Chrome and Firefox. It is recommended that you choose Chrome because there are more documents about Chrome on the web.

However, in addition to making sure that Chrome is installed, you should also make sure that the Chrome Driver. exe tool (the core of Selenium is WebDriver, Chromedriver. exe is Chrome’s WebDriver tool

The ChromeDriver version must match the Chrome browser version, and you can download it to the local PC

Download address is as follows: chromedriver.chromium.org/downloads

Crawler logic

The following steps should be taken in order to capture jingdong data in manual operation using Selenium simulation (here, Python book products are captured as an example) :

  • 1. Drive the browser and open the jingdong website;
  • 2. Go to the search box, empty it and fill it with Python books, and click the search button next to it.
  • 3. Go to the commodity page to capture the data, and then drive Selenium tool to complete page turning operation, and capture all the data in turn;

First need to initialize, create webDriver Chrome browser, data storage file (here I use TXT file)

def __init__(self,item_name,txt_path): Self. url = url self.item_name = item_name self.txt_file = 'https://www.jd.com/' # self.url = url self.item_name = item_name self.txt_file = Open (txt_path,encoding=' utF-8 ',mode='w+') options = webdriver.ChromeOptions() # Google options # set to developer mode, Avoid being identified options.add_experimental_option('excludeSwitches', ['enable-automation']) self.browser = webdriver.Chrome(executable_path= "C:/Program Files/Google/Chrome/Application/chromedriver.exe", options = options) self.wait = WebDriverWait(self.browser,2)Copy the code

The webdriver.chrome () method is used to create a driver for the browser Chrome, assigning the path to the chromedriver.exe folder that you downloaded locally to the executable_path parameter,

When the browser opens the web page, it may load slowly due to the network speed. Therefore, WebDriverWait method is used to create a wait method. Each time the browser calls the wait method, it needs to wait 2 seconds before performing the next operation.

After the initialization operation, the next is the main program to simulate access, input, click and other operations; I’ve wrapped it all up in a run() function,

def run(self): Self.browser.get (self.url) input_edit = self.browser.find_element(by.css_selector,'#key') input_edit.clear() self.browser.find_element(by.css_selector,'#key') input_edit.send_keys(self.item_name) search_button = self.browser.find_element(By.CSS_SELECTOR,'#search > div > div.form > button') search_button.click()# click on time.sleep(2) HTML = self.browser.page_source Current_url = self.browser.current_url # Get the current page URL initial_url = STR (current_URL).split('&pvid')[0] for I in Range (1100) : try: Print (' being parsed -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - {} images'. The format (STR) (I)) next_page_url = initial_url + '&page={}&s={}&click=0'.format(str(i*2+1),str(i*60+1)) print(next_page_url) self.browser.get(next_page_url) self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,'#J_goodsList > ul > li'))) html = Self.browser.page_source self.parse_html(HTML)# self.browser.page_source self.parse_html(HTML)# Print ('Error Next page',e) self.txt_file.close()#Copy the code

First, use get() method to visit the main page of JINGdong, and then locate the search bar and search button labels input_edit and search_button in the page. After completing the input, click the operation

If you can’t locate a web element tag, you can use browser developer mode, which is divided into the following steps (here CSS_Selector as an example) :

  • 1. Click the Pick element button in the upper left corner of Developer mode.
  • 2. Click on the element you want to select.
  • 3, in the HTML source selection area, right-click the mouse and select Copy
  • 4. Select Copy Selector.
  • 5, paste to the paste board;

In the page turning operation, here according to the LAW of JD URL to construct,

Page 5

https://search.jd.com/Search?keyword=%E4%BB%A3%E6%A3%AE%E9%93%B5&qrst=1&suggest=7.def.0.base&wq=%E4%BB%A3%E6%A3%AE%E9%93 %B5&stock=1&page=9&s=241&click=0Copy the code

On page 6

https://search.jd.com/Search?keyword=%E4%BB%A3%E6%A3%AE%E9%93%B5&qrst=1&suggest=7.def.0.base&wq=%E4%BB%A3%E6%A3%AE%E9%93 %B5&stock=1&page=11&s=301&click=0Copy the code

If you look closely, the only difference between page 5 and page 6 urls is the two parameters page and s;

  • Page difference 2;

  • The difference between s is 60;

According to this rule, by changing page and S parameters to build jingdong commodity information in the first 100 pages, to complete the data capture;

For the data extraction part, I use the parse_html function

In order to improve the friendliness of the program, I package all functions into a class, the user only needs to input two parameters, one is for the name of the goods to be collected, the other is the path of the storage file; Can complete the data crawl;

Finally, the data to be climbed into TXT file, the result is as follows

4,

Although Selenium has effectively cracked some anti-crawling mechanisms on the web side, it does not work on some sites, such as dragnet. When you use Selenium to drive the browser to simulate page turning on the dragnet official website, the site can recognize non-human actions and temporarily block and warn your IP.

For the complete source code involved in this article, pay attention to wechat public number: Xiao Zhang Python, background reply keyword: JINGdong goods, you can get!

Well, that’s all for this article. Thank you for reading!