In this article, we will use Selenium to simulate the behavior of users using the browser, and climb jingdong product information. Again, we will put the effect picture first:

1. Web page analysis

(1) Preliminary analysis

The blogger originally planned to write a crawler that could crawl all the product information, but in the process of analysis, it was discovered that the structure of the web page for different products was different

So, we dropped the idea and just crawled the information for laptop products

If you need to crawl other types of commodity information, just change the rules for extracting data, interested friends can have a try

All right, let’s get started!

First of all, open the laptop product home page with Chrome browser, we can easily find that the page is a dynamically loaded page

When we open the page, we only have 30 items, but when we drag down the page, it reloads the other 30 items

At this time, we can use Selenium to simulate the process of the browser pull-down web page, and obtain the information of all products on the website

>>> browser.execute_script("window.scrollTo(0,document.body.scrollHeight)")
Copy the code

(2) Simulated page turning

In addition, we found that the site has 100 pages in total

We can construct urls to retrieve the content of each web page, but here we will use Selenium to simulate the browser’s page-turning behavior

Pull down the page to the bottom and you can find a button for the next page. We just need to get and click the element to turn the page

>>> browser.find_element_by_xpath('//a[@class="pn-next" and @onclick]').click()
Copy the code

(3) Data acquisition

Next, we need to parse each web page to get the data we need, including (you can use Selenium to select elements) :

  • Product ID:browser.find_elements_by_xpath('//li[@data-sku]')Is used to construct the link address
  • Commodity price:browser.find_elements_by_xpath('//div[@class="gl-i-wrap"]/div[2]/strong/i')
  • Product Name:browser.find_elements_by_xpath('//div[@class="gl-i-wrap"]/div[3]/a/em')
  • Number of comments:browser.find_elements_by_xpath('//div[@class="gl-i-wrap"]/div[4]/strong')

2. Coding implementation

Well, the analysis process is simple. The basic idea is to use Selenium to simulate the behavior of the browser. Here is the code

from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import selenium.common.exceptions
import json
import csv
import time

class JdSpider() :
    def open_file(self) :
        self.fm = input('Please input file save format (TXT, JSON, CSV) :')
        whileself.fm! ='txt' andself.fm! ='json' andself.fm! ='csv':
            self.fm = input('Input error, please re-enter file save format (TXT, JSON, CSV) :')
        if self.fm=='txt' :
            self.fd = open('Jd.txt'.'w',encoding='utf-8')
        elif self.fm=='json' :
            self.fd = open('Jd.json'.'w',encoding='utf-8')
        elif self.fm=='csv' :
            self.fd = open('Jd.csv'.'w',encoding='utf-8',newline=' ')

    def open_browser(self) :
        self.browser = webdriver.Chrome()
        self.browser.implicitly_wait(10)
        self.wait = WebDriverWait(self.browser,10)

    def init_variable(self) :
        self.data = zip()
        self.isLast = False

    def parse_page(self) :
        try:
            skus = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//li[@class="gl-item"]')))
            skus = [item.get_attribute('data-sku') for item in skus]
            links = ['https://item.jd.com/{sku}.html'.format(sku=item) for item in skus]
            prices = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//div[@class="gl-i-wrap"]/div[2]/strong/i')))
            prices = [item.text for item in prices]
            names = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//div[@class="gl-i-wrap"]/div[3]/a/em')))
            names = [item.text for item in names]
            comments = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//div[@class="gl-i-wrap"]/div[4]/strong')))
            comments = [item.text for item in comments]
            self.data = zip(links,prices,names,comments)
        except selenium.common.exceptions.TimeoutException:
            print('parse_page: TimeoutException')
            self.parse_page()
        except selenium.common.exceptions.StaleElementReferenceException:
            print('parse_page: StaleElementReferenceException')
            self.browser.refresh()

    def turn_page(self) :
        try:
            self.wait.until(EC.element_to_be_clickable((By.XPATH,'//a[@class="pn-next"]'))).click()
            time.sleep(1)
            self.browser.execute_script("window.scrollTo(0,document.body.scrollHeight)")
            time.sleep(2)
        except selenium.common.exceptions.NoSuchElementException:
            self.isLast = True
        except selenium.common.exceptions.TimeoutException:
            print('turn_page: TimeoutException')
            self.turn_page()
        except selenium.common.exceptions.StaleElementReferenceException:
            print('turn_page: StaleElementReferenceException')
            self.browser.refresh()

    def write_to_file(self) :
        if self.fm == 'txt':
            for item in self.data:
                self.fd.write('----------------------------------------\n')
                self.fd.write('the link: + str(item[0]) + '\n')
                self.fd.write('price: + str(item[1]) + '\n')
                self.fd.write('name: + str(item[2]) + '\n')
                self.fd.write('the comment: + str(item[3]) + '\n')
        if self.fm == 'json':
            temp = ('link'.'price'.'name'.'comment')
            for item in self.data:
                json.dump(dict(zip(temp,item)),self.fd,ensure_ascii=False)
        if self.fm == 'csv':
            writer = csv.writer(self.fd)
            for item in self.data:
                writer.writerow(item)

    def close_file(self) :
        self.fd.close()

    def close_browser(self) :
        self.browser.quit()

    def crawl(self) :
        self.open_file()
        self.open_browser()
        self.init_variable()
        print('Start crawling')
        self.browser.get('https://search.jd.com/Search?keyword=%E7%AC%94%E8%AE%B0%E6%9C%AC&enc=utf-8')
        time.sleep(1)
        self.browser.execute_script("window.scrollTo(0,document.body.scrollHeight)")
        time.sleep(2)
        count = 0
        while not self.isLast:
            count += 1
            print('Climbing to number one' + str(count) + ' 页......')
            self.parse_page()
            self.write_to_file()
            self.turn_page()
        self.close_file()
        self.close_browser()
        print('End of crawl')

if __name__ == '__main__':
    spider = JdSpider()
    spider.crawl()
Copy the code

There are a few points in the code that need to be noted now for later study:

1, the self. Fd = open (‘ Jd. CSV ‘, ‘w’, encoding = “utf-8”, newline = ‘ ‘)

When opening the CSV file, it is better to add the parameter newline= “; otherwise, blank lines will appear in the file, which is not conducive to subsequent data processing

2, the self. The execute_script (” window. The scrollTo (0, document. Body. ScrollHeight) “)

In a simulated drag down the web browser, because the data update not in time, so often appear abnormal StaleElementReferenceException

Generally speaking, there are two ways to deal with it:

  • Use after the operation is completetime.sleep()Give your browser plenty of load time
  • Catch the exception and handle it accordingly

3, skus = [item.get_attribute(‘data-sku’) for item in skus]

When an element is selected using xpath syntax in Selenium, the attribute value of the node cannot be directly obtained. Instead, the get_attribute() method is used

4. Start the browser headless to speed up the crawl, just set the headless parameter when starting the browser

opt = webdriver.chrome.options.Options()
opt.set_headless()
browser = webdriver.Chrome(chrome_options=opt)
Copy the code