Life is short. I use Python

Previous portal:

Learning Python crawlers (1) : The Beginning

Python crawler (2) : Preparation (1) basic class library installation

Learn Python crawler (3) : Pre-preparation (2) Linux basics

Docker is a Python crawler

Learn Python crawler (5) : pre-preparation (4) database foundation

Python crawler (6) : Pre-preparation (5) crawler framework installation

Python crawler (7) : HTTP basics

Little White learning Python crawler (8) : Web basics

Learning Python crawlers (9) : Crawler basics

Python crawler (10) : Session and Cookies

Python crawler (11) : Urllib

Python crawler (12) : Urllib

Urllib: A Python crawler (13)

Urllib: A Python crawler (14)

Python crawler (15) : Urllib

Python crawler (16) : Urllib crawler (16) : Urllib crawler

Python crawler (17) : Basic usage for Requests

Python crawler (18) : Requests advanced operations

Python crawler (19) : Xpath base operations

Learn Python crawler (20) : Advanced Xpath

Python crawler (21) : Parsing library Beautiful Soup

Python crawler (22) : Beautiful Soup

Python crawler (23) : Getting started parsing pyQuery

Python Crawler (24) : 2019 douban movie Rankings

Python crawler (25) : Crawls stock information

You can’t even afford to buy a second-hand house in Shanghai

Selenium, an Automated Testing Framework, goes from Getting Started to Giving up

Selenium, an Automated Testing Framework, goes from Starter to Quit

Selenium obtains commodity information on a large e-commerce site

Python crawler (30) : Proxy basics

Python crawler (31) : Build a simple proxy pool yourself

Python crawler (32) : Introduction to asynchronous request library AIOHTTP basics

Python crawlers and their Scrapy framework

Python crawlers and their Scrapy framework

Python crawler (35) : Crawler framework Scrapy introduction foundation (three) Selector

Downloader Middleware. Python crawler framework Scrapy

Introduction to JavaScript rendering service Scrapy-Splash

The introduction

Scrapy crawls pages in much the same way that the Requests library does, emulating HTTP Requests directly, rather than dynamically rendering pages in JavaScript.

We used Selenium to pick up pages rendered dynamically by JavaScript using the docking browser. Selenium can also be used in Scrapy.

In this way, we don’t need to worry about whether a page load is a request sent, nor do we need to pay attention to the rendering process of the page, just grab the final result, truly visible and grabbing.

The sample

Small target

First, set a small goal. In the previous article, we used Selenium to capture the product information of a certain site. In this article, we will still use this site and thank a certain site for providing us with materials.

To prepare

Please confirm that you have installed Scrapy, Selenium, and the driver libraries required for Selenium. If you haven’t, please check out the previous article.

New project

Create a new Scrapy project and call it scrapySeleniumDemo.

scrapy startproject scrapy_selenium_demoCopy the code

Remember to find a favorite directory, preferably in English.

Create a new Spider with the following command:

scrapy genspider jd www.jd.comCopy the code

Py, set the robots.txt to False, otherwise we will not be able to fetch the relevant product data, because the robot protocol does not allow fetching product data.

ROBOTSTXT_OBEY = FalseCopy the code

Defining data structures

The first step is to define the data structure we are going to grab as Item:

import scrapy class ProductItem(scrapy.Item): collection = 'products' image = scrapy.Field() price = scrapy.Field() name = scrapy.Field() commit = scrapy.Field() shop  = scrapy.Field() icons = scrapy.Field()Copy the code

Here we define six fields, exactly the same as in the previous example, and then a collection, which is the name of the table to store the data into.

Spider

Next, we’ll define our Spider, starting with a start_requests() method, as shown in the following example:

# -*- coding: utf-8 -*- from scrapy import Request, Spider class JdSpider(Spider): name = 'jd' allowed_domains = ['www.jd.com'] start_urls = ['http://www.jd.com/'] def start_requests(self): base_url = 'https://search.jd.com/Search?keyword=iPhone&ev=exbrand_Apple' headers = { 'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36', 'referer': 'https://www.jd.com/' } for page in range(1, self.settings.get('MAX_PAGE') + 1, 2): url = base_url + '&page=' + str(page) yield Request(url=url, callback=self.parse, headers = headers)Copy the code

The maximum page number is represented by MAX_PAGE, and the same configuration needs to be added to settings.py as follows:

MAX_PAGE = 3Copy the code

In start_requests(), we traversed all the pages we needed to access through URL concatenation, using step 2 because of the page-turning rule for an item page.

Docking Selenium

Next we need to do data fetching for these requests, which we do by docking with Selenium.

The solution is to use Download Middleware for docking. Example code is as follows:

# -*- coding: utf-8 -*- # Define here the models for your spider middleware # # See documentation in: # https://docs.scrapy.org/en/latest/topics/spider-middleware.html from selenium import webdriver from selenium.common.exceptions import TimeoutException from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from  scrapy.http import HtmlResponse from logging import getLogger class SeleniumMiddleware(object): def __init__(self, timeout=None, service_args=[]): Self. logger = getLogger(__name__) self.timeout = timeout # Chrome enables windowless mode. chrome_options.add_argument('--headless') self.driver = webdriver.Chrome(service_args=service_args, chrome_options=chrome_options) self.driver.set_window_size(1400, 700) self.driver.implicitly_wait(self.timeout) self.driver.set_page_load_timeout(self.timeout) self.wait = WebDriverWait(self.driver, self.timeout) def __del__(self): self.driver.close() def process_request(self, request, spider): self.logger.debug('Chrome is Starting') try: page = request.meta.get('page', 1) self.driver.get(request.url) if page > 1: input = self.wait.until( EC.presence_of_element_located((By.XPATH, '//*[@id="J_bottomPage"]/span[2]/input'))) button = self.wait.until( EC.element_to_be_clickable((By.XPATH, '//*[@id="J_bottomPage"]/span[2]/a'))) input.clear() input.send_keys(page) button.click() return HtmlResponse(url=request.url, body=self.driver.page_source, request=request, encoding='utf-8', status=200) except TimeoutException: return HtmlResponse(url=request.url, status=500, request=request) @classmethod def from_crawler(cls, crawler): return cls(timeout=crawler.settings.get('SELENIUM_TIMEOUT'), service_args=crawler.settings.get('CHROME_SERVICE_ARGS'))Copy the code

Add the Download Middleware configuration to settings.py as follows:

DOWNLOADER_MIDDLEWARES = {
   'scrapy_selenium_demo.middlewares.SeleniumMiddleware': 543,
}Copy the code

Parsing the page

We get HtmlResponse in Download Middleware and need to parse it in a Spider like this:

def parse(self, response):
    products = response.css('#J_goodsList .gl-item .gl-i-wrap')
    for product in products:
        item = ProductItem()
        item['image'] = product.css('.p-img a img::attr("src")').extract_first()
        item['price'] = product.css('.p-price i::text').extract_first()
        item['name'] = product.css('.p-name em::text').extract_first()
        item['commit'] = product.css('.p-commit a::text').extract_first()
        item['shop'] = product.css('.p-shop a::text').extract_first()
        item['icons'] = product.css('.p-icons .goods-icons::text').extract_first()
        yield itemCopy the code

Store the mongo

We added item_mongopipeline to save data to MongoDB, as follows:

import pymongo

class MongoPipeline(object):
    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(mongo_uri=crawler.settings.get('MONGO_URI'),
                   mongo_db=crawler.settings.get('MONGO_DB')
                   )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def process_item(self, item, spider):
        name = item.__class__.__name__
        self.db[name].insert(dict(item))
        return item

    def close_spider(self, spider):
        self.client.close()Copy the code

Add related configuration in Settings:

ITEM_PIPELINES = {
   'scrapy_selenium_demo.pipelines.MongoPipeline': 300,
}Copy the code

Now that we’re done with the main program, we can run the crawler with the following command:

scrapy crawl jdCopy the code

Results xiaobian will not post, the code has been uploaded to the code repository, interested students can access the code repository.

The sample code

All of the code in this series will be available on Github and Gitee.

Example code -Github

Example code -Gitee