Python crawler (41) : Introduction to crawler framework Scrapy (eight) to Splash combat

Life is short. I use Python

Previous portal:

Learning Python crawlers (1) : The Beginning

Python crawler (2) : Preparation (1) basic class library installation

Learn Python crawler (3) : Pre-preparation (2) Linux basics

Docker is a Python crawler

Learn Python crawler (5) : pre-preparation (4) database foundation

Python crawler (6) : Pre-preparation (5) crawler framework installation

Python crawler (7) : HTTP basics

Little White learning Python crawler (8) : Web basics

Learning Python crawlers (9) : Crawler basics

Python crawler (10) : Session and Cookies

Python crawler (11) : Urllib

Python crawler (12) : Urllib

Urllib: A Python crawler (13)

Urllib: A Python crawler (14)

Python crawler (15) : Urllib

Python crawler (16) : Urllib crawler (16) : Urllib crawler

Python crawler (17) : Basic usage for Requests

Python crawler (18) : Requests advanced operations

Python crawler (19) : Xpath base operations

Learn Python crawler (20) : Advanced Xpath

Python crawler (21) : Parsing library Beautiful Soup

Python crawler (22) : Beautiful Soup

Python crawler (23) : Getting started parsing pyQuery

Python Crawler (24) : 2019 douban movie Rankings

Python crawler (25) : Crawls stock information

You can’t even afford to buy a second-hand house in Shanghai

Selenium, an Automated Testing Framework, goes from Getting Started to Giving up

Selenium, an Automated Testing Framework, goes from Starter to Quit

Selenium obtains commodity information on a large e-commerce site

Python crawler (30) : Proxy basics

Python crawler (31) : Build a simple proxy pool yourself

Python crawler (32) : Introduction to asynchronous request library AIOHTTP basics

Python crawlers and their Scrapy framework

Python crawler (35) : Crawler framework Scrapy introduction foundation (three) Selector

Downloader Middleware. Python crawler framework Scrapy

Spider Middleware and their Scrapy framework

Python crawler (38) : Introduction to the crawler framework Scrapy (six) Item Pipeline

Introduction to JavaScript rendering service Scrapy-Splash

Selenium is the foundation of Python and crawler framework Scrapy

The introduction

Is there another way to use Scrapy to link Selenium to pages rendered dynamically by JavaScript?

In this article, we will introduce how to use Scrapy to connect Splash to the dynamic rendering of JavaScript pages.

The sample

To prepare

Make sure you have installed the Splash service correctly, including the scrapy-Splash library. If you have not installed the Splash service correctly, please see the previous article. JavaScript rendering service Scrapy-Splash introduction to install.

New project

Create a new Scrapy project and call it scrapysplashdemo.

scrapy startproject scrapy_splash_demoCopy the code

Remember to find a favorite directory, preferably in English.

Create a new Spider with the following command:

scrapy genspider jd www.jd.comCopy the code

After all, the content of this article is mainly about how to introduce Scrapy to Splash. Of course, the other main reason is that the editor is lazy

configuration

Configuration can refer to the official making warehouse here, the link: https://github.com/scrapy-plugins/scrapy-splash.

Add the address of the Splash service to settings.py.

SPLASH_URL = 'http://localhost:8050/'Copy the code

If the Splash service is run on a remote server, configure the address of the remote server. For example, if the Splash service is run on 172.16.15.177, the configuration is as follows:

SPLASH_URL = 'http://172.16.15.177:8050/'Copy the code

Next, you need to configure several DOWNLOADER_MIDDLEWARES, as follows:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}Copy the code

We also need to configure a SPIDER_MIDDLEWARES as follows:

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}Copy the code

Next, we need to configure a de-weighted Python class SplashAwareDupeFilter as follows:

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'Copy the code

We also need to configure a Cache storage SplashAwareFSCacheStorage, as follows:

HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'Copy the code

And then we can get to work.

Send the request

Now that we’ve configured the configuration, we can use the SplashRequest object and pass the parameters. Scrapy will forward the request to Splash, who will load and render the page, and then return the result. At this time, the Response is the result rendered by Splash, which can be directly handed over to Spider for parsing.

Let’s take a look at the official example, as follows:

yield SplashRequest(url, self.parse_result, args={ # optional; parameters passed to Splash HTTP API 'wait': 0.5, # 'url' is prefilled from request url # 'http_method' is set to 'POST' for POST requests # 'body' is set to request body  for POST requests }, endpoint='render.json', # optional; default is render.html splash_url='<url>', # optional; overrides SPLASH_URL slot_policy=scrapy_splash.SlotPolicy.PER_DOMAIN, # optional )Copy the code

A SplashRequest object is constructed directly. The first two parameters are the destination URL and the callback method. Additionally, we can pass some parameters through args, such as the wait time, which in this case is 0.5.

For more description or reference the lot warehouse, address: https://github.com/scrapy-plugins/scrapy-splash.

Alternatively, we can use scrapy.Request, Splash, and meta configurations. Here’s an example:

yield scrapy.Request(url, self.parse_result, meta={ 'splash': { 'args': { # set rendering arguments here 'html': 1, 'png': 1, # 'url' is prefilled from request url # 'http_method' is set to 'POST' for POST requests # 'body' is set to request body  for POST requests }, # optional parameters 'endpoint': 'render.json', # optional; default is render.json 'splash_url': '<url>', # optional; overrides SPLASH_URL 'slot_policy': scrapy_splash.SlotPolicy.PER_DOMAIN, 'splash_headers': {}, # optional; a dict with headers sent to Splash 'dont_process_response': True, # optional, default is False 'dont_send_headers': True, # optional, default is False 'magic_response': False, # optional, default is True } })Copy the code

The two ways to send a Request are the same. You can choose either way.

The Lua script used in this article is the same as the one in the previous article. The details of the Lua script are as follows:

function main(splash, args)
  splash:go("https://www.jd.com/")
    return {
      url = splash:url(),
      jpeg = splash:jpeg(),
      har = splash:har(),
      cookies = splash:get_cookies()
    }
endCopy the code

The results are as follows:

Next, we’ll use SplashRequest to dock with Lua scripts in Spider, and that’s it, as follows:

# -*- coding: utf-8 -*-
import scrapy
from scrapy_splash import SplashRequest


lua_script = """
function main(splash, args)
  splash:go(args.url)
    return {
      url = splash:url(),
      jpeg = splash:jpeg(),
      har = splash:har(),
      cookies = splash:get_cookies()
    }
end
"""


class JdSpider(scrapy.Spider):
    name = 'jd'
    allowed_domains = ['www.jd.com']
    start_urls = ['http://www.jd.com/']

    def start_requests(self):
        url = 'https://www.jd.com/'
        yield SplashRequest(url=url, callback=self.parse)

    def parse(self, response):
        self.logger.debug(response.text)Copy the code

When the Spider is ready, you can run the crawler with the following command:

scrapy crawl jdCopy the code

The specific results will not be posted here, but simply printed out the response data in the form of log. However, if we carefully observe the typed data, we can see that the part originally dynamically rendered by JavaScript is also printed out, indicating that our Scrapy docking Splash’s actual combat is successful.

The sample code

All of the code in this series will be available on Github and Gitee.

Example code -Github

Example code -Gitee

Python crawler (41) : Introduction to crawler framework Scrapy (eight) to Splash combat

The introduction

The sample

To prepare

New project

configuration

Send the request

The sample code

Related Posts

Dart :HTML is the exclusive domain of DART, in addition to javascript

Don’t let your microservices run naked, based on Spring Session & Spring Security microservices permission control

Graphic HTTP (08) — LIMITATIONS of HTTP and alternative technologies