Life is short. I use Python
Previous portal:
Learning Python crawlers (1) : The Beginning
Python crawler (2) : Preparation (1) basic class library installation
Learn Python crawler (3) : Pre-preparation (2) Linux basics
Docker is a Python crawler
Learn Python crawler (5) : pre-preparation (4) database foundation
Python crawler (6) : Pre-preparation (5) crawler framework installation
Python crawler (7) : HTTP basics
Little White learning Python crawler (8) : Web basics
Learning Python crawlers (9) : Crawler basics
Python crawler (10) : Session and Cookies
Python crawler (11) : Urllib
Python crawler (12) : Urllib
Urllib: A Python crawler (13)
Urllib: A Python crawler (14)
Python crawler (15) : Urllib
Python crawler (16) : Urllib crawler (16) : Urllib crawler
Python crawler (17) : Basic usage for Requests
Python crawler (18) : Requests advanced operations
Python crawler (19) : Xpath base operations
Learn Python crawler (20) : Advanced Xpath
Python crawler (21) : Parsing library Beautiful Soup
Python crawler (22) : Beautiful Soup
Python crawler (23) : Getting started parsing pyQuery
Python Crawler (24) : 2019 douban movie Rankings
Python crawler (25) : Crawls stock information
You can’t even afford to buy a second-hand house in Shanghai
Selenium, an Automated Testing Framework, goes from Getting Started to Giving up
Selenium, an Automated Testing Framework, goes from Starter to Quit
Selenium obtains commodity information on a large e-commerce site
Python crawler (30) : Proxy basics
Python crawler (31) : Build a simple proxy pool yourself
Python crawler (32) : Introduction to asynchronous request library AIOHTTP basics
Python crawlers and their Scrapy framework
Python crawlers and their Scrapy framework
Python crawler (35) : Crawler framework Scrapy introduction foundation (three) Selector
Downloader Middleware. Python crawler framework Scrapy
Spider Middleware and their Scrapy framework
Python crawler (38) : Introduction to the crawler framework Scrapy (six) Item Pipeline
Introduction to JavaScript rendering service Scrapy-Splash
Selenium is the foundation of Python and crawler framework Scrapy
The introduction
Is there another way to use Scrapy to link Selenium to pages rendered dynamically by JavaScript?
In this article, we will introduce how to use Scrapy to connect Splash to the dynamic rendering of JavaScript pages.
The sample
To prepare
Make sure you have installed the Splash service correctly, including the scrapy-Splash library. If you have not installed the Splash service correctly, please see the previous article. JavaScript rendering service Scrapy-Splash introduction to install.
New project
Create a new Scrapy project and call it scrapysplashdemo.
scrapy startproject scrapy_splash_demoCopy the code
Remember to find a favorite directory, preferably in English.
Create a new Spider with the following command:
scrapy genspider jd www.jd.comCopy the code
After all, the content of this article is mainly about how to introduce Scrapy to Splash. Of course, the other main reason is that the editor is lazy
configuration
Configuration can refer to the official making warehouse here, the link: https://github.com/scrapy-plugins/scrapy-splash.
Add the address of the Splash service to settings.py.
SPLASH_URL = 'http://localhost:8050/'Copy the code
If the Splash service is run on a remote server, configure the address of the remote server. For example, if the Splash service is run on 172.16.15.177, the configuration is as follows:
SPLASH_URL = 'http://172.16.15.177:8050/'Copy the code
Next, you need to configure several DOWNLOADER_MIDDLEWARES, as follows:
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}Copy the code
We also need to configure a SPIDER_MIDDLEWARES as follows:
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}Copy the code
Next, we need to configure a de-weighted Python class SplashAwareDupeFilter as follows:
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'Copy the code
We also need to configure a Cache storage SplashAwareFSCacheStorage, as follows:
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'Copy the code
And then we can get to work.
Send the request
Now that we’ve configured the configuration, we can use the SplashRequest object and pass the parameters. Scrapy will forward the request to Splash, who will load and render the page, and then return the result. At this time, the Response is the result rendered by Splash, which can be directly handed over to Spider for parsing.
Let’s take a look at the official example, as follows:
yield SplashRequest(url, self.parse_result, args={ # optional; parameters passed to Splash HTTP API 'wait': 0.5, # 'url' is prefilled from request url # 'http_method' is set to 'POST' for POST requests # 'body' is set to request body for POST requests }, endpoint='render.json', # optional; default is render.html splash_url='<url>', # optional; overrides SPLASH_URL slot_policy=scrapy_splash.SlotPolicy.PER_DOMAIN, # optional )Copy the code
A SplashRequest object is constructed directly. The first two parameters are the destination URL and the callback method. Additionally, we can pass some parameters through args, such as the wait time, which in this case is 0.5.
For more description or reference the lot warehouse, address: https://github.com/scrapy-plugins/scrapy-splash.
Alternatively, we can use scrapy.Request, Splash, and meta configurations. Here’s an example:
yield scrapy.Request(url, self.parse_result, meta={ 'splash': { 'args': { # set rendering arguments here 'html': 1, 'png': 1, # 'url' is prefilled from request url # 'http_method' is set to 'POST' for POST requests # 'body' is set to request body for POST requests }, # optional parameters 'endpoint': 'render.json', # optional; default is render.json 'splash_url': '<url>', # optional; overrides SPLASH_URL 'slot_policy': scrapy_splash.SlotPolicy.PER_DOMAIN, 'splash_headers': {}, # optional; a dict with headers sent to Splash 'dont_process_response': True, # optional, default is False 'dont_send_headers': True, # optional, default is False 'magic_response': False, # optional, default is True } })Copy the code
The two ways to send a Request are the same. You can choose either way.
The Lua script used in this article is the same as the one in the previous article. The details of the Lua script are as follows:
function main(splash, args)
splash:go("https://www.jd.com/")
return {
url = splash:url(),
jpeg = splash:jpeg(),
har = splash:har(),
cookies = splash:get_cookies()
}
endCopy the code
The results are as follows:
Next, we’ll use SplashRequest to dock with Lua scripts in Spider, and that’s it, as follows:
# -*- coding: utf-8 -*-
import scrapy
from scrapy_splash import SplashRequest
lua_script = """
function main(splash, args)
splash:go(args.url)
return {
url = splash:url(),
jpeg = splash:jpeg(),
har = splash:har(),
cookies = splash:get_cookies()
}
end
"""
class JdSpider(scrapy.Spider):
name = 'jd'
allowed_domains = ['www.jd.com']
start_urls = ['http://www.jd.com/']
def start_requests(self):
url = 'https://www.jd.com/'
yield SplashRequest(url=url, callback=self.parse)
def parse(self, response):
self.logger.debug(response.text)Copy the code
When the Spider is ready, you can run the crawler with the following command:
scrapy crawl jdCopy the code
The specific results will not be posted here, but simply printed out the response data in the form of log. However, if we carefully observe the typed data, we can see that the part originally dynamically rendered by JavaScript is also printed out, indicating that our Scrapy docking Splash’s actual combat is successful.
The sample code
All of the code in this series will be available on Github and Gitee.
Example code -Github
Example code -Gitee