1 Installation (In Linux)

First, install docker

Curl - sSL # https://get.daocloud.io/docker | sh su root to switch to the root, Docker systemctl start docker systemctl restart docker restart dokcerCopy the code

2 Pull the mirror

sudo docker pull scrapinghub/splash
Copy the code

3 Start container:

sudo docker run -p 8050:8050 -p 5023:5023 --restart=always scrapinghub/splash
Copy the code

Splash now listens on IP 0.0.0.0 and binds ports 8050(HTTP) and 5023 (Telnet).

In this way, Splash is enabled, and if he wants remote access, he goes to the security group and assigns both the inbound and outbound ports to the aliyun server

For details about splash operations, see

Splash – cn – doc. Readthedocs. IO/zh_CN/lates…

Docker container startup Setting command: systemctl enable docker.service Enable the startup of the Docker container

When starting the container with docker run, use the –restart parameter

Ex. : # docker run -d –name mysql -p 3306:3306 –restart=always -v /var/lib/mysql:/var/lib/mysql -v The/etc/localtime: / etc/localtime 39.106.193.240:9100 / joss/mysql: 5.7

Always – Restart the container regardless of the exit state

If –restart=always is not specified, run the update command to set it

Command: docker update –restart=always Container ID Example: docker update –restart=always 9bb3df5a70bf

Set pieces

#View the Docker process in Linux

docker ps

#Kill the Docker processDocker kill 338****0dCopy the code

If you find yourself running multiple processes or looping processes only once, check to see if scrapy.Request dont_filter is not set

Requests with splash – url

# The combination of native Requests and Splash
import requests
from fake_useragent import UserAgent
splash_url = "Http://192.168.59.103:8050/render.html? url={}&wait=1"
url = 'https://www.guazi.com/sh/buy/'
headers = {"User-Agent": UserAgent().random}
response = requests.get(splash_url.format(url),headers={"User-Agent": UserAgent().random})
response.encoding='utf-8'
print(response.text)

Copy the code

Match the request splash – lua

Execute lua code
import requests
from fake_useragent import UserAgent
from urllib.parse import quote

url = "https://www.guazi.com/sh/buy/"
lua_script = ''' function main(splash,args) splash:go('{}') splash.wait(2) return splash:html() end '''.format(url)
splash_url = "Http://192.168.59.103:8050/execute? lua_source={}".format(quote(lua_script))

headers = {"User-Agent": UserAgent().random}
print(splash_url)
response = requests.get(splash_url, headers={"User-Agent": UserAgent().random})
response.encoding = 'utf-8'
print(response.text)

Copy the code

Scrapy collocation spalsh

Add the address of the Splash service to the Settings of the corresponding scrapy project.

SPLASH_URL = 'http://192.168.59.103:8050'
#. DOWNLOADER_MIDDLEWARES and splash in the Settings of middleware, and set the priority HttpCompressionMiddleware objects

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723.'scrapy_splash.SplashMiddleware': 725.'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,}#. Installation splash in SPIDER_MIDDLEWARES SplashDeduplicateArgsMiddleware middleware

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,}You can also set the corresponding filtering middleware -- DUPEFILTER_CLASS

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
#. You can set the scrapy. Contrib. Httpcache. FilesystemCacheStorage to use HTTP cache Splash

HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'


# then write this in spider (way 1)
import scrapy
from scrapy_splash import SplashRequest


class BaiduSpider(scrapy.Spider) :
    name = 'guazi'
    allowed_domains = ['guazi.comn']
    start_urls = ['https://www.guazi.com/sh/buy/']

    def start_requests(self) :
        yield SplashRequest(self.start_urls[0],dont_filter=True,args={'wait':1})

    def parse(self, response) :
        print(response.text)

        
Way # 2

import scrapy
from scrapy_splash import SplashRequest


class BaiduSpider(scrapy.Spider) :
    name = 'guazi2'
    allowed_domains = ['guazi.comn']
    start_urls = ['https://www.guazi.com/sh/buy/']

    def start_requests(self) :
        lua_script = ''' function main(splash, Args) Assert (splash:go(args.url)) Assert (splash:wait(0.5)) return {HTML = splash: HTML ()} end ""
        yield SplashRequest(url=self.start_urls[0],endpoint='execute',args={'lua_source':lua_script})

    def parse(self, response) :
        print(response.text)

Copy the code

More exciting, go to this URL content

Splash – cn – doc. Readthedocs. IO/zh_CN/lates…