Introduction to the

Scrapyd_pyppeteer: Python3.8 Selenium Pyppeteer scrapy scrapyd-Client LogParser Use it asynchronously in scrapy

Think big, a Python version of Puppeteer, with asynchronous, docker extensions.

There’s nothing on the Internet, but I finally figured it out.

Poor rendering effect of Splash cluster, long-term instability caused by Selenium GIRD memory problems, pyppeteer and Async as an important supplement.

Quick try

1. Start Scrapyd

# dockerhub
# docker run -p 6800:6800 chinaclark1203/scrapyd_pyppeteer

# ali cloud
docker run -p 6800:6800 registry.cn-hangzhou.aliyuncs.com/luzihang/scrapyd_pyppeteer
Copy the code

Scrapyd Startup log:

2020/7/2 5:41:46[2020-07-02 17:41:46.768] INFO in logParser. Run: logParser version: 0.8.2 2020/7/2 5:41:46 PM INFO in logParser. Run: Use 'logparser -h' to get help 2020/7/2 5:41:46[2020-07-02 17:41:46.769] INFO in logparser. Run: Main pid: 10 2020/7/2 5:41:46 PM INFO in logParser. Run: Check out the config file below for more advanced Settings. 2020/7/2 5:41:46 2020/7/2 Afternoon 5:41:46 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 2020/7/2 Afternoon 5:41:46 Loading Settings from/usr/local/lib/python3.8 / site - packages/logparser/Settings. Py 2020/7/2 Afternoon 5:41:46 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 2020/7/2 5:41:46 2020/7/2 5:41:46[2020-07-02 17:41:46,770] DEBUG in logParser. Run: Reading Settings from command line: Namespace(delete_json_files=True, disable_telnet=False, main_pid=0, scrapyd_logs_dir='/code/logs', Scrapyd_server = '127.0.0.1:6800', sleep = '10', Verbose =False) 2020/7/2 5:41:46[2020-07-02 17:41:46,770] DEBUG in logparser. Run: Checking Config 2020/7/2 PM 5:41:46[2020-07-02 17:41:46,770] INFO in logParser. Run: SCRAPYD_SERVER: 127.0.0.1:6800 2020/7/2 PM 5:41:46[2020-07-02 17:41:46,770] ERROR in logParser. Run: Check config fail: 2020/7/2 PM 5:41:46SCRAPYD_LOGS_DIR not found: '/code/logs' 2020/7/2 5:41:46 PM Check and update your Settings in / usr/local/lib/python3.8 / site - packages/logparser/Settings. Py 2020/7/2 afternoon 5:41:46 2020/7/2 Afternoon 5:41:47 T17:2020-07-02 41:46 + 0800 [-] Loading/usr/local/lib/python3.8 / site - packages/scrapyd/txapp. Py... 2020/7/2 PM 5:41:472020-07-02T17:41:47+0800 [-] Scrapyd Web Console available at http://0.0.0.0:6800/ 2020/7/2 PM 5:41:472020-07-02T17:41:47+0800 [-] Loaded [twisted.scripts. _twistd_unix.unixapplogger #info] Twistd 18.9.0 (/usr/local/bin/python 3.8.3) starting up. 2020/7/2 5:41:47 PM 2020-07-02t17:41:47 +0800 [twisted. Scripts._twistd_unix.UnixAppLogger#info] reactor class: The twisted. Internet. Epollreactor. Epollreactor. 2020/7/2 5:41:47 T17:2020-07-02 in the afternoon 41:47 + [-] Site starting on 6800 0800 2020/7/2 PM 5:41:472020-07-02T17:41:47+0800 [twisted.web.server.Site#info] Starting factory <twisted.web.server.Site Object at 0x7F680D7BF790 > 2020/7/2 PM 5:41:472020-07-02T17:41:47+0800 [Launcher] Scrapyd 1.2.0 started: max_proc=200, runner='scrapyd.runner'Copy the code

2. Visit scrapyd

127.0.0.1:6800

3. Go into the container and run the test code

➜ ~ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 2C3aba2b8d2b 192.168.95.55:7777 / scrapyhub/scrapyd_pyppeteer ". / entrypoint. Sh "4 minutes line Up 4 minutes 0.0.0.0:6800 - > 6800 / TCP Bold_gagarin 7a299c33e17c Joyzoursky/Python-ChromeDriver :3.7 "bash" 2 days ago Up 2 days magical_Galileo ➜ ~ docker exec -it 2c3aba2b8d2b bash root@2c3aba2b8d2b:/code# python python 3.8.3 (default, Jun 9 2020, 17:39:39) [GCC 8.3.0] on Linux Type "help", "copyright", "credits" or "license" for more information.>>> import asyncio
syncio.get_event_loop().run_until_complete(main())
print(res)>>> from pyppeteer import launch
>>>
>>> async def main():. browser = await launch(headless=True,executablePath='/usr/bin/google-chrome', args=['--no-sandbox', '--disable-dev-shm-usage']) ... . page = await browser.newPage() ... await page.setViewport(viewport={'width': 1280, 'height': 800}) ... await page.setExtraHTTPHeaders( ... {' the user-agent ':' Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'}... . # await page.goto('http://www.jcfc.cn/') ... await page.goto('https://httpbin.org/get') ... # await page.goto('https://news.baidu.com/') ... # await page.screenshot(path='example.png', fullPage=True) ... await asyncio.sleep(5) ... return await page.content() ...>>> res = asyncio.get_event_loop().run_until_complete(main())

>>> print(res)<html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;" >{ "args": {}, "headers": { "Accept": "text/html,application/xhtml+xml,application/xml; Q = 0.9, image/webp image/apng, * / *; Q = 0.8, application/signed - exchange; v=b3; Q =0.9", "accept-encoding ": "gzip, deflate, BR "," accept-language ": "en-us ", "Host": "httpbin.org"," sec-fetch -Dest": "document", "Sec-Fetch-Mode": "navigate", "Sec-Fetch-Site": "none", "Sec-Fetch-User": "?1", "upgrade-insecure -Requests": "1"," user-agent ": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36", "X-Amzn-trace-id ": Root=1-5efe9ca3-a9d80461ef0dc7725f7e1539}, "origin": "13.67.73.63", "url": "https://httpbin.org/get" } </pre></body></html>>>>


Copy the code

Test code:

import asyncio
from pyppeteer import launch

async def main(a):
    browser = await launch(headless=True,executablePath='/usr/bin/google-chrome', args=['--no-sandbox'.'--disable-dev-shm-usage'])
    
    page = await browser.newPage()
    await page.setViewport(viewport={'width': 1280.'height': 800})
    await page.setExtraHTTPHeaders(
        {'User-Agent': 'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'})# await page.goto('http://www.jcfc.cn/')
    await page.goto('https://httpbin.org/get')
    # await page.goto('https://news.baidu.com/')
    # await page.screenshot(path='example.png', fullPage=True)
    await asyncio.sleep(5)
    return await page.content()

res = asyncio.get_event_loop().run_until_complete(main())
print(res)
Copy the code

4, with scrapyDWeb and other scheduling platforms, using Scrapyd deployment application