Introduction to the

Scrapyd_pyppeteer: Python3.8 Selenium Pyppeteer scrapy scrapyd-Client LogParser Use it asynchronously in scrapy

Think big, a Python version of Puppeteer, with asynchronous, docker extensions.

There’s nothing on the Internet, but I finally figured it out.

Poor rendering effect of Splash cluster, long-term instability caused by Selenium GIRD memory problems, pyppeteer and Async as an important supplement.

Quick try

1. Start Scrapyd

# dockerhub
# docker run -p 6800:6800 chinaclark1203/scrapyd_pyppeteer

# ali cloud
docker run -p 6800:6800 registry.cn-hangzhou.aliyuncs.com/luzihang/scrapyd_pyppeteer
Copy the code

Scrapyd Startup log:

2020/7/2 5:41:46[2020-07-02 17:41:46.768] INFO in logParser. Run: logParser version: 0.8.2 2020/7/2 5:41:46 PM INFO in logParser. Run: Use 'logparser -h' to get help 2020/7/2 5:41:46[2020-07-02 17:41:46.769] INFO in logparser. Run: Main pid: 10 2020/7/2 5:41:46 PM INFO in logParser. Run: Check out the config file below for more advanced Settings. 2020/7/2 5:41:46 2020/7/2 Afternoon 5:41:46 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 2020/7/2 Afternoon 5:41:46 Loading Settings from/usr/local/lib/python3.8 / site - packages/logparser/Settings. Py 2020/7/2 Afternoon 5:41:46 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 2020/7/2 5:41:46 2020/7/2 5:41:46[2020-07-02 17:41:46,770] DEBUG in logParser. Run: Reading Settings from command line: Namespace(delete_json_files=True, disable_telnet=False, main_pid=0, scrapyd_logs_dir='/code/logs', Scrapyd_server = '127.0.0.1:6800', sleep = '10', Verbose =False) 2020/7/2 5:41:46[2020-07-02 17:41:46,770] DEBUG in logparser. Run: Checking Config 2020/7/2 PM 5:41:46[2020-07-02 17:41:46,770] INFO in logParser. Run: SCRAPYD_SERVER: 127.0.0.1:6800 2020/7/2 PM 5:41:46[2020-07-02 17:41:46,770] ERROR in logParser. Run: Check config fail: 2020/7/2 PM 5:41:46SCRAPYD_LOGS_DIR not found: '/code/logs' 2020/7/2 5:41:46 PM Check and update your Settings in / usr/local/lib/python3.8 / site - packages/logparser/Settings. Py 2020/7/2 afternoon 5:41:46 2020/7/2 Afternoon 5:41:47 T17:2020-07-02 41:46 + 0800 [-] Loading/usr/local/lib/python3.8 / site - packages/scrapyd/txapp. Py... 2020/7/2 PM 5:41:472020-07-02T17:41:47+0800 [-] Scrapyd Web Console available at http://0.0.0.0:6800/ 2020/7/2 PM 5:41:472020-07-02T17:41:47+0800 [-] Loaded [twisted.scripts. _twistd_unix.unixapplogger #info] Twistd 18.9.0 (/usr/local/bin/python 3.8.3) starting up. 2020/7/2 5:41:47 PM 2020-07-02t17:41:47 +0800 [twisted. Scripts._twistd_unix.UnixAppLogger#info] reactor class: The twisted. Internet. Epollreactor. Epollreactor. 2020/7/2 5:41:47 T17:2020-07-02 in the afternoon 41:47 + [-] Site starting on 6800 0800 2020/7/2 PM 5:41:472020-07-02T17:41:47+0800 [twisted.web.server.Site#info] Starting factory <twisted.web.server.Site Object at 0x7F680D7BF790 > 2020/7/2 PM 5:41:472020-07-02T17:41:47+0800 [Launcher] Scrapyd 1.2.0 started: max_proc=200, runner='scrapyd.runner'Copy the code

2. Visit scrapyd

127.0.0.1:6800

3. Go into the container and run the test code

➜ ~ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 2C3aba2b8d2b 192.168.95.55:7777 / scrapyhub/scrapyd_pyppeteer ". / entrypoint. Sh "4 minutes line Up 4 minutes 0.0.0.0:6800 - > 6800 / TCP Bold_gagarin 7a299c33e17c Joyzoursky/Python-ChromeDriver :3.7 "bash" 2 days ago Up 2 days magical_Galileo ➜ ~ docker exec -it 2c3aba2b8d2b bash root@2c3aba2b8d2b:/code# python python 3.8.3 (default, Jun 9 2020, 17:39:39) [GCC 8.3.0] on Linux Type "help", "copyright", "credits" or "license" for more information.>>> import asyncio
syncio.get_event_loop().run_until_complete(main())
print(res)>>> from pyppeteer import launch
>>>
>>> async def main():. browser = await launch(headless=True,executablePath='/usr/bin/google-chrome', args=['--no-sandbox', '--disable-dev-shm-usage']) ... . page = await browser.newPage() ... await page.setViewport(viewport={'width': 1280, 'height': 800}) ... await page.setExtraHTTPHeaders( ... {' the user-agent ':' Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'}... . # await page.goto('http://www.jcfc.cn/') ... await page.goto('https://httpbin.org/get') ... # await page.goto('https://news.baidu.com/') ... # await page.screenshot(path='example.png', fullPage=True) ... await asyncio.sleep(5) ... return await page.content() ...>>> res = asyncio.get_event_loop().run_until_complete(main())

>>> print(res)<html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;" >{ "args": {}, "headers": { "Accept": "text/html,application/xhtml+xml,application/xml; Q = 0.9, image/webp image/apng, * / *; Q = 0.8, application/signed - exchange; v=b3; Q =0.9", "accept-encoding ": "gzip, deflate, BR "," accept-language ": "en-us ", "Host": "httpbin.org"," sec-fetch -Dest": "document", "Sec-Fetch-Mode": "navigate", "Sec-Fetch-Site": "none", "Sec-Fetch-User": "?1", "upgrade-insecure -Requests": "1"," user-agent ": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36", "X-Amzn-trace-id ": Root=1-5efe9ca3-a9d80461ef0dc7725f7e1539}, "origin": "13.67.73.63", "url": "https://httpbin.org/get" } </pre></body></html>>>>


Copy the code

Test code:

import asyncio
from pyppeteer import launch

async def main(a):
    browser = await launch(headless=True,executablePath='/usr/bin/google-chrome', args=['--no-sandbox'.'--disable-dev-shm-usage'])
    
    page = await browser.newPage()
    await page.setViewport(viewport={'width': 1280.'height': 800})
    await page.setExtraHTTPHeaders(
        {'User-Agent': 'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'})# await page.goto('http://www.jcfc.cn/')
    await page.goto('https://httpbin.org/get')
    # await page.goto('https://news.baidu.com/')
    # await page.screenshot(path='example.png', fullPage=True)
    await asyncio.sleep(5)
    return await page.content()

res = asyncio.get_event_loop().run_until_complete(main())
print(res)
Copy the code

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Use the Python library pyppeteer in Docker

Introduction to the

Quick try

1. Start Scrapyd

2. Visit scrapyd

3. Go into the container and run the test code

4, with scrapyDWeb and other scheduling platforms, using Scrapyd deployment application

Use the Python library pyppeteer in Docker

Introduction to the

Quick try

1. Start Scrapyd

2. Visit scrapyd

3. Go into the container and run the test code

4, with scrapyDWeb and other scheduling platforms, using Scrapyd deployment application

Related Posts

Spring Cloud Implements Swagger2

Use Logstash to parse logs

New features of express delivery | OceanBase OMS V3.1.0 version and meet you!