Pyppeteer profile

Puppeteer is the Google Chrome team’s official Headless Chrome tool, which is a Node library, Provides an advanced API to control the DevTools protocol on no front page Chrome. It can also be configured to use the full (non-headless) Chrome. Chrome has long been the dominant browser, so Chrome Headless is set to become the industry standard for automated testing of Web applications. Using Puppeteer, you can operate both Linux and Chrome in a variety of applications. Pyppeteer is a Python wrapper to the headless browser, Puppeteer, that lets you manipulate Chrome in Python.

GIT Pyppeteer GIT Pyppeteer

Problems in use

  • Pyppeteer API providesClose ()Commands can’t really shut down the browser, causing a lot of zombie processes
  • Websockets version too high causing an errorpyppeteer.errors.NetworkError: Protocol error Network.getCookies: Target close
  • Chromium browser multi-open page jam problem
  • The browser window is large and the content display is small
  • usewhile TrueLooping through tasks causes python scripts to take up more and more memory, causing the program to fake death

The following will provide solutions to various problems in the use process

Pyppeteer use

Pyppeteer installation

python3 -m pip install pyppeteer
Copy the code

He downloaded chromium automatically when he first started using Pyppeteer, or went to the website to download the latest version of the browser and specify the browser path in the code. Chromium download address

Simple introduction


from pyppeteer import launch
import asyncio

async def main(a):
    # Create a browser
    browser = await launch({
        'executablePath': 'you download the Chromium. The app/Contents/MacOS/Chromium',})# Open a page, the same browser can open multiple pages
    page = await browser.newPage()
    await page.goto('https://baidu.com') # access the specified page
    await page.screenshot(path='example.png')  # screenshots
    await page.close() # Close the page
    await browser.close() # Close the browser (it was found that opening multiple pages generated a large number of zombie processes)Asyncio. Get_event_loop (.) run_until_complete (the main ())Copy the code

Run the above code will produce a page screenshots, if an error in the running pyppeteer.errors.Net workError: Protocol error Network. GetCookies: Target close can be resolved by lowering the WebSockets version

pip uninstall websockets # remove web socketsPIP install webSockets ==6.0 -- force-scanPortInstall version 6.0
Copy the code

Important parameter setting and method


import asynciofrom pyppeteer import launch


async def intercept_request(req):
    # do not load CSS and img resources
    if req.resourceType in ["image"."media"."eventsource"."websocket"."stylesheet"."font"] :await req.abort() # connect request
    else:
        res = {
            "method": req.method,
            "url": req.url,
            "data": "" if req.postData == None else req.postData,
            "res": "" if req.response == None else req.response
        }
        print(res) Print the requested content
        await  req.continue_() You can add parameters to redirect the request address or change the headers of the request

async def intercept_response(res):
    resourceType = res.request.resourceType
    Intercept Ajax requests for data
    if resourceType in ['xhr']:
        resp = await res.json()
        print(resp)You can use mysql, Redis, or create a class to store data
        
async def main(a):
    # Create a browser
    browser = await launch({
        'executablePath': 'you download the Chromium. The app/Contents/MacOS/Chromium'.'headless': False.# Turn off headless mode. Mainly used in test environment debugging
        'devtools': True.# Open Chromium devTools with headless
        'args': [ 
             '--disable-extensions'.'--hide-scrollbars'.'--disable-bundled-ppapi-flash'.'--mute-audio'.'--no-sandbox'.# --no-sandbox specifies the parameter to use in docker
             '--disable-setuid-sandbox'.'--disable-gpu',].'dumpio': True.# Set the headless browser process's stderr core stdout PIP to the main application. If set to True, the chromium Console output will be printed in the main application
    })
    # Open a page, the same browser can open multiple pages
    page = await browser.newPage()
    # Whether to enable JS, enabled is set to False, no rendering effect, if the page has Ajax requests to enable this item
    await page.setJavaScriptEnabled(enabled=True)
    If you enable two callback functions that can be registered, point to these two functions before the browser makes the request and gets it.
    await page.setRequestInterception(value=True)
    page.on('request', intercept_request) # request content
    page.on('response', intercept_response) The content of the response
    await page.goto('https://baidu.com') # access the specified page
    await page.screenshot(path='example.png')  # screenshots
    await page.close() # Close the page
    await browser.close() # Close the browser (it was found that opening multiple pages generated a large number of zombie processes)Asyncio. Get_event_loop (.) run_until_complete (the main ())Copy the code

zombies

Cause analysis,

When a parent creates a new child with a fork() system call, the core process assigns the child an entry point in the process table and stores the information in the process table corresponding to that entry point. One of these items of information is the identification code of its parent process. And when the child process at the end of the call (such as the exit command), in fact, he didn’t really be destroyed, but left a known as zombies, Zombie) data structure (the system calls the exit of the role is to make the process exits, but also just a normal process into a Zombie process, does not destroy it completely). At this point, the data in the process table is replaced by the process’s exit code, CPU time used during execution, and other data, which is retained until the system passes it to its parent process. Thus, the Defunct process appears after the child has terminated, but before the parent has read the data. At this point, the zombie child process has given up almost all of its memory space, has no executable code, and cannot be scheduled. It only reserves a place in the process list for other processes to collect information about the exit status of the process. Besides, the zombie process no longer occupies any storage space. If the parent does not install the SIGCHLD signal handler to wait or waitpid() for the child to terminate, and does not explicitly ignore the signal, it remains zombie. If the parent terminates, init will automatically take over the child. Collect his body. He can still be removed. In the case of Nginx, the default is to act as a background daemon. Here’s how it works. First, Nginx creates a child process. Second, the original Nginx process exits. Third, the Nginx child is received by the init process.








CMD ["/bin/bash"."-c"."Set-e && Your task script"]
Copy the code

But there are problems with this approach and it does not end the process gracefully. Suppose you send SIGTERM to bash with kill.Bash terminates, but does not send SIGTERM to its child processes! When bash ends, the kernel terminates all processes in the entire container. Packet expansion through the SIGKILL signal has not been cleanly terminated process. SIGKILL cannot be captured, so there is no way for a process to end cleanly. Suppose you’re running an application that’s busy writing files; The file may crash if the application is uncleanly terminated during writing. An unclean ending is a very bad thing. It’s like unplugging a server. But why care if the init process is terminated by SIGTERM? That’s because docker Stop sends SIGTERM to the init process. Docker Stop should stop the container cleanly so that you can start it later with Docker Start.

The while True loop causes the program to freeze

By constantly fetching data while True, Python processes will slowly consume more memory and become slower. I tried calling the del variable displayed after each variable is used and then calling gC.collect () at the end of the loop for garbage collection, but the program still stuck after running for a while. After reading articles about recycling and recycling for a few days, I still have no idea.

The ultimate big recruit

After several days of testing, I plan to use the way of child process to manage the call of Chrome browser and the processing of related data. Because process is the basic unit of resource allocation and scheduling in the system, process has its own independent memory space and CPU resources. When the process ends, the system will clean and return PCB. The implementation is to implement a custom Process class by inheriting Process and then terminating the child Process at the end of each while loop.

Interprocess communication

Although the way of using subprocess to manage crawler can effectively control the memory occupation, but how to return crawler data is a problem. Of course, you can directly operate mysql or REids in the child process to save data, which is suitable for single-node crawlers but not for distributed multi-node crawlers. Generally, distributed crawler is to capture data from multiple nodes and store data from one node. Data is shared through the Manager to save the data of the child process to the main process. But there are pitfalls when using Manager to handle mutable data types such as lists and dict.

Final code and environment

Image structures,

Dockerfile file

FROM centos:7
RUN set -ex \
    Preinstall required components
    && yum install -y wget tar libffi-devel zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gcc make initscripts \
    && wget https://www.python.org/ftp/python/3.6.0/Python-3.6.0.tgz \
    && tar -zxvf Python-3.6.0.tgz \
    && cd Python-3.6.0 \
    && ./configure prefix=/usr/local/python3 \
    && make \
    && make install \
    && make clean \
    && rm -rf /Python-3.6.0* \
    && yum install -y epel-release \
    && yum install -y python-pip
Set the default to python3
RUN set -ex \
    Back up older versions of Python
    && mv /usr/bin/python /usr/bin/python27 \
    && mv /usr/bin/pip /usr/bin/pip-python2.7 \
    The default configuration is python3
    && ln -s /usr/local/python3/bin/python3.6 /usr/bin/python \
    && ln -s /usr/local/python3/bin/pip3 /usr/bin/pip
# Fix yum failure caused by python version modification
RUN set -ex \
    && sed -i "S # / usr/bin/python# / usr/bin/python2.7 #" /usr/bin/yum \
    && sed -i "S # / usr/bin/python# / usr/bin/python2.7 #" /usr/libexec/urlgrabber-ext-down \
    && yum install -y deltarpm
Base environment configuration
RUN set -ex \
    Change the system time zone to GMT + 8
    && rm -rf /etc/localtime \
    && ln -s /usr/share/zoneinfo/Asia/Shanghai /etc/localtime \
    && yum install -y vim \
    Install the scheduled task component
    && yum -y install cronie
# Support Chinese
RUN localedef -c -f UTF-8 -i zh_CN zh_CN.utf8
# Chrome browser dependency
RUN yum install kde-l10n-Chinese -y
RUN yum install pango.x86_64 libXcomposite.x86_64 libXcursor.x86_64 libXdamage.x86_64 libXext.x86_64 libXi.x86_64 libXtst.x86_64 cups-libs.x86_64 libXScrnSaver.x86_64 libXrandr.x86_64 GConf2.x86_64 alsa-lib.x86_64 atk.x86_64 gtk3.x86_64 -y
RUN yum install ipa-gothic-fonts xorg-x11-fonts-100dpi xorg-x11-fonts-75dpi xorg-x11-utils xorg-x11-fonts-cyrillic xorg-x11-fonts-Type1 xorg-x11-fonts-misc -y
# Update PIP version
RUN pip install --upgrade pip
ENV LC_ALL zh_CN.UTF-8
RUN mkdir -p /usr/src/scrapy
COPY requirements.txt /usr/src/scrapy
RUN pip install -i https://pypi.douban.com/simple/ -r /usr/src/scrapy/requirements.txt
Copy the code

Docker – compose files

version: '3.3'Services: scrapy: Privileged: true build: scrapy tty: true volumes: type: bind source: / / /usr/src/scrapy ports: -"9999:9999"
    networks:
      scrapynet:
        ipv4_address: 172.19.0.8
    command: [/bin/bash, -c, set -e && python /usr/src/scrapy/job.py]
  
networks:
  scrapynet:
    driver: bridge
    ipam:
      driver: default
      config:
        - subnet: 172.19.0.0/24
Copy the code

Command: [/bin/bash, -c, set -e && python /usr/src/scrapy/job.py] Description of the command

  • /bin/bash prevents zombie processes from being created. The -e directive prevents bash from executing exec () as a simple command.
  • Python /usr/src/scrapy/job.py Real working script

Crawler script based on Pyppeteer

import asyncio,random,psutil,os,signal,time,subprocess,gc
from pyppeteer import launcher
from multiprocessing import Process
from multiprocessing import Manager
# hook disable prevents monitoring of WebDriver
launcher.AUTOMATION_ARGS.remove("--enable-automation")
from pyppeteer import launch

class AjaxData(a):
    response_data=[]

    @classmethod
    def init(cls):
        cls.response_data.clear()

    @classmethod
    def save(cls,data):
        cls.response_data.append(data)

    @classmethod
    def get_data(cls):
        return cls.response_data

async def intercept_request(req):
    if req.resourceType in ["image"] :await req.abort()
    else:
        res = {
            "method": req.method,
            "url": req.url,
            "data": "" if req.postData == None else req.postData,
            "res": "" if req.response == None else req.response
        }
        print(res)
        await req.continue_()


async def intercept_response(res):
    resourceType = res.request.resourceType
    if resourceType in ['xhr'] :# Actually return data
        # resp = await res.json()
        # Test data
        resp={'num':random.randint(1.200)}
        AjaxData.save(resp)
        del resp

class newpage(object):
    width, height = 1920.1080
    def __init__(self, page_url,chrome_browser):
        self.url = page_url
        self.browser = chrome_browser

    async def run(self):
        t = random.randint(1.4)
        tt = random.randint(t, 10)
        await asyncio.sleep(tt)
        try:
            page = await self.browser.newPage()
            await page.setUserAgent(
                userAgent='the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10 _12_6) AppleWebKit / 537.36 (KHTML, like Gecko) HeadlessChrome / 70.0.3521.2 Safari / 537.36 ')
            await page.setViewport(viewport={'width': self.width, 'height': self.height})
            # Whether to enable JS. If enabled is set to False, no rendering effect is available
            await page.setJavaScriptEnabled(enabled=True)
            await page.setRequestInterception(value=True)
            page.on('request', intercept_request)
            page.on('response', intercept_response)
            await page.goto(self.url, options={'timeout': 30000})
            await page.waitFor(selectorOrFunctionOrTimeout=1000)
            try:
                await page.close()
                return self.url
            except BaseException as err:
                return "close_newpage: {0}".format(err)
        except BaseException as err:
            return "newpage: {0}".format(err)

class Browser(Process):
    width, height = 1920.1080
    browser = None
    system = ' '
    pid = 0
    is_headless = True
    url_list = []
    returnlist=[]

    def __init__(self,urls,return_list):
        Process.__init__(self)
        self.url_list = urls
        self.returnlist = return_list

    # Encapsulate the kill () method to kill the main Chrome process and let init 1 take over its zombie child process
    def kill(self, name: str = ' '):
        if self.system == 'Windows':
            # win platform
            subprocess.Popen("taskkill /F /IM chrome.EXE ", shell=True)
        else:
            # Linux platform
            # Check whether the process exists
            if self.pid > 0 and psutil.pid_exists(self.pid):
                # Check whether the process is running
                p = psutil.Process(self.pid)
                print('Browser state: %s' % p.status())
                ifp.status() ! = psutil.STATUS_ZOMBIE:try:
                        pgid = os.getpgid(self.pid)
                        # Force end
                        os.kill(self.pid, signal.SIGKILL)
                        # os.kill(pgid, signal.SIGKILL)
                        print("End process: %d" % self.pid)
                        print("Parent process is: %d" % pgid)
                        print("Browser state: %d" % self.browser.process.wait())
                    except BaseException as err:
                        print("close: {0}".format(err))
                del p
            # check to see if there are other processes
            for proc in psutil.process_iter():
                if name in proc.name():
                    try:
                        pgid = os.getpgid(proc.pid)
                        os.kill(proc.pid, signal.SIGKILL)
                        print('Killed pid:%d process pgid :%d name :% s' % (proc.pid, pgid, proc.name()))
                        del pgid
                    except BaseException as err:
                        print("kill: {0}".format(err))
        time.sleep(3)

    # Open browser
    async def newbrowser(self):
        try:
            self.browser = await launch({
                'headless': self.is_headless,
                'devtools': not self.is_headless,
                'dumpio': True.'autoClose': True.# 'userDataDir': './userdata',
                'handleSIGTERM': True.'handleSIGHUP': True.# 'executablePath':'C:/Users/zhang/Desktop/chrome-win/chrome.exe',
                'args': [
                    '--no-sandbox'.# --no-sandbox specifies the parameter to use in docker
                    '--disable-gpu'.'--disable-extensions'.'--hide-scrollbars'.'--disable-bundled-ppapi-flash'.'--mute-audio'.'--disable-setuid-sandbox'.'--disable-xss-auditor'.'--window-size=%d,%d' % (self.width, self.height)
                ]
            })
        except BaseException as err:
            print("launch: {0}".format(err))

        print('---- Open your browser ----')

    async def open(self):
        await self.newbrowser()
        self.pid = self.browser.process.pid
        try:
            tasks = [asyncio.ensure_future(newpage(url,self.browser).run()) for url in self.url_list]
            for task in asyncio.as_completed(tasks):
                result = await task
                print('Task ret: {}'.format(result))
            del tasks[:]
        except BaseException as err:
            print("open: {0}".format(err))
        The # browser.close() method cannot completely exit the Chrome process, so we have wrapped the kill () method to kill the main Chrome process and let init 1 take over its zombie child process
        await self.browser.close()

    def run(self):
        AjaxData.init()
        loop = asyncio.get_event_loop()
        loop.run_until_complete(self.open())
        self.returnlist.extend(AjaxData.get_data())
        print('---- Close browser ----')
        self.kill('chrom')

if __name__ == '__main__':
    url_list=[
        'https://www.baidu.com/'.'https://www.baidu.com/'.'https://www.baidu.com/'.'https://www.baidu.com/',]while True:
        manager = Manager()
        return_list = manager.list()
        # Never stop performing tasks
        p = Browser(url_list, return_list)
        p.start()
        p.join(30)
        if p.is_alive() == True:
            p.terminate()
            print('Force the child process to close.... ')
        else:
            print('Child process closed... ')
        Print the data returned by the child process
        print(return_list)
        # to clean up
        del p
        del return_list[:]
        gc.collect()
Copy the code