Pyppeteer profile
Puppeteer is the Google Chrome team’s official Headless Chrome tool, which is a Node library, Provides an advanced API to control the DevTools protocol on no front page Chrome. It can also be configured to use the full (non-headless) Chrome. Chrome has long been the dominant browser, so Chrome Headless is set to become the industry standard for automated testing of Web applications. Using Puppeteer, you can operate both Linux and Chrome in a variety of applications. Pyppeteer is a Python wrapper to the headless browser, Puppeteer, that lets you manipulate Chrome in Python.
GIT Pyppeteer GIT Pyppeteer
Problems in use
- Pyppeteer API provides
Close ()
Commands can’t really shut down the browser, causing a lot of zombie processes - Websockets version too high causing an error
pyppeteer.errors.NetworkError: Protocol error Network.getCookies: Target close
- Chromium browser multi-open page jam problem
- The browser window is large and the content display is small
- use
while True
Looping through tasks causes python scripts to take up more and more memory, causing the program to fake death
The following will provide solutions to various problems in the use process
Pyppeteer use
Pyppeteer installation
python3 -m pip install pyppeteer
Copy the code
He downloaded chromium automatically when he first started using Pyppeteer, or went to the website to download the latest version of the browser and specify the browser path in the code. Chromium download address
Simple introduction
from pyppeteer import launch
import asyncio
async def main(a):
# Create a browser
browser = await launch({
'executablePath': 'you download the Chromium. The app/Contents/MacOS/Chromium',})# Open a page, the same browser can open multiple pages
page = await browser.newPage()
await page.goto('https://baidu.com') # access the specified page
await page.screenshot(path='example.png') # screenshots
await page.close() # Close the page
await browser.close() # Close the browser (it was found that opening multiple pages generated a large number of zombie processes)Asyncio. Get_event_loop (.) run_until_complete (the main ())Copy the code
Run the above code will produce a page screenshots, if an error in the running pyppeteer.errors.Net workError: Protocol error Network. GetCookies: Target close can be resolved by lowering the WebSockets version
pip uninstall websockets # remove web socketsPIP install webSockets ==6.0 -- force-scanPortInstall version 6.0
Copy the code
Important parameter setting and method
import asynciofrom pyppeteer import launch
async def intercept_request(req):
# do not load CSS and img resources
if req.resourceType in ["image"."media"."eventsource"."websocket"."stylesheet"."font"] :await req.abort() # connect request
else:
res = {
"method": req.method,
"url": req.url,
"data": "" if req.postData == None else req.postData,
"res": "" if req.response == None else req.response
}
print(res) Print the requested content
await req.continue_() You can add parameters to redirect the request address or change the headers of the request
async def intercept_response(res):
resourceType = res.request.resourceType
Intercept Ajax requests for data
if resourceType in ['xhr']:
resp = await res.json()
print(resp)You can use mysql, Redis, or create a class to store data
async def main(a):
# Create a browser
browser = await launch({
'executablePath': 'you download the Chromium. The app/Contents/MacOS/Chromium'.'headless': False.# Turn off headless mode. Mainly used in test environment debugging
'devtools': True.# Open Chromium devTools with headless
'args': [
'--disable-extensions'.'--hide-scrollbars'.'--disable-bundled-ppapi-flash'.'--mute-audio'.'--no-sandbox'.# --no-sandbox specifies the parameter to use in docker
'--disable-setuid-sandbox'.'--disable-gpu',].'dumpio': True.# Set the headless browser process's stderr core stdout PIP to the main application. If set to True, the chromium Console output will be printed in the main application
})
# Open a page, the same browser can open multiple pages
page = await browser.newPage()
# Whether to enable JS, enabled is set to False, no rendering effect, if the page has Ajax requests to enable this item
await page.setJavaScriptEnabled(enabled=True)
If you enable two callback functions that can be registered, point to these two functions before the browser makes the request and gets it.
await page.setRequestInterception(value=True)
page.on('request', intercept_request) # request content
page.on('response', intercept_response) The content of the response
await page.goto('https://baidu.com') # access the specified page
await page.screenshot(path='example.png') # screenshots
await page.close() # Close the page
await browser.close() # Close the browser (it was found that opening multiple pages generated a large number of zombie processes)Asyncio. Get_event_loop (.) run_until_complete (the main ())Copy the code
zombies
Cause analysis,
When a parent creates a new child with a fork() system call, the core process assigns the child an entry point in the process table and stores the information in the process table corresponding to that entry point. One of these items of information is the identification code of its parent process. And when the child process at the end of the call (such as the exit command), in fact, he didn’t really be destroyed, but left a known as zombies, Zombie) data structure (the system calls the exit of the role is to make the process exits, but also just a normal process into a Zombie process, does not destroy it completely). At this point, the data in the process table is replaced by the process’s exit code, CPU time used during execution, and other data, which is retained until the system passes it to its parent process. Thus, the Defunct process appears after the child has terminated, but before the parent has read the data. At this point, the zombie child process has given up almost all of its memory space, has no executable code, and cannot be scheduled. It only reserves a place in the process list for other processes to collect information about the exit status of the process. Besides, the zombie process no longer occupies any storage space. If the parent does not install the SIGCHLD signal handler to wait or waitpid() for the child to terminate, and does not explicitly ignore the signal, it remains zombie. If the parent terminates, init will automatically take over the child. Collect his body. He can still be removed. In the case of Nginx, the default is to act as a background daemon. Here’s how it works. First, Nginx creates a child process. Second, the original Nginx process exits. Third, the Nginx child is received by the init process.
CMD ["/bin/bash"."-c"."Set-e && Your task script"]
Copy the code
But there are problems with this approach and it does not end the process gracefully. Suppose you send SIGTERM to bash with kill.Bash terminates, but does not send SIGTERM to its child processes! When bash ends, the kernel terminates all processes in the entire container. Packet expansion through the SIGKILL signal has not been cleanly terminated process. SIGKILL cannot be captured, so there is no way for a process to end cleanly. Suppose you’re running an application that’s busy writing files; The file may crash if the application is uncleanly terminated during writing. An unclean ending is a very bad thing. It’s like unplugging a server. But why care if the init process is terminated by SIGTERM? That’s because docker Stop sends SIGTERM to the init process. Docker Stop should stop the container cleanly so that you can start it later with Docker Start.
The while True loop causes the program to freeze
By constantly fetching data while True, Python processes will slowly consume more memory and become slower. I tried calling the del variable displayed after each variable is used and then calling gC.collect () at the end of the loop for garbage collection, but the program still stuck after running for a while. After reading articles about recycling and recycling for a few days, I still have no idea.
The ultimate big recruit
After several days of testing, I plan to use the way of child process to manage the call of Chrome browser and the processing of related data. Because process is the basic unit of resource allocation and scheduling in the system, process has its own independent memory space and CPU resources. When the process ends, the system will clean and return PCB. The implementation is to implement a custom Process class by inheriting Process and then terminating the child Process at the end of each while loop.
Interprocess communication
Although the way of using subprocess to manage crawler can effectively control the memory occupation, but how to return crawler data is a problem. Of course, you can directly operate mysql or REids in the child process to save data, which is suitable for single-node crawlers but not for distributed multi-node crawlers. Generally, distributed crawler is to capture data from multiple nodes and store data from one node. Data is shared through the Manager to save the data of the child process to the main process. But there are pitfalls when using Manager to handle mutable data types such as lists and dict.
Final code and environment
Image structures,
Dockerfile file
FROM centos:7
RUN set -ex \
Preinstall required components
&& yum install -y wget tar libffi-devel zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gcc make initscripts \
&& wget https://www.python.org/ftp/python/3.6.0/Python-3.6.0.tgz \
&& tar -zxvf Python-3.6.0.tgz \
&& cd Python-3.6.0 \
&& ./configure prefix=/usr/local/python3 \
&& make \
&& make install \
&& make clean \
&& rm -rf /Python-3.6.0* \
&& yum install -y epel-release \
&& yum install -y python-pip
Set the default to python3
RUN set -ex \
Back up older versions of Python
&& mv /usr/bin/python /usr/bin/python27 \
&& mv /usr/bin/pip /usr/bin/pip-python2.7 \
The default configuration is python3
&& ln -s /usr/local/python3/bin/python3.6 /usr/bin/python \
&& ln -s /usr/local/python3/bin/pip3 /usr/bin/pip
# Fix yum failure caused by python version modification
RUN set -ex \
&& sed -i "S # / usr/bin/python# / usr/bin/python2.7 #" /usr/bin/yum \
&& sed -i "S # / usr/bin/python# / usr/bin/python2.7 #" /usr/libexec/urlgrabber-ext-down \
&& yum install -y deltarpm
Base environment configuration
RUN set -ex \
Change the system time zone to GMT + 8
&& rm -rf /etc/localtime \
&& ln -s /usr/share/zoneinfo/Asia/Shanghai /etc/localtime \
&& yum install -y vim \
Install the scheduled task component
&& yum -y install cronie
# Support Chinese
RUN localedef -c -f UTF-8 -i zh_CN zh_CN.utf8
# Chrome browser dependency
RUN yum install kde-l10n-Chinese -y
RUN yum install pango.x86_64 libXcomposite.x86_64 libXcursor.x86_64 libXdamage.x86_64 libXext.x86_64 libXi.x86_64 libXtst.x86_64 cups-libs.x86_64 libXScrnSaver.x86_64 libXrandr.x86_64 GConf2.x86_64 alsa-lib.x86_64 atk.x86_64 gtk3.x86_64 -y
RUN yum install ipa-gothic-fonts xorg-x11-fonts-100dpi xorg-x11-fonts-75dpi xorg-x11-utils xorg-x11-fonts-cyrillic xorg-x11-fonts-Type1 xorg-x11-fonts-misc -y
# Update PIP version
RUN pip install --upgrade pip
ENV LC_ALL zh_CN.UTF-8
RUN mkdir -p /usr/src/scrapy
COPY requirements.txt /usr/src/scrapy
RUN pip install -i https://pypi.douban.com/simple/ -r /usr/src/scrapy/requirements.txt
Copy the code
Docker – compose files
version: '3.3'Services: scrapy: Privileged: true build: scrapy tty: true volumes: type: bind source: / / /usr/src/scrapy ports: -"9999:9999"
networks:
scrapynet:
ipv4_address: 172.19.0.8
command: [/bin/bash, -c, set -e && python /usr/src/scrapy/job.py]
networks:
scrapynet:
driver: bridge
ipam:
driver: default
config:
- subnet: 172.19.0.0/24
Copy the code
Command: [/bin/bash, -c, set -e && python /usr/src/scrapy/job.py] Description of the command
- /bin/bash prevents zombie processes from being created. The -e directive prevents bash from executing exec () as a simple command.
- Python /usr/src/scrapy/job.py Real working script
Crawler script based on Pyppeteer
import asyncio,random,psutil,os,signal,time,subprocess,gc
from pyppeteer import launcher
from multiprocessing import Process
from multiprocessing import Manager
# hook disable prevents monitoring of WebDriver
launcher.AUTOMATION_ARGS.remove("--enable-automation")
from pyppeteer import launch
class AjaxData(a):
response_data=[]
@classmethod
def init(cls):
cls.response_data.clear()
@classmethod
def save(cls,data):
cls.response_data.append(data)
@classmethod
def get_data(cls):
return cls.response_data
async def intercept_request(req):
if req.resourceType in ["image"] :await req.abort()
else:
res = {
"method": req.method,
"url": req.url,
"data": "" if req.postData == None else req.postData,
"res": "" if req.response == None else req.response
}
print(res)
await req.continue_()
async def intercept_response(res):
resourceType = res.request.resourceType
if resourceType in ['xhr'] :# Actually return data
# resp = await res.json()
# Test data
resp={'num':random.randint(1.200)}
AjaxData.save(resp)
del resp
class newpage(object):
width, height = 1920.1080
def __init__(self, page_url,chrome_browser):
self.url = page_url
self.browser = chrome_browser
async def run(self):
t = random.randint(1.4)
tt = random.randint(t, 10)
await asyncio.sleep(tt)
try:
page = await self.browser.newPage()
await page.setUserAgent(
userAgent='the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10 _12_6) AppleWebKit / 537.36 (KHTML, like Gecko) HeadlessChrome / 70.0.3521.2 Safari / 537.36 ')
await page.setViewport(viewport={'width': self.width, 'height': self.height})
# Whether to enable JS. If enabled is set to False, no rendering effect is available
await page.setJavaScriptEnabled(enabled=True)
await page.setRequestInterception(value=True)
page.on('request', intercept_request)
page.on('response', intercept_response)
await page.goto(self.url, options={'timeout': 30000})
await page.waitFor(selectorOrFunctionOrTimeout=1000)
try:
await page.close()
return self.url
except BaseException as err:
return "close_newpage: {0}".format(err)
except BaseException as err:
return "newpage: {0}".format(err)
class Browser(Process):
width, height = 1920.1080
browser = None
system = ' '
pid = 0
is_headless = True
url_list = []
returnlist=[]
def __init__(self,urls,return_list):
Process.__init__(self)
self.url_list = urls
self.returnlist = return_list
# Encapsulate the kill () method to kill the main Chrome process and let init 1 take over its zombie child process
def kill(self, name: str = ' '):
if self.system == 'Windows':
# win platform
subprocess.Popen("taskkill /F /IM chrome.EXE ", shell=True)
else:
# Linux platform
# Check whether the process exists
if self.pid > 0 and psutil.pid_exists(self.pid):
# Check whether the process is running
p = psutil.Process(self.pid)
print('Browser state: %s' % p.status())
ifp.status() ! = psutil.STATUS_ZOMBIE:try:
pgid = os.getpgid(self.pid)
# Force end
os.kill(self.pid, signal.SIGKILL)
# os.kill(pgid, signal.SIGKILL)
print("End process: %d" % self.pid)
print("Parent process is: %d" % pgid)
print("Browser state: %d" % self.browser.process.wait())
except BaseException as err:
print("close: {0}".format(err))
del p
# check to see if there are other processes
for proc in psutil.process_iter():
if name in proc.name():
try:
pgid = os.getpgid(proc.pid)
os.kill(proc.pid, signal.SIGKILL)
print('Killed pid:%d process pgid :%d name :% s' % (proc.pid, pgid, proc.name()))
del pgid
except BaseException as err:
print("kill: {0}".format(err))
time.sleep(3)
# Open browser
async def newbrowser(self):
try:
self.browser = await launch({
'headless': self.is_headless,
'devtools': not self.is_headless,
'dumpio': True.'autoClose': True.# 'userDataDir': './userdata',
'handleSIGTERM': True.'handleSIGHUP': True.# 'executablePath':'C:/Users/zhang/Desktop/chrome-win/chrome.exe',
'args': [
'--no-sandbox'.# --no-sandbox specifies the parameter to use in docker
'--disable-gpu'.'--disable-extensions'.'--hide-scrollbars'.'--disable-bundled-ppapi-flash'.'--mute-audio'.'--disable-setuid-sandbox'.'--disable-xss-auditor'.'--window-size=%d,%d' % (self.width, self.height)
]
})
except BaseException as err:
print("launch: {0}".format(err))
print('---- Open your browser ----')
async def open(self):
await self.newbrowser()
self.pid = self.browser.process.pid
try:
tasks = [asyncio.ensure_future(newpage(url,self.browser).run()) for url in self.url_list]
for task in asyncio.as_completed(tasks):
result = await task
print('Task ret: {}'.format(result))
del tasks[:]
except BaseException as err:
print("open: {0}".format(err))
The # browser.close() method cannot completely exit the Chrome process, so we have wrapped the kill () method to kill the main Chrome process and let init 1 take over its zombie child process
await self.browser.close()
def run(self):
AjaxData.init()
loop = asyncio.get_event_loop()
loop.run_until_complete(self.open())
self.returnlist.extend(AjaxData.get_data())
print('---- Close browser ----')
self.kill('chrom')
if __name__ == '__main__':
url_list=[
'https://www.baidu.com/'.'https://www.baidu.com/'.'https://www.baidu.com/'.'https://www.baidu.com/',]while True:
manager = Manager()
return_list = manager.list()
# Never stop performing tasks
p = Browser(url_list, return_list)
p.start()
p.join(30)
if p.is_alive() == True:
p.terminate()
print('Force the child process to close.... ')
else:
print('Child process closed... ')
Print the data returned by the child process
print(return_list)
# to clean up
del p
del return_list[:]
gc.collect()
Copy the code