The chattering bird channel for your friends
A,Puppeteer
Introduction and Installation
Puppeteer is a Node library that provides a high-level API for controlling Chromium through the DevTools protocol. After Google released the Headless browser, Selenium was abandoned by me because Puppeteer was too friendly for Nodejs developers to install with NPM I, There is no need to install other dependent libraries (originally too young O (╥﹏╥) O, actually not simple).
If the operating system is MacOS, Centos is deployed on the server. 7. It’s really simple on MacOS, just NPM I Puppeteer. Installation can not have the following solutions:
#1. Set environment variables to skip Chromium download (2018-09-03 invalid)
set PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=1
#2. Only download module without build, but chromium needs to download by itself (valid on September 03, 2018)
npm i --save puppeteer --ignore-scripts
#3. Puppeteer provides an additional puppeteer-Core library starting from V1.7.0, which only contains the Puppeteer core library and does not download Chromium by default
npm i puppeteer-core
#If puppeteer cannot be installed, taobao Image is recommended
npm config set registry="https://registry.npm.taobao.org"
Copy the code
If Chromium was downloaded by itself, add the following configuration items when starting the Headless browser
this.browser = await puppeteer.launch({
/ / MacOS should be in the "XXX/Chromium. App/Contents/MacOS/Chromium", Linux should "/ usr/bin/Chromium - browser"
executablePath: "Chromium install path"./ / to sandbox
args: ['--no-sandbox'.'--disable-dev-shm-usage']});Copy the code
Click on Puppeteer use case to learn about Puppeteer
Second, the skills
Lazy loading screenshot
When taking screenshots or crawlers, we often encounter that some pages display data in a lazy loading way, and the first screen does not show all the information to us. For lazy loading, using the way of rolling to the end to crack. What? Lazy loading has no bottom, try to tune their interface directly, or there are other clever ways to welcome pointed out
page.evaluate(pageFunction, … Args): this function lets us use the built-in DOM selector
PageFunction = pageFunction; pageFunction = pageFunction
const result = await page.evaluate(param1, param2, param3 => {
return Promise.resolve(8 + param1 + param2 + param3);
}, param1, param2, param3);
// You can also pass a string:
console.log(await page.evaluate('1 + 2')); / / output "3"
const x = 10;
console.log(await page.evaluate(1 + `${x}`)); / / output "11"
Copy the code
Code: take Jane book lazy loading as an example
/** * lazy page automatically scrolls */
const path = require('path');
const puppeteer = require('puppeteer-core');
const log = console.log;
(async () = > {
const browser = await puppeteer.launch({
// executablePath: path.join(__dirname, './chromium/Chromium.app/Contents/MacOS/Chromium'),
// Turn off the headless mode to open the browser
headless: false.args: ['--no-sandbox'.'--disable-dev-shm-usage']});const page = await browser.newPage();
await page.goto('https://www.jianshu.com/u/40909ea33e50');
await autoScroll(page);
/ / fullPage screenshots
await page.screenshot({
path: 'auto_scroll.png'.type: 'png'.fullPage: true});awaitbrowser.close(); }) ();async function autoScroll(page) {
log('[AutoScroll begin]');
await page.evaluate(async() = > {await new Promise((resolve, reject) = > {
// The current height of the page
let totalHeight = 0;
// The distance to scroll down each time
let distance = 100;
// Run the setInterval loop
let timer = setInterval((a)= > {
let scrollHeight = document.body.scrollHeight;
// Perform the scroll operation
window.scrollBy(0, distance);
// Stop execution if the scrolling distance is greater than the current element height
totalHeight += distance;
if(totalHeight >= scrollHeight) { clearInterval(timer); resolve(); }},100);
});
});
log('[AutoScroll done]');
// After lazy loading, complete screenshots can be taken or data can be crawled
// do what you like ...
}
Copy the code
Element exact screenshot
Precise screenshots, as the name suggests, are taken of the area the element occupies on the page. Then change the way to Puppeteer processing, is to use the screenshot clip parameter, according to the element relative window coordinates (X, Y) and the element width and height (width, height) positioning screenshots. Of course, the element selector has to be accurate, otherwise no matter how accurate the screenshot is, right
- Page. Screenshot parameters clip
element.getBoundingClientRect()
: This method is used to get the relative positions of elements in the viewport (included in the return object)Left, top, width, height
), relevant knowledge points can be understood by Google$eval
: This method is executed within the pagedocument.querySelector
, and passes the matched element as the first argumentpageFunction
const path = require('path');
const puppeteer = require('puppeteer-core');
const log = console.log;
(async () = > {
const browser = await puppeteer.launch({
// executablePath: path.join(__dirname, './chromium/Chromium.app/Contents/MacOS/Chromium'),
// Turn off the headless mode to open the browser
headless: false.args: ['--no-sandbox'.'--disable-dev-shm-usage']});const page = await browser.newPage();
await page.goto('https://www.jianshu.com/');
const pos = await getElementBounding(page, '.board');
/ / clip screenshots
await page.screenshot({
path: 'element_bounding.png'.type: 'png'.clip: {
x: pos.left,
y: pos.top,
width: pos.width,
height: pos.height
}
});
awaitbrowser.close(); }) ();async function getElementBounding(page, element) {
log('[GetElementBounding]: ', element);
const pos = await page.$eval(element, e => {
// implement the evaluate function in pageFunction
// document.querySelector(element).getBoundingClientRect()
const {left, top, width, height} = e.getBoundingClientRect();
return {left, top, width, height};
});
log('[Element position]: '.JSON.stringify(pos, undefined.2));
return pos;
}
Copy the code
OK, so far we have been able to take screenshots of most of the elements, the rest of the elements that are inside the scroll
Screenshots of inner scroll elements
Inner scroll: As opposed to traditional Window form scrolling, the main scroll bar is inside the page (or an element), not on the browser form. The most common is in the background admin interface, the left bar and the right content area of the scroll bar are separate.
Imagine opening netease Cloud Music, there will be two inner scroll bars on the first screen. If we want to see more playlists, we need to slide the scroll bar down. The same goes for scrolling inside screenshots, which are combined with page scrolling to expose the target element to visual range, and window coordinates to achieve accurate screenshots.
Steps:
- Gets the coordinates of the target element and determines whether it is in the current viewable range. If it is in the window, no scrolling is required
- Because it is inside scrolling, the target element must have a layer of scrollbar parent element outside, by scrolling the parent element to indirectly show the target element. So this step needs to determine the parent element’s selector
- By simulating the page scrolling parent element (setting
window.scrollBy
orscrollLeft scrollTop
), so that the target object just appears intact in the window - Because it is inner scrolling, we need to retrieve the coordinates of the target element (
getBoundingClientRect
) - Take a screenshot with the new coordinates
Here’s a little detail about how to tell if an element has a scroll bar. If an element does not have an x-scroll bar, setting its scrollLeft has no effect; only global scrolling will work.
// If the scrollWidth value is greater than the clientWidth value, a horizontal scroll bar is present
element.scrollHeight > element.clientHeight
// If the scrollHeight is greater than the clientHeight value, the vertical scroll bar appears
element.scrollHeight > element.clientHeight
Copy the code
Example code: toNodejs official documentFor example, get a screenshot of the TTY in the left column
/** * Intercepts the li node */ where the TTY is located in the left column
const path = require('path');
const puppeteer = require('puppeteer-core');
const log = console.log;
(async () = > {
const browser = await puppeteer.launch({
executablePath: path.join(__dirname, './chromium/Chromium.app/Contents/MacOS/Chromium'),
// Turn off the headless mode to open the browser
headless: false.args: ['--no-sandbox'.'--disable-dev-shm-usage']});const page = await browser.newPage();
await page.setViewport({width: 1920.height: 600});
const viewport = page.viewport();
// Nodejs official Api documentation site
await page.goto('https://nodejs.org/dist/latest-v10.x/docs/api/');
// await page.waitFor(1000);
// It is strongly recommended to use waitForNavigation, 1000 is a devil of a number that makes the code insecure
await page.waitForNavigation({
// The 20-second timeout
timeout: 20000.// Determine that the page is complete when there is no more network connection
waitUntil: [
'domcontentloaded'.'networkidle0',]});// step1: Determines the parent element selector for the inner scroll
const containerEle = '#column2';
// step1: determine the target element selector
const targetEle = '#column2 ul:nth-of-type(2) li:nth-of-type(40)';
// step1: Get the coordinates of the target element in the current window
let pos = await getElementBounding(page, targetEle);
// Use the built-in DOM selector
const ret = await page.evaluate(async (viewport, pos, element) => {
// step1: determine if the target element is currently visible
const sumX = pos.width + pos.left;
const sumY = pos.height + pos.top;
// The distance that X and Y axes need to move
const x = sumX <= viewport.width ? 0 : sumX - viewport.width;
const y = sumY <= viewport.height ? 0 : sumY - viewport.height;
const el = document.querySelector(element);
// strp3: scroll the element into viewport
// We need to check whether x and y of the target element can be scrolled. If the element cannot be scrolled, we need to scroll window
// If the scrollWidth value is greater than the clientWidth value, a horizontal scroll bar is present
if (el.scrollWidth > el.clientWidth) {
el.scrollLeft += x;
} else {
window.scrollBy(x, 0);
}
// If the scrollHeight is greater than the clientHeight value, the vertical scroll bar appears
if (el.scrollHeight > el.clientHeight) {
el.scrollTop += y;
} else {
window.scrollBy(0, y);
}
return [el.scrollHeight, el.clientHeight];
}, viewport, pos, containerEle);
// step4: Since the target element is outside the window and inside the inner scroll parent element, we need to retrieve the coordinates again
pos = await getElementBounding(page, targetEle);
// await page.waitFor(1000);
// It is strongly recommended to use waitForNavigation, 1000 is a devil of a number that makes the code insecure
await page.waitForNavigation({
// The 20-second timeout
timeout: 20000.// Determine that the page is complete when there is no more network connection
waitUntil: [
'domcontentloaded'.'networkidle0',]});/ / 5. Screenshots
await page.screenshot({
path: 'scroll_and_bounding.png'.type: 'png'.clip: {
x: pos.left,
y: pos.top,
width: pos.width,
height: pos.height
}
});
awaitbrowser.close(); }) ();Copy the code
Three, stepped on the pit: inLinux
Installed on theChromium
It turns out that installing Chromium in a Linux environment will be an unforgettable experience. When puppeteer is installed, Chromium is automatically downloaded, and for well-known reasons, downloads often fail. Chromium can be downloaded successfully after changing the mirror source, but various errors are reported after startup, which is caused by the lack of partial dependence on Linux. After installing the required dependencies, the code runs smoothly. However, the screenshot shows that the Chinese font on the browser is full of boxes. OK, install font library, Chinese characters are displayed normally!
Best practices after potholes
- using
Chromium
andNPM package
Separate the way to install onlypuppeteer-core
Through theexecutablePath
Introduce self-downloadChromium
Greatly speed upnpm install
The speed. - Switch the Linux mirror source to Ali’s mirror source for quick download
Chromium
- Change the project to
Docker
Deployment, to avoid the occurrence of local development normal, online but all kinds of problems - Avoid using
page.waifFor(1000)
1000 milliseconds is just a gross estimate of the time. It’s better to let the program decide for itself
Related solutions:
-
An official compilation of errors
-
Centos installs dependency libraries
yum install pango.x86_64 libXcomposite.x86_64 libXcursor.x86_64 libXdamage.x86_64 libXext.x86_64 libXi.x86_64 libXtst.x86_64 cups-libs.x86_64 libXScrnSaver.x86_64 libXrandr.x86_64 GConf2.x86_64 alsa-lib.x86_64 atk.x86_64 gtk3.x86_64 ipa-gothic-fonts xorg-x11-fonts-100dpi xorg-x11-fonts-75dpi xorg-x11-utils xorg-x11-fonts-cyrillic xorg-x11-fonts-Type1 xorg-x11-fonts-misc -y
Copy the code
- Alpine Installation Tips
#Set ali mirror source
echo "https://mirrors.aliyun.com/alpine/edge/main" > /etc/apk/repositories
echo "https://mirrors.aliyun.com/alpine/edge/community" >> /etc/apk/repositories
echo "https://mirrors.aliyun.com/alpine/edge/testing" >> /etc/apk/repositories
#Install Chromium and dependencies, including Chinese font support
apk -U --no-cache update
apk -U --no-cache --allow-untrusted add zlib-dev xorg-server dbus ttf-freefont chromium wqy-zenhei@edge -f
Copy the code
Once installed, you need to go to the sandbox to run, although it’s not officially recommended.
Linux Sandbox: In computer security, a Sandbox is a mechanism for isolating programs to limit the permissions of untrusted processes. Sandbox techniques are often used to execute untested or untrusted clients. To avoid untrusted programs that might disrupt the execution of other programs.
--no-sandbox
: Go to sandbox run--disable-dev-shm-usage
: By default,Docker
Run a/dev/shm
Container with 64MB shared memory space. This is usually too small for Chrome and will cause Chrome to crash when rendering large pages. To repair, you must run the containerdocker run --shm-size=1gb
In order to increase/dev/shm
The capacity. Starting with Chrome 65, use--disable-dev-shm-usage
Flag to start the browser, which will be written to the shared memory file/tmp
Rather than/dev/shm
.
const browser = await puppeteer.launch({
args: ['--no-sandbox'.'--disable-dev-shm-usage']});Copy the code
Fourth, throughDocker container
Deployment project
At the end of the project, it was found that Chromium needed to be installed every time, and unexpected problems might occur every time. In order to save time and do more meaningful things, optimize the above deployment process through shell scripts and Docker containers.
Docker development process
- Determine the base image
- Written based on the underlying image
Dockerfile
- According to the
Dockerfile
Building project images - Push the built image to
The Docker warehouse
If the private deployment directly export the image, then import it to the customer environment - Pull the project image on the test/production machine to create and run
Docker container
- Verify that the project is running properly
Here is to deploy a basedPuppeteer
Take the service of
Determine the base image
Docker Search Node (Docker Search NodeCopy the code
Visit Docker Hub for a more detailed description and version
#Here select 'Node :10-alpine' as the base image
docker pull node:10-alpine
Copy the code
writeDockerfile
(The walkthrough is not complete, please find more detailed information online)
FROM: Specifies the underlying image, which must be the first non-commented directive in the Dockerfile
FROM <image name>
FROM node:10-alpine
Copy the code
MAINTAINER: Sets the author of the image
MAINTAINER<author name> (Not recommended, recommendedLABELTo specify the mirror author.LABEL MAINTAINER="zhangqiling"(recommended)Copy the code
RUN: command executed under shell or exec environment. The RUN directive adds a new layer to the newly created image, and the resulting submission is used in the next directive in the Dockerfile
RUN <command>
# RUN can execute any command and then create and commit a new layer on top of the current image
RUN echo "https://mirrors.aliyun.com/alpine/edge/main" > /etc/apk/repositories
When executing multiple commands, use \ newline
RUNapk -U add \ zlib-dev \ xorg-serverCopy the code
The intermediate image created by the RUN directive is cached and used in the next build. If you don’t want to use these cache images, you can specify the –no-cache parameter at build time, such as docker build –no-cache.
CMD: provides the container’s default execution command. Dockerfile allows only one CMD directive to be used, and if there are more than one CMD, only the last one will take effect
# there are three forms
CMD ["executable"."param1"."param2"]
CMD ["param1"."param2"]
CMD command param1 param2
Copy the code
COPY: Copies files or directories from the build environment to an image
COPY <src>... <dest>
COPY ["<src>"."<dest>"]
Copy the project to my_app
COPY . /workspase/my_app
Copy the code
ADD: Also copies files or directories from the build environment to the image
ADD <src>... <dest>
ADD ["<src>"."<dest>"]
Copy the code
In contrast to COPY, ADD’s < SRC > can be a URL. Also, if the file is compressed, Docker will automatically decompress it.
WORKDIR: Specifies the working directory of the RUN, CMD, and ENTRYPOINT commands
WORKDIR /workspase/my_app
Copy the code
ENV: Sets environment variables
# Two ways
ENV <key> <value>
ENV <key>=<value>
Copy the code
VOLUME: authorized access to the directory from the container to the host
VOLUME ["/data"]
Copy the code
EXPOSE: Specifies the port on which the container listens at run time
EXPOSE <port>;
Copy the code
Attach the test passedDockerfile
The sample
A couple of points
- Use domestic Ali cloud mirror station to speed up installation dependence
- Default does not support Chinese display, must use wenquanyi free Chinese font, this library only in
https://mirrors.aliyun.com/alpine/edge/testing/
Can find - The default urban area in the container is not the east 8 area, which affects log printing. You need to reset the time zone
- Docker container on Centos machine
npm install
Will report an error, setnpm config set unsafe-perm true
After the smooth installation, what is the reason? (Docker on MacOS doesn’t have this problem)
# pull node image
FROM node:10-alpine
# set the mirror author
LABEL MAINTAINER="[email protected]"
# Set up domestic Ali cloud mirror station, install Chromium 68, Wenquanyi free Chinese font and other dependent libraries
RUN echo "https://mirrors.aliyun.com/alpine/v3.8/main/" > /etc/apk/repositories \
&& echo "https://mirrors.aliyun.com/alpine/v3.8/community/" >> /etc/apk/repositories \
&& echo "https://mirrors.aliyun.com/alpine/edge/testing/" >> /etc/apk/repositories \
&& apk -U --no-cache update && apk -U --no-cache --allow-untrusted add \
zlib-dev \
xorg-server \
dbus \
ttf-freefont \
chromium \
wqy-zenhei@edge \
bash \
bash-doc \
bash-completion -f
# Set time zone
RUN rm -rf /etc/localtime && ln -s /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
Set environment variables
ENV NODE_ENV production
Create a directory for the project code
RUN mkdir -p /workspace
# specify working directories for RUN, CMD, and ENTRYPOINT commands
WORKDIR /workspace
Copy all files from the current path of the host to the working directory of the docker
COPY . /workspace
Clear the NPM cache file
RUN npm cache clean --force && npm cache verify
# If set to true, disallow UID/GID switching when running Package Scripts
# RUN npm config set unsafe-perm true
# installation pm2
RUN npm i pm2 -g
# install dependencies
RUN npm install
# Exposed port
EXPOSE 3000
# run command
ENTRYPOINT pm2-runtime start docker_pm2.json
Copy the code
Thanks for sharing
- A tool that emulates browser behavior
- Roll. Do you really get it
- Puppeteer In Chinese
- Linux sandbox technology introduction
- How to write good Dockerfile, Dockerfile best practice