Selenium is a well-known Web testing automation framework that supports many major browsers, provides a rich API, and is often used as a crawler tool. However, Selenium has obvious drawbacks, such as being too slow, demanding in version configuration, and most troublesome of all, constantly updating the corresponding driver.

Pyppeteer, another web test automation tool, is much easier to install, configure and run than Selenium, although it supports a relatively simple browser.

01. Pyppeteer profile

Before introducing Pyppeteer, let’s talk about Puppeteer. Puppeteer is Google’s New Node.js tool that is designed to manipulate the Chrome API by manipulating it with Javascript code. Complete data crawling, automatic testing of Web programs and other tasks.

Pyppeteer is the Python version of Puppeteer. The following is a brief introduction to the two features of Pyppeteer, Chromium browser and Asyncio framework:

1).chromium

Chromium is an independent browser, which is a plan launched by Google to develop its own browser, Google Chrome. It is equivalent to the experimental version of Chrome. Chromium is not as stable as Chrome, but has richer functions and faster update speed. There is usually a new development release every few hours.

Web automation for Pyppeteer is based on Chromium, and because of some of the features in Chromium, Pyppeteer is very easy to install and configure, which we’ll talk about later.

2).asyncio

Asyncio is a Python asynchronous coroutine library, introduced since version 3.4 standard library, directly built-in support for asynchronous IO, known as Python’s most ambitious library, the official website has very detailed introduction:

02. Installation and use

1). Minimal installation

PIP install pyppeteer can be used to install the pyppeteer library. A simple pyppeteer-install command will automatically download the latest version of Chromium browser to the default location of Pyppeteer.

If you don’t run the pyppeteer-install command, the Chromium browser will be automatically downloaded and installed the first time you use Pyppeteer, with the same effect. In general, Pyppeteer eliminates the need for driver configuration compared to Selenium.

Of course, for some reason, there may be chromium automatic installation can not be completed smoothly, then you can consider manual installation: first, find the corresponding version of your system from the following website, download the Chromium package;

'linux': 'https://storage.googleapis.com/chromium-browser-snapshots/Linux_x64/575458/chrome-linux.zip'
'mac': 'https://storage.googleapis.com/chromium-browser-snapshots/Mac/575458/chrome-mac.zip'
'win32': 'https://storage.googleapis.com/chromium-browser-snapshots/Win/575458/chrome-win32.zip'
'win64': 'https://storage.googleapis.com/chromium-browser-snapshots/Win_x64/575458/chrome-win32.zip'
Copy the code

(Swipe left and right to view)


Then, uncompress the compressed package in the specified directory of pyppeteer, the default directory of Windows. For default directories on other systems, see the following figure:

2). Use

Try it out when you’re done. In the main function, first create a browser object, then open a new TAB, visit baidu’s home page, take a screenshot of the current page and save it as “example.png”, and finally close the browser. As mentioned above, Pyppeteer is built based on asyncio, so it needs to use async/await structure when it is used.



This is because Pyppeteer uses a headless browser by default. If you want the browser to display, you need to set the parameter “headless =False” in the launch function. After the program is launched, a captured web page image will appear in the same directory:

03. Actual asynchronous fund crawling


We have been saying that Pyppeteer is a very efficient web test automation tool. The essence of Pyppeteer is that it is built based on Asyncio, and all of its properties and methods are almost coroutine objects, so it is very convenient when building asynchronous programs and naturally supports asynchronous execution.

Here’s how sequential execution compares to asynchronous execution:

1). Fund creep

The following figure is the historical net value data of a fund. This page is loaded with JS, and content information cannot be directly obtained through Requests. Therefore, we can consider using the method of simulating browser operation to carry out data fetching. (In fact, there is an API interface for the acquisition of net fund data. This task is just for demonstration, not of practical value.)

To make the effect more obvious, we climbed the fund list page (below) for the top 50 funds of the last 20 trading days of net worth data.

2). Sequential execution

The basic idea of program construction is to create a browser browser and a page, and then visit the net value data page of each fund in turn and climb the data. The core code is as follows:

The get_data() function in the code is used for net data page parsing and data transformation, and the get_all_codes() function is used to obtain the fund codes of all open funds (a total of more than 6000). Although the program also uses an async/await structure, the net value data of multiple funds is obtained sequentiously in the callurl_and_getData () function because the methods in Pyppeteer are coroutine objects and the program must be built this way.

In order to eliminate the time-consuming interference of opening the browser, we only counted the time of page access and data fetching, and the result was: 12.08 seconds.


3). Asynchronous execution

Now let’s change the program. The functions and functions remain the same. The main thing is to change the fundList loop to async task object. The core code is as follows:

The statistical interval is still calculated from the time the browser is opened, and the runtime is: 2.18 seconds, which is 6 times faster than sequential execution. As you can imagine, if the amount of work required to crawl is relatively large and the sequential execution takes 10 hours, the asynchronous execution may only take less than 2 hours, the optimization effect is very obvious.