Reference article:

  • One day a skill: how to remove the Selenium in the window. The navigator. Webdriver value
  • Pyppeteer use summary
  • Pyppeteer login taobao notes

Pyppeteer is a Python wrapper to the headless browser Puppeteer. Headless browsers are widely used for automated testing and are a great way to crawler ideas.

The biggest advantage of puppeteer (and other headless browsers) is of course the dimension reduction of JS encryption, complete disregard of JS encryption, and the ability to simulate click-and-save cookies for applications that require login. In many cases, front-end encryption is the most difficult part for a crawler to overcome. Puppeteer also has disadvantages. The biggest disadvantage is that puppeteer is much less efficient than interface crawlers, and even headless Chromium takes up a lot of memory. Additional maintenance of a browser startup and shutdown is also a burden.

In this article, we will write a simple demo to crawl the data of PDD search page. The final result is as follows:

We save the raw data for all API requests:

The following is an example JSON file:

The development environment

  • Python3.6 +

Python3.7 is best because Asyncio has a nice asyncio.run() method in Py3.7.

  • Install pyppeteer

Please check the official documentation if you have problems with the installation.

python3 -m pip install pyppeteer
Copy the code
  • Installation of chromium

As you know, the Network environment in China is very complicated. If you want to use The Chromium bound by Pyppeteer, it will take a long time to download, so we need to install it manually and specify executablePath in the program.

Download: www.chromium.org/getting-inv…

hello world

Pyppeteer’s Hello World app goes to Exmaple.com to take a screenshot:

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch({
        The directory for Windows and Linux is different from the directory for Executable
        'executablePath': 'you download the Chromium. The app/Contents/MacOS/Chromium',
    })
    page = await browser.newPage()
    await page.goto('http://example.com')
    await page.screenshot({'path': 'example.png'})
    await browser.close()

asyncio.get_event_loop().run_until_complete(main())
Copy the code

This section describes important pyppeteer interfaces

pyppeteer.launch

Launch browser, you can pass in a dictionary to configure several options, such as:

browser = await pyppeteer.launch({
    'headless': False, # Turn off headless mode
    'devtools': True, Open DevTools for Chromium
    'executablePath': 'you download the Chromium. The app/Contents/MacOS/Chromiu'.'args': [ 
        '--disable-extensions'.'--hide-scrollbars'.'--disable-bundled-ppapi-flash'.'--mute-audio'.'--no-sandbox'.'--disable-setuid-sandbox'.'--disable-gpu',].'dumpio': True,  
})
Copy the code

All the optional args parameters are here: Peter.sh /experiments…

What Dumpio does: Pips the headless browser process’s stderr core stdout to the main application. If set to True, chromium Console output will be printed in the main application.

Injecting JS scripts

Use the page. Evaluate form, for example:

await page.evaluate(""" () =>{ Object.defineProperties(navigator,{ webdriver:{ get: () => false } }) } """)
Copy the code

We will see that this step is critical, because the puppeteer for policy considerations, the use of the word is not very good, that is the meaning) will be set window. The navigator. Webdriver to true, told the site I am a webdriver drive browsers. Some sites are smarter (and have better anti-crawlers) and use this to determine if they are crawlers. Reference this article details the feeling: 1, a skill: how to remove the Selenium in the window. The navigator. Webdriver value

This is equivalent to typing that js code in DevTools.

You can also load a js file:

await page.addScriptTag(path=path_to_your_js_file)
Copy the code

Many useful operations can be completed by injecting JS scripts, such as automatic pull-down pages.

Request and Response are intercepted

await page.setRequestInterception(True)
page.on('request', intercept_request)
page.on('response', intercept_response)
Copy the code

The intercept_REQUEST and intercept_Response functions are equivalent to two registered callback functions that point to them before the browser makes and receives a request.

For example, you can disable images, multimedia resources, and websocket requests:

async def intercept_request(req):
    """Request filtering"""
    if req.resourceType in ['image'.'media'.'eventsource'.'websocket']:
        await req.abort()
    else:
        await req.continue_()
Copy the code

It then prints out each time it gets the request (only fetch and XHR response are printed here) :

async def intercept_response(res):
    resourceType = res.request.resourceType
    if resourceType in ['xhr'.'fetch']:
        resp = await res.text()
        print(resp)
Copy the code

The pyppeteer file contains the following resourceTypes:

Pinduoduo search crawler

The page automatically drops down

Pinduoduo search interface is an infinite drop-down page, we hope to realize the infinite drop-down page, and can control the program to exit in advance, otherwise it is not good to always pull down, we may not need so much data.

Js script

Async () => {await new Promise((resolve, reject) => {const maxScrollHeight = null; Const maxScrollTimes = null;letcurrentScrollTimes = 0; // Record the last scrollHeight to determine whether the pull-down operation is successful or not, thus ending the pull-down in advanceletscrollHeight = 0; // maxTries: Sometimes the failure to pull-down can be caused by the network speedlet maxTries = 5;
        let tried = 0;

        const timer = setInterval(() => {// Pull-down failed, exit early // BUG: This step works if the network speed is slow ~ // So set a maxTried variableif (document.body.scrollHeight === scrollHeight) {
                tried += 1;
                if (tried >= maxTries) {
                    console.log("reached the end, now finished!"); clearInterval(timer); resolve(); } } scrollHeight = document.body.scrollHeight; window.scrollTo(0, scrollHeight); window.scrollBy(0, -10); // Check whether maxScrollTimes is setif (maxScrollTimes) {
                if(currentScrollTimes >= maxScrollTimes) { clearInterval(timer); resolve(); }} // Check whether maxScrollHeight is setif (maxScrollHeight) {
                if (scrollHeight >= maxScrollHeight) {
                    if(currentScrollTimes >= maxScrollTimes) { clearInterval(timer); resolve(); } } } currentScrollTimes += 1; // Restore tried = 0; }, 1000); }); };Copy the code

There are several important parameters:

  • Interval: indicates the drop-down interval, in milliseconds
  • MaxScrollHeight: indicates the maximum drop-down height of a running page
  • MaxScrollTimes: Maximum number of times to pull down (recommended for better control of how much data to crawl)
  • MaxTries: Tries a maximum of several times when the pull-down fails. For example, the pull-down fails in interval MS due to network reasons

Replace these with the ones you need. In the meantime, you can open Chrome’s Developer tools and run the javascript script.

The complete code

This code is only a total of more than 70 lines, relatively simple, according to their actual needs to change.

import os
import time
import json
from urllib.parse import urlsplit
import asyncio
import pyppeteer
from scripts import scripts

BASE_DIR = os.path.dirname(__file__)


async def intercept_request(req):
    """Request filtering"""
    if req.resourceType in ['image'.'media'.'eventsource'.'websocket']:
        await req.abort()
    else:
        await req.continue_()


async def intercept_response(res):
    resourceType = res.request.resourceType
    if resourceType in ['xhr'.'fetch']:
        resp = await res.text()

        url = res.url
        tokens = urlsplit(url)

        folder = BASE_DIR + '/' + 'data/' + tokens.netloc + tokens.path + "/"
        if not os.path.exists(folder):
            os.makedirs(folder, exist_ok=True)
        filename = os.path.join(folder, str(int(time.time())) + '.json')
        with open(filename, 'w', encoding='utf-8') as f:
            f.write(resp)


async def main():
    browser = await pyppeteer.launch({
        # 'headless': False,
        # 'devtools': True
        'executablePath': '/Users/changjiang/apps/Chromium.app/Contents/MacOS/Chromium'.'args': [
            '--disable-extensions'.'--hide-scrollbars'.'--disable-bundled-ppapi-flash'.'--mute-audio'.'--no-sandbox'.'--disable-setuid-sandbox'.'--disable-gpu',].'dumpio': True,
    })
    page = await browser.newPage()

    await page.setRequestInterception(True)
    page.on('request', intercept_request)
    page.on('response', intercept_response)

    await page.setUserAgent('the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 '
                            '(KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 Edge/16.16299')
    await page.setViewport({'width': 1080, 'height': 960})
    await page.goto('http://yangkeduo.com')
    await page.evaluate(""" () =>{ Object.defineProperties(navigator,{ webdriver:{ get: () => false } }) } """)
    await page.evaluate("That section of your page automatically pulls down the JS script.")
    await browser.close()


if __name__ == '__main__':
    asyncio.run(main())

Copy the code

If you love computer science and basic logic like me, welcome to follow my wechat official account: