Node JS crawler: Climb waterfall stream web page in HIGH definition

Static web pages often use the get method to get all the content of the page. Dynamic web pages that asynchronously request data need to be loaded by the browser before fetching. This article describes how to continuously climb the waterfall stream web page.

When it comes to Python, there will be a lot of people talking about crawlers. Our Node JS crawler is also very simple. Compared with Python, it is only “asynchronous” and “multi-threaded” performance comparison. I don’t know much about Python, so I won’t comment on it.

Phantomjs is a ‘shell free’ Chrome, see Phantomjs.org for details on how to install it. Phantomjs provides a command line tool to run using the phantom XXX.js command. Use the phantom-Node library to play PhantomJS in Node Js, so you can use PM2 for daemon and load balancing.

The target

Climb more than 200 1920*1080 resolution animation wallpaper, the web page is Baidu waterfall flow pictures

way

The waterfall stream determines whether to continue loading based on the scrolling position of the page, so use PhantomJS to scroll the page to get more image links. When entering the detail page of a single image, it is a compressed image, which is a measure of Baidu to optimize the access speed. After a few seconds, the image SRC will be replaced by the link of a larger image. Therefore, wait a few seconds when entering the image detail page before retrieving the image SRC, depending on your Internet speed.

steps

For a link

Start by opening the web page with phantom

const phantom = require('phantom')

(async function() {
    const instance = await phantom.create();
    const page = await instance.createPage();
    const status = await page.open(url);
    const size = await page.property('viewportSize', {
        width: 1920.height: 1080
    })
}())
Copy the code

Get the number of links, less than 200 scroll pages

// Add a delay function to wait for the page to load before scrolling
function delay(second) {
    return new Promise((resolve) = > {
        setTimeout(resolve, second * 1000);
    });
}
Copy the code

async function pageScroll(i) {
    await delay(5)
    await page.property('scrollPosition', {
        left: 0.top: 1000 * i
    })
    let content = await page.property('content')
    let $ = cheerio.load(content)
    console.log($('.imgbox').length)
    if($('.imgbox').length < 200) {
        await pageScroll(++i)
    }
}
await pageScroll(0)

Copy the code

Extract image link

let urlList = []
$('.imgbox').each(function() {
    urlList.push('https://image.baidu.com'+ $(this).find('a').attr('href'))})Copy the code

Save the picture

Define the function that saves the picture

const request = require('request')
const fs = require('fs')

function save(url) {
    let ext = url.split('. ').pop()
    request(url).pipe(fs.createWriteStream(`./image/The ${new Date().getTime()}.${ext}`));
}
Copy the code

Iterating through urlList is recommended to use recursive traversal, loop traversal delay does not work

async function imgSave(i) {
    let page = await page.open(urlList[i])
    delay(1)
    let content = await page.property('content')
    $ = cheerio.load(content)
    let src = $('#currentImg').attr('src')
    save(src)
    if(i<urlList.length) {
        await imgSave(++i)
    }
}
await imgSave(0)
Copy the code

Finally, the result of climbing is shown in the figure, which is high resolution, and some pictures are anti-climbing

The complete code

const phantom = require('phantom')
const cheerio = require('cheerio')
const request = require('request')
const fs = require('fs')
function delay(second) {
    return new Promise((resolve) = > {
        setTimeout(resolve, second * 1000);
    });
}
let url = 'https://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&fm=index&fr=&hs=0&xthttps=111111 &sf=1&fmq=&pv=&ic=0&nc=1&z=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&word=%E5%8A%A8%E6%BC%AB+%E5%A3%8 1%E7%BA%B8&oq=%E5%8A%A8%E6%BC%AB+%E5%A3%81%E7%BA%B8&rsp=-1'
function save(url) {
    let ext = url.split('. ').pop()
    request(url).pipe(fs.createWriteStream(`./image/The ${new Date().getTime()}.${ext}`));
}
(async function() {
    let instance = await phantom.create();
    let page = await instance.createPage();
    let status = await page.open(url);
    let size = await page.property('viewportSize', {
        width: 1920.height: 1080
    })
    let $
    async function pageScroll(i) {
        await delay(1)
        await page.property('scrollPosition', {
            left: 0.top: 1000 * i
        })
        let content = await page.property('content')
        $ = cheerio.load(content)
        if($('.imgbox').length < 200) {
            await pageScroll(++i)
        }
    }
    await pageScroll(0)
    let urlList = []
    $('.imgbox').each(function() {
        urlList.push('https://image.baidu.com'+ $(this).find('a').attr('href'))})async function imgSave(i) {
        let status = await page.open(urlList[i])
        await delay(1)
        let content = await page.property('content')
        $ = cheerio.load(content)
        let src = $('#currentImg').attr('src')
        save(src)
        if(i<urlList.length) {
            await imgSave(++i)
        }
    }
    await imgSave(0)
    await instance.exit()
}());
Copy the code

My blog: www.bougieblog.cn, welcome to chat.

Node JS crawler: Climb waterfall stream web page in HIGH definition

The target

way

steps

For a link

Save the picture

The complete code

Related Posts

As a developer, you should know about load balancing LVS

Spring source learning 09: Refresh about the process

Kubernetes Notes (8) – Service configuration list