It has been a year since I wrote the second tutorial node.js on crawler (ii), which explains how to use puppeteer to open a headless browser for dynamic data crawling. The reason why there is no update for so long is that the purpose of learning crawler is to make an automatic crawler system. Last year, I developed the hot list of touch fish, and then I did not continue the research of crawler.
The development of touch fish hot list involves many functional modules, including interface design and front-end development, back-end highly customized crawler function and login module, data table design, automatic deployment, etc., learned a lot of things, this time I will take out the automatic crawler function to talk about. This time we will talk about the implementation of the timed crawl function:
First, open the website you want to climb, and see if it is a dynamic or static page. The main point is, if it is a static page, it is easy to do. Refer directly to the first lecture on using Cheerio to climb dom. If you encounter ajax dynamic loading, in addition to lecture 2 there is actually a way to F12 open the developer tools to find the web request, find the address of the request you want to crawl.
I take Baidu hot list as an example. I will crawl the data of baidu hot list headlines once every minute, combine the title and search volume and print them out. It is a dynamically loaded web page, according to the above method to find the request address top.baidu.com/mobile_v2/b… And then try to crawl:
const request = require('request') request('http://top.baidu.com/mobile_v2/buzz/hotspot', (err, res) => { if (err) { console.log(err.code) return false; } let data = json.parse (res.body).result.topwords console.log(' ${data[0].keyword} - ${data[0]. Searches} times search ')}) // Run results -> 11 killed in shiyan explosionCopy the code
Schedule.schedulejob () is used to start a scheduled task. The first parameter is the timer, that is, how often you want to execute it. The second argument is the execution function.
Cron timers are recommended. For example, ‘* * * * * *’, six placeholders for seconds, minutes, hours, days, months, and weeks, and asterisk (*) for full matching.
- 30th second trigger per minute: ’30 * * * * *’
- Trigger at 1:30 per hour: ’30 1 * * * *’
- Trigger every day at 1:1:30 am: ’30 1 1 * * *’
- Trigger at 1:1:30 on 1st of each month: ’30 1 1 1 1 * *’
- January 1, 2016 triggered at 1:1:30: ’30 1 1 1 2016 *’
- Trigger at 1:1:30 on 1 of a week: ’30 1 1 * * 1′
Now I’m going to crawl baidu’s hot headlines every minute, so
const schedule = require('node-schedule') const request = require('request') schedule.scheduleJob('0 * * * * *', () => { request('http://top.baidu.com/mobile_v2/buzz/hotspot', (err, res) => { if (err) { console.log(err.code) return false; } let data = json.parse (res.body).result.topwords console.log(' ${data[0].keyword} - ${data[0]. Searches} times search ')})})Copy the code
The running results are as follows:
Deal with ~