Introduce a simple crawler framework, the focus is simple, then directly to try it.
start
[Image here, a little slow to load]
Let’s see what happens in slow motion.
~ $ npm install crawl-pet -gCopy the code
Install the crawl – pet
~ $ cd /Volumes/M/downloadCopy the code
Go to the directory where you want to create a new project
download $ crawl-pet newCopy the code
Create a project and fill in the parameters as prompted
ctrl + cCopy the code
If the crawler rules need to be customized, exit the crawler.js file under the chapter project
module.exports = {
/**************** * Info part ****************/
projectDir: __dirname,
url : "https://imgur.com/r/funny".outdir : "/Volumes/M/download/imgur.com".saveMode : "group".keepName : true.limits : 5.timeout : 60000.limitWidth : 400.limitHeight : 400.proxy : "http://127.0.0.1:1087".userAgent : "Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_13_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36".cookies : null.fileTypes : "png|gif|jpg|jpeg|svg|xml|mp3|mp4|pdf|torrent|zip|rar".sleep : 1000.crawl_data : {},
// crawl_js : "./parser.js",
/**************** * Crawler part *****************/
// init(queen) {},
prep(queen) {
let url = queen.head.url;
let m = url.match(/^(https? : \ \ /)? (([\w\-]\.) ? imgur.com)\/*/i);
if(m) { url = (! m[1]?'https://' : ' ') + url.replace(/\/page(\/\d+(\/hit\.json)?) ? $|\/+$/i.' ');
if (!/\/(new|top|hot)$/i.test(url)) {
url += '/new';
}
queen.head.url = url + '/page/0/hit.json';
queen.save('api_url', url);
queen.save('page_offset'.0); }},// start(queen) {},
// filter(url) {},
// filterDownload(url) {},
// willLoad(request) {},
loaded(body, links, files, crawler) {
if (!/hit\.json/i.test(crawler.url)) {
return;
}
try {
let json = JSON.parse(body);
let data = json.data;
if(! data || data.length ===0) {
return;
}
let add_down = 0;
for (let pic of data) {
if (crawler.appendDownload('https://i.imgur.com/' + pic.hash + pic.ext)) {
add_down += 1; }}if (add_down) {
let api_url = crawler.read('api_url');
let offset = crawler.read('page_offset');
let add = 5;
while (add-- > 0) {
offset++;
crawler.appendPage(api_url + '/page/' + offset + '/hit.json');
}
crawler.save('page_offset', offset); }}catch (err) {
// PASS}},// browser(crawler) {}
}Copy the code
To clarify, there are two functions overwritten here, prep(Queen)/loaded(body, links, files, crawler), see more here.
Prep (Queen) is a preprocessing function that is called the first time an item is run, and the first time it is run after a reset. Here, according to the Imgur API, some changes have been made to the start link.
Imgur’s address structure is:
https://imgur.com/ sort/sort method/page/ page number /hit.jsonCopy the code
Loaded (Body, links, files, crawler) is called whenever the page loads.
body
crawler
crawler.appendPage(url)
crawler.appendDownload(url)
In this example, because the requested page is JSON, we need to unwrap the text into JSON and add the image URL to the download queue with appendDownload. If false is returned, the image has been downloaded repeatedly and new images can be downloaded. It regenerates five new pages and appendPage is added to the queue.
Describes a utility command
You can find the download link by referring to the local file name
$ crawl-pet -f local "CstcePq.png"Copy the code
See help for more commands
~ $ crawl-pet -hCopy the code
——————————————————————-
GIthub address: github.com/wl879/crawl…
There are some examples in the crawlers folder of the project.