What is a reptile?

The wiki explains it this way:

It’s an “automated web browsing” program, or a web bot. They are widely used by Internet search engines and other similar sites to obtain or update their content and retrieval methods. They can automatically capture all the pages they can access for further processing by search engines (sorting and sorting downloaded pages), so that users can retrieve the information they need more quickly.

Robots agreement

Robots.txt is an ASCII encoded text file stored in the root directory of a website, which usually tells the wanderer of a web search engine (also known as a web crawler) which contents in this website should not be obtained by the wanderer of a search engine and which can be obtained by the wanderer.

Robots.txt protocol is not a specification, but only a convention, so can not guarantee the privacy of the site.

To put it plainly, this is not a need to enforce compliance with the provisions, this is just an agreement between gentlemen, against gentlemen not against villains, but do not comply with this agreement may lead to unfair competition, you can search for ~

Here is a simple list of some configuration rules in robots.txt, have a general impression, but also helpful to understand the crawler logic

  • Allow all robots: user-agent: *
  • Only specific robots are allowed: user-agent: name_spider
  • Intercept all bots: Disallow: /
  • Disallow bots from accessing a specific directory: Disallow: /images/
  • .

The crawler (Anti – spiders)


General websites are anti-crawler from three aspects:

  • Headers requested by the user
  • User behavior
  • Site directory and data loading method
  • .

The first two are relatively easy to encounter, and most sites are anti-crawler from these angles. The third one is used by some Ajax sites, which makes it more difficult to crawl.

Anti-crawler by Headers

Many sites pass Headers:

  • User-Agent
  • Referer

Anti-crawler strategy: Add Headers to crawler and copy user-agent of browser to Headers of crawler. Or change the Referer value to the target website domain name.

Anti-crawler based on user behavior

  • By detecting user behavior:
    • Multiple visits to the same page from the same IP address within a short period of time
    • Perform the same operation for the same account multiple times within a short period of time

Anti-crawler strategy: 1. Write a crawler specially, climb the proxy IP publicly on the Internet, and replace one IP for several times each request; 2, each request after a random interval of a few seconds before the next request

Anti-crawler for dynamic pages

Most of the above situations occur on static pages, and there are some sites where the data we need to crawl is either obtained through Ajax requests or generated through JavaScript

Anti-crawler strategy: To find ajax requests, we can also analyze the specific parameters and the specific meaning of the response, and analyze the JSON of the response to get the required data.

Need to prepare knowledge


  • Javascript and JQuery
  • Simple NodeJS basics
  • HTTP Network packet capture and URL basicsIt's a real boon for front end engineers

Dependent libraries that need to be installed


  • superagent
  • cheerio
  • eventproxy
  • async

superagent


Superagent is a lightweight HTTP library. It is a very convenient client request proxy module in NodeJS. It is convenient for us to make web requests such as GET and POST.

cheerio


This is a Nodejs version of jQuery, which is used to retrieve data from web pages using CSS selectors in exactly the same way as jQuery.

eventproxy


Eventproxy module is to control the concurrent use of it to help you manage what whether the asynchronous operation completed, sometimes we need to send N HTTP requests at the same time, and then use the data for later processing, the request is completed, it will automatically call the processing function, you provide and will grab the data when the parameters are coming, Easy to handle.

async


Async is a process control toolkit that provides direct and powerful asynchronous functionality: mapLimit(ARR, Limit, Iterator, callback)

There are also powerful synchronization functions: mapSeries(ARR, Iterator, callback)

The crawler practice

All talk not practice fake handle, then let’s start ~

Start by defining the dependency library and the global variable ~

// Node's built-in module
const path = require('path')
const url = require('url');
const fs = require('fs')
// NPM installed dependency library
const superagent = require('superagent');
const cheerio = require('cheerio');
const eventproxy = require('eventproxy');
const async = require('async');
const mkdir = require('mkdirp')
// Set the crawler target URL
var targetUrl = 'https://cnodejs.org/';
Copy the code
/ / -- -- -- -- -- -- -- -- 1 -- -- -- -- -- -- -- -- -- --
// The simplest crawler
superagent.get(targetUrl)
	.end(function(err, res){
	  	console.log(res);
	})
Copy the code

Three lines of code ~ but this is really a crawler that outputs page information to terminal ~

Because we want to get the resources on the page and use Cheerio (jQuery for Node) to select the content in the specified class or ID on the page, we need to analyze the structure of the page we want to get, open the element selector in Google Chrome, For example, now we just want to get the URL of each click on cNode.

Add the program in 1 to cheerio to get urls:

/ / -- -- -- -- -- -- -- -- 2 -- -- -- -- -- -- -- -- -- --
// Add cheerio to get the specified content of the page

superagent.get(targetUrl)
	.end(function(err, res){
		var $ = cheerio.load(res.text);
		$('#topic_list .topic_title').each(function(index, element){
		var href = $(element).attr('href');
		console.log(href); })})Copy the code

Output:

These are all relative paths. What do we do? Don’t worry, there are URL modules:

/ / -- -- -- -- -- -- -- -- -- -- 3 -- -- -- -- -- -- -- -- -- -- -- -- --
var href = url.resolve(targetUrl, $(element).attr('href'));
Copy the code

Then continue in the execution program output:

Getting the urls is just the first step, now we need to get the content from the page that the urls are pointing to, for example, we need to get the title and the first comment from the secondary page, and print it out.

Here we’ll add the EventProxy module to elegantly control the callback function after the specified number of asynchronies:

/ / -- -- -- -- -- -- -- -- -- -- 4 -- -- -- -- -- -- -- -- -- -- -- -- --
// Add eventProxy to control postcount callback

var topicUrls = [];	

function getTopicUrls() {
	// code snippets in ----3----
};
getTopicUrls()
var ep = new eventproxy();
// The eventProxy module defines the callback function first
ep.after('crawled', topicUrls.length, function(topics) {
	topics = topics.map(function(topicPair) {
		var topicUrl = topicPair[0];
		var topicHtml = topicPair[1];
		var $ = cheerio.load(topicHtml);
		return ({
			title: $('.topic_full_title').text(),
			href: topicUrl,
			comment1: $('.reply_content .markdown-text').eq(0).text().trim()
		});
	});
	console.log('outcome');
	console.log(topics);
});

topicUrls.forEach(function(topicUrl) {
	superagent.get(topicUrl)
		.end(function(err, res){
			console.log('fetch --' + topicUrl + '--successfully');
      		// EventProxy tells the after function that an asynchrony has been executed and that the callback can be executed when the number of times is sufficient
			ep.emit('crawled', [topicUrl, res.text]);
		});
});
Copy the code

Output:

The outcome is null. What the hell? We checked the code again. shit! There is no control over asynchracy, causing topicUrls to be empty when executing topicurls.foreach ()

/ / -- -- -- -- -- -- -- -- -- -- 5 -- -- -- -- -- -- -- -- -- -- -
// Add promise control
var topicUrls = [];	
function getTopicUrls() {
	return new Promise(function(resolve){...// See the code in ----4----
	});
};
getTopicUrls().then(function(topicUrls){...// See the code in ----4----
})
Copy the code

Output:

As a surprise, the page appears with the title, URL, and comment we expected, but one thing is not as expected. If you look closely at the output log, there are a lot of empty object strings:

Is the page inaccessible?

We don’t have any output page copy this url into the browser enter returned to found that the page is accessible, because we are just a request to the browser, in the crawlers, because the node high concurrency features, we are very many requests at the same time, if over the load on the server, the server will crash So the server will generally have anti-crawler methods, and we happen to encounter this situation, how to prove? We direct the output urls points to all pages (tips: because the output terminal is too much, you can use the Linux command | less to control output logs paging page)

If you look closely at the output, you’ll soon see the following logs:

The pages are 503, which means the server has denied us access.

Let’s improve our program, async entry, to control the concurrency of the program and set the delay:

/ / -- -- -- -- -- -- -- -- -- -- 6 -- -- -- -- -- -- -- -- -- -- -
// Set the delay to 5
// Print out the title and the first comment

var topicUrls = [];	
function getTopicUrls() {
	return new Promise(function(resolve){
		superagent.get(targetUrl)
			.end(function(err, res){
				if (err) {
					return console.log('error:', err)
				}
				var $ = cheerio.load(res.text);
				$('#topic_list .topic_title').each(function(index, element){
					var href = url.resolve(targetUrl, $(element).attr('href')); topicUrls.push(href); resolve(topicUrls); })}); }); }; getTopicUrls().then(function(topicUrls){
	var ep = new eventproxy();
	ep.after('crawled', topicUrls.length, function(topics) {
		topics = topics.map(function(topicPair) {
			var topicUrl = topicPair[0];
			var topicHtml = topicPair[1];
			var $ = cheerio.load(topicHtml);
			return ({
				title: $('.topic_full_title').text(),
				href: topicUrl,
				comment1: $('.reply_content .markdown-text').eq(0).text().trim()
			});
		});
		console.log('------------------------ outcomes -------------------------');
		console.log(topics);
		console.log('Total crawler results' + topics.length + '条')});var curCount = 0;
	// Set the delay
    function concurrentGet(url, callback) {
    	var delay = parseInt((Math.random() * 30000000) % 1000.10);
	    curCount++;
		setTimeout(function() {
		    console.log('The current concurrency is', curCount, ', what I'm grabbing is'., url, ', time consuming ' + delay + '毫秒');  
	    	superagent.get(url)
				.end(function(err, res){
					console.log('fetch --' + url + '--successfully');
					ep.emit('crawled', [url, res.text]);
				});
		    curCount--;
		    callback(null,url +'Call back content');
		}, delay);
    }

	// Use async to control asynchronous fetching
	// mapLimit(arr, limit, iterator, [callback])
	// asynchronous callback
	async.mapLimit(topicUrls, 5 ,function (topicUrl, callback) {
		    concurrentGet(topicUrl, callback);
	    });
})
Copy the code

Take a look at the output log:

Ocd patients feel very comfortable with no ~

Wait, there’s one more point! For many programs, with the above foundation, the realization of picture crawler and storage is also very simple, such as download just said in the example of the secondary page of the author’s head, analysis of the page steps will not be described, directly on the code:

var dir = './images'
// Create a directory image store
mkdir(dir, function(err) {
	if(err) {
	    console.log(err); }});/ / -- -- -- -- -- -- -- -- -- -- 7 -- -- -- -- -- -- -- -- -- -- -
// Set the delay to 5
// Download the avatar

var topicUrls = [];	
function getTopicUrls() {
	return new Promise(function(resolve){.../ / reference - 6 -
	});
};
getTopicUrls().then(function(topicUrls){
	var ep = new eventproxy();
	ep.after('crawled', topicUrls.length, function(topics) {
		var imgUrls = []
		topics = topics.map(function(topicPair) {.../ / reference - 6 -
			imgUrls.push($('.user_avatar img').attr('src'));
		});
      	// Async. MapSeries is used for serial execution
		async.mapSeries(imgUrls, function (imgUrl, callback) {
			// Create a file stream
			const stream = fs.createWriteStream(dir + '/' + path.basename(imgUrl) + '.jpg');
			const res = superagent.get(imgUrl);
			// res.type('jpg')
			res.pipe(stream);
			console.log(imgUrl, '-- Save done ');
		    callback(null.'Call back content');
	    });
		console.log('------------------------ outcomes -------------------------');
		console.log('Total crawler results' + topics.length + '条');
	});
	var curCount = 0;
	// Number of concurrent requests
    function concurrentGet(url, callback) {.../ / reference - 6 -
    }
	// Use async to control asynchronous fetching
	async.mapLimit(topicUrls, 5 ,function (topicUrl, callback) {
		    concurrentGet(topicUrl, callback);
	    });
})

Copy the code

Run the program, and you’ll see a lot more images under the images folder…. Gnome male – “~