Crawler is an important means to obtain data at present, and Python is the most commonly used language for crawler, with rich frameworks and libraries. In my recent study, I found that NodJS can also be used for crawlers, which are written directly in JavaScript. It is not only simple and fast, but also can take advantage of Node’s asynchronous and high concurrency features. The following is my learning practice.
basis
Url module
The process of crawler is inseparable from the resolution of crawling URL, which is applied to the URL module of Node. The URL module is used to process and parse urls.
url.parse()
Used to parse web addressesurl.resolve()
Parse a target URL relative to a base URL
const url = require('url')
const myUrl = url.parse('https://user:[email protected]:8080/p/a/t/h?query=string#hash');
console.log(myUrl)
/ / {
// protocol: 'https:',
// slashes: true,
// auth: 'user:pass',
// host: 'sub.host.com:8080',
// port: '8080',
// hostname: 'sub.host.com',
// hash: '#hash',
// search: '? query=string',
// query: 'query=string',
// pathname: '/p/a/t/h',
// path: '/p/a/t/h? query=string',
// href:'https://user:[email protected]:8080/p/a/t/h?query=string#hash'
// }
console.log(url.resolve('/one/two/three'.'four'))
// the result is '/one/two/four'.
console.log(url.resolve('http://example.com/'.'/one'))
// Parse to 'http://example.com/one'
console.log(url.resolve('http://example.com/one'.'/two'))
// Parse to 'http://example.com/two'
Copy the code
The HTTP module
When crawler needs to send network request, it needs to adopt different modules according to URL protocol. HTTP module is used if it is HTTP, and HTTPS module is used if it is HTTPS. Requests require the module’s request method
Make an HTTP request using http.request(options[, callback]). Http.request () returns an instance of the http.ClientRequest class.
ClientRequest instance is a writable stream that inherits from stream. Represents an ongoing request. SetHeader (name, value), getHeader(name), or removeHeader(name) can be used to change the request header. The actual request header will be sent with the first data block, or when request.end() is called.
Request.end () is called before sending the request.
Send a POST request
const querystring = require('querystring')
const http = require('http')
const postData = querystring.stringify({
'msg': 'Hello World! '
});
const options = {
hostname: 'nodejs.cn'.port: 80.path: '/upload'.method: 'POST'.headers: {
'Content-Type': 'application/x-www-form-urlencoded'.'Content-Length': Buffer.byteLength(postData)
}
};
const req = http.request(options, (res) => {
console.log('Status code:${res.statusCode}`);
console.log('Response header:The ${JSON.stringify(res.headers)}`);
res.setEncoding('utf8');
res.on('data', (chunk) => {
console.log('Response body:${chunk}`);
});
res.on('end', () = > {console.log('No data in response');
});
});
req.on('error', (e) => {
console.error('Request encountered problem:${e.message}`);
});
// Write data to the request body.
req.write(postData);
req.end();
Copy the code
encapsulation
Encapsulation encapsulates common request methods into a function for easy reuse and management of code.
// Define the default request header
const _header = {
'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'.'Accept-Encoding': 'gzip, deflate, br' // Compressed data is loaded by default
}
Copy the code
Add user-Agent in the request header to simulate browser request. Add ‘accept-encoding ‘: ‘gzip, deflate, br’. Request gzip-compressed data, reducing traffic consumption and response time. In this way, after reading the data, you need to use the Zlib module to decompress it.
zlib.gzip(buffer[, options], callback)
Compressed datazlib.gunzip(buffer[, options], callback)
Extract the data
// Check whether there is a gzip string in the request header, if so, gzip compression is used
if(res.headers['content-encoding'] && res.headers['content-encoding'].split('; ').includes('gzip')) {
// Decompress the data and return the data
zlib.gunzip(result, (err, data) => {
if(err) {
reject(err)
} else {
resolve({
buffer: data,
headers: res.headers
})
}
})
}
Copy the code
The wrapped function passes in an options argument, which can be just a string or an Object containing various request information.
// Check if options is a string
if(typeof options === 'string') {
// Change the format of options to object
options = {
url: options,
method: 'GET'.header: {}}}else {
// If it is an object, add default properties to the Options object
options = options || {}
options.method = options.method || 'GET'
options.header = options.header || {}
}
Copy the code
The Promse function returns a Promse object that takes advantage of the asynchronous nature of JavaScript to send the request for efficiency. After that, URL module was used in Promise to resolve the requested URL and determine the protocol used by the requested URL according to protocol.
/ / url
var obj = url.parse(options.url)
// Parse the protocol
let mode = null
let port = 0
switch(obj.protocol) {
/ / the HTTPS protocol
case 'https:':
mode = require('https')
port = 443
break
/ / HTTP protocol
case 'http':
mode = require('http')
port = 80
break
}
Copy the code
HTTP. Request The request succeeds by checking whether the statusCode of the response is 200. If the request fails, check whether the request is redirected and redirection the request.
if(res.statusCode! =200) {// Check if it is a jump
if(res.statusCode==302 || res.statusCode==301) {// Update the URL to jump to
let location=url.resolve(options.url, res.headers['location']);
// Update the options Settings
options.url=location;
options.method='GET';
// Re-initiate the request
_request(options);
}else{
/ / returns the responsereject(res); }}Copy the code
The final code fetch. Js
const assert = require('assert')
const url = require('url')
const zlib = require('zlib')
const querystring = require('querystring')
// Define the default request header
const _header = {
'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'.'Accept-Encoding': 'gzip, deflate, br' // Compressed data is loaded by default
}
module.exports = (options) = > {
// Process parameters
if(typeof options === 'string') {
options = {
url: options,
method: 'GET'.header: {}}}else {
options = options || {}
options.method = options.method || 'GET'
options.header = options.header || {}
}
// Add request header information
for(let name in _header) {
options.header[name] = options.header[name] || _header[name]
}
// Encapsulate the post data
if(options.data) {
options.postData = querystring.stringify(options.data)
options.header['Content-Length'] = options.postData.length
}
// Return the Promise object
return new Promise((resolve, reject) = > {
_request(options)
function _request(options) {
/ / url
var obj = url.parse(options.url)
// Parse the protocol
let mode = null
let port = 0
switch(obj.protocol) {
case 'https:':
mode = require('https')
port = 443
break
case 'http':
mode = require('http')
port = 80
break
}
// Encapsulate the request
let req_options = {
hostname: obj.hostname,
port: obj.port || port,
path: obj.path,
method: options.method,
headers: options.header
}
// Send the request
let req_result = mode.request(req_options, (res) => {
// Check if there is an error
if(res.statusCode! =200) {// Check if it is a jump
if(res.statusCode==302 || res.statusCode==301) {// Update the URL to jump to
let location=url.resolve(options.url, res.headers['location']);
options.url=location;
options.method='GET';
_request(options);
}else{
/ / returns the responsereject(res); }}else {
// Process data
var data = []
res.on('data', chunk => {
data.push(chunk)
})
// Return data
res.on('end', () = > {// Process data
var result = Buffer.concat(data)
if(res.headers['content-length'] && res.headers['content-length'] != result.length) {
reject('Incomplete data load')}else {
// Check whether the data is compressed
if(res.headers['content-encoding'] && res.headers['content-encoding'].split('; ').includes('gzip')) {
zlib.gunzip(result, (err, data) => {
if(err) {
reject(err)
} else {
resolve({
buffer: data,
headers: res.headers
})
}
})
} else {
// Load data directly
resolve({
buffer: result,
headers: res.headers
})
}
}
})
}
})
// Error return
req_result.on('error', e=>reject(e));
// If there is data in POST, send it
if(options.postData) { req_result.write(options.postData) } req_result.end(); }})}Copy the code
In actual combat
Next, the encapsulated function is used to climb douban movie data, and the collected data is sorted according to the score, and finally output to TXT file.
Actual code
const fetch = require('.. /fetch')
const fs = require('fs')
// Select * from douban
var data = []
// Retrieve 100 pages of data
getData(100)
// Crawl single page data
// Parameter time Number of pages to climb
async function getData(time) {
var pageStart = 0
var pageLimit = 20
for(var i = 0; i < time; i++) {
var res = await fetch({
url: `https://movie.douban.com/j/search_subjects?type=movie&tag=%E8%B1%86%E7%93%A3%E9%AB%98%E5%88%86&sort=rank&page_limit=${pageLimit}&page_start=${pageStart}`
})
// Add data to data
var newData = JSON.parse(res.buffer.toString()) data.push(... newData.subjects) pageStart += pageLimit }// Sort the data
data.sort((a, b) = > b.rate - a.rate)
// Process the string output to the document
var res = data.reduce((str, item) = > {
return str + item.title + ':' + item.rate + '\n'
}, ' ')
// Save data to file
fs.writeFile('./sort.txt', res, function(err) {
if (err) {
throwerr; }}); }Copy the code
Final sort.txt data
Yes, minister 1984 Christmas Special: 9.8 Elizabeth: 9.6 Farewell My Concubine: 9.6 The Shawshank Redemption: 9.6 Prosecution Witness: 9.6 Mozart! 9.5 Schindler's List 9.5 Beautiful Life 9.5 Teahouse 9.4 The Killer Not too Cold 9.4 Twelve Angry Men 9.4 Back to Back, Face to face 9.4 Prosecution Witness 9.4 Sherlock Holmes II 9.4 Twelve Angry Men 9.4 Brilliant Life 9.4 Forrest Gump 9.4 Mozart: 9.4 Romeo and Juliet: 9.4 Evangelion Theatre: 9.4 Air/ From Your Heart: 9.4 Spirited Away: 9.3 The Furnace: 9.3 The Best Guy: The End 9.3 Inception: 9.3 Silver Soul: The End: The House of Everything Forever: Notre Dame: 9.3 City Lights: 9.3...Copy the code
Project code
The resources
- url API
- http.request API