Crawler is an important means to obtain data at present, and Python is the most commonly used language for crawler, with rich frameworks and libraries. In my recent study, I found that NodJS can also be used for crawlers, which are written directly in JavaScript. It is not only simple and fast, but also can take advantage of Node’s asynchronous and high concurrency features. The following is my learning practice.

basis

Url module

The process of crawler is inseparable from the resolution of crawling URL, which is applied to the URL module of Node. The URL module is used to process and parse urls.

  • url.parse()Used to parse web addresses
  • url.resolve()Parse a target URL relative to a base URL
const url = require('url')

const myUrl = url.parse('https://user:[email protected]:8080/p/a/t/h?query=string#hash');

console.log(myUrl)
/ / {
// protocol: 'https:',
// slashes: true,
// auth: 'user:pass',
// host: 'sub.host.com:8080',
// port: '8080',
// hostname: 'sub.host.com',
// hash: '#hash',
// search: '? query=string',
// query: 'query=string',
// pathname: '/p/a/t/h',
// path: '/p/a/t/h? query=string',
// href:'https://user:[email protected]:8080/p/a/t/h?query=string#hash'
// }

console.log(url.resolve('/one/two/three'.'four'))
// the result is '/one/two/four'.
console.log(url.resolve('http://example.com/'.'/one'))
// Parse to 'http://example.com/one'
console.log(url.resolve('http://example.com/one'.'/two'))
// Parse to 'http://example.com/two'
Copy the code

The HTTP module

When crawler needs to send network request, it needs to adopt different modules according to URL protocol. HTTP module is used if it is HTTP, and HTTPS module is used if it is HTTPS. Requests require the module’s request method

Make an HTTP request using http.request(options[, callback]). Http.request () returns an instance of the http.ClientRequest class.

ClientRequest instance is a writable stream that inherits from stream. Represents an ongoing request. SetHeader (name, value), getHeader(name), or removeHeader(name) can be used to change the request header. The actual request header will be sent with the first data block, or when request.end() is called.

Request.end () is called before sending the request.

Send a POST request

const querystring = require('querystring')
const http = require('http')
const postData = querystring.stringify({
  'msg': 'Hello World! '
});

const options = {
  hostname: 'nodejs.cn'.port: 80.path: '/upload'.method: 'POST'.headers: {
    'Content-Type': 'application/x-www-form-urlencoded'.'Content-Length': Buffer.byteLength(postData)
  }
};

const req = http.request(options, (res) => {
  console.log('Status code:${res.statusCode}`);
  console.log('Response header:The ${JSON.stringify(res.headers)}`);
  res.setEncoding('utf8');
  res.on('data', (chunk) => {
    console.log('Response body:${chunk}`);
  });
  res.on('end', () = > {console.log('No data in response');
  });
});

req.on('error', (e) => {
  console.error('Request encountered problem:${e.message}`);
});

// Write data to the request body.
req.write(postData);
req.end();
Copy the code

encapsulation

Encapsulation encapsulates common request methods into a function for easy reuse and management of code.

// Define the default request header
const _header = {
  'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'.'Accept-Encoding': 'gzip, deflate, br' // Compressed data is loaded by default
}
Copy the code

Add user-Agent in the request header to simulate browser request. Add ‘accept-encoding ‘: ‘gzip, deflate, br’. Request gzip-compressed data, reducing traffic consumption and response time. In this way, after reading the data, you need to use the Zlib module to decompress it.

  • zlib.gzip(buffer[, options], callback)Compressed data
  • zlib.gunzip(buffer[, options], callback)Extract the data
// Check whether there is a gzip string in the request header, if so, gzip compression is used
if(res.headers['content-encoding'] && res.headers['content-encoding'].split('; ').includes('gzip')) {
  // Decompress the data and return the data
  zlib.gunzip(result, (err, data) => {
    if(err) {
      reject(err)
    } else {
      resolve({
        buffer: data,
        headers: res.headers
      })
    }
  })
}
Copy the code

The wrapped function passes in an options argument, which can be just a string or an Object containing various request information.

// Check if options is a string
if(typeof options === 'string') {
  // Change the format of options to object
  options = {
    url: options,
    method: 'GET'.header: {}}}else {
  // If it is an object, add default properties to the Options object
  options = options || {}
  options.method = options.method || 'GET'
  options.header = options.header || {}
}
Copy the code

The Promse function returns a Promse object that takes advantage of the asynchronous nature of JavaScript to send the request for efficiency. After that, URL module was used in Promise to resolve the requested URL and determine the protocol used by the requested URL according to protocol.

/ / url
var obj = url.parse(options.url)

// Parse the protocol
let mode = null
let port = 0
switch(obj.protocol) {
  / / the HTTPS protocol
  case 'https:':
    mode = require('https')
    port = 443
    break
  / / HTTP protocol
  case 'http':
    mode = require('http')
    port = 80
    break
}
Copy the code

HTTP. Request The request succeeds by checking whether the statusCode of the response is 200. If the request fails, check whether the request is redirected and redirection the request.

if(res.statusCode! =200) {// Check if it is a jump
  if(res.statusCode==302 || res.statusCode==301) {// Update the URL to jump to
    let location=url.resolve(options.url, res.headers['location']);
    // Update the options Settings
    options.url=location;
    options.method='GET';
    // Re-initiate the request
    _request(options);
  }else{
    / / returns the responsereject(res); }}Copy the code

The final code fetch. Js

const assert = require('assert')
const url = require('url')
const zlib = require('zlib')
const querystring = require('querystring') 

// Define the default request header
const _header = {
  'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'.'Accept-Encoding': 'gzip, deflate, br' // Compressed data is loaded by default
}

module.exports = (options) = > {
  // Process parameters
  if(typeof options === 'string') {
    options = {
      url: options,
      method: 'GET'.header: {}}}else {
    options = options || {}
    options.method = options.method || 'GET'
    options.header = options.header || {}
  }

  // Add request header information
  for(let name in _header) {
    options.header[name] = options.header[name] || _header[name]
  }
  // Encapsulate the post data
  if(options.data) {
    options.postData = querystring.stringify(options.data)
    options.header['Content-Length'] = options.postData.length
  }

  // Return the Promise object
  return new Promise((resolve, reject) = > {
    _request(options)
    function _request(options) {
      / / url
      var obj = url.parse(options.url)

      // Parse the protocol
      let mode = null
      let port = 0
      switch(obj.protocol) {
        case 'https:':
          mode = require('https')
          port = 443
          break
        case 'http':
          mode = require('http')
          port = 80
          break
      }
      // Encapsulate the request
      let req_options = {
        hostname: obj.hostname,
        port: obj.port || port,
        path: obj.path,
        method: options.method,
        headers: options.header
      }
      // Send the request
      let req_result = mode.request(req_options, (res) => {
        // Check if there is an error
        if(res.statusCode! =200) {// Check if it is a jump
          if(res.statusCode==302 || res.statusCode==301) {// Update the URL to jump to
            let location=url.resolve(options.url, res.headers['location']);
            options.url=location;
            options.method='GET';
            _request(options);
          }else{
            / / returns the responsereject(res); }}else {
            // Process data
          var data = []
          res.on('data', chunk => {
            data.push(chunk)
          })
          // Return data
          res.on('end', () = > {// Process data
            var result = Buffer.concat(data)
            if(res.headers['content-length'] && res.headers['content-length'] != result.length) {
              reject('Incomplete data load')}else {
              // Check whether the data is compressed
              if(res.headers['content-encoding'] && res.headers['content-encoding'].split('; ').includes('gzip')) {
                zlib.gunzip(result, (err, data) => {
                  if(err) {
                    reject(err)
                  } else {
                    resolve({
                      buffer: data,
                      headers: res.headers
                    })
                  }
                })
              } else {
                // Load data directly
                resolve({
                  buffer: result,
                  headers: res.headers
                })
              }
            }
          })
        }
      })
      // Error return
      req_result.on('error', e=>reject(e));
      // If there is data in POST, send it
      if(options.postData) { req_result.write(options.postData) } req_result.end(); }})}Copy the code

In actual combat

Next, the encapsulated function is used to climb douban movie data, and the collected data is sorted according to the score, and finally output to TXT file.

Actual code

const fetch = require('.. /fetch')
const fs = require('fs')
// Select * from douban
var data = []
// Retrieve 100 pages of data
getData(100)

// Crawl single page data
// Parameter time Number of pages to climb
async function getData(time) {
  var pageStart = 0
  var pageLimit = 20
  for(var i = 0; i < time; i++) {
    var res = await fetch({
      url: `https://movie.douban.com/j/search_subjects?type=movie&tag=%E8%B1%86%E7%93%A3%E9%AB%98%E5%88%86&sort=rank&page_limit=${pageLimit}&page_start=${pageStart}`
    })
    // Add data to data
    var newData = JSON.parse(res.buffer.toString()) data.push(... newData.subjects) pageStart += pageLimit }// Sort the data
  data.sort((a, b) = > b.rate - a.rate)
  // Process the string output to the document
  var res = data.reduce((str, item) = > {
    return str + item.title + ':' + item.rate + '\n'
  }, ' ')
  // Save data to file
  fs.writeFile('./sort.txt', res, function(err) {
      if (err) {
          throwerr; }}); }Copy the code

Final sort.txt data

Yes, minister 1984 Christmas Special: 9.8 Elizabeth: 9.6 Farewell My Concubine: 9.6 The Shawshank Redemption: 9.6 Prosecution Witness: 9.6 Mozart! 9.5 Schindler's List 9.5 Beautiful Life 9.5 Teahouse 9.4 The Killer Not too Cold 9.4 Twelve Angry Men 9.4 Back to Back, Face to face 9.4 Prosecution Witness 9.4 Sherlock Holmes II 9.4 Twelve Angry Men 9.4 Brilliant Life 9.4 Forrest Gump 9.4 Mozart: 9.4 Romeo and Juliet: 9.4 Evangelion Theatre: 9.4 Air/ From Your Heart: 9.4 Spirited Away: 9.3 The Furnace: 9.3 The Best Guy: The End 9.3 Inception: 9.3 Silver Soul: The End: The House of Everything Forever: Notre Dame: 9.3 City Lights: 9.3...Copy the code

Project code

The resources

  • url API
  • http.request API