background

1. Every graduation season is the rental season, a room is difficult to find, the price does not increase; 2. Rental information is filled with a large number of agents and principal landlords; 3. When you have all the data, do whatever you want with it, like data visualizationCopy the code
- Note: - crawling data, may involve sensitive information, here, will not mention the specific rental platform - this article is mainly about the crawler process - please forgive meCopy the code

Began to implement

  1. Step1 set up a request server
  2. Step2 find the corresponding interface of the rental platform & crawl data
  3. Step3 store data

Step1 set up a request server

Setting up a Request Server using NodeJS is simple

/** * ./index.js */
const https = require('https') // Select HTTPS: The REQUESTED API is the HTTPS protocol
var stateNum = 0 // Count the number of requests
const options = {
    hostname: '????? '.// You want to request the API domain name, not the url domain name
    port: 443.path: '????? '.method: 'GET'.headers: {},params: {}}function getData(options,stateNum) {
    const req = https.request(options, res= > {
      console.log('Status code:${res.statusCode}`)
      res.setEncoding('utf8');
      res.on('data'.d= > {
          console.log(d.length,1234567);
          // TODO
      })

    })
    req.on('error'.error= > {
      console.error(error)
    })
    req.end()
}
Copy the code

Instead of starting a local service, just use HTTPS. Request ()

Step2 request the interface to obtain data

node index.js
Copy the code

Obviously, there is an Authorization field in the request header. While constantly scanning the interface, some rules are found — based on the change of each time stamp and the encryption means, it is used as the new Authorization in each request header. After many attempts and demonstration, Finally, the auth encryption algorithm of the platform is solved.

- Not to expand too much on the Auth algorithm here - jump straight to the stored dataCopy the code

Step3 store data

This article does not involve the database, at the beginning of the technology spread too much, may be persuaded, still choose a simple way to store as TXT files

// ./index.js
// This is written in the TODO area mentioned above
fs.writeFileSync(`${__dirname}/txtdata/data${stateNum}.txt`,d,{ flag: 'a+' },err= > {
  console.log('Write error'+err);
})
Copy the code

Capture partial screenshots of data

summary

- This article mainly shares the crawler process, when writing about auth algorithm, I feel in a dilemma; - About Auth, I consulted my friends who are specialized in reptile. - His answer is - decryption is difficult, auth will add some offset to confuse; - Then go back to thinking about the auth; Good luck, turned to the website source code; - MD5 encryption, where the offset is a string constant and a timestamp. - This crawled down the data; - Before finding the source code, uphold the bold hypothesis, careful verification, when turning to the source code, it has been a day; - Later in the confidence of the drive, to climb the information of a recruitment website, but unfortunately, fixed limits10Group data, then give up - welcome to message/private message communicationCopy the code

Technical summary

  1. This article mainly uses three modules: nodejs built-in module HTTPS FS and MD5 (you need to download md5.js)

  2. The MD5 referenced in this article is github.com/blueimp/Jav…