* 💨💔🌊 writes 50 lines of code for a Node crawler.

This is the 19th day of my participation in Gwen Challenge.

Introduction to 👽

Recently, I suddenly wanted to try to build a crawler. As a native front-end person, it was naturally impossible to do this with Python. Thanks to NodeJs, JavaScript can easily fulfill this requirement in non-browser environments.

After learning a circle, I can finally achieve some simple crawler tasks. I will share this crawler experience with you below. Through this article, I dare not say that we can make a thorough understanding of the crawler or complete complex crawler tasks, but to achieve entry or no problem. So this sharing is about getting started with minimal code. If you have a better idea or experience after entry, welcome to guide the exchange, after all, I am only a crawler novice.

👽 Tool Introduction

🙋NodeJs: javascript runtime, everyone knows, not to say more;

🙋Express: a web application development framework under NodeJs, based on which we can build a simple local server;

🙋superagent: a request library under NodeJs. Compared with the HTTP library in Node, it is more flexible and easy to use. In this development, we use it to submit requests.

🙋 Cheerio: a DOM manipulation library under NodeJs, which can be seen as jquery in Node.

👽 Initializes the project

After creating a project locally, we can execute the following commands in sequence:

// Install dependency NPM I Express Superagent CheerioCopy the code

👽 Establish the Express server

Create app.js and easily configure Express:

/ / into the express
const app = require('express') ();let data = [];// Define the data to be returned

app.get('/'.async (req, res, next) => {// Define the route and the corresponding returned data
  res.send(data.length > 0 ? data : 'HelloWorld');
});

app.listen(3000.() = > {
  // Set port to 3000 and listen
  console.log('app started at port http://localhost:3000... ');
});

Copy the code

At this point, execute the startup command node app.js and access the corresponding address to see the response content:

Tips: If you need to re-run the startup command after saving app.js every time you change it, the page can be updated. This is actually a very cumbersome process. This repeated process can be handled automatically by a tool: Nodemon. Nodemon is a tool used to monitor js page changes and automatically execute node commands. It is also very simple to use. You only need to install nodemon globally, and then run nodemon app.js.Copy the code

👽 Submit the network request

Here we take the Boss page as an example, directly copy the URL and cookie of the page to be climbed, and send the request through superAgent:

/ / into the superAgent
const app = require('superAgent');

function getHtml(url, cookie) {
  superagent
    .get(url)
    .set('Cookie', cookie)
    .end((err, res) = > {
      if (err) {
        console.log('page fetching failed:${err}`);
      } else {
        // Res.text can be seen as the page returned from the original request
        data = cookData(res.text);// Assign processed page data to data, cookData which we define later}}); }let url =
  'https://www.zhipin.com/job_detail/?query=%E8%85%BE%E8%AE%AF&city=100010000&industry=&position=100999';

let cookie ='xxxxxxx';

getHtml(url,cookie)// Execute method
Copy the code

👽 Parses the page data

Suppose we want to crawl the secondary page link data corresponding to the front-end position in the boss. The overall idea of the solution is: first review the original boss web page, analyze the DOM structure, and then select the corresponding DOM through Cheerio to extract the required data.

In the figure below, we can clearly see that the primary-box is our target element.

/ / into the cheerio
const cheerio = require('cheerio');

function cookData(res) {
  let data = [];

  let $ = cheerio.load(res);// Parse the response page data into a Cheerio object

  // Get the data on the target element
  $('div.primary-box')
    .each((i, node) = > {
      data[i] = node.attribs.href;
    });

  return data;
};
Copy the code

At this point we are done, and the page should display the returned data:

👽 Full code

const app = require('express') ();const cheerio = require('cheerio');
const superagent = require('superagent');

let data = []; 

let url =
  'https://www.zhipin.com/job_detail/?query=%E8%85%BE%E8%AE%AF&city=100010000&industry=&position=100999';
let cookie ='xxxxx';

function cookData(res) {
  let data = [];

  let $ = cheerio.load(res); 
  $('div.primary-box').each((i, node) = > {
    data[i] = node.attribs.href;
  });

  return data;
}
function getHtml(url, cookie) {
  superagent
    .get(url)
    .set('Cookie', cookie)
    .end((err, res) = > {
      if (err) {
        console.log('page fetching failed:${err}`);
      } else{ data = cookData(res.text); }}); } getHtml(url, cookie); app.get('/'.async (req, res, next) => {
  res.send(data.length > 0 ? data : 'HelloWorld');
});

app.listen(3000.() = > {
  console.log('app started at port http://localhost:3000... ');
});

Copy the code

👽 epilogue

If there is no need to view in the front end, express in the project is also not needed. It is also a good choice to store the processed data locally through fs library. Of course, the demand of crawler in reality is not so simple, the road ahead is long, everyone come on 🚀🚀🚀!

* 💨💔🌊 writes 50 lines of code for a Node crawler.

Introduction to 👽

👽 Tool Introduction

👽 Initializes the project

👽 Establish the Express server

👽 Submit the network request

👽 Parses the page data

👽 Full code

👽 epilogue

Related Posts

Front-end performance optimization, GC, V8 quickly GET

Based on proxy to achieve the simplest two-way binding (less code thief. JPG

Flutter allows webView to slide with native components