This is the 19th day of my participation in Gwen Challenge.
Introduction to π½
Recently, I suddenly wanted to try to build a crawler. As a native front-end person, it was naturally impossible to do this with Python. Thanks to NodeJs, JavaScript can easily fulfill this requirement in non-browser environments.
After learning a circle, I can finally achieve some simple crawler tasks. I will share this crawler experience with you below. Through this article, I dare not say that we can make a thorough understanding of the crawler or complete complex crawler tasks, but to achieve entry or no problem. So this sharing is about getting started with minimal code. If you have a better idea or experience after entry, welcome to guide the exchange, after all, I am only a crawler novice.
π½ Tool Introduction
πNodeJs: javascript runtime, everyone knows, not to say more;
πExpress: a web application development framework under NodeJs, based on which we can build a simple local server;
πsuperagent: a request library under NodeJs. Compared with the HTTP library in Node, it is more flexible and easy to use. In this development, we use it to submit requests.
π Cheerio: a DOM manipulation library under NodeJs, which can be seen as jquery in Node.
π½ Initializes the project
After creating a project locally, we can execute the following commands in sequence:
// Install dependency NPM I Express Superagent CheerioCopy the code
π½ Establish the Express server
Create app.js and easily configure Express:
/ / into the express
const app = require('express') ();let data = [];// Define the data to be returned
app.get('/'.async (req, res, next) => {// Define the route and the corresponding returned data
res.send(data.length > 0 ? data : 'HelloWorld');
});
app.listen(3000.() = > {
// Set port to 3000 and listen
console.log('app started at port http://localhost:3000... ');
});
Copy the code
At this point, execute the startup command node app.js and access the corresponding address to see the response content:
Tips: If you need to re-run the startup command after saving app.js every time you change it, the page can be updated. This is actually a very cumbersome process. This repeated process can be handled automatically by a tool: Nodemon. Nodemon is a tool used to monitor js page changes and automatically execute node commands. It is also very simple to use. You only need to install nodemon globally, and then run nodemon app.js.Copy the code
π½ Submit the network request
Here we take the Boss page as an example, directly copy the URL and cookie of the page to be climbed, and send the request through superAgent:
/ / into the superAgent
const app = require('superAgent');
function getHtml(url, cookie) {
superagent
.get(url)
.set('Cookie', cookie)
.end((err, res) = > {
if (err) {
console.log('page fetching failed:${err}`);
} else {
// Res.text can be seen as the page returned from the original request
data = cookData(res.text);// Assign processed page data to data, cookData which we define later}}); }let url =
'https://www.zhipin.com/job_detail/?query=%E8%85%BE%E8%AE%AF&city=100010000&industry=&position=100999';
let cookie ='xxxxxxx';
getHtml(url,cookie)// Execute method
Copy the code
π½ Parses the page data
Suppose we want to crawl the secondary page link data corresponding to the front-end position in the boss. The overall idea of the solution is: first review the original boss web page, analyze the DOM structure, and then select the corresponding DOM through Cheerio to extract the required data.
In the figure below, we can clearly see that the primary-box is our target element.
/ / into the cheerio
const cheerio = require('cheerio');
function cookData(res) {
let data = [];
let $ = cheerio.load(res);// Parse the response page data into a Cheerio object
// Get the data on the target element
$('div.primary-box')
.each((i, node) = > {
data[i] = node.attribs.href;
});
return data;
};
Copy the code
At this point we are done, and the page should display the returned data:
π½ Full code
const app = require('express') ();const cheerio = require('cheerio');
const superagent = require('superagent');
let data = [];
let url =
'https://www.zhipin.com/job_detail/?query=%E8%85%BE%E8%AE%AF&city=100010000&industry=&position=100999';
let cookie ='xxxxx';
function cookData(res) {
let data = [];
let $ = cheerio.load(res);
$('div.primary-box').each((i, node) = > {
data[i] = node.attribs.href;
});
return data;
}
function getHtml(url, cookie) {
superagent
.get(url)
.set('Cookie', cookie)
.end((err, res) = > {
if (err) {
console.log('page fetching failed:${err}`);
} else{ data = cookData(res.text); }}); } getHtml(url, cookie); app.get('/'.async (req, res, next) => {
res.send(data.length > 0 ? data : 'HelloWorld');
});
app.listen(3000.() = > {
console.log('app started at port http://localhost:3000... ');
});
Copy the code
π½ epilogue
If there is no need to view in the front end, express in the project is also not needed. It is also a good choice to store the processed data locally through fs library. Of course, the demand of crawler in reality is not so simple, the road ahead is long, everyone come on πππ!