This article has participated in the Denver Nuggets Creators Camp 3 “More Productive writing” track, see details: Digg project | creators Camp 3 ongoing, “write” personal impact.
If you don’t know Python and happen to be a node.js novice, this article will help you gain a new skill in how to write a simple crawler from scratch in Node.js in 10 minutes. Node installation is not explained step by step, if not, you can baidu. Start the first step in the Node startup environment:
1: Create a folder WebSpider on drive D
2: right-click CMD to open it in administrator mode and go to drive D. CD goes to CD WebSpider
3: create a FirstSpider folder
The directory after the FirstSpider folder was created
4: Go to the CD FirstSpider folder you just created
5: NPM init (Initialization project)
At this point, you need to fill in some project information. You can fill in the information according to the situation, or you can press Enter.
After the project is created, a package.json file is generated. This file contains basic information about the project.
6: Install third-party packages (later programs will directly call the package module)
Note: HTTP module, FS module are built-in packages, so there is no need to add additional.
This is where you install the Cheerio package and the Request package. Request is used to make HTTP requests and Cheerio is used to analyze and extract the downloaded DOM you can think of it as jQuery
In CMD, CD goes to the CD FirstSpider folder and then executes the command: NPM install Cheerio — save
NPM install Request -save after installing Cheerio package
NPM (NodeJS package Manager), nodeJS package manager – The purpose of save is to write the project’s dependencies on the package to the package.json file.
If you want to sort out the data and images you crawl, create a data and Image folder in advance. Create a new one in the FirstSpider folder
Create a subfolder data (for grabbing news text content) create a subfolder image (for grabbing image resources) create a first_spider fileCopy the code
The directory structure of the entire project is shown below:
7: Highlights, open first_spider and type code line by line. If you don’t want to write code, just go to the website and find some code to test.
var request = require('request')
var cheerio = require('cheerio')
for(var i = 1;i<4;i++){
request('http://www.souweixin.com/personal?t=41&p='+i,function(error,response,body){
if(!error && response.statusCode == 200){
$ = cheerio.load(body)
var links = [];
$(".boldBorder > a").each(function(i,item){
links.push($(this).attr("href"))
})
for(var i=0;i<links.length;i++){
request('http://www.souweixin.com'+links[i],function(error,response,body){
if(!error && response.statusCode == 200){
$ = cheerio.load(body)
console.log('weixin: '+$('.bold').text()+' name: '+$('h1').text()+' desc: '+$('.f18').text());
}
})
}
}
})
}
Copy the code
On the CMD command line, CD to the FirstSpider folder where you created the project and node first_spider.js will run.
9: Note: if this time encountered such a bug
Port 80 is occupied. It is added that: Solve the problem that port 80 is occupied.www.jianshu.com/p/a7fc19b0c…
Find the ports and turn them off one by one