For some reason, we want to get data from a website. We can get the data we want through the NodeJS crawler. To complete a crawler, the main steps are as follows:
grab
The most important step of crawler is how to grab the page you want. And can take into account time efficiency, can concurrently climb multiple pages. To capture the target content, we need to analyze the page structure.
1. Use nodeJS request module to obtain the HTML code of the target page.
2. Use Cheerio module to process THE HTML module and get the required data.
storage
After obtaining valuable data, we can save the data for our convenience. It can be converted to JSON files or stored directly in the database.
The specific implementation
Initialize a project:
$ npm initCopy the code
2. Install the dependency module:
$ npm install express request cheerio --saveCopy the code
Express is used to build node services;
Request is an Ajax-like way of getting the HTML code inside a URL;
Cheerio is similar to jquery in that it processes the HTML code it gets.
Create a spider.js file from the root directory
var express = require('express'),
app = express(),
request = require('request'),
cheerio = require('cheerio'),
fs = require('fs');
var fetchData = [];
function fetchBrand(){
request('http://waimai.baidu.com/waimai/shop/1434741117'.function(err, res, body) {
if(err || res.statusCode ! =200) {
console.log(err);
console.log('Crawl failed');
return false;
}
var $ = cheerio.load(body, { decodeEntities: false });// Solve the garble problem
var curBrands = $('.list-wrap');
for(var i = 0; i < curBrands.length; i++){var obj = {
name: curBrands.eq(i).find('.list-status .title').text(),
sub: []
}
fetchData.push(obj);
var curSeries = curBrands.eq(i).find('.list-item');
for (var j = 0; j < curSeries.length; j++) {var obj = {
imgpath:curSeries.eq(j).find('.bg-img').attr('style').substring(curSeries.eq(j).find('.bg-img').attr('style').indexOf('http'), curSeries.eq(j).find('.bg-img').attr('style').indexOf(') ')),
name: curSeries.eq(j).find('h3').text(),
recommend: curSeries.eq(j).find('.divider').prev().text(),
sale: curSeries.eq(j).find('.divider').next().text(),
stock: curSeries.eq(j).find('.stock-count').text(),
saPrice: curSeries.eq(j).find('span strong').text(),
orPrice: curSeries.eq(j).find('del strong').text()
}
fetchData[fetchData.length - 1].sub.push(obj); }}var t = JSON.stringify(fetchData);
fs.writeFileSync('baiduData.json', t);
})
}
fetchBrand();Copy the code
4. Run spider. Js
$ node spider.jsCopy the code
The console will output capture success. You will find a json file named baidudata.json in your folder.