preface
Today is to bring you the node simple crawler, for the front-end small white is also very good to understand and will be very small skills with a sense of accomplishment
The idea of crawler can be summarized as: request URL -> HTML (information) -> parse HTML
This article will take you to climb douban TOP250 movie information
tool
Cheerio is a quick, flexible and concise implementation of jquery’s core functions. It is mainly intended for use on the server side where you need to manipulate the DOM. You can simply understand it as a very convenient tool for parsing HTML. Before using cheerio, you only need to install cheerio on the terminal
Node crawler step parsing
First, select the url of the web page and use HTTP protocol to get the data of the web page
Douban TOP250 link address: https://movie.douban.com/top250
const https = require('https');
https.get('https://movie.douban.com/top250'.function(res){
// segment returns its own splice
let html = ' ';
// Add data when it is generated
res.on('data'.function(chunk){
html += chunk;
})
// Finish stitching
res.on('end'.function(){
console.log(html); })})Copy the code
So in the code above, it’s important to note that when we request the data, we get the data in segments, so we have to concatenate the data ourselves
res.on('data'.function(chunk){
html += chunk;
})
Copy the code
And when it’s done we can print it out and see if we have the full data
res.on('end'.function(){
console.log(html);
})
Copy the code
2. Use Cheerio tool to parse the required content
const cheerio = require('cheerio');
res.on('end'.function(){
console.log(html);
const $ = cheerio.load(html);
let allFilms = [];
$('li .item').each(function(){
// This loop points to the current movie
// The title under the current movie
// equivalent to this.querySelector
const title = $('.title'.this).text();
const star = $('.rating_num'.this).text();
const pic = $('.pic img'.this).attr('src');
// console.log(title,star,pic);
// Save database
// No data is stored in a JSON file fs
allFilms.push({
title,star,pic
})
})
Copy the code
You can check the source code of the web page to see which tag the content is under, and then use the $sign to get the content you want. Here I have the name of the movie, the rating, and the picture of the movie
Save the data
Now to save the data, I saved the data to a file in film.json, and we introduced a FS module to write the data to the file
const fs = require('fs');
fs.writeFile('./films.json'.JSON.stringify(allFilms),function(err){
if(! err){console.log('File written down'); }})Copy the code
On (‘end’) the file write code should be written in res.on(‘end’).
Download the pictures
The image data we crawl is the image address, what if we want to save the image locally? At this time only need to be the same as the previous request webpage data, the picture address URL request back, each picture can be written to the local
function downloadImage(allFilms) {
for(let i=0; i<allFilms.length; i++){
const picUrl = allFilms[i].pic;
// request -> get the content
// fs.writefile ('./xx.png',' content ')
https.get(picUrl,function(res){
res.setEncoding('binary');
let str = ' ';
res.on('data'.function(chunk){
str += chunk;
})
res.on('end'.function(){
fs.writeFile(`./images/${i}.png`,str,'binary'.function(err){
if(! err){console.log(The first `${i}A picture downloaded successfully); }})})})}}Copy the code
The steps to download the image are exactly the same as the steps to crawl the page data, we save the image as.png and write the function to download the image, we call the function in res.on(‘end’) and we are done
The source code
// Request url -> HTML (information) -> parse HTML
const https = require('https');
const cheerio = require('cheerio');
const fs = require('fs');
/ / request top250
// The browser enters a URL, get
https.get('https://movie.douban.com/top250'.function(res){
// console.log(res);
// segment returns its own splice
let html = ' ';
// Add data when it is generated
res.on('data'.function(chunk){
html += chunk;
})
// Finish stitching
res.on('end'.function(){
console.log(html);
const $ = cheerio.load(html);
let allFilms = [];
$('li .item').each(function(){
// This loop points to the current movie
// The title under the current movie
// equivalent to this.querySelector
const title = $('.title'.this).text();
const star = $('.rating_num'.this).text();
const pic = $('.pic img'.this).attr('src');
// console.log(title,star,pic);
// Save database
// No data is stored in a JSON file fs
allFilms.push({
title,star,pic
})
})
// Write the array to json
fs.writeFile('./films.json'.JSON.stringify(allFilms),function(err){
if(! err){console.log('File written down'); }})// Download the imagedownloadImage(allFilms); })})function downloadImage(allFilms) {
for(let i=0; i<allFilms.length; i++){
const picUrl = allFilms[i].pic;
// request -> get the content
// fs.writefile ('./xx.png',' content ')
https.get(picUrl,function(res){
res.setEncoding('binary');
let str = ' ';
res.on('data'.function(chunk){
str += chunk;
})
res.on('end'.function(){
fs.writeFile(`./images/${i}.png`,str,'binary'.function(err){
if(! err){console.log(The first `${i}A picture downloaded successfully); }})})})}}Copy the code
conclusion
Crawler is not only Python, we node is very convenient and simple, front-end novice to master a small skill is also very good, for their own node learning has a great help, welcome to leave a message to discuss