The Node crawler you didn't know was so easy

preface

Today is to bring you the node simple crawler, for the front-end small white is also very good to understand and will be very small skills with a sense of accomplishment

The idea of crawler can be summarized as: request URL -> HTML (information) -> parse HTML

This article will take you to climb douban TOP250 movie information

tool

Cheerio is a quick, flexible and concise implementation of jquery’s core functions. It is mainly intended for use on the server side where you need to manipulate the DOM. You can simply understand it as a very convenient tool for parsing HTML. Before using cheerio, you only need to install cheerio on the terminal

Node crawler step parsing

First, select the url of the web page and use HTTP protocol to get the data of the web page

Douban TOP250 link address: https://movie.douban.com/top250

const https = require('https');
https.get('https://movie.douban.com/top250'.function(res){
    // segment returns its own splice
    let html = ' ';
    // Add data when it is generated
    res.on('data'.function(chunk){
        html += chunk;
    })
    // Finish stitching
    res.on('end'.function(){
        console.log(html); })})Copy the code

So in the code above, it’s important to note that when we request the data, we get the data in segments, so we have to concatenate the data ourselves

res.on('data'.function(chunk){
        html += chunk;
    })
Copy the code

And when it’s done we can print it out and see if we have the full data

res.on('end'.function(){
        console.log(html);
    })
Copy the code

2. Use Cheerio tool to parse the required content

const cheerio = require('cheerio');
res.on('end'.function(){
        console.log(html);
        const $ = cheerio.load(html);
        let allFilms = [];
        $('li .item').each(function(){
            // This loop points to the current movie
            // The title under the current movie
            // equivalent to this.querySelector
            const title = $('.title'.this).text();
            const star = $('.rating_num'.this).text();
            const pic = $('.pic img'.this).attr('src');
            // console.log(title,star,pic);
            // Save database
            // No data is stored in a JSON file fs
            allFilms.push({
                title,star,pic
            })
        })
Copy the code

You can check the source code of the web page to see which tag the content is under, and then use the $sign to get the content you want. Here I have the name of the movie, the rating, and the picture of the movie

Save the data

Now to save the data, I saved the data to a file in film.json, and we introduced a FS module to write the data to the file

const fs = require('fs');
fs.writeFile('./films.json'.JSON.stringify(allFilms),function(err){
            if(! err){console.log('File written down'); }})Copy the code

On (‘end’) the file write code should be written in res.on(‘end’).

Download the pictures

The image data we crawl is the image address, what if we want to save the image locally? At this time only need to be the same as the previous request webpage data, the picture address URL request back, each picture can be written to the local

function downloadImage(allFilms) {
    for(let i=0; i<allFilms.length; i++){
        const picUrl = allFilms[i].pic;
        // request -> get the content
        // fs.writefile ('./xx.png',' content ')
        https.get(picUrl,function(res){
            res.setEncoding('binary');
            let str = ' ';
            res.on('data'.function(chunk){
                str += chunk;
            })
            res.on('end'.function(){
                fs.writeFile(`./images/${i}.png`,str,'binary'.function(err){
                    if(! err){console.log(The first `${i}A picture downloaded successfully); }})})})}}Copy the code

The steps to download the image are exactly the same as the steps to crawl the page data, we save the image as.png and write the function to download the image, we call the function in res.on(‘end’) and we are done

The source code

// Request url -> HTML (information) -> parse HTML
const https = require('https');
const cheerio = require('cheerio');
const fs = require('fs');
/ / request top250
// The browser enters a URL, get
https.get('https://movie.douban.com/top250'.function(res){
    // console.log(res);
    // segment returns its own splice
    let html = ' ';
    // Add data when it is generated
    res.on('data'.function(chunk){
        html += chunk;
    })
    // Finish stitching
    res.on('end'.function(){
        console.log(html);
        const $ = cheerio.load(html);
        let allFilms = [];
        $('li .item').each(function(){
            // This loop points to the current movie
            // The title under the current movie
            // equivalent to this.querySelector
            const title = $('.title'.this).text();
            const star = $('.rating_num'.this).text();
            const pic = $('.pic img'.this).attr('src');
            // console.log(title,star,pic);
            // Save database
            // No data is stored in a JSON file fs
            allFilms.push({
                title,star,pic
            })
        })
        // Write the array to json
        fs.writeFile('./films.json'.JSON.stringify(allFilms),function(err){
            if(! err){console.log('File written down'); }})// Download the imagedownloadImage(allFilms); })})function downloadImage(allFilms) {
    for(let i=0; i<allFilms.length; i++){
        const picUrl = allFilms[i].pic;
        // request -> get the content
        // fs.writefile ('./xx.png',' content ')
        https.get(picUrl,function(res){
            res.setEncoding('binary');
            let str = ' ';
            res.on('data'.function(chunk){
                str += chunk;
            })
            res.on('end'.function(){
                fs.writeFile(`./images/${i}.png`,str,'binary'.function(err){
                    if(! err){console.log(The first `${i}A picture downloaded successfully); }})})})}}Copy the code

conclusion

Crawler is not only Python, we node is very convenient and simple, front-end novice to master a small skill is also very good, for their own node learning has a great help, welcome to leave a message to discuss

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

The Node crawler you didn’t know was so easy

preface

tool

Node crawler step parsing

First, select the url of the web page and use HTTP protocol to get the data of the web page

2. Use Cheerio tool to parse the required content

Save the data

Download the pictures

The source code

conclusion

The Node crawler you didn’t know was so easy

preface

tool

Node crawler step parsing

First, select the url of the web page and use HTTP protocol to get the data of the web page

2. Use Cheerio tool to parse the required content

Save the data

Download the pictures

The source code

conclusion

Related Posts

Five steps to becoming a good front End engineer

Four tips for solving Lambda performance problems

Everyone is happy about e-sports entering Asia, but e-sports itself is not the biggest winner