50 lines of code, Node crawler training project 🕷️

preface

Almost every operation in the Trending project is annotated, which is suitable for those who are interested in Node Crawler to have a basic understanding of it.

introduce

50 lines, minimalist node crawler for Github Trending. A 50-line Node crawler, a simple Axios, Express, Cheerio experience project.

experience

First, make sure node10.0+ environment exists on the computer, and then

1. Pull this item

git clone https://github.com/poozhu/Crawler-for-Github-Trending.git
cd Crawler-for-Github-Trending
npm i
node index.js
Copy the code

2. Alternatively, download the project package and decompress it

cdCrawler-for-Github-Trending-master // Go to project folder NPM I node index.jsCopy the code

The sample

When the project is started, you can see the console output

Listening on port 3000!
Copy the code

In this case, open the browser and go to the local service http://localhost:3000/

http://localhost:3000/time-language / / cycle time said, language on behalf of the language Such as: HTTP: / / http://localhost:3000/ / / the default fetching all language of http://localhost:3000/daily / / representative today Optional parameters: Weekly, or HTTP: / / http://localhost:3000/daily-JavaScript / / on behalf of today's JavaScript classification Optional parameters: any languageCopy the code

Wait a little to see an example of returned data after climbing:

[{"title": "lib-pku / libpku"."links": "https://github.com/lib-pku/libpku"."description": "Private collation of your curriculum materials"."language": "JavaScript"."stars": "14297"."forks": "4360"."info": "3121 stars this week." "
 },
 {
  "title": "SqueezerIO / squeezer"."links": "https://github.com/SqueezerIO/squeezer"."description": "Squeezer Framework - Build serverless dApps"."language": "JavaScript"."stars": "3212"."forks": "80"."info": "2807 stars this week." "},... ]Copy the code

parsing

Firstly, the dependency package required by the crawler is introduced

const cheerio = require('cheerio')   // Page fetching module, can get the content of the page like JQ
const axios = require('axios');  // Request processing
const express = require('express')  // Server module
Copy the code

Splicing request address

function getData(time, language) {
    // Determine the parameters required by the crawl function, through the judgment of the parameters to do the request address splicing
    let url = 'https://github.com/trending'+ (!!!!! language ?'/' + language : ' ') + '? since=' + time;
}
Copy the code

Make a simple GET request

function getData(time, language) {
    let url = 'https://github.com/trending'+ (!!!!! language ?'/' + language : ' ') + '? since=' + time;
    axios.get(url)
        .then(function (response) {
            let html_string = response.data.toString(); 
            const $ = cheerio.load(html_string); // Pass pages to modules
            // At this point we are ready to fetch the elements of the page $as we would with JQ
            });
        })
        .catch(function (error) {
            console.log(error); })}Copy the code

Analyze Github Trending page structure and write the data extraction part

It can be found that the list exists under the Box element, so according to this structure, we can obtain the corresponding position of the data as JQ:

$('.Box .Box-row').each(function () { // Get the corresponding node value like jQuery
    let obj = {};
    obj.title = $(this).find('h1').text().trimStart().trimEnd(); // Get the title
    obj.links = 'https://github.com/' + obj.title.replace(/\s/g."");   // Splice links
    obj.description = $(this).find('p').text().trimStart().trimEnd();  // get the description
    obj.language = $(this).find('>.f6 .repo-language-color').siblings().text().trimStart().trimEnd();  // Get the language
    obj.stars = $(this).find('>.f6 a').eq(0).text().trimStart().trimEnd();  // Get the start number
    obj.forks = $(this).find('>.f6 a').eq(1).text().trimStart().trimEnd();  // Get the number of branches
    obj.info = $(this).find('>.f6 .float-sm-right').text().trimStart().trimEnd();  // Obtain the star information of the corresponding period
    obj.avatar = $(this).find('>.f6 img').eq(0).attr('src');  // Get the first author avatar
}
Copy the code

The routing configuration

After completing the core function, then simply configure the corresponding route to fetch the corresponding data and return:

const app = express()

// Default fetching page configuration
app.get('/', (req, res) => {
    let promise = getData('daily'); // Initiate fetching
    promise.then(response= > {
        res.json(response); // Data is returned
    });
})

// There are two arguments
app.get('/:time-:language', (req, res) => {
    const {
        time, // Get the sort time
        language // Get the corresponding language
    } = req.params;
    let promise = getData(time, language); // Initiate fetching
    promise.then(response= > {
        res.json(response); // Data is returned
    });
})

// There is a time parameter
app.get('/:time', (req, res) => {
    const {
        time, // Get the sort time
    } = req.params;
    let promise = getData(time); // Initiate fetching
    promise.then(response= > {
        res.json(response); // Data is returned
    });
})

app.listen(3000, () = >console.log('Listening on port 3000! ')) // Listen on port 3000
Copy the code

conclusion

In this project, data will be climbed in real time for each visit, so the data return speed will be very slow. It is expected that as interface data, it will be periodically climbed to the database.

But understanding the project code can bring the above node modules and crawlers the most basic usage and concepts, I hope to help you. 😀