preface
Almost every operation in the Trending project is annotated, which is suitable for those who are interested in Node Crawler to have a basic understanding of it.
introduce
50 lines, minimalist node crawler for Github Trending. A 50-line Node crawler, a simple Axios, Express, Cheerio experience project.
experience
First, make sure node10.0+ environment exists on the computer, and then
1. Pull this item
git clone https://github.com/poozhu/Crawler-for-Github-Trending.git
cd Crawler-for-Github-Trending
npm i
node index.js
Copy the code
2. Alternatively, download the project package and decompress it
cdCrawler-for-Github-Trending-master // Go to project folder NPM I node index.jsCopy the code
The sample
When the project is started, you can see the console output
Listening on port 3000!
Copy the code
In this case, open the browser and go to the local service http://localhost:3000/
http://localhost:3000/time-language / / cycle time said, language on behalf of the language Such as: HTTP: / / http://localhost:3000/ / / the default fetching all language of http://localhost:3000/daily / / representative today Optional parameters: Weekly, or HTTP: / / http://localhost:3000/daily-JavaScript / / on behalf of today's JavaScript classification Optional parameters: any languageCopy the code
Wait a little to see an example of returned data after climbing:
[{"title": "lib-pku / libpku"."links": "https://github.com/lib-pku/libpku"."description": "Private collation of your curriculum materials"."language": "JavaScript"."stars": "14297"."forks": "4360"."info": "3121 stars this week." "
},
{
"title": "SqueezerIO / squeezer"."links": "https://github.com/SqueezerIO/squeezer"."description": "Squeezer Framework - Build serverless dApps"."language": "JavaScript"."stars": "3212"."forks": "80"."info": "2807 stars this week." "},... ]Copy the code
parsing
Firstly, the dependency package required by the crawler is introduced
const cheerio = require('cheerio') // Page fetching module, can get the content of the page like JQ
const axios = require('axios'); // Request processing
const express = require('express') // Server module
Copy the code
Splicing request address
function getData(time, language) {
// Determine the parameters required by the crawl function, through the judgment of the parameters to do the request address splicing
let url = 'https://github.com/trending'+ (!!!!! language ?'/' + language : ' ') + '? since=' + time;
}
Copy the code
Make a simple GET request
function getData(time, language) {
let url = 'https://github.com/trending'+ (!!!!! language ?'/' + language : ' ') + '? since=' + time;
axios.get(url)
.then(function (response) {
let html_string = response.data.toString();
const $ = cheerio.load(html_string); // Pass pages to modules
// At this point we are ready to fetch the elements of the page $as we would with JQ
});
})
.catch(function (error) {
console.log(error); })}Copy the code
Analyze Github Trending page structure and write the data extraction part
It can be found that the list exists under the Box element, so according to this structure, we can obtain the corresponding position of the data as JQ:
$('.Box .Box-row').each(function () { // Get the corresponding node value like jQuery
let obj = {};
obj.title = $(this).find('h1').text().trimStart().trimEnd(); // Get the title
obj.links = 'https://github.com/' + obj.title.replace(/\s/g.""); // Splice links
obj.description = $(this).find('p').text().trimStart().trimEnd(); // get the description
obj.language = $(this).find('>.f6 .repo-language-color').siblings().text().trimStart().trimEnd(); // Get the language
obj.stars = $(this).find('>.f6 a').eq(0).text().trimStart().trimEnd(); // Get the start number
obj.forks = $(this).find('>.f6 a').eq(1).text().trimStart().trimEnd(); // Get the number of branches
obj.info = $(this).find('>.f6 .float-sm-right').text().trimStart().trimEnd(); // Obtain the star information of the corresponding period
obj.avatar = $(this).find('>.f6 img').eq(0).attr('src'); // Get the first author avatar
}
Copy the code
The routing configuration
After completing the core function, then simply configure the corresponding route to fetch the corresponding data and return:
const app = express()
// Default fetching page configuration
app.get('/', (req, res) => {
let promise = getData('daily'); // Initiate fetching
promise.then(response= > {
res.json(response); // Data is returned
});
})
// There are two arguments
app.get('/:time-:language', (req, res) => {
const {
time, // Get the sort time
language // Get the corresponding language
} = req.params;
let promise = getData(time, language); // Initiate fetching
promise.then(response= > {
res.json(response); // Data is returned
});
})
// There is a time parameter
app.get('/:time', (req, res) => {
const {
time, // Get the sort time
} = req.params;
let promise = getData(time); // Initiate fetching
promise.then(response= > {
res.json(response); // Data is returned
});
})
app.listen(3000, () = >console.log('Listening on port 3000! ')) // Listen on port 3000
Copy the code
conclusion
In this project, data will be climbed in real time for each visit, so the data return speed will be very slow. It is expected that as interface data, it will be periodically climbed to the database.
But understanding the project code can bring the above node modules and crawlers the most basic usage and concepts, I hope to help you. 😀