The story development
Calm and carefree day, is still less money and more things, here imagine an expression. Suddenly one day, the story came, a statistical beauty of the old classmate said she was too lazy to copy and paste, let me help her crawler to get the content of some articles and make paragraph text, they do follow-up data statistics and analysis, this favor of course I have to help, although I did not climb… Haha, then AFTER work I spent a few hours making this simple little Demo using nodeJS which I am quite familiar with (the story will continue at the end of the article with the photos of old classmates! 😁).
Goal:
To climb the site (WWWN. CDC. Gov/NCHS/nhanes…), then go to the first paragraph and the last table information inside each article, as shown below:
(2018.08.31 Supplement)
Crawler scheme analysis:
Open the main page you want to crawl to, right-click a document under Doc File and choose “Check”. In the browser developer tool (Windows F12 can be directly opened), you can see the URL of the document and the DOM structure of the entire page under Element.
Using the same method, you can open one of the articles, right click where the article is crawling to view the Dom structure of the article and retrieve the content.
A brief idea:
Locate the document in Table– Doc File under Table– take out the url of the article under Doc File– get the content of the article according to the URL — select the DOM where the content we want to crawl to — format and save it in TXT.
The node packages used are as follows:
var = require(‘https’)
NPM init initializes the build project, then installs NPM install Cheerio and Request, and establishes app.js filling code.
Project Contents:
Direct on the source code:
var http = require('http');
var https = require('https')
var fs = require('fs');
var cheerio = require('cheerio');
var urlHeader = 'https://wwwn.cdc.gov'
var urlFather = 'https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Laboratory&CycleBeginYear=2013'/ / the initial urllet count = 0;
function findUrlList(x,callback){
https.get(x, function (res) {
var html = ' '; Var titles = []; res.setEncoding('utf-8'); // Listen to the data event, one piece ata time res.on('data'.function(chunk) { html += chunk; }); // Listen for the end event. If the HTML of the entire page content is retrieved, the callback res.on() is executed.'end'.function() { var $ = cheerio.load(html); Var urlArr = []; $('#GridView1 tbody tr').each(function (index, item) {
let url = urlHeader + $(this).children().next().children('a').attr("href")
startRequest(url)
urlArr.push(url)
})
console.log(urlArr.length)
callback()
});
}).on('error'.function (err) {
console.log(err);
});
}
functionStartRequest (x) {// Use the HTTP module to make a get request to the server.function (res) {
var html = ' '; Var titles = []; res.setEncoding('utf-8'); // Listen to the data event, one piece ata time res.on('data'.function(chunk) { html += chunk; }); // Listen for the end event. If the HTML of the entire page content is retrieved, the callback res.on() is executed.'end'.function() { var $ = cheerio.load(html); Var news_item = {// Get the title of the article: $('div #PageHeader h2').text().trim(),
url: 'Article Address:'+x,
firstParagraph: 'First paragraph: \n'+ $('#Component_Description').next().text().trim(),
codeBookAndFrequencies: 'Table information: \n'+ $('#Codebook').children().text().trim() }; savedContent($,news_item); // Store the content and title of each article}); }).on('error'.function(err) { console.error(err); }); } // This function is used to store local news content resourcesfunction savedContent($, news_item) {
count++;
let x = '['+count+'] ' + '\n';
x += news_item.url;
x = x + '\n';
x += news_item.firstParagraph;
x = x + '\n';
x += news_item.codeBookAndFrequencies;
x = x + '\n';
x += '------------------------------------------------------------ \n';
x = x + '\n';
x = x + '\n'; // Add the news text paragraph by paragraph to the /data folder and name the file fs.appendFile('./data/' + news_item.title + '.txt', x, 'utf-8'.function (err) {
if(err) { console.log(err); }}); } // startRequest(url); FindUrlList (urlFather,() => {console.log()'work done')})Copy the code
Run the command line node app.js to see the result!
Crawler TXT result check:
API reference:
The story continues
Yeah, it took me two short, interrupted nights to build it for her, and then I realized that he didn’t want that, but a much bigger and more versatile system that would interface with their R language system!! But there are no funds, ha ha, that’s the end of the story! But as a very inquisitive program, along with a good command of the basic crawler, or very satisfied! Thanks for reading, I hope I can give you some harvest!
Beauty please accept, you will not be disappointed, ha ha ha ha 😄