Initialize & install dependencies
npm init --yes
npm i express superagent cheerio -s
Copy the code
SuperAgent is a lightweight Ajax API that can be used by both server-side (Node.js) client (browser)
Cheerio is a fast, flexible, and implementable jQuery core implementation of Cheerio Chinese documentation
Build a simple server
// Create a server instance
const express = require('express')
const app = express()
app.get('/'.(req,res) = > {
res.send('Reptile In action')})// Get the server information and print it
let server = app.listen(3000.() = > {
let host = server.address().address;
let port = server.address().port;
// %s Another way to concatenate strings
console.log('Program is running http://%s:%s', host, port);
})
Copy the code
Running server
nodemon index.js
Copy the code
Enter localhost:3000 in the browser
Analyze page content
Baidu News — Baidu.com
Example, open baidu news home page, console, review elements
Cheerio to get id > UL > Li > A to get the text in a label
Access to the page
Introduce the superAgent module and call the get method, passing in the page address
const superagent = require('superagent')
superagent.get('http://news.baidu.com/').end((err,res) = > {
if(err) {
console.log('Hot news fetching failed' + err);
}
console.log(res);
})
Copy the code
After saving, the server will update, and the terminal will print out the result. Due to too much content, the terminal can not accommodate, and the upper part has been covered
All data returned by the page address is contained in the RES
Process the data
Now let’s start processing the data
- Introducing the cheerio library
- Below we declare the handlers for res
- The top declares the results to be returned
- Superagent.get () handles calls to methods within functions and returns the results to pre-declared variables
Note here that the request method app.get(‘/’, (req,res)=> {}) is placed under the handler
const express = require('express')
const superagent = require('superagent')
const cheerio = require('cheerio')
const app = express()
let hotNews = []
superagent.get('http://news.baidu.com/').end((err,res) = > {
if(err) {
console.log('Hot news fetching failed' + err);
}
// Call the function and the result is directly assigned to the external variable
hotNews = getHotNews(res)
})
let getHotNews = res= > {
// Get $by passing res.text (get the full string of res) to cheerio library load
let $ = cheerio.load(res.text)
// Pass the selector selector element to the $method, and you get one
// $('#pane-news ul li a')
console.log($('#pane-news ul li a'));
}
app.get('/'.(req,res) = > {
res.send(hotNews)
})
// Get the server information and print it
let server = app.listen(3000.() = > {
let host = server.address().address;
let port = server.address().port;
// %s Another way to concatenate strings
console.log('Program is running http://%s:%s', host, port);
})
Copy the code
$(‘#pane-news ul Li a’) returns an array of all corresponding node objects
let getHotNews = res= > {
// Declare an empty array
let hotNews = []
// Get $by passing res.text (get the full string of res) to cheerio library load
let $ = cheerio.load(res.text)
// The $method is passed a selector to select elements, resulting in an array containing all the corresponding elements
// Iterate through the array to get each element's text and href into the news object
$('#pane-news ul li a').each((index, ele) = > {
let news = {
title: $(ele).text(), // Get the headlines
href: $(ele).attr('href') // Get the news page link
}
hotNews.push(news) // The result of each iteration pushes news into the declared array
})
// The result is returned at the end of the loop and assigned to the uppermost empty object with a call
return hotNews
}
Copy the code
Return the data
Print the result after the function call
The passed value is changed to the returned value
app.get('/'.(req,res) = > {
res.send(hotNews)
})
Copy the code
Of course, once the data is retrieved, it may not be the client to display it directly
This part can then be handled in the Superagent
superagent.get('http://news.baidu.com/').end((err,res) = > {
if(err) {
console.log('Hot news fetching failed' + err);
}
// Call the function and the result is directly assigned to the external variable
hotNews = getHotNews(res)
/* 1. Save the route to the database. 2. The routing page requests data from the database to be displayed in the Echarts chart */
})
Copy the code
Failed to capture local news
I’m not going to write it here because I’ve written it over and over again, and the reason I can’t get this part of the data is because this part of the data is going to be dynamically retrieved from the current page of the browser
Access to news.baidu.com through superagent is to obtain all the static content under this domain name, and cannot trigger the function request to complete the loading of dynamic content
The solution is to use a third-party plug-in to simulate the browser to visit the front page of Baidu News. In this simulated browser, when the dynamic content is loaded successfully, the data is captured and returned to the front-end browser
Nightmare implements dynamic data fetching
segmentio/nightmare: A high-level browser automation library. (github.com)
Use the NIGHTMARE automated testing tool
Electron can create desktop applications using pure javascript to call Chrome’s rich native interface. Think of it as a desktop-focused node.js variant, rather than a Web server, whose browser-based application makes it extremely easy to interact with all manner of responses
Nightmare is a spider-based framework for automated web testing and crawlers, because it has the same automated testing capabilities as plantomJS that can simulate user behavior on a page and trigger some asynchronous data loading. You can also access urls directly to fetch data, just like the Request library, and you can set the latency of the page, so it’s a breeze to trigger either manually or behavior-triggered scripts
Install dependencies
npm i nightmare -s
Copy the code
use
Import modules, get instances, and call methods to get data dynamically
const express = require('express')
const app = express()
const Nightmare = require('nightmare')
// Setting show: true displays an automated built-in browser
const nightmare = Nightmare({ show: true})
const cheerio = require('cheerio')
let localNews = []
//---------------------------------------------------------------------------------
nightmare
.goto('http://news.baidu.com')// The link to access
.wait('div#local_news') // Nodes waiting to be loaded
.evaluate(() = > document.querySelector('div#local_news').innerHTML)// Evaluate node content
.then(htmlStr= > { // Get the HTML string
localNews = getLocalNews(htmlStr) // Call the method
})
.catch(err= > {
console.error(err)
})
//----------------------------------------------------------------------------------
let getLocalNews = htmlStr= > {
let localNews = []
let $ = cheerio.load(htmlStr) // We don't need.text because we already got a string
$('ul#localnews-focus li a').each((index, ele) = > {
let news = {
title: $(ele).text(),
href: $(ele).attr('href')
}
localNews.push(news)
})
return localNews
}
app.get('/'.(req,res) = > {
res.send(localNews)
})
// Get the server information and print it
let server = app.listen(3000.() = > {
let host = server.address().address;
let port = server.address().port;
// %s Another way to concatenate strings
console.log('Program is running http://%s:%s', host, port);
})
Copy the code
Now open the link and you can see the dynamically loaded content