This project is for reference only, to provide a beginner’s example. This project uses Puppeteer to access the data of one of my blog systems
-
Look at the demo
Because headless is disabled, the browser window pops up for easy debugging
Note: because it is my own personal blog and the server is abroad, it is slower to access in China. Such as running this demo timeout, belongs to the normal phenomenon, in this case, it is recommended to open the scientific Internet access (do not say much about this technology), open the global proxy try
The directory structure
/data
: Capture data storage location (conditional can be directly stored in the database)/utils
: utility classindex.js
: entry file (core code)config.js
: configuration file (crawler address and crawler page number configuration)
Project start
- Install dependencies (go to the project directory)
npm install Copy the code
- Ordinary start
npm start Copy the code
- Debug launch
npm run dev Copy the code
The core code
- First introduced
puppeteer
withconfig.js
Configure the file and disable itheadless
Start for observation
Puppeteer relies on Chromium, and access to each page process is limited to a sandbox. Even if a page is attacked by malware, if it is shut down, there is no impact on the operating system or other pages.
import puppeteer from "puppeteer";
import config from "./config";
var ALL_PAGES = config.allPages; // Number of pages fetched
var ORIGIN = config.origin; // Crawl the target website [String]
(async() = > {/ / start the puppeteer
var browser = await puppeteer.launch({
headless: false.devtools: false.defaultViewport: {
width: 1200.height: 1000}});// Initialize a page
var page = await browser.newPage();
// Write the logic code to crawl the data
awaitbrowser.close(); }) ();Copy the code
-
Before writing the logic code, take a look at the target website we crawled today for analysis
Open the home page and you’ll see an area of all blog posts
Analysis:
- There are six articles on each page
- Each page has a page number at the bottom, and when you click on the next page, the address becomes
http://blog.fe-spark.cn/page/ pages /
And there still areAll posts
- Each post information needs to be clicked in to get the full post
-
After obtaining these analysis results, we started to practice
- The first thing to do is loop, so loop who? You can see from the target site that there are five pages, so loop those five pages
- Internal circulation, of course, is every page of the blog ~ because only through the circulation of each blog, to get the content of the blog ah ~
for (var i = 0; i <= ALL_PAGES - 1; i++) {
// Suppose there is a loadPage method to get all the blog data per page
var a = await loadPage(i);
// Retrieve each page of data for storage
writeFileSync(
resolve(__dirname, "./data/page_" + (i + 1) + ".json"),
JSON.stringify(a),
"utf-8"
);
}
Copy the code
- Continue to write
loadPage
methods
async function loadPage(i) {
// the goto method jumps to the specified page
await page.goto(
// Add 'page/ number of pages/' to the first page
`${ORIGIN}/${i == 0 ? "" : "page/" + parseInt(i + 1) + "/"}`,
{
waitUntil: "networkidle0".timeout: 60000});// Get the jump link for each blog post and store it in an array
var href = await page.$eval(".spark-posts".(dom) = > {
_dom = Array.from(dom.querySelectorAll(".post-card>a"));
return _dom.map((item) = > {
return item.getAttribute("href");
});
});
// Wait for the page to load, get the title, introduction, and other required data of all blog posts on this page
var result = await page.evaluate(() = > {
var links = [];
var parent = document.querySelector(".spark-posts");
var list = parent.querySelectorAll(".post-card");
if (list.length > 0) {
Array.prototype.forEach.call(list, (item) = > {
var article_title = item.querySelector("a .title")
.innerText;
var description = item.querySelector("a .excerpt")
.innerText;
var mate = item.querySelector(".metadata .date").innerText;
var tags = Array.prototype.map.call(
item.querySelectorAll(".metadata>div a"),
(item) = > {
returnitem.innerText; }); links.push({ article_title, description, mate, tags }); });returnlinks; }});// The above operation does not get the content of the blog, the next step is to enter each blog, extract the content of the blog
for (var i = 0; i < href.length; i++) {
await page.goto(`${ORIGIN}${href[i]}`, {
waitUntil: "networkidle0".timeout: 60000
});
var content = await page.evaluate(() = > {
return $(".post-wrapper").text();
});
result[i].content = content;
}
// Finally, the assembled result is returned for receiving
return result;
}
Copy the code
The final result
The Spaces are not removed from the data, so you can change them yourself if you are interested
The above code uses ES2016. This project has been configured with Babel and can be run directly
Download and run to see the effect
Click here to see the project
reference
The Puppeteer: github.com/puppeteer/p…
The crawler tool Puppeteer combat: www.jianshu.com/p/a9a55c03f…