This project is for reference only, to provide a beginner’s example. This project uses Puppeteer to access the data of one of my blog systems

  • Look at the demo

Because headless is disabled, the browser window pops up for easy debugging

Note: because it is my own personal blog and the server is abroad, it is slower to access in China. Such as running this demo timeout, belongs to the normal phenomenon, in this case, it is recommended to open the scientific Internet access (do not say much about this technology), open the global proxy try

The directory structure

  • /data: Capture data storage location (conditional can be directly stored in the database)
  • /utils: utility class
  • index.js: entry file (core code)
  • config.js: configuration file (crawler address and crawler page number configuration)

Project start

  • Install dependencies (go to the project directory)
    npm install
    Copy the code
  • Ordinary start
    npm start
    Copy the code
  • Debug launch
    npm run dev
    Copy the code

The core code

  • First introducedpuppeteerwithconfig.jsConfigure the file and disable itheadlessStart for observation

Puppeteer relies on Chromium, and access to each page process is limited to a sandbox. Even if a page is attacked by malware, if it is shut down, there is no impact on the operating system or other pages.

import puppeteer from "puppeteer";
import config from "./config";

var ALL_PAGES = config.allPages; // Number of pages fetched
var ORIGIN = config.origin; // Crawl the target website [String]

(async() = > {/ / start the puppeteer
	var browser = await puppeteer.launch({
		headless: false.devtools: false.defaultViewport: {
			width: 1200.height: 1000}});// Initialize a page
	var page = await browser.newPage();
    

    // Write the logic code to crawl the data

	awaitbrowser.close(); }) ();Copy the code
  • Before writing the logic code, take a look at the target website we crawled today for analysis

    Open the home page and you’ll see an area of all blog posts

    Analysis:

    1. There are six articles on each page
    2. Each page has a page number at the bottom, and when you click on the next page, the address becomeshttp://blog.fe-spark.cn/page/ pages /And there still areAll posts
    3. Each post information needs to be clicked in to get the full post
  • After obtaining these analysis results, we started to practice

    1. The first thing to do is loop, so loop who? You can see from the target site that there are five pages, so loop those five pages
    2. Internal circulation, of course, is every page of the blog ~ because only through the circulation of each blog, to get the content of the blog ah ~
for (var i = 0; i <= ALL_PAGES - 1; i++) {
    // Suppose there is a loadPage method to get all the blog data per page
    var a = await loadPage(i);
    
    // Retrieve each page of data for storage
    writeFileSync(
        resolve(__dirname, "./data/page_" + (i + 1) + ".json"),
        JSON.stringify(a),
        "utf-8"
    );
}
Copy the code
  • Continue to writeloadPagemethods
async function loadPage(i) {
    // the goto method jumps to the specified page
    await page.goto(
        // Add 'page/ number of pages/' to the first page
        `${ORIGIN}/${i == 0 ? "" : "page/" + parseInt(i + 1) + "/"}`,
        {
            waitUntil: "networkidle0".timeout: 60000});// Get the jump link for each blog post and store it in an array
    var href = await page.$eval(".spark-posts".(dom) = > {
        _dom = Array.from(dom.querySelectorAll(".post-card>a"));
        return _dom.map((item) = > {
            return item.getAttribute("href");
        });
    });
    // Wait for the page to load, get the title, introduction, and other required data of all blog posts on this page
    var result = await page.evaluate(() = > {
        var links = [];
        var parent = document.querySelector(".spark-posts");
        var list = parent.querySelectorAll(".post-card");
        if (list.length > 0) {
            Array.prototype.forEach.call(list, (item) = > {
                var article_title = item.querySelector("a .title")
                    .innerText;
                var description = item.querySelector("a .excerpt")
                    .innerText;
                var mate = item.querySelector(".metadata .date").innerText;
                var tags = Array.prototype.map.call(
                    item.querySelectorAll(".metadata>div a"),
                    (item) = > {
                        returnitem.innerText; }); links.push({ article_title, description, mate, tags }); });returnlinks; }});// The above operation does not get the content of the blog, the next step is to enter each blog, extract the content of the blog
    for (var i = 0; i < href.length; i++) {
        await page.goto(`${ORIGIN}${href[i]}`, {
            waitUntil: "networkidle0".timeout: 60000
        });
        
        var content = await page.evaluate(() = > {
            return $(".post-wrapper").text();
        });
        result[i].content = content;
    }

    // Finally, the assembled result is returned for receiving
    return result;
}
Copy the code

The final result

The Spaces are not removed from the data, so you can change them yourself if you are interested

The above code uses ES2016. This project has been configured with Babel and can be run directly

Download and run to see the effect

Click here to see the project

reference

The Puppeteer: github.com/puppeteer/p…

The crawler tool Puppeteer combat: www.jianshu.com/p/a9a55c03f…

Www.ecocn.org Www.ecocn.org Www.ecocn.org