Remember the process of puppeteer data crawling (part 2) - crawling sogou wechat search public number articles

As mentioned in the last book, there are screenshots using the Puppeteer library to briefly show the features and basic usage of puppeteer. This time let’s talk about using puppeteer to crawl the page data and write it to Excel.

Puppeteer: Crawl into the public account of sogou wechat search and write Excel

Background: first sogou search now can have a function according to the key word to query WeChat public articles, and we have to do is give it the specified keyword articles to climb down, due to the demand from our business there, so in order to facilitate operation classmate see, also need to put up the data into Excel.

The feeling after using it is that it is not complicated as long as you understand the API. Personally, the complex part lies in the need to consider various base cases. First of all, the first general websites are anti-crawl processing, we need to travel through a reasonable way around. So for example, what I was going to do was read the href attribute of the A tag, and then open the link directly in the browser. But after trying to find, and can operate like this, it is estimated that sogou did what anti-climbing operation, this will report IP anomalies. This problem is solved using puppeteer’s page.click method, since puppeteer essentially simulates real user clicks and is not affected by anti-crawling mechanisms. Another is because the crawl is micro channel article, so also consider some of the micro channel article itself, such as the article was deleted or migrated by the publisher, the article was reported, or this article to share another article and so on…

In short, I feel that the crawler scripts written in Puppeteer are strongly coupled to the business, so I feel that the code itself is of little reference value. However, I can take a look at all kinds of weird problems encountered in the process of writing again, which may provide some ideas for solving the problems you encounter.

The complete code is as follows:

const cheerio = require("cheerio");
const puppeteer = require("puppeteer");
const company = require("./company.json");
const keywords = require("./keywords.json");
const { exportExcel } = require("./export");

// Obtain the url of the search result page
function getUrl(queryWords, currPage = 1) {
  return `https://weixin.sogou.com/weixin?type=2&s_from=input&query=${queryWords}&page=${currPage}`;
}
(async function() {
  const width = 1024;
  const height = 1600;
  // Create a browser instance
  const browser = await puppeteer.launch({
    // Turn off Chrome headless (no interface) mode
    headless: false.// As the name implies, set the default window width
    defaultViewport: { width: width, height: height }
  });
  async function sougouSpiders(keywords) {
    // Create a new page (browser Tab)
    const page = await browser.newPage();
    // Set the page waiting timeout to be longer to avoid errors caused by network reasons
    page.setDefaultNavigationTimeout(90000);
    // Set the width and height of the new Tab
    await page.setViewport({ width: width, height: height });
    let currSpiderPage = 1;
    // Open the url of a keyword in a new Tab page
    await page.goto(getUrl(keywords, currSpiderPage), {
      withUntil: "networkidle2"
    });
    // Get the page content
    let firstPageHtml = await page.content();
    // Get the cheerio instance
    let$=await cheerio.load(firstPageHtml);
    const is404 = $(".main-left")
      .eq(0)
      .text()
      .includes("No relevant wechat official account article was found.");
    // Consider the possibility that a keyword query fails
    if (is404) {
      console.log("No search results for current keyword!");
      // If the keyword query fails, close the page
      await page.close();
    } else {
      // How many pages are retrieved for the current keyword based on the value of the paging component
      const totalPage =
        Array.from($(".p-fy a")).length === 0
          ? 1
          : Array.from($(".p-fy a")).length;
      // If the search result exceeds 10 pages, only 10 pages will be retrieved
      const reallyTotalPage = totalPage >= 10 ? 10 : totalPage;
      // Print the information on the console for troubleshooting
      console.log(
        'Current keywords:${keywords}The total number of pages successfully obtained the current keyword:${reallyTotalPage}`
      );

      const result = [];
      // The above operation is really just to know the information about the page search results, the actual crawling operation is in the following function
      async function getData(currSpiderPage) {
        console.log("Climbing page:", currSpiderPage);
        await page.goto(getUrl(keywords, currSpiderPage), {
          withUntil: "networkidle2"
        });

        let firstPageHtml = await page.content();
        let $ = cheerio.load(firstPageHtml);

        const itemLen = $(".news-box .news-list").find("li").length;
        for (let j = 0; j < itemLen; j++) {
          const currText = $(".news-box .news-list")
            .find("li")
            .eq(j)
            .find(".txt-box")
            .eq(0);
          const currLinkBtnId = currText
            .find("h3 a")
            .eq(0)
            .attr("id");
          console.log(
            'Crawling keywords [${keywords}】 : the first${currSpiderPage}The first page${j +
              1}Bar, current page total${itemLen}Article,${reallyTotalPage}Page, time:The ${new Date()}`
          );
          // Some items have no image, so the img-box is hidden
          await page.addStyleTag({
            content:
              ".img-box{ border:10px solid red; display:block! important }"
          });
          // Get the link node
          const linkBtn = await page.$(`#sogou_vr_11002601_img_${j}`);
          // Some items cannot be clicked during the actual crawl
          if (linkBtn) {
            await page.click(`#sogou_vr_11002601_img_${j}`);
          } else {
            await page.click(`#sogou_vr_11002601_title_${j}`);
          }

          try {
            let currUrl = null;
            // listen for changes in the current page URL. If mp.weixin.qq.com appears, the article is opened normally
            const newTarget = await browser.waitForTarget(t= > {
              currUrl = t.url();
              return t.url().includes("mp.weixin.qq.com");
            });
            // Get an instance of the newly opened article details page
            const newPage = await newTarget.page();
            try {
              const title = await newPage.waitForSelector("#activity-name");
              // If you can read the title of the article
              if (title) {
                // Read the page content parent container node
                await newPage.waitForSelector("#js_content");
                const newPageContent = await newPage.content();
                let jq = await cheerio.load(newPageContent);
                // Go through all the images first. If there is an image, follow the image's URL
                await jq("#js_content img").each((idx, curr) = > {
                  jq(curr).after(
                    '<div> <div>${$(curr).attr(
                       "src"
                     )}】 < / div > `
                  );
                });
                // After completing the above operations, read the details of the current article
                const text = await jq("#js_content").text();
                // All the information about the current article is written into the array
                result.push({
                  title: $(".news-box .news-list li .txt-box h3 a")
                    .eq(j)
                    .text(),
                  url: currUrl,
                  account: currText
                    .find(".account")
                    .eq(0)
                    .text(),
                  publishTime: currText
                    .find(".s2")
                    .eq(0)
                    .text(),
                  content: text
                });
                // Close the page after the current article has been climbed
                awaitnewPage.close(); }}catch (e) {
              // If the above operation is not correct, it indicates that there is a problem with the current article.
              console.log("This article is dead.");
              // If this article is wrong, it will still count as result
              result.push({
                title: $("#" + currLinkBtnId).text(),
                url: 'Key words [${keywords}】 : the first${currSpiderPage}The first page${j +
                  1}Article `.account: currText
                  .find(".account")
                  .eq(0)
                  .text(),
                publishTime: currText
                  .find(".s2")
                  .eq(0)
                  .text(),
                content: "This post has been deleted or migrated by the publisher, or has been reported."
              });
              // Complete the above operations and close the page
              awaitnewPage.close(); }}catch (err) {
            // If a timeout error occurs, close the current page and proceed to the next one
            if (err.toString && err.toString().includes("TimeoutError")) {
              const pagesLen = (await browser.pages()).length;
              (await browser.pages())[pagesLen - 1].close();
            }
            console.log("=====！！！！！！ Out!!!! Wrong!!!!! !!!! = = = = = = = =", err);
            // If the link to this article is not opened, put it in result
            result.push({
              title: $("#" + currLinkBtnId).text(),
              url: 'Key words [${keywords}】 : the first${currSpiderPage}The first page${j +
                1}Article `.account: currText
                .find(".account")
                .eq(0)
                .text(),
              publishTime: currText
                .find(".s2")
                .eq(0)
                .text(),
              content: "There is something wrong with the pa of this article. Search it manually."}); }}}for (let i = 0; i <= totalPage; i++) {
        if (currSpiderPage <= totalPage) {
          awaitgetData(currSpiderPage); currSpiderPage++; }}console.log('Key words: 【${keywords}], the crawl is complete and writing to Excel! `);
      / / write to Excel
      await exportExcel(keywords, result);
      // Close the page
      awaitpage.close(); }}const companyList = company.data;
  const keywordList = keywords.data;
  // loop
  for (let i = 0; i < companyList.length; i++) {
    for (let j = 0; j < keywordList.length; j++) {
      await sougouSpiders(
        `${companyList[i].companyName} ${keywordList[j].join("")}`); }}console.log("It's over."); }) ();Copy the code

ExportExcel exportExcel:

const fs = require("fs");
const xlsx = require("node-xlsx");

module.exports = {
  async exportExcel(fileName, data) {
    let dataArr = [];
    let title = ["Article title"."The article URL".Author's Name (Public Id)."Release Date"."Content"];
    dataArr.push(title);
    data.forEach(curr= > {
      dataArr.push([
        curr.title,
        curr.url,
        curr.account,
        curr.publishTime,
        curr.content
      ]);
    });
    const options = {
      ! "" cols": [{wch: 70 },
        { wch: 100 },
        { wch: 30 },
        { wch: 30 },
        { wch: 200}};/ / write XLSX
    var buffer = await xlsx.build(
      [
        {
          name: "sheet1".data: dataArr
        }
      ],
      options
    );
    await fs.writeFile(`. / dist/data1 【${fileName}】. XLSX `, buffer, function(
      err
    ) {
      if (err) throw err;
      console.log("Write successful!"); }); }};Copy the code

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Remember the process of puppeteer data crawling (part 2) – crawling sogou wechat search public number articles

Puppeteer: Crawl into the public account of sogou wechat search and write Excel

Remember the process of puppeteer data crawling (part 2) – crawling sogou wechat search public number articles

Puppeteer: Crawl into the public account of sogou wechat search and write Excel

Related Posts

The relationship between CMake tool chain and NDK

Uni-app plug-in, shortcut menu, hover window navigation

JavaScript: Know regular expressions