Preface: Recently, I wanted to use Python to crawl some data, and suddenly remembered why not use Node.js to crawl data. After all, Python has not been used for several years, so I came up with this article (purely for recording my development experience). However, there is not much development using Node in the market. So this article may be written may be a little sloppy, hope to read a small partner many forgive, can also point out deficiencies or mistakes.
The current project adopts _VSCode_ development.
1. Create a project and run NPM init to initialize the project.
2. Puppeteer NPM I Puppeteer
3. Create a new project directory SRC /index.js.
4. Introduce puppetter in index.js first
5. Because I am here to practice the first business website, so temporarily used to climb. Website addresses can be climbed according to their own needs.
To enter the body
What is written here is a self-executing function to run the project
; (async ()=>{})();Copy the code
There are many basic ways to write puppetter on the web, but I won’t go into details here. I will post the code directly here.
const puppeteer = require('puppeteer') const baseUrl = 'https://www.yicai.com'; ; (async () => {const browser = await puppeteer.launch({headless: false, // browser interface startup slowMo: Args: ['--no-sandbox'], dumpio: false, devTools: true, // Dev mode}); const page = await browser.newPage(); await page.goto(baseUrl, { waitUntil: 'networkidle2' }); await page.waitFor(2000); // Wait to load more button node load await page.waitForSelector('.u-btn')})();Copy the code
Because they don’t have pagination like other pagers, the temporary strategy is to load more data in one click
So implement a loop of clicking the page button while I wait for more buttons to load
for (let index = 0; index < 1; Index ++) {// my index<1 is currently crawling only one page, so I am awaiting page.waitfor (2000); Click ('.u-btn')}Copy the code
Here is where the button is clicked in the code above, and we can get the class
Next, we should start crawling our data.
const result = await page.evaluate(() => { let $ = window.$; // var items = $('.m-con a') Var items = $('# headList ').children('a') var links = [] if (items.length >= 1) {items.each((index, item) => { let it = $(item) let articleTitle = it.find('h2').text() let articleIntroduction = it.find('p').text() let imageAddress = it.find('img').attr('src') let createdTime = it.find('span').text() let detailPage = it.attr('href') links.push({ articleTitle, articleIntroduction, imageAddress, createdTime, detailPage: detailPage }) }) } return linksCopy the code
_var items = $(‘.m-con a’) _ This requires us to analyze the element structure of the page we climb. See the analysis below for details
My own requirement is to shoot the list data under the headline Tab. But when I typed in the log, there were 75 entries, and I checked the page repeatedly and found only 25 entries, so I started F12 again. Finally, I found that they controlled the tags by dispaly: None and display:block. The first time they enter the page, they load three tabs at a time, and they each have an ID.
So instead of (‘.m −cona ‘), we get the class from (‘.m −cona ‘), we get the class from (‘.m −cona ‘), we get the class from (‘# headList ‘).children(‘a’) A tag under headList to process the data.
Of course, this is determined on a case-by-case basis, depending on the elements of each page
Ok, now that we’ve written the script method, we can execute it…
Node SRC /index.js Run the command
Okay, so that’s what we printed out.
Finally, I’ll post the final code for you
const puppeteer = require('puppeteer') const baseUrl = 'https://www.yicai.com'; ; (async () => {const browser = await puppeteer.launch({headless: false, // browser interface startup slowMo: Args: ['--no-sandbox'], dumpio: false, devTools: true, // Dev mode}); const page = await browser.newPage(); await page.goto(baseUrl, { waitUntil: 'networkidle2' }); await page.waitFor(2000); // Wait for a node to load await page.waitForSelector('.u-btn') for (let index = 0; index < 1; index++) { await page.waitFor(2000); await page.click('.u-btn') } const result = await page.evaluate(() => { let $ = window.$; // var items = $('.m-con a') Var items = $('# headList ').children('a') var links = [] if (items.length >= 1) {items.each((index, item) => { let it = $(item) let articleTitle = it.find('h2').text() let articleIntroduction = it.find('p').text() let imageAddress = it.find('img').attr('src') let createdTime = it.find('span').text() let detailPage = it.attr('href') links.push({ articleTitle, articleIntroduction, imageAddress, createdTime, detailPage: detailPage }) }) } return links }) console.log(result); await page.close(); }) ();Copy the code