A friend in the group asked if there was any way to save the article from this link once and for all. Click on it to see that it is actually a collection of articles. So the requirement is to save each article in the link in this document. Save form can have many kinds, can be a picture, can also be a web page. The puppeteer library is used, so the format is PDF.
Needs deconstruction
Complete the whole movement, mainly divided into these two parts. Get links to all articles in the document; Save the content in each link as a PDF file.
For obtaining links, there are two ways: one is to use the Request module to request the url to obtain documents; The second is to save the web page to the local use of FS module to obtain the document content. When I get the document, which is the entire HTML document, I can’t think of a good way to get links to all the articles. It’s easier to do this on a web page. The DOM quertSelectorAllAPI and CSS selectors make it very easy to get all the href attributes in a links. But this is Node, this is outside the DOM. I thought of using regular matching directly, but abandoned it. I googled cheerio and forgot the cheerio. Cheerio is a quick, flexible and concise jQuery implementation specially designed for the server.
As far as I know, the standard practice for saving web content is to save it as a PDF file, and I just learned that Puppeteer meets this need. Puppeteer is a Node library maintained by the Chrome DevTools team that provides advanced apis for controlling the Chrome browser. In addition to crawling web content and saving it as a PDF file, it can also be used as a solution for server-side rendering and automated testing.
Need to implement
For a link
So let’s start with this part of the code
const getHref = function () {
let file = fs.readFileSync('./index.html').toString()
const $ = cheerio.load(file)
let hrefs = $('#sam').find('a')
for (e in hrefs) {
if (hrefs[e].attribs && hrefs[e].attribs['href']) {
hrefArr.push({
index: e,
href: hrefs[e].attribs['href']
})
}
}
fs.writeFileSync('hrefJson.json', JSON.stringify(hrefArr))
}
Copy the code
Since the rest of the code depends on the file being read, the readFileSync method is used. If the format of the returned content is not declared, the default is Buffer. You can choose to fill in the UTF8 format, or use the toString method directly after the method.
Two lines of code take all the linked DOM elements with Cheerio and process them one by one into the format you’ll use later. There may be a case where the a tag does not have the href attribute, but this is also a bug that will be found later in the debugging program.
If you need to save all links separately, use the writeFile method.
Save as PDF
Again, let’s do this code first.
const saveToPdf = function () {
async () => {
const browser = await puppeteer.launch({
executablePath: './chrome-win/chrome.exe'}); // Link countlet i = 0
async function getPage() {
const page = await browser.newPage();
await page.goto(hrefArr[i]['href'] and {waitUntil: 'domcontentloaded'}); // Page titlelet pageTitle
if (hrefArr[i]['href'].includes('weixin')) {
pageTitle = await page.$eval('meta[property="og:title"]', el => el.content)
} else {
pageTitle = await page.$eval('title', el => el.innerHTML)
}
letTitle = pageTitle. Trim () // Remove the slanting barlet titlea = title.replace(/\s*/g, ""// Remove the vertical barlet titleb = titlea.replace(/\|/g, "");
await page.pdf({ path: `${i}${titleb}.pdf` });
i++
if (i < hrefArr.length) {
getPage()
} else {
await browser.close();
}
}
getPage()
}
}
Copy the code
Waiting for Chrome to open and possibly for other asynchronous requests. The outermost layer uses async with arrow functions to wrap the actual executing code.
When puppetter is installed using NPM, chrome browser is downloaded by default. However, the download is usually not successful on a foreign server. Of course, there are solutions, but I won’t go into them here. If you are installing Puppeteer, read this article or just Google it.
As mentioned in the previous section, we need to save the content of more than one link as a PDF, so we use the variable I to identify each link we need to access.
As for getting the page title, it did take a little time to get the page title with the link. Therefore, there are mainly two kinds of website links in the link, one is wechat public number article, the other is Sina Caixin such a website. There is no title content in wechat articles as sina does.
The $eval method of the Page class takes two parameters: a selector and a function executed in the browser context. The $eval method runs the Document. querySelector method on the page and passes its return value to the second argument, which is the method we wrote. To obtain sina web page article title for example, title is the incoming selector, we need to be its label content.
pageTitle = await page.$eval('title', el => el.innerHTML)
Copy the code
In the process of generating the file name, because the folder is still part of the file path. You also need to consider the Windows file path specification. However, the title of a web page is not subject to this specification, resulting in a contradiction. This problem is later found when debugging, at the beginning of writing code did not think of this problem. That is, you need to remove the characters such as slashes and Spaces in the title.
After obtaining the content of each link, I + 1 will mark the location of the link until all the link content is saved, and close the open web page.