sequence
Recall this year, like a blink. I do not remember Mr. Yang Jiang once said, “We have been so eager for the waves of fate, until we finally found that the most beautiful scenery of life is the calm and calm of heart.”
above
Hi, good afternoon everyone, today is Friday, and the last day of 21 years. At the end of the year. To share a little knowledge ~
What is it about? Just look at the headline. It’s so obvious. Ha-ha-ha, damn it, I didn’t sell it.
Okay, let’s get straight to the point
wen
Before we get to puppeteer, let’s find out what you know about puppeteer. I used it recently. It’s real. And then the main melody of this article is also done with this.
Oh, and the main language is NodeJS. Then we will begin
First we will create a new project:
// create folder mkdir onlinerun // go to folder CD onlinerun // initialize NPM init -y // touch test.jsCopy the code
Then we’ll install some dependencies to use:
NPM install puppeteer node-schedule request -d puppeteer node-schedule request -d npm install puppeteer --registry=https://registry.npm.taobao.orgCopy the code
I personally suggest this way, because in the process of development, generally company projects and personal projects are mixed development, then set up this source to the company, or Taobao, are too troublesome.
Heh heh, but also depends on personal preference ~
After puppeteer is installed, the puppeteer is ready for project development. What is puppeteer used for?
Here’s how puppeteer works, if you’ve used it before you can skip this part. Ow.
// a crawler function const Puppeteer = require(" Puppeteer "); const log = console.log; (async () => { const browser = await Puppeteer.launch({ headless: true, }).catch(() => browser.close); const page = await browser.newPage(); await page.goto("https://www.baidu.com/").catch(() => browser.close); $$eval("a", (el) => el.map((x) => x.href)); const obj = [] urls.forEach((item) => { obj.push(item) }) log(obj) await browser.close(); }) ();Copy the code
The simple example above, which you can try, is a crawler. If that’s okay, we’ll get to the surveillance we talked about in this episode. Ow.
If it’s surveillance, then what is surveillance. Need access to puppeteer?
I don’t know if you’ve seen this online, but when a project resource file dies, especially a static file like an image resource, it looks like this:
And this, you can’t control it, it happens, you let it go, no users report it, you don’t know about it, and people have it hanging online. Hahaha hahaha
Another scenario is that the front-end frameworks currently being developed are either VUE or React. There will be some console dependencies that will report errors and alarms. Of course, if that resource dies, the console will also be exposed. Here’s an example:
If you encounter an alarm warning, you can ignore it. After all, it does not affect the main process. The program is in use.
Well, if you’re a little bit more careful, you can take care of that. Ocd says console is clean and comfortable to look at. Hahaha hahaha
However, error is one thing and cannot be tolerated. Because this shows that your program has a problem, the page may normally render, the function looks down is no problem, but the hidden bug, is certainly there.
So about these two problems, are the kind of beginning said, do not go to the tube, certainly do not know, unless such as user report. So how do you do that? There are so many pages online. This is not a joke. Hey, this is where puppeteer comes in.
Remember that reptile we just started testing out? If you haven’t read the advice, go over it again.
So back to the last question, how to monitor so many online pages, first need to use crawler, first access to several core pages. The change is to wrap up the code and write the URL into the CONF.
Then the urls that crawl out of these core pages, if secondary, continue to traverse. If it’s just a jump link, return directly.
After doing this you get a collection of urls. These urls will come in handy later.
Speaking of url traversal storage, there are three recommended ways to do it
-
Drop library, you can use mysql, or Mongo, etc, put all the urls in the database, and then design an interface to return them;
-
Write a file, that is, write the URL to a file in the local project, and then read the file directly when doing monitoring later, with the following code:
const fs = require("fs"); fs.appendFile(imgUrl + 'urls.json', JSON.stringify(obj), 'utf8', function(err, ret) { if (err) { throw err } console.log('success') }) Copy the code
-
Just climb the latest one every time, and then monitor it.
Well, these three methods, I most recommend is the first method, in fact, when it comes to practice, can be divided into two steps, my plan ah, we see the situation reference can be: expose an interface, why does this interface use?
This interface, which is called to monitor, gets the URL to go round to facilitate monitoring. Then, when this interface is requested, the data to be obtained is the existing data in the database, but at the same time, the crawler method will be triggered.
Why start the crawler method? Since the data in your local call was crawled last time, you need to update the data in the database this time. For next time.
So some students will say, well, you don’t have time. Also, does reading and writing of data matter?
First of all, you don’t have to worry about timeliness, because if you deploy a service, it’s always running. There is also no long gap in the data. Then there is the problem of consistent reading and writing. Every time you finish reading the interface and spit out the data, you will update the data in your database. At this time, you have got the existing data, and the database should be updated. Don’t worry about it.
Okay, I’m done with that. If you have a better way, you can communicate. Ow. Now let me get to the point. Look directly at the code:
/* * IMT project */ const puppeteer = require("puppeteer"); var sendrequest = require("request"); const schedule = require("node-schedule"); let nowurls = []; const log = console.log; function delay(time) { return new Promise(function (resolve) { setTimeout(resolve, time); }); } // get page dom const asynconlineasserts = async (newurl) => { try { const cookiesarr = [ { name: "uid", value: "xxxxxxxxxxx", domain: ".xxxxxxx.com", path: "/", expires: Date.now() + 3600 * 1000, }, ... ] ; Puppeteer. launch({headless: true, // enable the interface, timeout: 60 * 1000, // devTools: true, // enable the developer console // set each step to slow down 200 ms slowMo: 200, ignoreHTTPSErrors: true, // Set the width of the open page in the browser defaultViewport: null, args: ["--start-maximized", "--no-sandbox"], ignoreDefaultArgs: ["--enable-automation"], // executablePath: "/usr/bin/google-chrome", }) .then(async (browser) => { try { const page = await browser.newPage(); await page.setDefaultNavigationTimeout(0); const headlessUserAgent = await page.evaluate( () => navigator.userAgent ); const chromeUserAgent = headlessUserAgent.replace( "HeadlessChrome", "Chrome" ); await page.setUserAgent(chromeUserAgent); await page.setExtraHTTPHeaders({ "accept-language": "zh-CN,zh; Q = 0.9 "}); const openoptions = { timeout: 0, waitUntil: ["load", // wait for the "load" event to trigger "domContentLoaded ", // wait for the" domContentLoaded "event to trigger "networkidle0", // there are no network connections within 500ms. "networkidle2", // there are no more than 2 network connections within 500ms. // setCookie await page.setcookie (... cookiesarr); await delay(5000); await page.reload(); await delay(5000); page.on("response", (response) => { const status = response.status().toString(); if (status.startsWith("4")) { // log(response.url()); // log(response.status()); var differror = { method: "POST", url: "https://xxxxxxxxxxxxxx", headers: { "Content-Type": "Application /json",}, body: json. stringify({msgType: "text", text: {content: newURL + ' ` + response. Url () + ` status: ` + response. The status (), mentioned_list: (" "), mentioned_mobile_list:,,}}), ""}; sendrequest(differror, function (error, response) { if (error) throw new Error(error); log(response.body); }); }}); page.on("console", (msg) => { // log("msg._type: ", msg._type); if (msg._type === "error" || msg._type === "warning") { // log("errormsg: ", JSON.stringify(msg)); var differror = { method: "POST", url: "https://xxxxxxxxxxx", headers: { "Content-Type": "application/json", }, body: Json.stringify ({msgtype: "text", text: {content: newurl + 'The URL request has some error: ` + JSON.stringify(msg), mentioned_list: [""], mentioned_mobile_list: [""], }, }), }; sendrequest(differror, function (error, response) { if (error) throw new Error(error); log(response.body); }); }}); await page.goto(newurl, openoptions).catch(() => browser.close); await browser.close(); try { } catch (e) { log(e); } } catch (e) { log(e); } await delay(3000); await browser.close(); }); } catch (e) { log(e + "//////////"); }}; function geturls() { var options = { method: "GET", url: "http://10.xxx.xxx.xxx:2727/getonlineurls", headers: {}, }; sendrequest(options, function (error, response) { if (error) throw new Error(error); const obj = JSON.parse(response.body).data; nowurls = obj; }); } let startnum = 0; let rule = new schedule.RecurrenceRule(); let times = []; for (let i = 1; i < 60; i++) { times.push(i); } rule.minute = times; // Let job = schedule.scheduleJob(rule, async () => {// foreach try {log("startnum: ", startnum); let oldlen = nowurls.length; log("oldlen: " + oldlen); if (oldlen > 0 && startnum < oldlen) { if (nowurls[startnum].includes("login")) { startnum++; } else { await asynconlineasserts(nowurls[startnum]); startnum++; } } else { startnum = 0; geturls(); } } catch (e) { log(e + "---------"); }});Copy the code
Well, it’s also important to explain why puppeteer is used in addition to its unexpected dependencies. One is to send requests. This is mainly the first method I used to store crawler data, so it exposes an interface.
The other is, is used as internal alarm, to send alarm requests, enterprise wechat. Hahahaha.
This is a task command for the Node service. You can create a scheduled task based on your needs.
const schedule = require("node-schedule"); let rule = new schedule.RecurrenceRule(); const log = console.log; let times = []; for (let i = 1; i < 60; i++) { times.push(i); } rule.minute = times; // Let job = schedule.scheduleJob(rule, Async () = > {/ / foreach try {log (' happy New Year ')} the catch (e) {log (e + "-- -- -- -- -- -- -- -- --"); }});Copy the code
Print ‘Happy New Year’ every minute.
Ok, back to the main theme, the core of the code just posted is really two pieces:
page.on("response", (response) => { const status = response.status().toString(); if (status.startsWith("4")) { log(response.url()); log(response.status()); }});Copy the code
This is mainly to monitor the return of the full response from each request. You can monitor the content in the response according to your own needs and then do the logic. Another is:
page.on("console", (msg) => { // log("msg._type: ", msg._type); if (msg._type === "error" || msg._type === "warning") { log("errormsg: ", JSON.stringify(msg)); }});Copy the code
Monitor each page request as it completes, monitor what happens under the console, monitor errors, alarms, etc. These are also based on their own business scenario requirements to do the relevant logic design.
About this section of the documentation, for you to post the manual
Then about the final way of alarm, in fact, according to their own needs to do, you can choose report, you can choose email, you can choose alarm robot.
Personally, I think the alarm or email is more appropriate, because it is related to the monitoring line. Then the problem is reported at the first time and solved in time
A small point about puppeteer is the executablePath parameter. The main purpose of this is to specify the Chrome path on your service. If you don’t specify the chrome path, it doesn’t matter, it will use the built-in Chromnium
But I don’t like to use my own, hahaha. Once that’s done, you can deploy to your service.
Finally execute;
node test.js
Copy the code
For node startup, check out forever at some point.
Final effect:
At the end of the
Well, that’s all for this episode. Thank you for watching
Today is the last day of the 21st year, I wish you all a happy New Year, happy New Year.
Hey hey ~ bye 👋 ~