Write a crawler using pure client-side JavaScript

JavaScript crawlers sound crazy, right?

Crawlers, most people’s understanding of crawlers is stuck in the use of back-end languages such as Python written crawlers. In practice, however, using client-side JavaScript has a number of advantages that a back-end crawler can’t:

It can be easily shared with anyone who has a browser on their computer
By running on the client side, you can almost ignore the anti-crawler mechanism of the other site
Can have a perfect UI, no development basis of the small white can also be used at will

How do you start this thing?

And “how does the client JavaScript crawler run” question, answer this question is very simple, roughly there are three kinds of running JavaScript code:

Save the browser bookmark, start with javascript:, click to run
Copy to the browser console and press Enter
There’s a browser extension called oil Monkey.

Here we mainly look at the third way, the oil monkey script running.

You may have heard of the Oil Monkey extension, which is called Greasemonkey in Firefox and Tampermonkey in Chrome. You can easily find it in the browser’s extension store. And what it does by itself, in a word

As a rule, run an extra piece of JavaScript code at a different URL

Please refer to the following example:

// ==UserScript== // @name Pxer // @include http://www.pixiv.net* // ==/UserScript== javascript:void((function() { document.documentElement.appendChild( document.createElement('script') ).src='http://pxer-app.pea3nut.org/jsonp.js?'+(+new Date); }) ());Copy the code

==UserScript== is the rule of the Oil-monkey script, which tells the oil-monkey:

When the browser opens the page at www.pixiv.net*, execute the following script

The content of the script is to introduce a JS file to the page via JSONP.

You can do a lot of things with that. (Not a bad thing.)

Advantages, or very much

Above is a two-dimensional illustration exchange site pixiv.net. The bar showing “Pxer 7” was created by automatically loading a JavaScript script from the Oil Monkey script that did not exist on the original site.

To access the original link, click here. You may need to register an account at Pixiv.net

With automatic loading of the Oilmonkey script, you can:

Use powerful JavaScript to provide additional functionality to users without affecting the use of the original site

Combined with the advantages mentioned at the beginning:

It can be used by anyone! Even if it is the small white without the development foundation
Can’t be banned! Operated by real users, do you think anti-crawler policies such as account sealing, IP and verification code will be effective for client crawler?
Using HTML+CSS, build a beautiful UI perfect into the original site, simply can’t be more simple ~

It’s amazing to suddenly feel out of nowhere, isn’t it?

With all that said, let’s do an actual implementation

Without proof, let’s look at an actual client-side JavaScript crawler open source project – Pxer

Making: pea3nut/Pxer
Project official website: Pxer official website – may be the most useful P station batch capture software at present

Pxer is a pure client-side JavaScript crawler that runs directly on the browser side without any configuration.

The biggest function of Pxer is to quickly capture the pictures in pixiv.net (similar to petal net). Instead of simply retrieving an IMG tag, it uses algorithms and Ajax requests to perform more complex functions.

For detailed descriptions of the project, please refer to the project website and Github website

Good jsDOC comments and detailed documentation are ready for you

Write a crawler using pure client-side JavaScript

How do you start this thing?

Advantages, or very much

With all that said, let’s do an actual implementation

Related Posts

Python qrCode

“Front Meeting Room” Dialogue with Winter and You Yuxi, in-depth exploration of Vue3 design ideas (I)

Vue nextTick application scenarios and implementation principles