Understand crawler HOOK technology

preface

In order to help colleagues in other departments to better complete heavy operations, I have studied some encryption measures for anti-crawling websites, but examples in work are not suitable for display. Just before the crawler challenge, encountered a encryption problem, the topic address, although not the first encountered. But this time I think about one thing, in the reverse website, can we have some convenient ways to solve this kind of problem more efficiently? After a simple search on Baidu, I found HOOK technology, which is more in line with my idea and opens the door to a new world. To sum up just take some of their own small experience with everyone Lao.

An appetizer

Let’s start with the question, okay? Some websites like alert when we’re browsing the web. ‘) Do you feel broken? Assuming that pop-ups like this persist and cause our user experience to be terrible, now think about how you can use your technology to get rid of this annoying thing.

—– Here’s five minutes to think about it ——

If you don’t have any ideas, this article will help you a lot.

If we want to disable alert on a web page, we can do this:

alert = function(message){
   console.log(message)
}
Copy the code

At this point, you’ll notice that after this code is executed, the alert function on the page has been deactivated, and console.log is now a more friendly way to print messages.

Question prompts

As we learned in the “appetizer” section above, it is actually quite convenient to rewrite a system built-in function. So can we write the previous example a little bit better? The answer is yes, consider this code:

let myalert = alert;// Back up the alert function
alert = function (message){//改写alert
   console.log('Intercepted Alert function messages:',message)
}
Copy the code

Students will have questions:

  • Why backup the alert function?
  • Just changed the console message?

To answer the first question: before we change anything, it’s a good habit to make backups. It’s just a matter of habit.

Second question: Is there a question that comes to mind from the adaptation of the Consolelog? Now we’ve overwritten the alert and actually intercepted it! That’s right, that’s what we’re talking about HOOK finally brought into our theme and officially started!

What is a HOOK

Let’s start with a more official definition:

Hook technology is also called Hook function. Before the system calls the function, the Hook program will first capture the message, and the Hook function will first get control. At this time, the Hook function can not only process (change) the execution behavior of the function, but also force to end the message transmission. In a nutshell, it’s about pulling out the system’s programs and making snippets of code that we execute ourselves.

Mandarin generally does not speak human language, in a nutshell, let’s take an example.

You ordered a takeaway:

  • Normal process: place an order – delivery by the delivery boy – receive the delivered food
  • HOOK technology can do things :(HOOK to do something, such as let you do not order)- order (you can make a phone call to ask a friend before placing an order)- (HOOK to do something, such as cancel the order)- (HOOK to do something, such as refuse the delivery)- receive and deliver food

According to this example, it can be found that Hook technology can change some fixed operations at a certain time in a normal process, just like a troublemaker, it can add spice to your affairs.

The first HOOK

So let’s go back to the original alert and think about it. Is that what we’re talking about? Alert itself is used for popover, this is our normal process.

We can make a confirmation button in the popover window, and then confirm whether to popover. This is hook, let’s implement and improve it:

let myalert = alert;// Back up the alert function
alert = function (message){//改写alert
    console.log('Let's pop over, there's something I can do.');
    console.log('Intercepted Alert function messages:',message)
    if(confirm('Do I want to pop over? ')){
    myalert(message)
}else{}
   console.log('Popover cancelled, message interception succeeded! ')}Copy the code

Run the code on the console and see if you can see that we’ve successfully hooked the alert function.

Console operation

Found a good electronic writing, can be moved to Chrome DevTools using tips

There is an unmentioned API: debug/undebug that you can explore on your own

How to hook others web page?

As we all know, their own things up this simple, nothing more than add code to do, other people’s code to open the page are finished, have no to play! It’s not. Let’s think about it. Can we use other tools to do that?

  • Google plugins or oilmonkey injection, which can listen for several different states of document loading and execute JS code at a specific moment.

  • Proxy injection, which modifies the reply data, is inserted in the first position within the label

  • Using chrome – devtools – protocol, through Page. AddScriptToEvaluateOnNewDocument injection.

Either way, the code can be pre-injected to the front of the page

But here’s the point! All of the above are required to install tools, I would like to recommend a 0 installation method, browser can be adapted. You need to get a Chrome browser (version 85 and up, I think) and open the console panel:

  • You can see the Page panel on the left, which contains all the page loading resources
  • If you look sideways, you can find the Overrides panel.

We mainly look at these two panels. First (specify a local folder directory for the first time) Enable overrides Click Enable to enable the local overrides function.

Go back to the Page panel and right click on the save for Overrides option, you can see that after clicking, there will be a small dot under the icon, which means that we have successfully rewritten the content, and the console editor can let us rewrite the content. We will add the hook function we need to put in the head tag. After the modification, CTRL + S will save and refresh the page to take effect (be careful not to close the console).

This way is more convenient. Can get/ POST requests in the network also be intercepted and rewritten in this way? The answer is yes! You just have to transcode the special characters in the request path, such as: Simulate the request interface, if the interface has parameters? Replace the question mark with %3f for the file name.

Detailed coding reference: www.w3school.com.cn/tags/html_r…

methods Web site The file path
POST coolaf.com:1010/tool/ajaxgp File directory coolaf.com%3a1010 tool ajaxgp whereajaxgpIs a file
GET www.baidu.com/s?wd=2131 File directory \www.baidu.com\s%3fwd=2131Among thems%3fwd=2131Is a file

Write what you want to write in the file, don’t care about the suffix (important). So we can rewrite things in the console and nobody can stop us from debugging

Several common hook functions

Let’s start with the first one. How do I get the console to automatically report what elements the page has created?

—– Here’s five minutes to think about it ——

Ok, let’s see how to implement a hook function like this:

let create_element = document.createElement.bind(document);
document.createElement = function (_element) {
  console.log("Create DOM tag :", _element);
  return create_element(_element);
}
Copy the code

Document. createElement (bind); document.createElement (bind);

So we need to bind when we refer to a method.

Is it what I think it is? It doesn’t matter if it’s different, feel free to give your own way in the comments. All roads lead to Rome.

Would it be similarly easier if you implemented an eval function?

let my_eval = eval;
eval = function (message) {
  console.log("eval:", message);
  my_eval(message);
}
Copy the code

Basically using this idea, you can do hook functions.

Detect page variables

Let’s think more deeply about how to hook into what variables are declared on the page.

—– Here’s five minutes to think about it ——

My method is to take several steps (you have a better one to leave a comment on) :

  • The first step is to get an empty page (note that browsers do not install plug-ins) and all the initializers under the Window object, and get the key array. Because Windows can record all global variables, it is inevitable to use it as a breakthrough. Disadvantages: only get global variables, not the closure of the method (the older generation often said: do not say too full, there are other methods to welcome the message).
  • The second step is to compare these to the current page and see which variables are extra.
  • The third step is to hook these variables to test their dynamics.

Okay, so that’s the idea, so how do we do that? Let’s take it step by step:

To ensure that all plug-ins and interference elements are removed, we open a new page with the browser, and record the current default variables of the page, show the operation:

var originKey = [];// Array holds the default variables under window
for(key in window){
    originKey.push(key)
}
Copy the code

As you can see, we’ve collected these variables. If we take a target page, we remove these variables, and what we’re left with are the extra variables that we declared.

Use the same method to collect the variables of the target page, a comparison of the two arrays, you can know. The blank pages are fixed, so we can run through the array the first time and use it the second time. So our code looks like this:

(() = >{
var originKey = ["parent". Omit many system variables"dispatchEvent"];
var moreKey = [];
 for (key in window) {
        if (window.hasOwnProperty(key) && ! originKey.includes(key)) moreKey.push(key) }console.log(moreKey)
})()
Copy the code

First, the immediate function must be executed. If it is exposed, it will cause the variable to be mounted under the window.

So how do we detect a variable from the time it’s declared to the time it’s set?

To get a sense of the two apis, go directly to the MDN introduction, which is very well written:

  • Object.defineProperty()
  • Proxy

Very simple code:

var temp = ' ';
Object.defineProperty(window.'mytest', {
set: function (value) {
    console.log('Found variable is now assigning! ');
    temp = value;
},
get: function () {
    console.log('Found variable being read at this moment! ');
    returntemp; }})Copy the code

Let’s see if we can hook it into its behavior.

As you can see, it’s perfectly fine. You might be thinking, why not use proxy, isn’t it a more cutting-edge API? Proxy cannot proxy an undefined variable and must be an object. So we have to use Object.defineProperty and of course you know the downside of that, you can’t really listen in. Is there a way to do that? B: of course!

Look at my example, hook ‘axios’,’ Vue ‘:

(function () {
    var values = {};
    function hooks(varName, values) {
        Object.defineProperty(window, varName, {
            set: function (value) {
                console.log([" variables" + varName + "] : normal is being assigned :", value);
                values[varName] = value;
                if (Object.prototype.toString.call(value).indexOf('object')) {
                    values[varName] = new Proxy(values[varName], {
                        set: function (obj, prop, value) {
                            console.log([" variables" + varName + "] : object proxy setting properties", prop, "Values.", value)
                            obj[prop] = value
                            return value;
                        },
                        get: function (obj, prop) {
                            console.log([" variables" + varName + "] : object proxy reads properties", prop, "Values.", obj[prop])
                            returnobj[prop]; }}}}),get: function () {
                console.log([" variables" + varName + "] : Normal reading directly")
                return values[varName]
            }
        })
    }
    hooks("axios", values);
    hooks("Vue", values); }) ();Copy the code

In fact, hook is more common when we are carrying out some special API debugging, such as cookie operation, then hook cookie.

Monitor variables within some special closures

This is actually the only way I can save the country curve, use the previous mentioned hook others web page operation, in their code, use object.defineProperty or proxy can! Consider your code changed!

Why does it work in your browser, but not in Node or any other language?

The main point here is that many web sites use browser detection. For example: check whether the current environment canvas supports it. To check out general browser features, refer to the article! So when we do hook, we also need to pay attention to if the script is going to run in other languages, we should pay special attention to the site has checked the browser environment, can not copy the code to use.