preface
Text and pictures from the network, only for learning, exchange, do not have any commercial purposes, copyright belongs to the original author, if you have any questions, please contact us to deal with.
Author: Wang Ping
PS: If you need Python learning materials, please click on the link below to obtain them
Note.youdao.com/noteshare?i…
Now the web code is more and more complex, in addition to the use of VUE and other front-end framework to make development easier, mainly for anti-crawler, so write crawler efforts on more and more. Attack and defense in fighting each other in the evil relationship but mutual promotion of each other.
This paper discusses the strategy of JS anti-crawler, and looks at how to crack it.
JS write cookies
We want to write a crawler to capture the data in a web page, nothing more than open the web page, look at the source code, if the HTML has the data we want, that is easy. Use the Requests url to get the web page source and parse it.
Wait a minute! The page for Requests was a pair of JS pages that were completely different from the source code you’d see when you opened the browser! In this case, the browser usually runs the JS to generate a cookie (or cookies) and then makes a second request with the cookie. The server receives this cookie and thinks your access is legitimate through the browser.
In fact, you can see this process in browsers (Chrome, Firefox). Press F12 to go to the Network window and select “Preserve Log” (Firefox is “Persist Logs”) to refresh the web page so that we can see the history of Network requests. Here’s an example:
The solution is to study that JS and find out how it generates cookies, and the crawler can solve this problem.
Js-encrypted Ajax request parameters
Write crawler to catch the data in a web page, and find that there is no data we want in the source code of the web page, that is a bit of trouble. That data is often the result of ajax requests. But don’t be afraid. Press F12 to open the Network window and refresh the page to see which urls have been downloaded to load the page. The data we want is in the result of a URL request. The type of URL in Chrome’s Network is mostly XHR. By observing their “Response”, we can find the data we want.
However, things don’t always work that well. The URL contains many parameters, one of which is a seemingly meaningless string. This string is probably the result of an encryption algorithm used by JS, and the server will use the same algorithm to verify that you are receiving a request from the browser. We can copy this URL to the address bar, change that parameter to a random letter, and access it to see if it gets the right result, thereby verifying that it is an important encryption parameter.
For such encryption parameters, you can use debug JS to find the corresponding JS encryption algorithm. The key is to set “XHR/ Fetch Breakpoints” in Chrome.
3. JS undebug (undebug)
We have used Chrome F12 to view the loading process of web pages, or to debug the JS process. If we use this method too much, the site will add a counter-debugging strategy. If we open F12, we will pause in a “debugger” line of code, and we can’t jump out anyway. It looks like this:
The solution to this JS undebugging is called “de-undebugging,” and the strategy is to “Call the Stack” to find the function that put us in an infinite loop and redefine it.
Such a function has almost no other function and is just a trap for us. We can redefine this function in the Console, for example, as an empty function, so that when we run it again it doesn’t do anything and it doesn’t trap us. Type “Breakpoint” in the place of the function call. Since we are already in the trap, to refresh the page, JS should stop running at the breakpoint set. At this point, the function is not running, we redefine it in the Console, continue to run to skip the trap.
JS sends mouse click events
There are also some sites, its reverse crawl is not the way above. You could open a normal page from the browser, but in Requests you’d be asked to enter a captcha or redirect to another page. You may be confused at first, but don’t worry, a closer look at “Network” may reveal some clues. For example, the following Network stream contains information:
Let’s start with the logic. JS will respond to the event when the link is clicked. Before opening the link, visit cl.gif, send the current information to the server, and then open the clicked link. When the server receives the request of the clicked link, it will check whether it has sent the corresponding information through cl.gif. If it has sent the corresponding information, it will consider it as a legitimate browser visit and present the normal webpage content.
Because Requests did not have a mouse event response and simply accessed the link without accessing the cl.gif procedure, the server refused service.
Once you understand this process, it’s not hard to come up with a strategy to get around this anti-crawling strategy with little to no research into the JS content (it’s possible to modify the link that you click on). Simply visit cl.gif before you visit the link. The key is to study the parameters after cl.gif, and take them all in.
At the end
Crawlers and web sites are a pair of enemies, mutually compatible. When the crawler knows the anti-crawl strategy, it can make the anti-crawl strategy of the response. If the website knows the crawler’s anti-crawling strategy, it can do an “anti-anti-crawling” strategy… The devil rises a foot, and their struggle will not end.