https://implicit-style-css_0.crawler-lab.com
Copy the code

Here is an interface:

The task, in this case, is to get the text of the content displayed on the page. Before we write our crawler code, we need to do a few things:

  • Identify the source of the target content, that is, find the request that responded to the target content
  • Determine the location of the target content in the web page

It’s just basic observation and analysis.

In terms of Network requests, open the browser debugging tool and switch to the Network panel. Only two resources are loaded on the page:

An HTML document and a JS file, presumably the content we want is in the HTML document. Clicking on the request splits the browser developer tool into two columns, again with a list of request records on the left and details of the specified request on the right. Switch to Response on the right pane and you can see the content of the server’s Response:

It looks like what we’re looking for is in the response body. At first glance, we can simply fetch the text content of the P tag under the DIV tag whose class is RDText. It’s not that simple, however, as careful readers may have noticed that the text displayed in the response body is not exactly the same as the text displayed on the page — there are fewer punctuation marks and text in the response body, and more span tags. For example, the page displays:

NightTeam was officially established on September 9, 2019. The team consists of several powerful developers in the reptile field: Cui Qingcai, Zhou Ziqi, Chen Xiangan, Tang Yifei, Feng Wei, CAI Jin, Dai Huangjin, Zhang Yeqing and Wei Shidong.Copy the code

The response body reads:

<p> NightTeam was officially launched on September 9, 2019 <span class="context_kw0"<span style = "max-width: 100%; clear: both; min-height: 1em"context_kw1"> < / span > < span class ="context_kw21"<span class= "span" style = "box-sizing: border-box! Important; word-wrap: break-word! Important;"context_kw2"></span>
</p>
Copy the code

In this sentence, the span tag replaces the comma, the word, and the name. Looking at the whole thing, you can see that these SPAN tags all have a class attribute.

Discerning people know that it is using the browser rendering principle to do anti-crawler measures. For those wondering, check out Python3 anti-crawler principles and circumvention.

Since it has to do with span and class, let’s take a look at what the class property sets. The span tag with class name context_kw0 looks like this:

.context_kw0::before {
    content: ",";
}
Copy the code

Let’s look at the other span tags with class attribute context_kw21 as follows:

.context_kw21::before {
    content: "Name";
}
Copy the code

The original replaced text appears here! See here, presumably smart you also know is how to return a responsibility!

The solution to this problem is as simple as extracting the Content value corresponding to the SPAN tag class attribute name and restoring it to the text.

Attribute names have a pattern: context_kw + number. Is it possible that context_KW is fixed, the number is looping, or the subscript is in the array? To take a wild guess, suppose there was a dictionary like this:

{0: ",", 1: "The",  21: "Name"}
Copy the code

Combining context_KW with the dictionary key gives the class name and the corresponding value as content, which seems close enough. Crawler engineers know that the only way you can do this on a web page is with JavaScript. For those wondering, check out Python3 anti-crawler principles and circumvention.

Search it!

Invoke the global search function of the browser debugger, type context_kw and call. Then search the results for information that looks useful, such as:

Context_kw was found in JavaScript code, the key information being.context_KW + I + _0xea12(‘0x2c’). The code is also mixed up! If you can’t see it, you can find the author Wei Shidong to sign up for “JavaScript Reverse class”. After learning, you can quickly find the code that looks useful and understand the logic of the code.

Read the JavaScript code hand in hand here. The first paragraph, 977 lines of code, reads as follows:

var _0xa12e = ['appendChild'.'fromCharCode'.'ifLSL'.'undefined'.'mPDrG'.'DWwdv'.'styleSheets'.'addRule'.'::before'.'.context_kw'.'::before{content:\x20\x22'.'cssRules'.'pad'.'clamp'.'sigBytes'.'YEawH'.'yUSXm'.'PwMPi'.'pLCFG'.'ErKUI'.'OtZki'.'prototype'.'endWith'.'test'.'8RHz0u9wbbrXYJjUcstWoRU1SmEIvQZQJtdHeU9/KpK/nBtFWIzLveG63e81APFLLiBBbevCCbRPdingQfzOAFPNPBw4UJCsqrDmVXFe6+LK2CSp26aUL4S +AgWjtrByjZqnYm9H3XEWW+gLx763OGfifuNUB8AgXB7/pnNTwoLjeKDrLKzomC+pXHMGYgQJegLVezvshTGgyVrDXfw4eGSVDa3c/FpDtban34QpS3I='.'enc'.'Latin1'.'parse'.'window'.'location'.'href'.'146385F634C9CB00'.'decrypt'.'ZeroPadding'.'toString'.'split'.'length'.'style'.'type'.'setAttribute'.'async'.'getElementsByTagName'.'NOyra'.'fgQCW'.'nCjZv'.'parentNode'.'insertBefore'.'head'];
        (function (_0x4db306, _0x3b5c31) {
            var _0x24d797 = function (_0x1ebd20) {
                while (--_0x1ebd20) {
                    _0x4db306['push'](_0x4db306['shift']());
                }
            };
Copy the code

Read on and you’ll see the word CryptoJS, which tells you that some encryption or decryption takes place in your code.

The second paragraph, 1133 lines, reads as follows:

for (var i = 0x0; i < words[_0xea12('0x18')]; i++) {
            try {
                document[_0xea12('0x2a')][0x0][_0xea12('0x2b'(a)]'.context_kw' + i + _0xea12('0x2c'), 'content:\x20\x22' + words[i] + '\x22');
            } catch (_0x527f83) {
                document['styleSheets'][0x0]['insertRule'](_0xea12('0x2d') + i + _0xea12('0x2e') + words[i] + '\x22}', document[_0xea12('0x2a')][0x0][_0xea12('0x2f')][_0xea12('0x18')]); }}Copy the code

Here we loop words, and then we combine the subscript of the words element with the corresponding element, which is very close to what we guessed, and now we need to find words.

How to find?

Not again?

Search for the code that defines words:

var secWords = decrypted[_0xea12('0x16')](CryptoJS['enc'] ['Utf8'])[_0xea12('0x17'(a)]', ');
var words = new Array(secWords[_0xea12('0x18')]);
Copy the code

In this way, we finally find that the content of CSS is an encrypted element in array _0xa12e after AES decryption and some processing.

Now that the logic is clear, we can start to figure out the JS code we need.

Although this code is confused, it is still relatively simple, so the specific code picking steps will not be demonstrated. Here are two points that need to be rewritten after picking out the code.

The first one is the exception capture in the figure below, which determines whether the current URL is the original site, but during debugging, there is no window object and document object in the Node environment. If you do not modify it, there will be exceptions, so you need to comment out the code with these objects. For example, the following if statement:

try {
	if (top[_0xea12('0x10')][_0xea12('0x11')][_0xea12('0x12')] != window[_0xea12('0x11'] ['href']) {
	top['window'][_0xea12('0x11'] ['href'] = window[_0xea12('0x11')][_0xea12('0x12')];
}
Copy the code

In other places, you have to step on the holes yourself.

Once you’ve done that, you can retrieve all of the characters that were replaced. You just need to replace them with HTML to restore the page.

Anti-crawler principle

The example uses ::before, and the text below describes what it does:

In CSS, ::before is used to create a pseudo-element that will be the first child of the element that matches the selection. The content attribute is often used to add modifiers to an element.

From: developer.mozilla.org/zh-CN/docs/…

For example, create a new HTML document and fill it with the following:

< Q > Hello everyone, I am salted fish </ Q >, < Q > I am a member of NightTeam </ Q >Copy the code

Then style the Q tag:

q::before { 
  content: "" ";
  color: blue;
}
q::after { 
  content: "»";
  color: red;
}
Copy the code

The complete code is as follows (for those with no HTML background) :

<style>

q::before { 
  content: "" ";
  color: blue;
}
q::after { 
  content: "»"; color: red; } </style> <q>Copy the code

When we open the HTML document with a browser, we see something like this:

In the style, we add ::before and ::after attributes to the q tag, and set the content and corresponding color. As a result, content wrapped in the Q tag will appear in front of the blue symbol, followed by a red symbol.

Easy to understand!

summary

This article briefly introduces the application of implicit style-CSS in anti-crawler, and through a simple example to learn how to deal with this situation, I believe that you have tried to clearly know how to crack the next time you encounter this kind of anti-crawler.

Of course, this example is not perfect enough to completely cover all the application ways of implicit style-CSS in anti-crawler. If readers are interested in this kind of anti-crawler, they may as well find several examples to try by themselves, and welcome to communicate with me through the comment area.

In this paper, the reference

NightTeam public account article “I heard that you encountered this kind of reverse climb to stop cooking? Teach you to kill it in seconds!”

Wei Shidong’s new book “Python3 Anti-crawler Principle and Bypass actual Combat”

Copyright statement

Author: SFHFPC — Wei Shidong

Link: www.sfhfpc.com

Before the filing is complete, you can only access it through the IP address: http://121.36.22.204

Source: Algorithms and anti-crawlers

Copyright belongs to the author, non-commercial reprint please indicate the source, prohibit commercial reprint.