Excerpted from the forthcoming Python3 anti-crawler principles and circumventions, the scope of this open book is chapter 6 – text obturation anti-crawler. This is section 4 of chapter 6. The rest of the sections will be broadcast gradually.

Font anti crawler introduction overview

Prior to CSS3, Web developers had to use fonts that were already on the user’s computer. But in the CSS3 era, developers can use @font-face to specify a font for a web page, depending on the user’s computer font. Developers can place the desired font file on a Web server and use it in CSS styles. When a user accesses a Web application using a browser, the corresponding font is downloaded to the user’s computer by the browser.

When we learned about browsers and page rendering, we learned that CSS is meant to decorate HTML, so it doesn’t change the content of the HTML document while rendering the page. Because fonts are loaded and mapped by CSS, even Splash, Selenium, and Puppeteer can’t get the corresponding text content. Font anti-crawler takes advantage of this feature and applies customized fonts to important data in web pages, making crawler program unable to obtain correct data.

6.4.1 Example of font anti-crawler

Example 7: Font anti-crawler example.

Website: www.porters.vip/confusion/m…

Task: Climb the movie score, number of critics, and box office data on the movie information display page, as shown in Figure 6-32.

Figure 6-32 Example 7 page

Before writing the code, we need to determine the element positioning of the target data. While locating, we found some strange symbols in the HTML, the HTML code is as follows:

<div class="movie-index"> 
   <p class="movie-index-title"> user rating </p> <div class="movie-index-content score normal-score"> 
       <span class="index-left info-num "> 
       <span class="stonefont"> ☒.☒ </span> 
       </span> 
   <div class="index-right"> 
   <div class="star-wrapper"> 
   <div class="star-on" style="width:90%;"></div> 
   </div> 
   		<span class="score-num"><span class="stonefont"</span> </div> </div> </div>Copy the code

The most important data on the page is a bunch of weird characters, where “☒.☒” is displayed in HTML instead of “9.7” and “☒☒.☒☒” is displayed in HTML instead of “56.83”. Unlike the mapping anti-crawler in Section 6.3, the text in the case is replaced by the “☒” symbol, making it impossible to distinguish. This is very strange, “☒” can stand for so many numbers?

Note that the content displayed in the Elements panel of Chrome Developer Tools is not necessarily the text of the corresponding text. To find out what the “☒” symbol is, you need to check the web source code. The corresponding page source code is as follows:

<div class="movie-index">
    <p class="movie-index-title"> user rating </p> <div class="movie-index-content score normal-score">
        <span class="index-left info-num ">
            <span class="stonefont"> &#xe624.&#xe9c7</span>
        </span>
        <div class="index-right">
          <div class="star-wrapper">
            <div class="star-on" style="width:90%;"></div>
          </div>
          <span class="score-num"><span class="stonefont"> &  
        </div>
    </div>
</div>
Copy the code

What you see from the web source is not a symbol, but some characters beginning with &#x, which is very similar to the SVG mapping anti-crawler in Example 6. We compare the numbers displayed on the page with the characters in the source code of the page, and the mapping is shown in Figure 6-33.

Figure 6-33 Mapping between characters and digits

There is a one-to-one correspondence between characters and numbers, so we just need to find more pages and round up the characters for numbers 0 to 9. But what if the font on your target site changes dynamically? Does the mapping also change?

As we learned and analyzed in Section 6.3, we know that artificial mapping cannot solve these problems. We must find the rules of the mapping relationship and implement the mapping algorithm using Python code. Moving on, is the character map asynchronously loaded and then rendered using JavaScript?

Figure 6-34 Request record

Network request records As shown in Figure 6-34, no asynchronous request is found in the network request records. This suspicion is not confirmed. Any clues about CSS styles? The class attribute value of the tag enclosing the symbol in the page is stonefont:

<span class="stonefont"> &#xe624.&#xe9c7</span> 
<span class="stonefont"> &# # # xf593 & xe9c7 & xe9c7. & # xe624 than < / span >
<span class="stonefont"> &#xea16&#xe339.&#xefd4&#xf19a</span>
Copy the code

But only the font is set in the CSS style:

.stonefont { 
 	font-family: stonefont; 
}
Copy the code

Since it is a custom font, it means that the font file will be loaded. We can find the loaded font file movie.woff in the network request, and download it to the local, and then use Baidu font Editor to have a look inside.

Baidu font Editor FontEditor (see fontstore.baidu.com/static/edit… You can import and export font files in TTF, WOFF, EOT, and OTF formats, adjust font shapes, Outlines, and preview fonts in real time, as shown in Figure 6-35.

Figure 6-35 Baidu Font Editor page

After opening the page, drag the movie.woff file to the gray area of Baidu Font Editor, as shown in Figure 6-36.

Figure 6-36 Preview font file movie.woff

There are 12 font blocks in this font file, including 2 blank font blocks and 0 ~ 9 numeric font blocks. It’s a safe guess that the numbers used in ratings and box office figures came from there.

Therefore, we also need to understand some font file format related knowledge, after understanding the file format and rules, to find a more reasonable solution.

6.4.2 Font File WOFF

WOFF (Web Open Font Format) is a Font Format standard for Web pages. It is essentially based on SFNT fonts (such as TrueType), so it has the font structure of TrueType, we just need to understand the relevant knowledge of TrueType fonts.

TrueType font is a computer outline font jointly developed by Apple and Microsoft. Each font in TrueType font is described by a series of points on the grid, which is the smallest unit in the font. The relationship between font and point is shown in Figure 6-37.

Figure 6-37 Shows the relationship between character lines and points

Font files contain not only glyph data and dot information, but also character-to-glyph mapping, font titles, naming, and horizontal indicators, all of which exist in corresponding tables. Therefore, we can also think of TrueType font files as a series of tables, including the commonly used tables

Figure 6-38 shows the functions of the switch.

Figure 6-38 Common tables that form font files and their functions

How do I view the structure of these tables and the information they contain? Font files such as WOFF can be converted into XML files with the help of the third-party Python library FontTools, which allows you to view the structure and table information of font files. First we need to install the fontTools library with the following command:

$ pip install fonttools
Copy the code

Once installed, you can use the library to convert file types, using Python code like this:

from fontTools.ttLib import TTFont 
font = TTFont('movie.woff') # Open the movie.woff file in the current directory
font.saveXML('movie.xml') # Save as movie.xml
Copy the code

When the code runs, an XML file named Movie is generated in the current directory. The contents of the character-to-glyph mapping table CMAP in the file are as follows:

<cmap_format_4 platformID="0" platEncID="3" language="0"> 
   <map code="0x78" name="x"/> 
   <map code="0xe339" name="uniE339"/> 
   <map code="0xe624" name="uniE624"/> 
   <map code="0xe7df" name="uniE7DF"/> 
   <map code="0xe9c7" name="uniE9C7"/> 
   <map code="0xea16" name="uniEA16"/> 
   <map code="0xee76" name="uniEE76"/> 
   <map code="0xefd4" name="uniEFD4"/> 
   <map code="0xf19a" name="uniF19A"/> 
   <map code="0xf57b" name="uniF57B"/> 
   <map code="0xf593" name="uniF593"/> 
</cmap_format_4>
Copy the code

On the map label, code indicates a character, and name indicates a character name, as shown in Figure 6-39.

Figure 6-39 Mapping between characters and glyphs

The character 0xe339 in the XML corresponds to the character  in the web source code, so we have determined the relationship between the character codes in the HTML and the corresponding glyph in the movie.woff font file. Glyph data is stored in the GLYf table, and the data for each glyph is independent. For example, glyph data for uniE339 is as follows:

<TTGlyph name="uniE339" xMin="0" yMin="And 12" xMax="510" yMax="719"> 
   <contour> 
     <pt x="410" y="534" on="1"/> 
     <pt x="398" y="586" on="0"/> 
     <pt x="377" y="609" on="1"/> 
     <pt x="341" y="646" on="0"/> 
     <pt x="289" y="646" on="1"/> 
     ... 
   </contour> 
   <contour> 
     <pt x="139" y="232" on="1"/> 
     <pt x="139" y="188" on="0"/> 
     <pt x="178" y="103" on="0"/> 
     ... 
   </contour> 
   <instructions/> 
</TTGlyph>
Copy the code

The TTGlyph tag records the glyphs ‘name, X-axis coordinates, and Y-axis coordinates (coordinates can also be understood as the width and height of the glyphs). The contour tag records the contour information of the glyphs, that is, the coordinate positions of multiple points, which form the glyphs as shown in Figure 6-40.

Figure 6-40 Outline of THE UNI 339

We can adjust the position of the point in baidu font Editor, and then save the font file and convert the new font file to XML format. The font data with the same name is as follows:

<TTGlyph name="uniE339" xMin="115" yMin="6" xMax="430" yMax="495"> 
 <contour> 
   <pt x="400" y="352" on="1"/> 
   <pt x="356" y="406" on="0"/> 
   <pt x="342" y="421" on="1"/> 
   <pt x="318" y="446" on="0"/> 
   <pt x="283" y="446" on="1"/> 
   ... 
 </contour> 
 <instructions/> 
</TTGlyph>

Copy the code

Then the glyph data before adjustment is compared with the glyph data after adjustment.

As shown in Figure 6-41, after the position of the dot is adjusted, the font data is also changed. For example, the x and y coordinates of the xMin, xMax, yMin, yMax and PT labels are different from the previous ones.

Figure 6-41 Figure 6-41 Data comparison

XML files record glyphs coordinate information. In fact, there is no way to get text directly from glyphs data, only from other sources. Although the target site uses multiple fonts, the font is the same for the same text. For example, there are now two fonts, movie.woff and food.woff, which contain the following glyphs:

# movie.woff 
# contain 10 glyph data: [0123456789]
<cmap_format_4 platformID="0" platEncID="3" language="0"> 
   <map code="0x78" name="x"/> 
   <map code="0xe339" name="uniE339"/> # number 6
   <map code="0xe624" name="uniE624"/> # the number nine
   <map code="0xe7df" name="uniE7DF"/> # number 2
   <map code="0xe9c7" name="uniE9C7"/> # number seven
   <map code="0xea16" name="uniEA16"/> # the number 5
   <map code="0xee76" name="uniEE76"/> # 0
   <map code="0xefd4" name="uniEFD4"/> # the number eight
   <map code="0xf19a" name="uniF19A"/> # the number 3
   <map code="0xf57b" name="uniF57B"/> # the number 1
   <map code="0xf593" name="uniF593"/> # number four
</cmap_format_4> 

# food.woff 
# contains three glyphs: [012]
<cmap_format_4 platformID="0" platEncID="3" language="0"> 
   <map code="0x78" name="x"/> 
   <map code="0xe556" name="uniE556"/> # 0
   <map code="0xe667" name="uniE667"/> # the number 1
   <map code="0xe778" name="uniE778"/> # number 2
</cmap_format_4>

Copy the code

To realize automatic text recognition, you need to prepare reference glyphs, that is, manually prepare glyphs mapping relationship and glyphs data of numbers 0 to 9, for example:

Pseudo-code for mapping # 0 and 7 to glyph names. The value corresponding to the data key is glyph data
font_mapping = [ 
   {'name': 'uniE9C7'.'words': '7'.'data': 'uniE9C7_contour_pt'}, 
   {'name': 'uniEE76'.'words': '0'.'data': 'uniEE76_countr_pt'},]Copy the code

When we encounter other font files on the target site, we can match the glyph data in the reference glyph to the target glyph, and if the glyph data is very close, we assume that the two glyph describe the same text. The glyph data contains the TTGlyph tag that records the name of the glyph and the start and end coordinates of the glyph, which represent the position of the glyph on the canvas, and the PT tag that records the point coordinates, which represent the position of each point in the glyph on the canvas. In start-stop coordinates, the X-axis difference represents the glyph width and the Y-axis difference represents the glyph height.

As shown in Figure 6-42, the start and end coordinates, width and height of two glyphs are very different, but they can describe the same text. Therefore, the position of glyphs on the canvas does not affect the description text, nor does the width and height of glyphs.

Figure 6-42 Shows two glyphs of the same text

Can the number and coordinate values of point coordinates be used as comparison conditions?

As shown in Figure 6-43, the glyphs of two characters are different. Although the name of these two glyphs is uniE9C7, there is a large gap between most PT labels X and Y in the glyphs data, so we can determine that these two glyphs do not describe

The same text. You might think of the number of points as an exclusion condition, which means if the number of points is different, then this

Two glyphs do not describe the same text. Is it really so?

Figure 6-43 describes the comparison of glyphs in different characters

In Figure 6-44, the font of description 7 on the left has 17 dots, while the font of description 7 on the right has 20 dots. Figure 6-45 shows the corresponding glyphs.

Figure 6-44 Describes the glyphs of the same text

Figure 6-45 Describes the characters in the same font

Although the number of dots is different, their glyphs do not change much and will not cause users to misread, so the number of dots cannot be used as a condition to exclude different glyphs. Therefore, only glyphs with exactly the same starting and ending coordinates and point coordinates describe the same character.

6.4.3 Font anti-crawler bypass actual combat

To determine whether two sets of glyph data describe the same character, we must extract the corresponding glyph data from THE HTML, and then compare the glyphs to the reference glyph data we have prepared. Now let’s go through the steps of this sequence.

(1) Prepare reference glyph description information.

(2) Visit the target webpage.

(3) Read the font encoding characters from the target web page.

(4) Download the WOFF file and open it with Python code.

(5) Find the outline information of the font in the WOFF file according to the font encoding characters.

(6) Compare the outline information of the glyphs with that of the reference glyphs.

(7) The comparison results are obtained.

Let’s finish the code for the first four steps. Download the WOFF file and map the text described in the glyph to the human-recognized text. Since the glyph data is relatively large, we can hash the glyph data, so that the result is short and unique, and will not affect the comparison result. The following uses numbers 0 to 9 as an example:

base_font = { 
 "font": [{"name": "uniEE76"."value": "0"."hex": "fc170db1563e66547e9100cf7784951f"}, 
 {"name": "uniF57B"."value": "1"."hex": "251357942c5160a003eec31c68a06f64"}, 
 {"name": "uniE7DF"."value": "2"."hex": "8a3ab2e9ca7db2b13ce198521010bde4"}, 
 {"name": "uniF19A"."value": "3"."hex": "712e4b5abd0ba2b09aff19be89e75146"}, 
 {"name": "uniF593"."value": "4"."hex": "e5764c45cf9de7f0a4ada6b0370b81a1"}, 
 {"name": "uniEA16"."value": "5"."hex": "c631abb5e408146eb1a17db4113f878f"}, 
 {"name": "uniE339"."value": "6"."hex": "0833d3b4f61f02258217421b4e4bde24"}, 
 {"name": "uniE9C7"."value": "Seven"."hex": "4aa5ac9a6741107dca4c5dd05176ec4c"}, 
 {"name": "uniEFD4"."value": "8"."hex": "c37e95c05e0dd147b47f3cb1e5ac60d7"}, 
 {"name": "uniE624"."value": "9"."hex": "704362b6e0feb6cd0b1303f10c000f95"}}]Copy the code

In the dictionary, name represents the name of the character, value represents the description of the character, and hex represents the MD5 value of the character.

Considering that the font file path in the network request record may change, we must find the font file path set in the CSS. The HTML code introduced in the CSS is:

<link href="./css/movie.css" rel="stylesheet">

Copy the code

By introducing the code that the CSS file path for www.porters.vip/confusion/c… @font-face is the code to set the font:

@font-face { 
   font-family: stonefont; 
   src:url('.. /font/movie.woff') format('woff'); 
}

Copy the code

The font file path for www.porters.vip/confusion/f… The Python code is as follows:

import re 
from parsel import Selector 
from urllib import parse 
from fontTools.ttLib import TTFont 
url = 'http://www.porters.vip/confusion/movie.html' 
resp = requests.get(url) 
sel = Selector(resp.text) 
Extract all CSS file paths for page loading
css_path = sel.css('link[rel=stylesheet]::attr(href)').extract() 
woffs = [] 
for c in css_path: 
   # Concatenate the correct CSS file path
   css_url = parse.urljoin(url, c) 
   # request a CSS file
   css_resp = requests.get(css_url) 
   Match the woff file path in the CSS file
   woff_path = re.findall("src:url\('.. (.*.woff)'\) format\('woff'\);", 
   css_resp.text)
   if woff_path: 
       # If the same path exists, add it to the Woffs list
       woffs += woff_path 
woff_url = 'http://www.porters.vip/confusion' + woffs.pop() 
woff = requests.get(woff_url) 
filename = 'target.woff' 
with open(filename, 'wb') as f: 
   # Save the file locally
   f.write(woff.content) 
Use the TTFont library to open the woff file you just downloaded
font = TTFont(filename)

Copy the code

Because TTFont can read the structure of a WOFF file directly, there is no need to save wOFF as an XML file here. #xe624.# xe9C7. Introduce the base font data base_font into the original code, and add the following code:

web_code = '&#xe624.&#xe9c7'
# Encoding text replacement
woff_code = [i.upper().replace('&#X'.'uni') for i in web_code.split('. ')] 
import hashlib 
result = [] 
for w in woff_code: 
   Retrieve the encoded glyphs from the font file
   content = font['glyf'].glyphs.get(w).data 
   # font message MD5
   glyph = hashlib.md5(content).hexdigest() 
   for b in base_font.get('font') :# Compare with the MD5 value in the base glyph, if the same, extract the text that describes the glyph
       if b.get('hex') == glyph: 
           result.append(b.get('value')) 
           break 
Print the mapping result
print(result)

Copy the code

The above code runs as follows:

['9'.'7']

Copy the code

The running result shows that the text described in the font file can be correctly mapped.

6.4.4 summary

Font creep can cause a lot of trouble for the reptile engineer. While crawler engineers have found a solution to this problem, it relies on strict conditions, and it can be a real headache for crawler engineers if developers change font files frequently or prepare multiple font files and switch randomly. However, these tasks are not easy for developers.

New welfare

I can’t wait! Python3 anti-crawler principles and circumvention is finally coming to you! In order to thank you for your anticipation and support of Wei Shidong and the book, there will be many book giveaways and limited-time discounts during the book launch.

Want to communicate with Wei Shidong or participate in the book launch activities of friends can scan the TWO-DIMENSIONAL code into the group and INTERACT with me oh!

Reprint instructions

This article is excerpted from the published book “Python3 Anti-crawler principle and Bypass the actual combat”, welcome friends and colleagues to reprint!

Be sure to bring the copyright information to 😊.