Python font crawler tutorial 63-100 Python font crawler

context

In a big category of anti-crawling circle, the websites involved are actually quite many, currently more often bullied by crawler coder websites, Cat’s Eye film, Car Home, Volkswagen Review, 58 tongcheng, eye check…… Or quite many, thousands of technical masters, there are always a variety of anti-crawl technology, for crawler coder, dry! It’s over. It’s 996 anyway

As a series of articles, it is inevitable that the cat’s eye film is still “learning”, why? Because it’s typical

The cat’s eye, film and television

Open Cat eye Pro, general operations, Google Chrome, developer tools, grab DOM nodes,

piaofang.maoyan.com/?ver=normal

Note that all of the number positions in the DOM structure are squares.

Font reverse crawl literacy

Font crawler is a common anti-crawler technology. Websites use customized font files, which are displayed normally in the browser, but the crawler crawls down the data is either garbled or changed into other characters. The use of custom font files is a new feature in CSS3. For those familiar with the front end, it is the font-face property.

Some of the most important cracking material collection

Find the font-family property, check the Settings, find cs font, this is obviously a custom font, search the web page cs.

There is a WOFF format in the screenshot above

Web Open Font Format (WOFF) is a Font Format standard used in Web pages. This font format was developed in 2009 and is now being standardized by the Web Fonts Working Group of the World Wide Web Consortium with a view to becoming a recommended standard. This font format not only makes effective use of compression to reduce file size, but also does not contain encryption and is DRM free.

Decoding operation

import base64

font_face = "d09GRgABAAAAAAggAAsAAAAAC7gAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAABHU1VCAAABCAAAADMAAABCsP6z7U9TLzIAAAE8AAAARAAAAFZW7laVY21hcAA AAYAAAAC8AAACTA/VLRxnbHlmAAACPAAAA5EAAAQ0l9+jTWhlYWQAAAXQAAAALwAAADYUwblKaGhlYQAABgAAAAAcAAAAJAeKAzlobXR4AAAGHAAAABIAAAA wGhwAAGxvY2EAAAYwAAAAGgAAABoF2gTmbWF4cAAABkwAAAAfAAAAIAEZADxuYW1lAAAGbAAAAVcAAAKFkAhoC3Bvc3QAAAfEAAAAXAAAAI/gSKzLeJxjYGR gYOBikGPQYWB0cfMJYeBgYGGAAJAMY05meiJQDMoDyrGAaQ4gZoOIAgCKIwNPAHicY2Bk0mWcwMDKwMHUyXSGgYGhH0IzvmYwYuRgYGBiYGVmwAoC0lxTGBw YKr7LMev812GIYdZhuAIUZgTJAQDZjgsneJzFkj0OgzAMhV8KpT906NiJE3ThUIgrsLL0BD1Fxk5dOAC3iEgkJEYWRvoSs1SCtXX0RbId+Vl2AOwBROROYkC 9oeDtxagK8QjnEI/xoH/DlZEjKpMb3Vnbuto1fTkUo56yeeaL7cyaKVZcOz6TUOlE9R0O7DOlqu8w2aj0A1P/k/62S7ifi5eSaoEtmlzg/GC04HfcWYEzhW0 Fv1tXC5wzXCNw4uhLwf+RoRC81qgF7gNTJiD+ANtoRPR4nEWTz2/aZhzG39dUOCWEkGHjQlrAmNgGkuDY2ARwDMWBNj8ZCRBCWhqiltJsbbOo6dI22lr2Q2q n/QHdZdIOu1Q79N5J03raOrU59A+o1Otum9RLRPbaIZkPr/S+0vs+n+f7PAYQgMO/gQgIgAGQkEjCR/AAfdBcDrGXwAWAS6ZJhwW34owGE0oCLTG4z+jTksv TtwaHnP60L0tjtyr5UPPeg2z9k0hL3b2dvMSiJzDznQPsL2ADAwDQMi1DaUgiGZIbskC9+ycsXGw2a++eleB+Vyg9O0Bnvx7dO/wXA9gbwIAYIvNBSUS6Gpy Ccc6KW5kgK8cVSfRBknBAJsixHIyzTNBKEpRbVL7rV4VImnNYceiJjSZW73+5Mb2jpu8WK3HFBttLk+lqOHKv+Isqj2iyVxnuO2WNeL0PN29+M/d958lPlfF YBabnVxuLhXB05f957CAeO3LBDDkgLpuTkOBOLdDmZyaH+f4kJvhUZyUoegTq6A7ycAr7Hfh7DhQTEedcNEnjGjpwk4ThBdF/a5tRsrWqHtWJ5Ty82n3PBaa ZxqNk/vONKa3vZT638bTK+m1wq/ybm3p0ff3iijJZP+b6gLhCAIyQdDyhWQysYyUNGhpWHPGiBOGHLtdvG+aTbKpIhufUzDysn959vUtHCV3gReqjvnLZ7/P EYnJAmD03eW1mtmBr3diujC2IVIanx85QAz1f/6BuvAHRE18cksMTlKjIPWElgdKhfBBpGxkZgXGdwQuKVuHCqjdkcyRXM4o0bas5k6lySpyQxYnMhcftK3u n/5jLVfc43rYA01NCRssN1mMT3jO19Tn34KXC5a+26uC4H7CLGAJgFCGxJoDhk+zN1WgF6oiJ4aYgYXIiuqAV/mAnQ/FIIELZBwJr0spe6mru1pN5/bOKItu 7T7k8q5SKd8uYO06NUP7kuWVlYrzT0u9M/fhiv7EkjJe7r0Yr0frCzEoVWE56SqCUx9C/YvTSzNW0jaJF+wThlkQjk6DVQrgptFGOds8/3XqxvZnLd96ezxa EXFxgaL11/mxwJBgOSGS4/EUJfs1vfnzj9nybd1/JXd7T1Gah8XM8E/A39Gz3MZcnXCTBPVwqnczkoMcCXKgL0DTfa4DRM0QiKk6ORbOKeLztxe30WafT7hi +VryuFuql+8sR/kFoDDY7s4vltUhWvZlpcYvLs7VXz+/swPV0SsqB/wAGjODCAAAAeJxjYGRgYADixSuWzY3nt/nKwM3CAAI3LlqdRND/37AwMJ0HcjkYmEC iAGAmDGEAeJxjYGRgYNb5r8MQw8IAAkCSkQEV8AAAM2IBzXicY2EAghQGBiYd4jAAN4wCNQAAAAAAAAAMADAATACUAK4A4AEaAVwBoAHmAhoAAHicY2BkYGD gYTBgYGYAASYg5gJCBob/YD4DAA6DAVYAeJxlkbtuwkAURMc88gApQomUJoq0TdIQzEOpUDokKCNR0BuzBiO/tF6QSJcPyHflE9Klyyekz2CuG8cr7547M3d 9JQO4xjccnJ57vid2cMHqxDWc40G4Tv1JuEF+Fm6ijRfhM+oz4Ra6eBVu4wZvvMFpXLIa40PYQQefwjVc4Uu4Tv1HuEH+FW7i1mkKn6Hj3Am3sHC6wm08Ou8 tpSZGe1av1PKggjSxPd8zJtSGTuinyVGa6/Uu8kxZludCmzxMEzV0B6U004k25W35fj2yNlCBSWM1paujKFWZSbfat+7G2mzc7weiu34aczzFNYGBhgfLfcV 6iQP3ACkSaj349AxXSN9IT0j16JepOb01doiKbNWt1ovippz6sVYYwsXgX2rGVFIkq7Pl2PNrI6qW6eOshj0xaSq9mpNEZIWs8LZUfOouNkVXxp/d5woqebe YIf4D2J1ywQB4nG2KOxKAIBBDN/hBEe8ioKAlKt7Fxs4Zj++4tKZ5k7yQoBxF/9EQKFCiQg2JBi0UOmj0hEfe15nG2TCHGD8ewSTuwYe8u+zHdWdv8y/Z5Jh uW5jRT0QvGVQXkQ=="

print(len(font_face))

b = base64.b64decode(font_face)
with open('font.ttf'.'wb') as f:
    f.write(b)
Copy the code

There are three ways to process TTF files. The first way is to use FontCreator to open TTF files directly. The second way is to use fontTools, a third-party Python library, to manipulate TTF files. The third use of baidu fontstore fontstore.baidu.com/static/edit…

FontCreator is a little easier to find

You can search baidu by yourself, you can also directly open my Baidu disk download

Link: pan.baidu.com/s/1ZyWwk37h… Extraction code: KK2H

After installation, you can try it out directly, or you can use the harmonious method supported by the state to carry out harmony

'uniE481': '7',
'uniE0AA': '4', 
'uniF71E': '9', 
'uniE767': '1', 
'uniE031': '5', 
'uniE4BD': '2',
'uniF2AA': '3',
'uniE2E3': '6', 
'uniE3C9': '8', 
'uniEA65': '0'
Copy the code

There’s no problem with the numbers 369 million

Start encoding and decrypting fonts

Some web pages nested with multiple sets of fonts, increased the cost of anti-crawl, then their own research can be

Using fontTools, you can retrieve each character object, which you can simply think of as holding the shape information of the character. And the code can be used as the ID of this object, with a one-to-one correspondence. Similar to cat’s eye movies, the encoding of characters corresponding to multiple sets of fonts is changed, but the shape of characters is unchanged, that is, the object is unchanged.

Parsing font files using fontTools

Install fonttools

pip install fonttools

Darknode. in/font/font-t…

The basic use

from fontTools.ttLib import TTFont

font = TTFont('font.ttf')
font.saveXML('01.xml')
Copy the code

Open the XML file

Note that the ID here is the number, do not take the corresponding number

Notice, when you’re writing code, you need to pay attention

Make a summary about cat’s eye font reverse climb

In practice, you will find that the cat’s eye movie, the character encoding changes with each refresh, but the object of the font, that is, the pixel point is the same.

base_font.ttf
online_font.ttf

Get font files for the first time

Font processing has been downloaded locally
base_font = TTFont('font.ttf') # Open the local TTF file

base_uni_list = base_font.getGlyphOrder()[2:]   Get all codes, remove the first 2, you can see the above icon

# Write the encoding and corresponding font of the first font file
origin_dict = {'uniE481': '7'.'uniE0AA': '4'.'uniF71E': '9'.'uniE767': '1'.'uniE031': '5'.'uniE4BD': '2'.'uniF2AA': '3'.'uniE2E3': '6'.'uniE3C9': '8'.'uniEA65': '0'}
Copy the code

Get online fonts

Get the font that is online after the refresh

Get the base64 encoding of the font file
online_ttf_base64 = re.findall(r"base64,(.*)\) format", response)[0]
online_base64_info = base64.b64decode(online_ttf_base64)
with open('online_font.ttf'.'wb')as f:
    f.write(online_base64_info)
online_font = TTFont('online_font.ttf')  # Font files downloaded dynamically on the Internet.

online_uni_list = online_font.getGlyphOrder()[2:]


for uni2 in online_uni_list:
    obj2 = online_font['glyf'][uni2]  # get the object corresponding to uni2 in online_font
    for uni1 in base_uni_list:
        obj1 = base_font['glyf'][uni1]  Get the object corresponding to uni1 in base_font
        if obj1 == obj2:  # check whether two objects are equal
            dd = "&#x" + uni2[3:].lower() + '; '  # Change to Unicode encoding
            if dd in response:  If the Unicode encoding for uni2 is in response, replace it with the number in origin_dict.
                response = response.replace(dd, origin_dict[uni1])
Copy the code

Request module is used to obtain response

url = 'https://piaofang.maoyan.com/?ver=normal'
headers = {
    'User-Agent': 'Browser UA'.'Accept': 'text/html,application/xhtml+xml,application/xml; Q = 0.9, image/webp image/apng, * / *; Q = 0.8 ',
}
response = requests.get(url=url, headers=headers).content  # get bytes
charset = chardet.detect(response).get('encoding')  # get the encoding format
response = response.decode(charset, "ignore")  # Decode to get a string

Copy the code

Running Result Display