context
In a big category of anti-crawling circle, the websites involved are actually quite many, currently more often bullied by crawler coder websites, Cat’s Eye film, Car Home, Volkswagen Review, 58 tongcheng, eye check…… Or quite many, thousands of technical masters, there are always a variety of anti-crawl technology, for crawler coder, dry! It’s over. It’s 996 anyway
As a series of articles, it is inevitable that the cat’s eye film is still “learning”, why? Because it’s typical
The cat’s eye, film and television
Open Cat eye Pro, general operations, Google Chrome, developer tools, grab DOM nodes,
piaofang.maoyan.com/?ver=normal
Note that all of the number positions in the DOM structure are squares.
Font reverse crawl literacy
Font crawler is a common anti-crawler technology. Websites use customized font files, which are displayed normally in the browser, but the crawler crawls down the data is either garbled or changed into other characters. The use of custom font files is a new feature in CSS3. For those familiar with the front end, it is the font-face property.
Some of the most important cracking material collection
Find the font-family property, check the Settings, find cs font, this is obviously a custom font, search the web page cs.
There is a WOFF format in the screenshot above
Web Open Font Format (WOFF) is a Font Format standard used in Web pages. This font format was developed in 2009 and is now being standardized by the Web Fonts Working Group of the World Wide Web Consortium with a view to becoming a recommended standard. This font format not only makes effective use of compression to reduce file size, but also does not contain encryption and is DRM free.
Decoding operation
import base64
font_face = "d09GRgABAAAAAAggAAsAAAAAC7gAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAABHU1VCAAABCAAAADMAAABCsP6z7U9TLzIAAAE8AAAARAAAAFZW7laVY21hcAA AAYAAAAC8AAACTA/VLRxnbHlmAAACPAAAA5EAAAQ0l9+jTWhlYWQAAAXQAAAALwAAADYUwblKaGhlYQAABgAAAAAcAAAAJAeKAzlobXR4AAAGHAAAABIAAAA wGhwAAGxvY2EAAAYwAAAAGgAAABoF2gTmbWF4cAAABkwAAAAfAAAAIAEZADxuYW1lAAAGbAAAAVcAAAKFkAhoC3Bvc3QAAAfEAAAAXAAAAI/gSKzLeJxjYGR gYOBikGPQYWB0cfMJYeBgYGGAAJAMY05meiJQDMoDyrGAaQ4gZoOIAgCKIwNPAHicY2Bk0mWcwMDKwMHUyXSGgYGhH0IzvmYwYuRgYGBiYGVmwAoC0lxTGBw YKr7LMev812GIYdZhuAIUZgTJAQDZjgsneJzFkj0OgzAMhV8KpT906NiJE3ThUIgrsLL0BD1Fxk5dOAC3iEgkJEYWRvoSs1SCtXX0RbId+Vl2AOwBROROYkC 9oeDtxagK8QjnEI/xoH/DlZEjKpMb3Vnbuto1fTkUo56yeeaL7cyaKVZcOz6TUOlE9R0O7DOlqu8w2aj0A1P/k/62S7ifi5eSaoEtmlzg/GC04HfcWYEzhW0 Fv1tXC5wzXCNw4uhLwf+RoRC81qgF7gNTJiD+ANtoRPR4nEWTz2/aZhzG39dUOCWEkGHjQlrAmNgGkuDY2ARwDMWBNj8ZCRBCWhqiltJsbbOo6dI22lr2Q2q n/QHdZdIOu1Q79N5J03raOrU59A+o1Otum9RLRPbaIZkPr/S+0vs+n+f7PAYQgMO/gQgIgAGQkEjCR/AAfdBcDrGXwAWAS6ZJhwW34owGE0oCLTG4z+jTksv TtwaHnP60L0tjtyr5UPPeg2z9k0hL3b2dvMSiJzDznQPsL2ADAwDQMi1DaUgiGZIbskC9+ycsXGw2a++eleB+Vyg9O0Bnvx7dO/wXA9gbwIAYIvNBSUS6Gpy Ccc6KW5kgK8cVSfRBknBAJsixHIyzTNBKEpRbVL7rV4VImnNYceiJjSZW73+5Mb2jpu8WK3HFBttLk+lqOHKv+Isqj2iyVxnuO2WNeL0PN29+M/d958lPlfF YBabnVxuLhXB05f957CAeO3LBDDkgLpuTkOBOLdDmZyaH+f4kJvhUZyUoegTq6A7ycAr7Hfh7DhQTEedcNEnjGjpwk4ThBdF/a5tRsrWqHtWJ5Ty82n3PBaa ZxqNk/vONKa3vZT638bTK+m1wq/ybm3p0ff3iijJZP+b6gLhCAIyQdDyhWQysYyUNGhpWHPGiBOGHLtdvG+aTbKpIhufUzDysn959vUtHCV3gReqjvnLZ7/P EYnJAmD03eW1mtmBr3diujC2IVIanx85QAz1f/6BuvAHRE18cksMTlKjIPWElgdKhfBBpGxkZgXGdwQuKVuHCqjdkcyRXM4o0bas5k6lySpyQxYnMhcftK3u n/5jLVfc43rYA01NCRssN1mMT3jO19Tn34KXC5a+26uC4H7CLGAJgFCGxJoDhk+zN1WgF6oiJ4aYgYXIiuqAV/mAnQ/FIIELZBwJr0spe6mru1pN5/bOKItu 7T7k8q5SKd8uYO06NUP7kuWVlYrzT0u9M/fhiv7EkjJe7r0Yr0frCzEoVWE56SqCUx9C/YvTSzNW0jaJF+wThlkQjk6DVQrgptFGOds8/3XqxvZnLd96ezxa EXFxgaL11/mxwJBgOSGS4/EUJfs1vfnzj9nybd1/JXd7T1Gah8XM8E/A39Gz3MZcnXCTBPVwqnczkoMcCXKgL0DTfa4DRM0QiKk6ORbOKeLztxe30WafT7hi +VryuFuql+8sR/kFoDDY7s4vltUhWvZlpcYvLs7VXz+/swPV0SsqB/wAGjODCAAAAeJxjYGRgYADixSuWzY3nt/nKwM3CAAI3LlqdRND/37AwMJ0HcjkYmEC iAGAmDGEAeJxjYGRgYNb5r8MQw8IAAkCSkQEV8AAAM2IBzXicY2EAghQGBiYd4jAAN4wCNQAAAAAAAAAMADAATACUAK4A4AEaAVwBoAHmAhoAAHicY2BkYGD gYTBgYGYAASYg5gJCBob/YD4DAA6DAVYAeJxlkbtuwkAURMc88gApQomUJoq0TdIQzEOpUDokKCNR0BuzBiO/tF6QSJcPyHflE9Klyyekz2CuG8cr7547M3d 9JQO4xjccnJ57vid2cMHqxDWc40G4Tv1JuEF+Fm6ijRfhM+oz4Ra6eBVu4wZvvMFpXLIa40PYQQefwjVc4Uu4Tv1HuEH+FW7i1mkKn6Hj3Am3sHC6wm08Ou8 tpSZGe1av1PKggjSxPd8zJtSGTuinyVGa6/Uu8kxZludCmzxMEzV0B6U004k25W35fj2yNlCBSWM1paujKFWZSbfat+7G2mzc7weiu34aczzFNYGBhgfLfcV 6iQP3ACkSaj349AxXSN9IT0j16JepOb01doiKbNWt1ovippz6sVYYwsXgX2rGVFIkq7Pl2PNrI6qW6eOshj0xaSq9mpNEZIWs8LZUfOouNkVXxp/d5woqebe YIf4D2J1ywQB4nG2KOxKAIBBDN/hBEe8ioKAlKt7Fxs4Zj++4tKZ5k7yQoBxF/9EQKFCiQg2JBi0UOmj0hEfe15nG2TCHGD8ewSTuwYe8u+zHdWdv8y/Z5Jh uW5jRT0QvGVQXkQ=="
print(len(font_face))
b = base64.b64decode(font_face)
with open('font.ttf'.'wb') as f:
f.write(b)
Copy the code
There are three ways to process TTF files. The first way is to use FontCreator to open TTF files directly. The second way is to use fontTools, a third-party Python library, to manipulate TTF files. The third use of baidu fontstore fontstore.baidu.com/static/edit…
FontCreator is a little easier to find
You can search baidu by yourself, you can also directly open my Baidu disk download
Link: pan.baidu.com/s/1ZyWwk37h… Extraction code: KK2H
After installation, you can try it out directly, or you can use the harmonious method supported by the state to carry out harmony
'uniE481': '7',
'uniE0AA': '4',
'uniF71E': '9',
'uniE767': '1',
'uniE031': '5',
'uniE4BD': '2',
'uniF2AA': '3',
'uniE2E3': '6',
'uniE3C9': '8',
'uniEA65': '0'
Copy the code
There’s no problem with the numbers 369 million
Start encoding and decrypting fonts
Some web pages nested with multiple sets of fonts, increased the cost of anti-crawl, then their own research can be
Using fontTools, you can retrieve each character object, which you can simply think of as holding the shape information of the character. And the code can be used as the ID of this object, with a one-to-one correspondence. Similar to cat’s eye movies, the encoding of characters corresponding to multiple sets of fonts is changed, but the shape of characters is unchanged, that is, the object is unchanged.
Parsing font files using fontTools
Install fonttools
pip install fonttools
Darknode. in/font/font-t…
The basic use
from fontTools.ttLib import TTFont
font = TTFont('font.ttf')
font.saveXML('01.xml')
Copy the code
Open the XML file
Note that the ID here is the number, do not take the corresponding number
Notice, when you’re writing code, you need to pay attention
Make a summary about cat’s eye font reverse climb
In practice, you will find that the cat’s eye movie, the character encoding changes with each refresh, but the object of the font, that is, the pixel point is the same.
base_font.ttf
online_font.ttf
Get font files for the first time
Font processing has been downloaded locally
base_font = TTFont('font.ttf') # Open the local TTF file
base_uni_list = base_font.getGlyphOrder()[2:] Get all codes, remove the first 2, you can see the above icon
# Write the encoding and corresponding font of the first font file
origin_dict = {'uniE481': '7'.'uniE0AA': '4'.'uniF71E': '9'.'uniE767': '1'.'uniE031': '5'.'uniE4BD': '2'.'uniF2AA': '3'.'uniE2E3': '6'.'uniE3C9': '8'.'uniEA65': '0'}
Copy the code
Get online fonts
Get the font that is online after the refresh
Get the base64 encoding of the font file
online_ttf_base64 = re.findall(r"base64,(.*)\) format", response)[0]
online_base64_info = base64.b64decode(online_ttf_base64)
with open('online_font.ttf'.'wb')as f:
f.write(online_base64_info)
online_font = TTFont('online_font.ttf') # Font files downloaded dynamically on the Internet.
online_uni_list = online_font.getGlyphOrder()[2:]
for uni2 in online_uni_list:
obj2 = online_font['glyf'][uni2] # get the object corresponding to uni2 in online_font
for uni1 in base_uni_list:
obj1 = base_font['glyf'][uni1] Get the object corresponding to uni1 in base_font
if obj1 == obj2: # check whether two objects are equal
dd = "&#x" + uni2[3:].lower() + '; ' # Change to Unicode encoding
if dd in response: If the Unicode encoding for uni2 is in response, replace it with the number in origin_dict.
response = response.replace(dd, origin_dict[uni1])
Copy the code
Request module is used to obtain response
url = 'https://piaofang.maoyan.com/?ver=normal'
headers = {
'User-Agent': 'Browser UA'.'Accept': 'text/html,application/xhtml+xml,application/xml; Q = 0.9, image/webp image/apng, * / *; Q = 0.8 ',
}
response = requests.get(url=url, headers=headers).content # get bytes
charset = chardet.detect(response).get('encoding') # get the encoding format
response = response.decode(charset, "ignore") # Decode to get a string
Copy the code