Font the climb

Font reverse crawling is also a custom font reverse crawling. By calling a custom font file to render the text in the web page, the text in the web page is no longer the text, but the corresponding font encoding, and the encoded text content cannot be collected by copying or simple collection.

Now it seems that many websites have adopted this anti – crawling mechanism, we through the actual situation of the cat’s eye to explain.

The following is the display on the Cat’s Eye page:

Let’s look at the elements

What the hell is this? It’s all gibberish.

Those familiar with CSS will know that CSS has a @font-face that allows web developers to specify online fonts for their web pages. Originally designed to eliminate reliance on fonts on the user’s computer, it now has a new function – anti-crawl.

There are thousands of Chinese characters in common use. If all of them are put into a customized font, the font file will become very large, which will inevitably affect the loading speed of the web page. Therefore, the general website will select key content to protect it, as shown in the picture above.

The garbled characters here are due to Unicode encodings, which you can see by looking at the source file.

Search stonefont to find the definition of @font-face:

Here. Woff file is the font file, to download it, we use fontstore.baidu.com/static/edit… The page opens and displays the following:

As shown in the web source code. Does it look something like this? In fact, it does. Remove the leading &#x and the ending; After that, the remaining four hexadecimal numbers displayed with UNI are encoded in the font file. So &# xeA0b corresponds to the number “9”.

Now that we know how it works, let’s see how it works.

To process font files, we need to use the FontTools library.

Convert font files to XML files.

from fontTools.ttLib import TTFont

font = TTFont('bb70be69aaed960fa6ec3549342b87d82084.woff')
font.saveXML('bb70be69aaed960fa6ec3549342b87d82084.xml')
Copy the code

Open the XML file

The entire code is displayed at the beginning, and the ID here is just the number, do not take it as the corresponding real value. In fact, nowhere in the font file does it say what the actual value of EA0B is.

See the following

Here is the font information for each word, and when the computer displays it, it doesn’t need to know what the word is, it just needs to know which pixel is black and which pixel is white.

Cat’s eye’s font file is dynamically loaded and will change every time it is refreshed. Although there are only 9 digits from 0 to 9 defined in the font, the encoding and order will change. That is, “EA0B” stands for “9” in this font file, but not in other files.

But one thing that doesn’t change is the shape of the word, the points defined in the figure above.

We first download a random font file, named base.woff, then use fontStore website to check the code and the actual value of the corresponding relationship, manual dictionary and save. When crawler crawls, download the font file, according to the code in the source code of the web page, find the “font” in the font file, recycle with the “font” in the base. Woff file to compare, “font” is the same that means it is the same word. After finding the glyph in base.woff, we get the code for the glyph, which we’ve already manually mapped to the values to get the actual values we want.

The premise here is that the glyph defined in each font file is the same (cat’s Eye currently does this, and may change its strategy in the future), but if you get more complicated and add a little bit of random deformation to the glyph in each font, this method won’t work and you’ll only get the killer “OCR”.

basefont = TTFont('base.woff')
fontdict = {'uniF30D': '0'.'uniE6A2': '8'.'uniEA94': '9'.'uniE9B1': '2'.'uniF620': '6'.'uniEA56': '3'.'uniEF24': '1'.'uniF53E': '4'.'uniF170': '5'.'uniEE37': '7'}


def get_moviescore(url):
    headers = {'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) '
                             'the Chrome / 68.0.3440.106 Safari / 537.36'}
    html = requests.get(url, headers=headers).text
    soup = BeautifulSoup(html, 'lxml')
    ddlist = soup.find_all('dd')
    for dd in ddlist:
        a = dd.find('a')
        if a is not None:
            link = host + a['href']
            time.sleep(5)
            dhtml = requests.get(link, headers=headers).text
            msg = {}

            dsoup = BeautifulSoup(dhtml, 'lxml')
            msg['name'] = dsoup.find(class_='name').text
            ell = dsoup.find_all('li', {'class': 'ellipsis'})
            msg['type'] = ell[0].text
            msg['country'] = ell[1].text.split('/') [0].strip()
            msg['length'] = ell[1].text.split('/') [1].strip()
            msg['release-time'] = ell[2].text[:10]

            Download the font file
            woff = regex_woff.search(dhtml).group()
            wofflink = 'http:' + woff
            localname = 'font\\' + os.path.basename(wofflink)
            if not os.path.exists(localname):
                downloads(wofflink, localname)
            font = TTFont(localname)

            BeautifulSoup does not display properly because it contains Unicode characters and can only be retrieved using the original text via the re
            ms = regex_text.findall(dhtml)
            if len(ms) < 3:
                msg['score'] = '0'
                msg['score-num'] = '0'
                msg['box-office'] = '0'
            else:
                msg['score'] = get_fontnumber(font, ms[0])
                msg['score-num'] = get_fontnumber(font, ms[1])
                msg['box-office'] = get_fontnumber(font, ms[2]) + dsoup.find('span', class_='unit').text
            print(msg)


def get_fontnumber(newfont, text):
    ms = regex_font.findall(text)
    for m in ms:
        text = text.replace(f'&#x{m}; ', get_num(newfont, f'uni{m.upper()}'))
    return text


def get_num(newfont, name):
    uni = newfont['glyf'][name]
    for k, v in fontdict.items():
        if uni == basefont['glyf'][k]:
            return v
Copy the code

You can also scan the code to pay attention to my personal public account, and the background reply “Cat eye” to obtain the source code, and the code I use basefont.