real

This anti-crawler method is usually used to protect key data in pages, such as movie box office, merchant phone number of takeout platform, vehicle model quotation on car portal or commodity attribute and price on e-commerce platform.

For the introduction, implementation and principle of font anti-crawler, you can refer to the book “Python3 Anti-crawler principle and Bypass actual combat”, you can also search for information through the search engine, this article will not be repeated.

The problem with this article is how to make the application accurately recognize the text that has been replaced with a custom font.

This article will revolve around site aHR0cHM6Ly9tYW95YW4uY29tL2ZpbG1zLzEyMTgwMjk = is discussed, specific objectives as follows:

Obviously, user ratings, box office totals, and so on are key data, and that’s what crawlers want. Although 9.4 and 14.05 are visible to the human eye, they are . and . in browser developer tools. The corresponding HTML code of 9.4 is:

<span class="stonefont"> .  < / span >Copy the code

The web source code, however, is a different story:

<span class="stonefont"> &#xe761; . 
Copy the code

Experienced friends can see it at a glance, this is the font anti-crawler means! For inexperienced friends, please read Python3 anti-crawler principles and circumvention.

This font anti-crawler cracking idea is:

Obtain the TTF or WOFF font files in the relevant CSS files and use python’s fontTools module to establish font mapping.

However, when you analyze the examples presented in this article, you will find that the fonts used on the page change dynamically in real time, and you cannot establish a definite correspondence. This is very similar to the anti-crawler ideas mentioned in Python3 anti-crawler principles and Bypassing actual Combat.

Fortunately, the font crawler method encountered and Python3 crawler principle and bypass actual combat are not exactly the same, some of the methods are not used, it is ok.

Next, we will introduce the simplest k-nearest neighbor algorithm based on deep learning to crack the real time dynamic fonts anti-crawling measures. Here are the steps for cracking:

  • Download the font file used for the page to your local PC
  • View the font file through the font editor
  • Observe the random dynamic phenomenon of font files and record the change rule
  • Derive the rules of change

Take the case website given in this article as an example. First, find the font file (usually in WOFF format) loaded on the page in the NetWork column of browser developer tools and download it locally. Then use a font editor (such as Baidu Font Editor) to view the font file, as shown below:

Next, use the fontTools library to convert the WOFF file to XML and see how the coordinates change. For example, the special character of the number 6 is uniF5DE. The corresponding coordinate values are as follows:

In another font file, the coordinate of the number 6 is as follows:

After many tests, it is found that although objects with the same number are different, there is little difference, and the difference of each coordinate in the object is small. In this way we can define the difference between the coordinate values of the object within a certain range as two identical numbers.

Next, the simplest method of machine learning, KNN algorithm, can classify coordinates after simple training.

So what is the KNN algorithm?

Simply speaking, k-nearest neighbor algorithm uses the distance method between different eigenvalues of strategies to classify.

Working principle:

There is a sample data set, also known as the training sample set, and each data in the sample set has a label, that is, we know the corresponding relationship between each data in the sample set and its classification. After inputting the new data without labels, each feature of the new data is compared with the corresponding feature of the data in the sample set, and then the algorithm extracts the classification label of the most similar data (nearest neighbor) of the feature in the sample set. In general, we select only the first k most similar data in the sample data set, which is where k comes from in the K-nearest neighbor algorithm, which is usually an integer not greater than 20. Finally, the classification with the most frequent occurrence among k most similar data was selected as the classification of the new data.

Based on the above introduction, we can first make a sample data set, which is as follows:

That is, the coordinates of the relevant numbers are extracted from the XML file. Through KNN algorithm, the coordinates of the input font are calculated and its label is marked:

def classifyPerson(font):
		The full code will be available at the end of this article
		pass
Copy the code

After the KNN algorithm tags, the source file can be replaced with the replace method:

fonts = {}
for i in base_list:
		The full code will be available at the end of this article
Copy the code

Here, multiple target addresses are randomly selected and their correct output is verified. The corresponding data of this case is as follows:

{"Movie Title": "Young You."."User rating": "9.4"."Number of raters": "95.2 万"."Cumulative box office": "732 million"}
Copy the code

At this point, the font crawl problem is solved.

For more ideas and research on font crawling, please refer to “Python3 Anti-crawler principle and Bypass actual combat”. There are some important points not mentioned in this article, it is recommended to turn to the book to complete the knowledge.

For the complete code, please follow the wechat official account “House of Reptile Engineers” and reply 20191029 to get the link to the code warehouse.

Reference for this article:

Public number crawler Engineer home article “Based on k-Nearest Neighbor Algorithm, crack CSS dynamic obfuscation fonts”

Wei Shidong’s new book “Python3 Anti-crawler Principle and Bypass actual Combat”