If anyone asks you for tips on crawler crawling, please ask them to read this article

This article was first publishedMy personal blog, simultaneously published in the nuggets column, non-commercial reprint please note the source, commercial reprint please read the original link in the legal statement.

The Web is an open platform, which has underpinned the development of the Web from its birth in the early 1990s to today for nearly 30 years. However, the so-called success is also nothing to lose, open characteristics, search engines and easy to learn HTML, CSS technology makes web become the most popular and mature information communication media in the field of the Internet; But now as a commercial software, web content on the platform of information copyright, however, there is no guarantee that because compared in terms of the client software, the contents of the web page you can be very low cost, low threshold technology to realize the access to some of the scraping program, which is the subject of this series will explore – web crawler.

There are many who believe that the Web should always be in the spirit of openness, that the information presented on a page should be shared unreservedly with the entire Web. However, IN my opinion, the development of the IT industry today, the Web is no longer the so-called “hypertext” information carrier to compete with PDF, IT has been a lightweight client software ideology. With the development of commercial software today, The Web has to face the problem of intellectual property protection. Just imagine that if the original high-quality content is not protected, plagiarism and piracy run rampant in the online world, it is actually detrimental to the healthy development of the Web ecosystem, and it is difficult to encourage the production of more high-quality original content.

Unauthorized crawler crawler is one of the main villains that harm the ecology of web original content, so to protect web content, we must first consider how to anti-crawler.

From the point of view of reptilian attack and defense

The simplest crawler is the HTTP request supported by almost all server-side and client-side programming languages. As long as you initiate an HTTP GET request to the URL of the target page, you can get the complete HTML document when the browser loads the page, which is called “synchronous page”.

On the defensive, the server can use the User-Agent in the HTTP request header to check whether the client is a legitimate browser program, or a scripted crawler, and decide whether to send the actual page content to you.

Of course, this is a minimal defense method. As the attacking party, the crawler can completely forge the User-Agent field, and even, if you like, the HTTP GET method, request header Referrer, Cookie and other fields can be easily forged by the crawler.

At this point, the server can use the browser HTTP header fingerprint, according to your own browser manufacturer and version (from the user-Agent), to verify whether each field in your HTTP header matches the characteristics of the browser, if not, it will be treated as a crawler. A typical application of this technology is PhantomJS 1.x. Because the underlying version calls the Qt framework network library, there are obvious characteristics of Qt framework network requests in the HTTP header, which can be directly recognized and intercepted by the server.

In addition, there is a more abnormal server side crawler detection mechanism, which is to plant a cookie token in the HTTP response of all HTTP requests to visit the page. It then checks whether the request contains a cookie token in some Ajax interface that executes asynchronously within the page, and sends the token back indicating that it is a legitimate browser visit. Otherwise, it means that the user who was just handed that token visited the page HTML but did not access the Ajax request called after executing JS in THE HTML, which is probably a crawler program.

If you access an interface directly without carrying a token, it means that you have not requested an HTML page and have made a web request directly to the interface that should be accessed by ajax in the page, which obviously proves that you are a suspicious crawler. Amazon, a well-known e-commerce site, has adopted this defensive strategy.

The above is based on the server side check crawler, can play some tricks.

Detection based on client JS runtime

Modern browsers have given JavaScript powerful capabilities, so we can make all the core content of the page into JS asynchronous request ajax to obtain data and render in the page, which obviously improves the threshold of crawler grasping content. In this way, we transfer the battle of crawler and anti-crawler from the server to the JS runtime in the client browser. Next, we talk about crawler crawler technology combining the client JS runtime.

Just talked about a variety of server-side verification, for ordinary Python, Java language prepared for HTTP crawling procedures, has a certain technical threshold, after all, a Web application for unauthorized crawling is black box, a lot of things need to try a little bit, and spend a lot of manpower and resources to develop a good set of crawling procedures, As long as the web site as the defending side can easily adjust a few strategies, the attacker will have to spend the same amount of time to modify the crawler logic again.

This is where headless Browser comes in. What technology is that? In fact, let the program can operate the browser to visit the web page, so that the crawler can call the API exposed by the browser to call the program to achieve complex crawling business logic.

In fact, this is not a new technology in recent years, there were PhantomJS based on webKit kernel, SlimerJS based on Firefox kernel, and even trifleJS based on IE kernel, If you are interested, here and here are two collections of Headless Browsers.

The principle of these headless browser programs is to transform and encapsulate some open source browser core C++ code to achieve a simple GUI – free browser interface rendering program. The common problem with these projects, however, is that because their code is based on the trunk code of a version of the official WebKit kernel such as fork, they can’t keep up with the latest CSS properties and JS syntax, and they have some compatibility issues, so they don’t run as reliably as the real RELEASE GUI browsers.

Among them, the most mature and most used is PhantonJS. I have written a blog post about this crawler and I won’t repeat it here. PhantomJS has a lot of problems because it’s a single-process model, doesn’t have the necessary sandbox protection, and the browser kernel is less secure. In addition, the project author has declared that he has stopped maintaining the project.

The Headless Mode API has been opened in Chrome 59 release and headless Chromium Dirver library has been opened based on node.js calls. I also contributed a list of deployment dependencies for the centos environment to the library.

Headless Chrome is a unique addition to The Headless Browser. As a Chrome Browser itself, it supports a variety of new CSS rendering features and JS runtime syntax.

Based on such means, the crawler, as the attacking party, can bypass almost all the server side verification logic, but there are still some flaws in the js runtime of the client side, such as:

Plugin object-based checks

if(navigator.plugins.length === 0) {
    console.log('It may be Chrome headless');
}
Copy the code

Language-based checks

if(navigator.languages === ' ') {
    console.log('Chrome headless detected');
}
Copy the code

Webgl-based checking

var canvas = document.createElement('canvas');
var gl = canvas.getContext('webgl');

var debugInfo = gl.getExtension('WEBGL_debug_renderer_info');
var vendor = gl.getParameter(debugInfo.UNMASKED_VENDOR_WEBGL);
var renderer = gl.getParameter(debugInfo.UNMASKED_RENDERER_WEBGL);

if(vendor == 'Brian Paul' && renderer == 'Mesa OffScreen') {
    console.log('Chrome headless detected');
}
Copy the code

Check based on the browser hairline feature

if(! Modernizr['hairline']) {
    console.log('It may be Chrome headless');
}
Copy the code

A check of img objects generated based on the error IMG SRC attribute

var body = document.getElementsByTagName('body') [0];
var image = document.createElement('img');
image.src = 'http://iloveponeydotcom32188.jg';
image.setAttribute('id'.'fakeimage');
body.appendChild(image);
image.onerror = function(){
    if(image.width == 0 && image.height == 0) {
        console.log('Chrome headless detected'); }}Copy the code

Based on these Browser features, you can kill most Headless Browser applications on the market. At this level, it actually raises the threshold of web crawling, requiring developers to write crawlers to modify the C++ code of the browser kernel and recompile a browser. Moreover, the above features are not small changes to the browser kernel. If you’ve ever tried compiling a Blink kernel or a Gecko kernel you know how difficult it is for a script kid

Further, we can also check the attributes and methods of various native objects of JS runtime, DOM and BOM based on the browser brand and version model information described by the UserAgent field of the browser, and observe whether their characteristics are in line with the characteristics that the browser of this version should have.

This approach, known as browser fingerprint checking, relies on large Web sites collecting information from various browser apis. As an attacker writing crawlers, you can pre-inject SOME JS logic into the Headless Browser runtime to fake Browser features.

In addition, when studying the use of JS API for Robots Browser detection on the Browser side, we found an interesting trick. You can disguise a pre-injected JS Function as a Native Function, and have a look at the following code:

var fakeAlert = (function(){}).bind(null);
console.log(window.alert.toString()); // function alert() { [native code] }
console.log(fakeAlert.toString()); // function () { [native code] }
Copy the code

Crawler attackers may pre-inject some JS methods, wrap a proxy function as a hook around some native API, and then override the native API with this fake JS API. If the defender checks this based on checking [native code] after toString, it will be bypassed. So you need to be more rigorous, because bind(null) forged methods don’t take function names after toString, so you need to check if the function name is empty after toString.

What does this technique do? To extend this, anti-fetching defenders have a method called Robot Detect that actively throws an alert while JS is running. Copywriting can write something relevant to business logic. Normal user clicks ok will inevitably have a delay of 1s or more. Since alert blocks JS code from running in the browser (in fact, in V8 it pauses the ISOLATE context in a sort of process suspension manner), crawlers as attackers can choose to pre-inject a piece of JS code before all the JS on the page runs. Forge all the pop-up methods such as alert, prompt and confirm hook. If the defender checks that the alert method he calls is still not native before the popover code, the path is blocked.

Silver bullet against reptile

At present, the most reliable method of anti-grab and robot inspection is verification code technology. But captchas don’t necessarily force a user to enter a string of alphanumeric numbers. There are a number of behavioral validation technologies that are based on mouse, touch screen, and more. One of the most sophisticated is Google reCAPTCHA, which uses machine learning to distinguish users from crawlers.

Based on many of the above techniques for identifying users and crawlers, the ultimate goal of a site’s defenders is to block IP addresses or impose a high-intensity captcha strategy on visitors to this IP address. As a result, attackers have to purchase IP proxy pools to capture web content, otherwise a single IP address is easily blocked and cannot be captured. The threshold of fetching and anti-fetching is raised to the level of economic cost of IP proxy pool.

Robot protocol

In addition, there is a “white way” in the field of crawler grasping technology, called robots protocol. You can access /robots.txt at the root of a website. For example, let’s take a look at github’s robotics protocol. Allow and Disallow declare crawler authorization for each UA.

However, this is just a gentleman’s agreement, and while it has legal benefits, it can only limit commercial search engines’ spider applications, not the “wild crawlers”.

Write in the last

The crawling and countermeasures of web content are destined to be a cat and mouse game, you can never completely block the way of crawler program with a certain technology, you can only improve the crawling cost of attackers, and do a more accurate understanding of unauthorized crawling behavior.

The attack and defense of the verification code mentioned in this article is actually a relatively complex technical difficulty, so I leave a suspense here. If you are interested, you can pay attention to it and look forward to detailed elaboration in the following article.

In addition, welcome friends who are interested in crawling to pay attention to my open source project Webster. The project uses Node.js combined with Chrome Headless mode to achieve a high availability web crawler crawling framework, so that Chrome can render the page. Can grab all js and Ajax rendered asynchronous content in a page; Combined with Redis, a task queue is implemented to make the crawler easily extend horizontally and vertically. It’s easy to deploy, and I’ve provided an official version of the basic runtime Docker image for Webster. If you want to get a sneak peek, you can also try the Webster Demo Docker image.