The opening

With the popularity of Python and big data, a large number of engineers flock to it. Crawler technology, due to its ease of learning and remarkable effect, has become the first object sought after by everyone. The development of crawler has entered the peak period, so the pressure brought to the server is multiplied. In order to ensure the normal operation of the service or to reduce the pressure and cost, enterprises have to use a variety of technical means to prevent crawler engineers from unrestraintly demanding resources from the server, which is called “anti-crawler”.

“Anti-crawler technology” is a general term for the Internet technology to restrict crawler, and the anti-crawler bypass is a problem all crawler engineers have to face, and it is also the most concerned aspect in the interview of middle and senior crawler engineers.

The problem

However, in our daily communication, the author found that most junior crawler engineers could only spit on the technical articles written by others on the Internet. Besides knowing how to forge user-agent in the browser request header information when requesting, they had no idea about:

  • Why would you do that?
  • What’s the good of that?
  • Can I do it any other way?
  • How does it work?
  • How does it recognize my reptile?
  • How should I get around it?

Know nothing. If you don’t know how it works or how to do it, you’ll still be confused when your target site tweaks its anti-crawler strategy

Yeah, you know, you look like shit.

The author wishes

I am also trying to share such knowledge, so that people can learn the simple anti-crawler principle and implementation method in the anti-crawler knowledge through this article in their spare time, and then get familiar with his bypass operation. For example, user-Agent anti-crawler means, understand its principle and personally realize anti-crawler, and then personally bypass it. Perhaps through this small case, you can open the door to your thinking, pry open the sewer of your thinking.

The body of the

Above is talk, below is practice. A great man once expressed this idea:

It doesn’t matter whether the cat is black or white, it’s not a good cat if it can’t catch mice

What is the user-agent

User Agent (UA for short) is a special string header that enables the server to identify the operating system and version used by the customer, CPU type, browser and version, browser rendering engine, browser language, and browser plug-in. Some websites often send different pages to different operating systems and browsers by judging UA. Therefore, some pages may not be displayed properly in a browser, but UA can be disguised to bypass detection. The flowchart for a browser to make a request to the server can be shown as follows:

Here, take Firefox and Google Chrome as examples. The FORMAT or expression of UA is as follows:

Firefox is the user-agent:

Mozilla / 5.0 (Macintosh; Intel Mac OS X 10.13; The rv: 63.0) Gecko / 20100101 Firefox 63.0Copy the code

Chrome the user-agent:

Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36Copy the code

What role does user-Agent play in network requests?

In network requests, user-Agent is a kind of identification. The server can determine whether the requestor is a browser, client, or other terminal by using the User-Agent in the request header parameter (of course, it is permissible to have the value of user-Agent empty, because it is not required parameter).

Why do anti-crawlers choose user-agent?

From the introduction above, you can see that it is the identity of the terminal. This means that the server knows exactly if the request was sent from Firefox or Internet Explorer, or even if it was sent from an application such as Python.

The rendering function of web pages, dynamic effects and pictures is realized by the browser. The browser is a relatively closed procedure, because it needs to ensure the successful rendering of data, so users cannot obtain the content data from the browser on a large scale and automatically.

That’s not the case with crawlers, which are built to take content from the Web and turn it into data. These are two very different approaches, and you can also interpret them as writing code to capture content data on a large scale and automatically.

Back to the point, why was user-agent chosen?

Because programming languages have a default identifier, this identifier is sent to the server as the user-Agent value in the request header parameter when making a network request without your knowledge. For example, when the Python language initiates a network request through code, the user-Agent value contains Python. Also, languages like Java and PHP have default identifiers.

Anti-crawler blacklist strategy

Now that we know the characteristics of programming language, combined with the actual needs, then the idea of anti-crawler comes out. This is a blacklist strategy. As long as the request appears in the blacklist, it is regarded as a crawler, and such request can not be processed or corresponding error prompt can be returned.

Why is a blacklist policy used instead of a whitelist policy?

In real life, there are many browsers (Firefox, Google, 360, Maxthon, Oprah, Windows of the World, QQ, etc.),

In addition, many services are not only open to the browser, but sometimes these services provide services to applications in the form of apis, such as android back-end apis, which provide data services for Android applications. The software itself only takes care of the interface and structure, and the data is retrieved from the back-end apis. At this point, the user-Agent will become Android in the initiated request.

That’s why you can’t use a whitelist policy.

Blacklisting is simple. When you want to block a request from Python code or a request from Java code, just add it to the blacklist.

View the User-Agent in the request header through the Nginx service log

Nginx is a lightweight Web/reverse proxy server and email (IMAP/POP3) proxy server. It is characterized by less memory, strong concurrency, in fact, Nginx concurrency is indeed in the same type of web server performance is better, using Nginx enterprises: Baidu, JINGdong, Sina, netease, Tencent, Taobao and so on.

Nginx installation and startup

Nginx can be installed using the system’s own installation tools (yum on Centos, apt-get on Debian, and BREW on MacOS).

sudo apt-get install nginx
Copy the code

Then run the following command on the terminal:

sudo systemctl start nginx
Copy the code

Note: Due to system differences and version differences, the installation and startup commands are slightly different. The solution is searched by yourself

Nginx log

Nginx provides users with logging, which records the status of each server request and other information, including user-Agent. The default Nginx log storage path is:

/var/log/nginx/
Copy the code

Run commands on the terminal

cd /var/log/nginx && ls
Copy the code

If you go to the log directory and list the files in the directory, you can see that there are two main files: access.log and error.log

They record successful requests and error messages, respectively. We use Nginx’s access log to view information about each request.

Several ways to initiate a request

The browser

When Nginx starts, it listens on port 80 by default. All you need to access is the IP address or domain name. Assuming the IP address is 127.0.0.1, you can enter:

http://127.0.0.1
Copy the code

When you press Enter, the browser will make a request to the server, just like when you surf the Internet.

Python code

Here we use the Requests library to initiate network Requests. Create a new file locally called gets.py with the code:

import requests
Make a request to the target and print the returned HTTP status code
resp = requests.get("http://127.0.0.1")
print(resp.status_code)
Copy the code

Postman

Postman is a powerful tool for web page debugging and sending HTTP requests. It can simulate the browser, access the specified Url and output the returned content, as shown in the following figure:

Curl

This is a URL syntax to work in the command line transport tool, it not only supports URL address access but also supports file upload and download, so it can be called a comprehensive transport tool. It can also simulate the browser and access the specified Url, as shown below:

Nginx logs results

The above four methods are used to make requests to the server, so let’s take a look at what information is recorded in Nginx logs. Run the following command on the terminal:

sudo cat access.log
Copy the code

To view the log file. You can see the records of these requests:

# request record127.0.0.1 - - [04/Nov/ 2017:22:19:07 +0800]"The GET/HTTP / 1.1"200, 396,"-" "Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"127.0.0.1 - - [04/Nov/ 2017:22:19:07 +0800]"The GET/favicon. Ico HTTP / 1.1"404, 200,"http://127.0.0.1/" "Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"127.0.0.1 - - [04/Nov/ 2017:22:20:36 +0800]"The GET/HTTP / 1.1" 304 0 "-" "Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"[04/Nov/ 2017:22:27:14 +0800]"GET /z_stat.php? Id = 1256772952 & web_id HTTP / 1.1 = 1256772952"404, 144,"http://appstore.deepin.org/" "Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/538.1 (KHTML, like Gecko) Deepin - Appstore /4.0.9 Safari/538.1"127.0.0.1 - - [04/Nov/ 2017:22:42:10 +0800]"The GET/HTTP / 1.1"200, 396,"-" "PostmanRuntime / 7.3.0"127.0.0.1 - - [04/Nov/ 2017:22:42:51 +0800]"The GET/HTTP / 1.1"200, 612,"-" "Curl / 7.60.0"
Copy the code

Implement anti-crawler

In order to filter Python and Curl requests, only Firefox and Postman requests are allowed to pass, and 403 error message is returned for filtered requests.

The anti-crawler process, as shown in the figure above, is equivalent to building a firewall between the server and the resource, and the request in the blacklist will be discarded as garbage.

Configure Nginx rules

Nginx provides configuration files and corresponding rules that allow us to filter out requests that are not allowed to pass, which is what we used in this anti-crawler. The Nginx configuration file is usually placed in /etc/nginx/ directory called nginx.conf. Let’s look at the configuration file to see where the site’s configuration file is. Then through the system’s own editor (the author’s system has its own Nano, other systems may have their own Vim) to edit the configuration file. Find the address of the site configuration file in the configuration file (the author uses the directory /etc/nginx/sites-enable), then find the local level configuration in the configuration file, and add the following content:

if ($http_user_agent ~* (Python|Curl)) {
     return 403;
    }
Copy the code

Check whether the request header contains Python or Curl. If the request header contains Python or Curl, return 403. Save the configuration and run the following command:

sudo nginx -s reload
Copy the code

Anti-crawler effect test

Repeat the steps visited above to initiate the request using a browser, Python code, the Postman tool, and Curl. As you can see from the result returned, there is a difference.

  • If the browser returns a normal page, the system is not affected.
  • The Status code for Python code is 403 instead of 200
  • Postman, as before, returns the correct content;
  • Curl, like Python, does not access resources correctly because requests are filtered out.

Tip: You can continue to modify the Nginx configuration to test it, and you will find the same result as now: as long as the request is in the blacklist, it will be filtered out and return 403 error.

Tip: This is why you usually need to fake the browser in the request header when writing crawler code.

Anti-crawler bypassing user-agent mode

Through the above learning, we know the principle of user-Agent anti-crawler means, and through Nginx to achieve anti-crawler, next we learn how to bypass this anti-crawler measures.

Python bypasses anti-crawlers

In the Requests library, users are allowed to customize the request headers, so we can trick the Nginx server by changing the value of the User-Agent in the request header to the browser’s request header id. Change the previous Python code to:

import requests
# Fake request headers to fool the server
headers = {"User-Agent": "Mozilla / 5.0 (Macintosh; Intel Mac OS X 10.13; The rv: 9527.0) Gecko / 20100101 Firefox / 9527.0"}
resp = requests.get("http://127.0.0.1", headers=headers)
print(resp.status_code)
Copy the code

In the code we use the Firefox request header, and for better observation we can change the browser version number (to 9527) to distinguish between real browsers (this will not affect the request results). Run the file and see what is returned:

200
Copy the code

It is not 403, indicating that this type of anti-crawler has been bypassed (see, this is what the articles on the web say, you need to modify the request header to bypass anti-crawler, now you see how it works).

Exercise: Test again using Postman

A test may not be accurate, you can also use Postman to test again, remember how to do that?

  • Add the identity to be filtered (Postman) to the Nginx configuration file
  • Reload the configuration file to make it take effect
  • Make a request through Postman to see if it is filtered
  • Use the Postman tool again with the browser’s identity to make the request and see if it is filtered

Tip: This exercise will be easier to understand if you do it yourself, and it will deepen your image.

conclusion

To review the process throughout the article:

We started with the anti-crawler phenomenon, then learned the principle of user-Agent anti-crawler strategy, and realized anti-crawler through Nginx. Finally, we verified our ideas through Python code examples and Postman examples, and finally understood the reason clearly and clearly. When the target changes its strategy, it is also clear which methods can be used to circumvent it.

Consider: In the example, I just wrote a crawler in Python to demonstrate, but what about a crawler written in Java? What about PHP crawlers? What about requests from Android?

You can test them in turn, and the results are sure to teach you something.

If you are a reptile enthusiast or a junior reptile engineer, and you want to improve your level, we can communicate together, scan code attention!

You can obtain the following anti-crawler result report document (PDF) by replying to “Anti-crawler Report” on wechat public account.

It will make you look more professional

Screenshots of the report:

The structure of the report is as follows: