preface

At present, for the open service of some large websites, a considerable part of the traffic is caused by crawler, accounting for about 20%. Crawler will increase the server data and traffic overhead, internal data leakage and many other problems.

Anti-crawler has also become some problems that the current server needs to pay attention to, among which the author’s group has encountered the problem of crawler malicious crawling of free resources, so it began to study the anti-crawler flow limiting scheme and corresponding similar crawler alarm scheme.

The body of the

This anti-crawler scheme mainly includes two parts: blocking and warning.

Crawler restriction scheme

For the limitation scheme of crawler, there are the following recognition methods at present.

1. Restrict by user-agent

No matter the client, browser or crawler makes HTTP requests, it will attach a user-Agent field in the header, similar to the meaning of indicating identity, such as Taobao/7.7.1 (iPad; IOS 12.1; Scale/2.00), REST-Client /2.1.0 (Darwin18.6.0x86_64) Ruby /2.6.3 P62, etc., so you can filter the headers in the request and return an exception if it is not a normal device UA.

Advantages: simple, and the error rate is very low, can be configured in Nginx.

Disadvantages: Very easy to bypass, user-Agent fields can be captured and forged.

2. IP banned

The IP access frequency is monitored at the Nginx gateway layer or the server itself. If the IP access frequency exceeds a certain threshold, the IP address is temporarily blocked.

Pros: Simple.

Disadvantages: A slightly advanced crawler will have a series of public IP address pools as proxies, which cannot be prevented. The error rate is high. Because public IP addresses are blocked, certain areas may be inaccessible. Thresholds are difficult to control.

3. User ban

For user ID and mobile phone number to do blocking measures.

Advantages: Simple and has less impact than banning public IP addresses.

Disadvantages: It is difficult to define the monitoring dimension based on user ID. Thresholds are difficult to control.

4. Verification code restrictions

Verification code restriction is common on the Web. If a user frequently accesses the service, the next service invocation will redirect the user to the page for manually entering the verification code.

Advantages: The error rate is very low, can be very effective script containment.

Disadvantages: The client needs to cooperate with the transformation, and the technical scheme is complicated.

Specific adoption scheme

According to the above existing anti-script crawler technical solution and its own business situation, the anti-crawler technical solution with multiple dimensions of UA + IP + user ID is prepared to reduce the token expiration time.

We can introduce this solution in the filter filter of each request or directly in the gateway layer. Because the blacklist pool of multiple dimensions needs to be maintained, it obviously needs to be persisted in the database. Meanwhile, the web background is introduced to manually maintain the blacklist pool. However, the database cannot be checked every time. Therefore, you need to synchronize the blacklist content from the database to the local cache of the server every five seconds or longer, and then determine whether to permit the blacklist based on the request information.

Alarm Technical Solution

The above limit scheme is for the server. In addition to the direct limit, there should be an appropriate alarm scheme. If the threshold is exceeded, the alarm is generated. The scheme process is as follows:

About threshold limit scheme in sliding time

For the threshold limit within the specified time in each dimension, detailed implementation schemes are required, including the following three

  1. The redis key-value expiration time limit is directly specified

Take the IP traffic limiting alarm scheme as an example. This method is not standard traffic limiting because it is inaccurate to use the threshold time limit as the rediskey expiration time. For example, there is 1 request at 0 minute 1 second, 8 requests at 0 minute 59 seconds, and 9 requests at 1 minute 01 seconds. There are 17 requests between 0 minutes 59 seconds and 1 minutes 01 seconds, but the alarm of 10 requests per minute cannot be triggered as follows:

Time consumption: low

Space consumption: low, n for each dimension

Accuracy: Average

  1. Using Redis, millisecond key updates

Set redis key to ip_{timestamp} and expiration time to the time corresponding to the threshold. When checking whether the threshold is exceeded, scan all IP key prefixes. If the number of prefixes exceeds the threshold, an alarm is generated as follows:

Space consumption: high, n by N for each dimension

Accuracy: high, similar to sliding window scheme

  1. Redis good current limit

The redis extension module has the corresponding token bucket extension module, and the command is as follows, corresponding to key, funnel capacity 15, and 30 tokens are released every 60 seconds, and the command is cl.throttle IP_127.0.0.1 15 30 60. You need to expand the Redis-cell module by yourself, and you can search by yourself.

Time consumption: medium

Space consumption: low, n for each dimension

Accuracy: high

conclusion

As above for the solution of the crawler including restrictions prohibit, threshold alarm two aspects, including the two places are not perfect place, such as when to limit, although only in each request added some memory in the filter in the level of logic, but if you limit the dimension or a single dimension (e.g., IP) excess data in blacklist, It also puts a lot of pressure on the server; In terms of threshold alarms, limits on access frequency within a given time are not perfect. If you have a more suitable solution, come forward and discuss it together.

In addition, the technical scheme of anti-crawler this time does not involve the corresponding flow limiting scheme, but the flow limiting scheme can be completed by combining the limit and alarm schemes.