This article has participated in the good article call order activity, click to see: back end, big front end double track submission, 20,000 yuan prize pool for you to challenge!

preface

As a developer engaged in SEO work for a long time, the original resources of the website is our capital, but it is difficult to avoid being other crawlers, all crawl away within a few hours, and the game between the crawlers has become a lasting tug of war. In the long-term game, a practical anti-crawler method is also summarized. Although it can not guarantee to prevent 100% of the crawlers, it can also prevent most of the crawlers, or increase the collection cost of the crawlers.

Crawler identification

There are two main identifiers to identify crawlers:

1, user-agent, user header information, normal user access will have browser information, for example

Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36Copy the code

When a crawler makes a request, if there is no deliberate forgery, it will carry the language identifier used. For example, the python request header is:

Python - requests / 2.25.1Copy the code

If ‘user_agent’ in the php.ini configuration file is open, it will be ‘PHP’. If ‘user_agent’ is not open, it will not be ‘user-agent’

User-agent forgery should be the simplest and cost zero, so unless it is very obvious to intercept directly, it is generally not restricted according to this.

2. IP, user access will carry user IP, we can judge whether the same IP is a crawler according to the number of visits in a short period of time, of course, IP can also use proxy IP forgery.

Website Crawler Warning

The main use of Kafka’s high concurrency and stability features, without affecting the premise of the service, to achieve access statistics. The redis ordered list can be used to calculate the IP addresses with the highest traffic volume within the specified period to achieve the purpose of warning. Take the PHP website as an example

1. Judge whether the user IP is in the Redis collection of prohibited IP at the entrance of the website program

$ip = Tools::getClientIp();
$res = $redis->get('ip:f:' . $ip);
if ($res) {
    return false;
}
Copy the code

Tools::getClientIp()

public static function getClientIp()
{
    $ip = ' ';
    if (getenv('HTTP_CLIENT_IP')) {
        $ip = getenv('HTTP_CLIENT_IP');
    } else if (getenv('HTTP_X_FORWARDED_FOR')) {
        $ip = getenv('HTTP_X_FORWARDED_FOR');
        $ips = explode(', '.$ip);
        $ip = $ips[0];
    } else if (getenv('REMOTE_ADDR')) {
        $ip = getenv('REMOTE_ADDR');
    }
    return $ip;
}
Copy the code

Push valid IP addresses to the Kafka message queue. No Kafka can also, you can use other message queues, such as RabbitMq, but because it is a website entry, concurrency is relatively large, Kafka in stability and processing large concurrency has greater advantages, so choose Kafka.

3, in the Kafka consumer program, will receive the user IP pushed to Redis in the unit of hours for the ordered collection, this step is mainly for the statistics of each time period, the most visited IP, through human observation or script timing statistics, illegal IP or send email reminder.

/ / set the key
$key = 'list:' . date('YmdH');
$redis->zincrby($key.1.$ip);
Copy the code

Next, use the Redis funnel to add limits on the access rate of abnormal IP addresses.

Redis funnel added anticrawler mechanism

After Redis4.0, a current-limiting module is provided: Redis-cell, which uses the funnel algorithm and provides atomic current-limiting instructions. This module needs to be installed separately, the specific installation method can refer to some of the installation method on the Internet, here is not detailed. Talk about the parameters of this module

> cl.throttle 127.0.0.1 14 30 60 1 1) (integer) 0 2) (INTEGER) 15 # Capacity of the funnel capacity 3) (INTEGER) 14 # Remaining space of the funnel left_quota 4) (INTEGER) -1 # If the funnel is rejected, how long does it take to retry, In seconds 5) (INTEGER) 2 # How long after the funnel is completely empty, in secondsCopy the code

This directive means that the frequency of 127.0.0.1 IP behavior is allowed up to 30 times per 60s, the funnel has an initial capacity of 15 (because the counting starts from 0, up to 14 is 15), and the default space for each behavior is 1 (optional). If rejected, take the fourth value of the return array and use sleep as the retry time, or an asynchronous scheduled task can be used to retry.

$res = $redis->executeRaw(['cl.throttle'.$ip.14.30.60]);
if ($res[0]) {
    $redis->setex('ip:f:' . $ip.$res[3].1);
}
Copy the code

conclusion

1. The ordered collection data of IP access times in each period of each day. The data with the highest traffic should be written to the database and deleted periodically.

2. IP funnel traffic limiting is a very dangerous operation, or we should often observe the access log to find out the regularity of crawler, and determine the reasonable traffic limiting parameters.