Most of the time, we can also access some pages or request some interfaces without logging in, because after all, the site itself needs to do SEO, and there is no login restriction on all pages.

However, there are some disadvantages without logging in. There are two main disadvantages.

  • Pages with login restrictions cannot be climbed. For example, if a forum is set up to log in to view resources, a blog is set up to log in to view the full text, etc., these pages need login account to view and climb.

  • Although some pages and interfaces can be directly requested, once the request is frequent, the access will be easily restricted or the IP address will be blocked directly. However, such problems will not occur after login, so the possibility of anti-crawling after login is lower.

Let’s do a simple experiment with the second case. In the case of Weibo, let’s first find an Ajax interface, For example https://m.weibo.cn/api/container/getIndex?uid=1638782947&luicode=20000174&type=uid&value=1638782947&conta sina finance and economics official microblog information interface Inerid =1005051638782947, if directly accessed by the browser, the data returned is in JSON format, as shown in the figure below, which contains some information of sina finance official Weibo, directly parsing JSON can extract information.

However, this interface will have request frequency detection in the absence of logins. If you visit too frequently over a period of time, such as opening this link and constantly refreshing it, you will see an indication of too many requests, as shown in the figure below.

If you want to open a browser window, open the https://passport.weibo.cn/signin/login?entry=mweibo&r=https://m.weibo.cn/, after login weibo account to open the link, the page shows the results of the interface, The unlogged page still shows requests too frequently, as shown in the figure below.

In the figure, the left is the result of requesting the interface after logging in to the account, and the right is the result of requesting the interface after logging in to the account. The interface links of the two are exactly the same. Login status cannot be accessed normally, but login status can be displayed normally.

Therefore, login accounts can reduce the probability of being banned.

We can try to log in and then do the crawl, the probability of being banned will be much smaller, but can not completely rule out the risk of being banned. If you keep using the same account for frequent requests, you may encounter the problem of closing the account because of frequent requests.

If we need to do large-scale capture, we need to have a lot of accounts, each request to randomly select an account, so as to reduce the access frequency of a single account, the probability of being blocked will be greatly reduced.

So how to maintain login information of multiple accounts? This is where the Cookies pool comes in. Let’s look at how the Cookies pool is built.

I. Objectives of this section

We take Sina Weibo as an example to realize the building process of a Cookies pool. Many Sina Weibo accounts and Cookies after login are saved in the Cookies pool, and the Cookies pool also needs to regularly detect the validity of each cookie. If a cookie is invalid, it will delete the Cookies and simulate login to generate new Cookies. At the same time, the Cookies pool also needs a very important interface, namely the interface to obtain random Cookies. After the Cookies run, we only need to request this interface to obtain a random cookie and use it to crawl.

Thus, the Cookies pool needs to have several core functions such as automatic generation of Cookies, periodic detection of Cookies, and provision of random Cookies.

Second, preparation

You definitely need some Weibo accounts before you set up. You need to have the Redis database installed and running. Python’s RedisPy, Requests, Selelnium, Flask libraries need to be installed. In addition, you need to install Chrome and configure ChromeDriver.

Third, Cookies pool architecture

Cookies have a similar architecture to proxy pools, consisting of four core modules, as shown in the figure below.

The basic module of the Cookies pool architecture is divided into four parts: storage module, generation module, detection module and interface module. The functions of each module are as follows.

  • The storage module is responsible for storing the user name and password of each account and the Cookies information corresponding to each account. At the same time, it also needs to provide some methods to achieve convenient access operations.

  • The build module is responsible for generating new Cookies. This module will get the user name and password of the account one by one from the storage module, and then simulate the login to the target page. Judging the successful login, the Cookies will be returned and handed to the storage module for storage.

  • The detection module needs to periodically detect Cookies in the database. Here we need to set a detection link, different sites detect different links, the detection module will take the Cookies corresponding to the account one by one to request links, if the returned status is valid, then the Cookies are not invalid, otherwise the Cookies will be invalid and removed. Then wait for the build module to regenerate.

  • Interface modules need apis to provide interfaces for external services. Since there may be multiple Cookies available, we can randomly return the interface to Cookies so that each cookie is likely to be picked up. The more Cookies there are, the less likely each one is to be picked up, reducing the risk of being blocked.

The basic idea for designing a cookie pool is similar to the proxy pool described earlier. Next we design the overall architecture and then implement the Cookies pool in code.

Fourth, the implementation of Cookies pool

First, understand the realization process of each module respectively.

1. Storage module

In fact, all you need to store is your account information and Cookies. The account consists of the user name and password. We can save the mapping between the user name and password. Cookies can be stored as JSON strings, but we will need to generate Cookies based on the account number later. When generating Cookies, we need to know which accounts have generated Cookies and which have not, so we need to save the user name information corresponding to the Cookies at the same time, which is actually the mapping between the user name and Cookies. Here are two sets of mappings, and we naturally thought of Redis’s Hash, so we created two hashes with the structure shown below.

The Key of a Hash is the account number, and the Value corresponds to a password or a cookie. In addition, it should be noted that since the Cookies pool needs to be extensible, the account and Cookies stored may not only be the microblog in this example. Other sites can also connect to this Cookies pool, so the Hash name here can be classified as a secondary level. For example, the Hash name for saving accounts can be accounts: Weibo, and the Hash name for Cookies can be Cookies: Weibo. To expand zhihu’s Cookies pool, we can use Accounts :zhihu and Cookies :zhihu, which is convenient.

Next we create a storage module class that provides some basic Hash operations as follows:

import random
import redis

class RedisClient(object):
    def __init__(self, type, website, host=REDIS_HOST, port=REDIS_PORT, password=REDIS_PASSWORD):
        """Initialize Redis connection :param host: address :param port: port: param password: password"""
        self.db = redis.StrictRedis(host=host, port=port, password=password, decode_responses=True)
        self.type = type
        self.website = website

    def name(self):
        """Get Hash name :return: Hash name"""
        return "{type}:{website}".format(type=self.type, website=self.website)

    def set(self, username, value):
        """Set key-value pairs :param username: username: param value: password or Cookies :return:"""
        return self.db.hset(self.name(), username, value)

    def get(self, username):
        ""Param username: username: return:""
        return self.db.hget(self.name(), username)

    def delete(self, username):
        ""Param username: username: return: delete result""
        return self.db.hdel(self.name(), username)

    def count(self):
        """Get number :return: number"""
        return self.db.hlen(self.name())

    def random(self):
        """Random get key values for random Cookies get: Return: random Cookies"""
        return random.choice(self.db.hvals(self.name()))

    def usernames(self):
        """Get all account information :return: All user names"""
        return self.db.hkeys(self.name())

    def all(self):
        """Get all key-value pairs: RETURN: mapping table of user name and password or Cookies"""
        return self.db.hgetall(self.name())Copy the code

Here we create a new RedisClient class that initializes the __init__() method with two key arguments type and website, representing the type and site name, which are the two fields used to concatenate Hash names. If this is the Hash to store accounts, the type is accounts and website is Weibo, and if this is the Hash to store Cookies, the type is Cookies and website is Weibo.

There are also several fields that represent Redis connection information, which is obtained during initialization and then the StrictRedis object is initialized to establish a Redis connection.

The name() method concatenates type and website to form the Hash name. The set(), get(), and delete() methods set, get, and delete a key-value pair of the Hash, and the count() methods get the length of the Hash.

The more important method is random(), which is used to randomly select a cookie from the Hash and return it. Every time the random() method is called, random Cookies will be obtained. This method can be connected with the interface module to achieve the request interface to obtain random Cookies.

2. Generate modules

The generation module is responsible for obtaining the information of each account and simulating login, and then generating Cookies and saving. We first obtain the information of the two hashes to see which accounts have more hashes than Cookies and have not generated Cookies. Then we traverse the remaining accounts and generate Cookies.

The main logic here is to find out which accounts do not have corresponding Cookies, and then obtain Cookies one by one, with the code as follows:

for username in accounts_usernames:
    if not username in cookies_usernames:
        password = self.accounts_db.get(username)
        print('Generating Cookies'.'account', username, 'password', password)
        result = self.new_cookies(username, password)Copy the code

Because we are connected to Sina Weibo, we have cracked the four-grid verification code of Sina Weibo before, so we can directly connect to it here, but now we need to add a method to get Cookies, and return different results according to different situations, the logic is as follows:

def get_cookies(self):
    return self.browser.get_cookies()

def main(self):
    self.open()
    if self.password_error():
        return {
            'status': 2.'content': 'Wrong username or password'
        }
    If no verification code is required, the login succeeds
    if self.login_successfully():
        cookies = self.get_cookies()
        return {
            'status': 1,
            'content': cookies
        }
    # Get captcha image
    image = self.get_image('captcha.png')
    numbers = self.detect_image(image)
    self.move(numbers)
    if self.login_successfully():
        cookies = self.get_cookies()
        return {
            'status': 1,
            'content': cookies
        }
    else:
        return {
            'status': 3.'content': 'Login failed'
        }Copy the code

The type of result returned here is dictionary, and with status code status, we can do different processing according to different status code in the generation module. For example, if the status code is 1, Cookies are successfully obtained. We only need to save Cookies to the database. If the status code is 2, it indicates that the user name or password is incorrect. In this case, we should delete the account information stored in the current database. For example, if the status code is 3, it represents some errors of login failure. At this time, it cannot be determined whether the user name or password is wrong or Cookies cannot be obtained successfully, so it is simply prompted to proceed to the next processing, similar code implementation is as follows:

result = self.new_cookies(username, password)
# Successful acquisition
if result.get('status') == 1:
    cookies = self.process_cookies(result.get('content'))
    print('Cookies obtained successfully', cookies)
    if self.cookies_db.set(username, json.dumps(cookies)):
        print('Saved Cookies successfully')
Delete account because password error
elif result.get('status') = = 2:print(result.get('content'))
    if self.accounts_db.delete(username):
        print('Account deleted successfully')
else:
    print(result.get('content'))Copy the code

If you want to extend other sites, simply implement the new_cookies() method and return the corresponding simulated login results, such as 1 for success and 2 for wrong username or password.

After the code runs, the account that has not generated Cookies will be traversed once, simulating login to generate new Cookies.

3. Check the module

We can now use the generation module to generate Cookies, but there are still problems of Cookies failure, such as Cookies failure due to too long time, or Cookies cannot be used too often to properly request web pages. If we encounter a cookie like this, we certainly can’t keep it in the database.

Therefore, we also need to add a timing detection module, which is responsible for traversing all Cookies in the pool and setting corresponding detection links. We request this link with one Cookies after another. If the request is successful or the status code is valid, the Cookies are valid. If the request fails or normal data cannot be retrieved, such as jumping straight back to the login page or to the authentication page, then the Cookies are invalid and we need to remove the Cookies from the database.

After the Cookies are removed, the generation module mentioned just now will detect that the Hash of the Cookies is less than the Hash of the account, and the generation module will think that the Cookies of the account have not been generated, so it will log in again with the account, and the Cookies of the account will be updated again.

All the detection module needs to do is detect that the Cookies have failed and then remove them from the data.

To achieve general extensibility, we first define a detector’s parent class and declare some common components, which are implemented as follows:

class ValidTester(object):
    def __init__(self, website='default'):
        self.website = website
        self.cookies_db = RedisClient('cookies', self.website)
        self.accounts_db = RedisClient('accounts', self.website)

    def test(self, username, cookies):
        raise NotImplementedError

    def run(self):
        cookies_groups = self.cookies_db.all()
        for username, cookies in cookies_groups.items():
            self.test(username, cookies)Copy the code

Here we define a parent class called ValidTester, specify the name of the site as website in __init__(), and create two storage modules connecting objects cookies_db and accounts_db to Hash Cookies and accounts, respectively. The run() method is the entry point, in this case it iterates through all the Cookies, and then calls the test() method to test. In this case, the test() method is not implemented, which means that we need to write a subclass to override the test() method. Each subclass is responsible for detecting different websites. For example, we can define WeiboValidTester as WeiboValidTester, and implement its unique test() method to check whether the Cookies on Weibo are valid, and then do the corresponding processing. Therefore, we need to add a subclass to inherit this ValidTester, override the test() method, and implement as follows:

import json
import requests
from requests.exceptions import ConnectionError

class WeiboValidTester(ValidTester):
    def __init__(self, website='weibo'):
        ValidTester.__init__(self, website)

    def test(self, username, cookies):
        print('Testing Cookies'.'Username', username)
        try:
            cookies = json.loads(cookies)
        except TypeError:
            print('Cookies are not valid ', username)
            self.cookies_db.delete(username)
            print('delete Cookies', username)
            return
        try:
            test_url = TEST_URL_MAP[self.website]
            response = requests.get(test_url, cookies=cookies, timeout=5, allow_redirects=False)
            if response.status_code == 200:
                print('Cookies effective', username)
                print('Partial test results', response.text[0:50])
            else:
                print(response.status_code, response.headers)
                print('Cookies failure', username)
                self.cookies_db.delete(username)
                print('delete Cookies', username)
        except ConnectionError as e:
            print('Exception occurs', e.args)Copy the code

The test() method first converts the Cookies to a dictionary, detects the format of the Cookies, deletes them if the format is incorrect, and requests the detected URL with the Cookies if the format is correct. The test() method checks the microblog here. The URL can be an Ajax interface. For configurability, we define the test URL as a dictionary, as shown below:

TEST_URL_MAP = {
    'weibo': 'https://m.weibo.cn/'
}Copy the code

If we want to extend other sites, we can add them all to the dictionary. For Weibo, we use Cookies to request target sites, prohibit redirection and set timeout time, and detect the return status code after receiving Response. If the 200 status code is directly returned, the Cookies are valid. Otherwise, the system may redirect to the login page, indicating that the Cookies are invalid. If the Cookies fail, we simply remove them from the Hash of the Cookies.

4. Interface module

The generation module and detection module can complete real-time Cookies detection and update if they are run regularly. However, Cookies still need to be used by crawlers in the end, and a cookie pool can be used by multiple crawlers, so we also need to define a Web interface, and crawlers can access random Cookies by visiting this interface. Flask is used to build the interface, and the code is shown as follows:

import json
from flask import Flask, g
app = Flask(__name__)
Generate a configuration dictionary for the module
GENERATOR_MAP = {
    'weibo': 'WeiboCookiesGenerator'
}
@app.route('/')
def index():
    return '<h2>Welcome to Cookie Pool System</h2>'

def get_conn():
    for website in GENERATOR_MAP:
        if not hasattr(g, website):
            setattr(g, website + '_cookies'.eval('RedisClient' + '("cookies", "' + website + '"'))
    return g

@app.route('/<website>/random')
def random(website):
    """Get random Cookie, access address such as /weibo/random :return: random Cookie"""
    g = get_conn()
    cookies = getattr(g, website + '_cookies').random()
    return cookiesCopy the code

We also need to achieve universal configuration to connect to different sites, so the first field of the interface link is defined as the name of the site, and the second field is defined as the method of obtaining. For example, / Weibo /random is to obtain random Cookies of Weibo, and /zhihu/random is to obtain random Cookies of Zhihu.

5. Scheduling module

Finally, we add a scheduling module to make these modules work together. The main work is to drive several modules to run regularly, and each module needs to run on different processes, as shown below:

import time
from multiprocessing import Process
from cookiespool.api import app
from cookiespool.config import *
from cookiespool.generator import *
from cookiespool.tester import *

class Scheduler(object):
    @staticmethod
    def valid_cookie(cycle=CYCLE):
        while True:
            print('Cookies detection process is running ')
            try:
                for website, cls in TESTER_MAP.items():
                    tester = eval(cls + '(website="' + website + '"')
                    tester.run()
                    print('Cookies detection completed ')
                    del tester
                    time.sleep(cycle)
            except Exception as e:
                print(e.args)

    @staticmethod
    def generate_cookie(cycle=CYCLE):
        while True:
            print('Cookies generation process running ')
            try:
                for website, cls in GENERATOR_MAP.items():
                    generator = eval(cls + '(website="' + website + '"')
                    generator.run()
                    print('Cookies generated complete ')
                    generator.close()
                    time.sleep(cycle)
            except Exception as e:
                print(e.args)

    @staticmethod
    def api():
        print('API interface is running ')
        app.run(host=API_HOST, port=API_PORT)

    def run(self):
        if API_PROCESS:
            api_process = Process(target=Scheduler.api)
            api_process.start()

        if GENERATOR_PROCESS:
            generate_process = Process(target=Scheduler.generate_cookie)
            generate_process.start()

        if VALID_PROCESS:
            valid_process = Process(target=Scheduler.valid_cookie)
            valid_process.start()Copy the code

Two important configurations are used here, namely the dictionary configuration for generating module classes and testing module classes, as shown below:

Generate module classes, such as extending other sites, please configure here
GENERATOR_MAP = {
    'weibo': 'WeiboCookiesGenerator'
}

# Test module classes, such as extending other sites, please configure here
TESTER_MAP = {
    'weibo': 'WeiboValidTester'
}Copy the code

This configuration is used for dynamic extension, with the key name being the site name and the key value being the class name. If other sites need to be configured, they can be added to the dictionary, such as the extension of the generation module of Zhihu site, which can be configured as:

GENERATOR_MAP = {
    'weibo': 'WeiboCookiesGenerator'.'zhihu': 'ZhihuCookiesGenerator',}Copy the code

The Scheduler iterates through the dictionary, dynamically creates objects of each class using eval(), and runs each module by calling its entry run() method. At the same time, multiple processes in each module use the Process class in MultiProcessing, which can be started by calling its start() method.

In addition, each module also has a module switch, we can freely set the switch on and off in the configuration file, as follows:

Generate a module switch
GENERATOR_PROCESS = True
Verify the module switch
VALID_PROCESS = False
# Interface module switch
API_PROCESS = TrueCopy the code

If the value is set to True, the module is enabled. If the value is set to False, the module is disabled.

At this point, our Cookies are complete. Next, we turn on the modules at the same time to start the scheduler, and the similar output of the console is as follows:

API interface start * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit) The Cookies generation process starts The Cookies detection process starts the Cookies account is being generated 14747223314 Password ASDF1129 Testing Cookies User name 14747219309 Cookies Valid 14747219309 Testing Cookies User name 14740626332 Cookies are valid 14740626332 Testing Cookies User name 14740691419 Cookies valid 14740691419 Testing Cookies user name 14740618009 Cookies valid 14740618009 Testing Cookies User name 14740636046 Cookies Are valid 14740636046 Cookies are tested User name 14747222472 Cookies are valid 14747222472 Cookies have completed verification code location 420 580 384 544 Successfully matched drag order [1, 4, 2, 3] successfully obtained Cookies {'SUHB': '08J77UIj4w5n_T'.'SCF': 'AimcUCUVvHjswSBmTswKh0g4kNj4K7_U9k57YzxbqFt4SFBhXq3Lx4YSNO9VuBV841BMHFIaH4ipnfqZnK7W6Qs.'.'SSOLoginState': '1501439488'.'_T_WM': '99b7d656220aeb9207b5db97743adc02'.'M_WEIBOCN_PARAMS': 'uicode%3D20000174'.'SUB': '_2A250elZQDeRhGeBM6VAR8ifEzTuIHXVXhXoYrDV6PUJbkdBeLXTxkW17ZoYhhJ92N_RGCjmHpfv9TB8OJQ.. '} Cookies are saved successfullyCopy the code

The above is the console output of program operation, from which we can see that each module starts normally, the test module tests Cookies one by one, and the generation module obtains Cookies of accounts that have not generated Cookies. Each module runs in parallel without interference.

We can access the interface to get random Cookies, as shown in the figure below.

The crawler only needs to request this interface to obtain random Cookies.

Five, this section code

This section of code address is: https://github.com/Python3WebSpider/CookiesPool.

This resource starting in Cui Qingcai personal blog still find: Python3 tutorial | static find web crawler development practical experience

For more crawler information, please follow my personal wechat official account: Attack Coder

Weixin.qq.com/r/5zsjOyvEZ… (Qr code automatic recognition)