Most of the time, we can also access some pages or request some interfaces without logging in, because after all, the site itself needs to do SEO, and there is no login restriction on all pages.
However, there are some disadvantages without logging in. There are two main disadvantages.
Let’s do a simple experiment with the second case. In the case of Weibo, let’s first find an Ajax interface, For example https://m.weibo.cn/api/container/getIndex?uid=1638782947&luicode=20000174&type=uid&value=1638782947&conta sina finance and economics official microblog information interface Inerid =1005051638782947, if directly accessed by the browser, the data returned is in JSON format, as shown in the figure below, which contains some information of sina finance official Weibo, directly parsing JSON can extract information.
However, this interface will have request frequency detection in the absence of logins. If you visit too frequently over a period of time, such as opening this link and constantly refreshing it, you will see an indication of too many requests, as shown in the figure below.
If you want to open a browser window, open the https://passport.weibo.cn/signin/login?entry=mweibo&r=https://m.weibo.cn/, after login weibo account to open the link, the page shows the results of the interface, The unlogged page still shows requests too frequently, as shown in the figure below.
In the figure, the left is the result of requesting the interface after logging in to the account, and the right is the result of requesting the interface after logging in to the account. The interface links of the two are exactly the same. Login status cannot be accessed normally, but login status can be displayed normally.
Therefore, login accounts can reduce the probability of being banned.
We can try to log in and then do the crawl, the probability of being banned will be much smaller, but can not completely rule out the risk of being banned. If you keep using the same account for frequent requests, you may encounter the problem of closing the account because of frequent requests.
If we need to do large-scale capture, we need to have a lot of accounts, each request to randomly select an account, so as to reduce the access frequency of a single account, the probability of being blocked will be greatly reduced.
So how to maintain login information of multiple accounts? This is where the Cookies pool comes in. Let’s look at how the Cookies pool is built.
I. Objectives of this section
We take Sina Weibo as an example to realize the building process of a Cookies pool. Many Sina Weibo accounts and Cookies after login are saved in the Cookies pool, and the Cookies pool also needs to regularly detect the validity of each cookie. If a cookie is invalid, it will delete the Cookies and simulate login to generate new Cookies. At the same time, the Cookies pool also needs a very important interface, namely the interface to obtain random Cookies. After the Cookies run, we only need to request this interface to obtain a random cookie and use it to crawl.
Thus, the Cookies pool needs to have several core functions such as automatic generation of Cookies, periodic detection of Cookies, and provision of random Cookies.
Second, preparation
You definitely need some Weibo accounts before you set up. You need to have the Redis database installed and running. Python’s RedisPy, Requests, Selelnium, Flask libraries need to be installed. In addition, you need to install Chrome and configure ChromeDriver.
Third, Cookies pool architecture
Cookies have a similar architecture to proxy pools, consisting of four core modules, as shown in the figure below.
The basic module of the Cookies pool architecture is divided into four parts: storage module, generation module, detection module and interface module. The functions of each module are as follows.
The basic idea for designing a cookie pool is similar to the proxy pool described earlier. Next we design the overall architecture and then implement the Cookies pool in code.
Fourth, the implementation of Cookies pool
First, understand the realization process of each module respectively.
1. Storage module
In fact, all you need to store is your account information and Cookies. The account consists of the user name and password. We can save the mapping between the user name and password. Cookies can be stored as JSON strings, but we will need to generate Cookies based on the account number later. When generating Cookies, we need to know which accounts have generated Cookies and which have not, so we need to save the user name information corresponding to the Cookies at the same time, which is actually the mapping between the user name and Cookies. Here are two sets of mappings, and we naturally thought of Redis’s Hash, so we created two hashes with the structure shown below.
The Key of a Hash is the account number, and the Value corresponds to a password or a cookie. In addition, it should be noted that since the Cookies pool needs to be extensible, the account and Cookies stored may not only be the microblog in this example. Other sites can also connect to this Cookies pool, so the Hash name here can be classified as a secondary level. For example, the Hash name for saving accounts can be accounts: Weibo, and the Hash name for Cookies can be Cookies: Weibo. To expand zhihu’s Cookies pool, we can use Accounts :zhihu and Cookies :zhihu, which is convenient.
Next we create a storage module class that provides some basic Hash operations as follows:
import random import redis class RedisClient(object): def __init__(self, type, website, host=REDIS_HOST, port=REDIS_PORT, password=REDIS_PASSWORD): """ Initializes Redis connection :param host: address :param port: port: param password: Self. db = redis.strictredis (host=host, port=port, password=password, Decode_responses =True) self.type = type self.website = website def name(self): "" Hash name """ return "{type}:{website}". Format (type=self.type, website=self.website) def set(self, username, value): """ set key value pairs :param username: username: param value: password or Cookies :return: Return self.db.hset(self.name(), username, value) def get(self, username): Return self.db.hget(self.name(), username) def delete(self, username): """ Return self.db.hdel(self.name(), username) def count(self): "" return self.db.hdel(self.name(), username) def count(self): "" return self.db.hlen(self.name()) def random(self): """ "" return random. Choice (self.db.hvals(self.name())) def userNames (self): "" Return self.db.hkeys(self.name()) def all(self): "" return self.db.hkeys(self.name()) def all(self): "" return self.db.hgetall(self.name())Copy the code
Here we create a new RedisClient class that initializes the __init__() method with two key arguments type and website, representing the type and site name, which are the two fields used to concatenate Hash names. If this is the Hash to store accounts, the type is accounts and website is Weibo, and if this is the Hash to store Cookies, the type is Cookies and website is Weibo.
There are also several fields that represent Redis connection information, which is obtained during initialization and then the StrictRedis object is initialized to establish a Redis connection.
The name() method concatenates type and website to form the Hash name. The set(), get(), and delete() methods set, get, and delete a key-value pair of the Hash, and the count() methods get the length of the Hash.
The more important method is random(), which is used to randomly select a cookie from the Hash and return it. Every time the random() method is called, random Cookies will be obtained. This method can be connected with the interface module to achieve the request interface to obtain random Cookies.
2. Generate modules
The generation module is responsible for obtaining the information of each account and simulating login, and then generating Cookies and saving. We first obtain the information of the two hashes to see which accounts have more hashes than Cookies and have not generated Cookies. Then we traverse the remaining accounts and generate Cookies.
The main logic here is to find out which accounts do not have corresponding Cookies, and then obtain Cookies one by one, with the code as follows:
for username in accounts_usernames: if not username in cookies_usernames: Password = self.accounts_db.get(username) print(' generating Cookies', 'username ',' password ', password) result = self.new_cookies(username, password)Copy the code
Because we are connected to Sina Weibo, we have cracked the four-grid verification code of Sina Weibo before, so we can directly connect to it here, but now we need to add a method to get Cookies, and return different results according to different situations, the logic is as follows:
def get_cookies(self): return self.browser.get_cookies() def main(self): self.open() if self.password_error(): Return {'status': 2, 'content': 'username or password error'} # If self.login_successfully(): cookies = self.get_cookies() return { 'status': 1, 'content': Image = self.get_image('captcha.png') numbers = self.detect_image(image) self.move(numbers) if self.login_successfully(): cookies = self.get_cookies() return { 'status': 1, 'content': cookies } else: Return {'status': 3, 'content': 'login failed'}Copy the code
The type of result returned here is dictionary, and with status code status, we can do different processing according to different status code in the generation module. For example, if the status code is 1, Cookies are successfully obtained. We only need to save Cookies to the database. If the status code is 2, it indicates that the user name or password is incorrect. In this case, we should delete the account information stored in the current database. For example, if the status code is 3, it represents some errors of login failure. At this time, it cannot be determined whether the user name or password is wrong or Cookies cannot be obtained successfully, so it is simply prompted to proceed to the next processing, similar code implementation is as follows:
If result.get('status') == 1: Cookies = self.process_cookies(result.get('content')) print(' error ', cookies) if self.cookies_db.set(username, Json.dumps (cookies)): print(' dumps ') elif result.get('status') == 2: Print (result.get('content')) if self.accounts_db.delete(username): print(' accounts_db ') else: print(result.get('content'))Copy the code
If you want to extend other sites, simply implement the new_cookies() method and return the corresponding simulated login results, such as 1 for success and 2 for wrong username or password.
After the code runs, the account that has not generated Cookies will be traversed once, simulating login to generate new Cookies.
3. Check the module
We can now use the generation module to generate Cookies, but there are still problems of Cookies failure, such as Cookies failure due to too long time, or Cookies cannot be used too often to properly request web pages. If we encounter a cookie like this, we certainly can’t keep it in the database.
Therefore, we also need to add a timing detection module, which is responsible for traversing all Cookies in the pool and setting corresponding detection links. We request this link with one Cookies after another. If the request is successful or the status code is valid, the Cookies are valid. If the request fails or normal data cannot be retrieved, such as jumping straight back to the login page or to the authentication page, then the Cookies are invalid and we need to remove the Cookies from the database.
After the Cookies are removed, the generation module mentioned just now will detect that the Hash of the Cookies is less than the Hash of the account, and the generation module will think that the Cookies of the account have not been generated, so it will log in again with the account, and the Cookies of the account will be updated again.
All the detection module needs to do is detect that the Cookies have failed and then remove them from the data.
To achieve general extensibility, we first define a detector’s parent class and declare some common components, which are implemented as follows:
class ValidTester(object):
def __init__(self, website='default'):
self.website = website
self.cookies_db = RedisClient('cookies', self.website)
self.accounts_db = RedisClient('accounts', self.website)
def test(self, username, cookies):
raise NotImplementedError
def run(self):
cookies_groups = self.cookies_db.all()
for username, cookies in cookies_groups.items():
self.test(username, cookies)Copy the code
Here we define a parent class called ValidTester, specify the name of the site as website in __init__(), and create two storage modules connecting objects cookies_db and accounts_db to Hash Cookies and accounts, respectively. The run() method is the entry point, in this case it iterates through all the Cookies, and then calls the test() method to test. In this case, the test() method is not implemented, which means that we need to write a subclass to override the test() method. Each subclass is responsible for detecting different websites. For example, we can define WeiboValidTester as WeiboValidTester, and implement its unique test() method to check whether the Cookies of Weibo are valid, and then do the corresponding processing, so we need to add a subclass to inherit this ValidTester, override the test() method. The implementation is as follows:
import json import requests from requests.exceptions import ConnectionError class WeiboValidTester(ValidTester): def __init__(self, website='weibo'): ValidTester.__init__(self, website) def test(self, username, cookies): Print (' testing Cookies', 'username ', username) try: Cookies = json. Loads (Cookies) except TypeError: Print ('Cookies not valid ', username) self.cookies_db.delete(username) print(' delete Cookies', username) return try: test_url = TEST_URL_MAP[self.website] response = requests.get(test_url, cookies=cookies, timeout=5, allow_redirects=False) if response.status_code == 200: Print ('Cookies are valid ', username) print(' partial test result ', response.text[0:50]) else: Print (response.status_code, response.headers) print('Cookies fail ', Self.cookies_db. delete(username) print(' delete Cookies', username) except ConnectionError as e: Print (' exception ', e.args)Copy the code
The test() method first converts the Cookies to a dictionary, detects the format of the Cookies, deletes them if the format is incorrect, and requests the detected URL with the Cookies if the format is correct. The test() method checks the microblog here. The URL can be an Ajax interface. For configurability, we define the test URL as a dictionary, as shown below:
TEST_URL_MAP = {
'weibo': 'https://m.weibo.cn/'
}Copy the code
If we want to extend other sites, we can add them all to the dictionary. For Weibo, we use Cookies to request target sites, prohibit redirection and set timeout time, and detect the return status code after receiving Response. If the 200 status code is directly returned, the Cookies are valid. Otherwise, the system may redirect to the login page, indicating that the Cookies are invalid. If the Cookies fail, we simply remove them from the Hash of the Cookies.
4. Interface module
The generation module and detection module can complete real-time Cookies detection and update if they are run regularly. However, Cookies still need to be used by crawlers in the end, and a cookie pool can be used by multiple crawlers, so we also need to define a Web interface, and crawlers can access random Cookies by visiting this interface. Flask is used to build the interface, and the code is shown as follows:
Import json from flask import flask, g app = flask (__name__) # GENERATOR_MAP = {'weibo': 'WeiboCookiesGenerator' } @app.route('/') def index(): return '<h2>Welcome to Cookie Pool System</h2>' def get_conn(): for website in GENERATOR_MAP: if not hasattr(g, website): setattr(g, website + '_cookies', eval('RedisClient' + '("cookies", "' + website + '")')) return g @app.route('/<website>/random') def random(website): """ Get a random Cookie, access the address such as /weibo/random :return: Random Cookie """ g = get_conn() cookies = getattr(g, website + '_cookies').random() return cookiesCopy the code
We also need to achieve universal configuration to connect to different sites, so the first field of the interface link is defined as the name of the site, and the second field is defined as the method of obtaining. For example, / Weibo /random is to obtain random Cookies of Weibo, and /zhihu/random is to obtain random Cookies of Zhihu.
5. Scheduling module
Finally, we add a scheduling module to make these modules work together. The main work is to drive several modules to run regularly, and each module needs to run on different processes, as shown below:
import time from multiprocessing import Process from cookiespool.api import app from cookiespool.config import * from cookiespool.generator import * from cookiespool.tester import * class Scheduler(object): @staticmethod def valid_cookie(cycle= cycle): while True: print('Cookies ') try: for website, cls in TESTER_MAP.items(): Tester = eval(CLS + '(website="' + website + '")') tester.run() print('Cookies completed ') del Tester time.sleep(cycle) except Exception as e: print(e.args) @staticmethod def generate_cookie(cycle=CYCLE): while True: Print ('Cookies start running ') try: for website, CLS in generator_map.items (): Generator = eval(CLS + '(website="' + website + '")') generator.run() print('Cookies completed ') generator.close() time.sleep(cycle) except Exception as e: print(e.args) @staticmethod def api(): Run (host=API_HOST, port=API_PORT) def run(self): if API_PROCESS: api_process = Process(target=Scheduler.api) api_process.start() if GENERATOR_PROCESS: generate_process = Process(target=Scheduler.generate_cookie) generate_process.start() if VALID_PROCESS: Valid_process = Process(target= scheduler.valid_cookie) valid_process.start() two important configurations are used here, which are the dictionary configurations for generating module classes and testing module classes, as shown below: GENERATOR_MAP = {'weibo': GENERATOR_MAP = {'weibo': TESTER_MAP = {'Weibo ': 'WeiboValidTester'} 'WeiboCookiesGenerator'}Copy the code
This configuration is used for dynamic extension, with the key name being the site name and the key value being the class name. If other sites need to be configured, they can be added to the dictionary, such as the extension of the generation module of Zhihu site, which can be configured as:
GENERATOR_MAP = {
'weibo': 'WeiboCookiesGenerator',
'zhihu': 'ZhihuCookiesGenerator',
}Copy the code
The Scheduler iterates through the dictionary, dynamically creates objects of each class using eval(), and runs each module by calling its entry run() method. At the same time, multiple processes in each module use the Process class in MultiProcessing, which can be started by calling its start() method.
In addition, each module also has a module switch, we can freely set the switch on and off in the configuration file, as follows:
GENERATOR_PROCESS = True # GENERATOR_PROCESS = False # Interface module API_PROCESS = TrueCopy the code
If the value is set to True, the module is enabled. If the value is set to False, the module is disabled.
At this point, our Cookies are complete. Next, we turn on the modules at the same time to start the scheduler, and the similar output of the console is as follows:
API interface start * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit) Cookies generation process start Cookies detection process start Testing Cookies user name 14747219309 Cookies being generated 14747223314 Password ASDF1129 Cookies Valid 14747219309 Cookies user name being tested 14740691419 Cookies user name being tested 14740626332 Cookies Valid 14740626332 Cookies valid 14740691419 Cookies valid 14740636046 Cookies user name 14740618009 Cookies are valid 14740618009 Testing Cookies User name 14740636046 Drag order [1, 4, 2, 3] User name 14747222472 Cookies valid 14747222472 Cookies detected complete verification code position 420 580 384 544 matched successfully obtained Cookies {'SUHB': '08J77UIj4w5n_T', 'SCF': 'AimcUCUVvHjswSBmTswKh0g4kNj4K7_U9k57YzxbqFt4SFBhXq3Lx4YSNO9VuBV841BMHFIaH4ipnfqZnK7W6Qs.', 'SSOLoginState': '1501439488', '_T_WM': '99b7d656220aeb9207b5db97743adc02', 'M_WEIBOCN_PARAMS': 'uicode%3D20000174', 'SUB': '_2A250elZQDeRhGeBM6VAR8ifEzTuIHXVXhXoYrDV6PUJbkdBeLXTxkW17ZoYhhJ92N_RGCjmHpfv9TB8OJQ.. '} Saved Cookies successfullyCopy the code
The above is the console output of program operation, from which we can see that each module starts normally, the test module tests Cookies one by one, and the generation module obtains Cookies of accounts that have not generated Cookies. Each module runs in parallel without interference.
We can access the interface to get random Cookies, as shown in the figure below.
The crawler only needs to request this interface to obtain random Cookies.
Five, this section code
This section of code address is: https://github.com/Python3WebSpider/CookiesPool.
The original article was published on June 24, 2018
Author: Cui Qingcai
This article is from the Python Enthusiast Community, a cloud community partner. For more information, follow the Python Enthusiast Community.