The text and pictures in this article come from the network, only for learning, exchange, do not have any commercial purposes, copyright belongs to the original author, if you have any questions, please contact us to deal with

The following article comes from Tencent Cloud author: Xiaoke

(Want to learn Python? Python Learning exchange group: 1039649593, to meet your needs, materials have been uploaded to the group file stream, you can download! There is also a huge amount of new 2020Python learning material.)



I worked as a distributed deep web crawler in the company, and built a stable agent pool service to provide effective agents for thousands of crawlers, so as to ensure that each crawler got the effective proxy IP corresponding to the website, so as to ensure the fast and stable operation of crawlers. Of course, things made in the company cannot be open source. However, spare time itching, so want to use a few free resources to do a simple agent pool service.

1, problem,

Where does the proxy IP come from? Just self-taught crawler when there is no proxy IP to go to the West thorn, quick agent, such as free agent website to climb, or some individual agent can be used. Of course, if you have a better proxy interface can also access yourself. The collection of free agents is simple: visit the page page – > extract the re /xpath – > save

How to guarantee agency quality?

You can be sure that most of the free proxy IP addresses are unusable, otherwise why would others offer paid IP addresses (although the fact is that many agents pay IP is also unstable, and many are unusable). Therefore, the proxy IP collected back can not be used directly, you can write detection programs to use these proxies to visit a stable website, to see whether it can be used normally. This process can be multithreaded or asynchronous because detecting agents is a slow process.

How will the collected agents be stored?

A high-performance NoSQL database, SSDB, which supports multiple data structures, is recommended for proxy Redis. Supports queue, hash, SET, k-V pair, and T-level data. Is a good intermediate storage tool to do distributed crawler.

How to make it easier for crawlers to use these proxies?

The answer must be to make it a service. There are so many Web frameworks in Python that you can write an API for a crawler to call. This has many advantages, such as: when the crawler finds that the proxy cannot be used, it can actively delete the proxy IP through API, and when the crawler finds that the proxy pool IP is not enough, it can actively refresh the proxy pool. This is more reliable than the detection program.

2. Agent pool design

The agent pool consists of four parts:

ProxyGetter:

Proxy access interface, currently there are 5 free proxy source, each call will grab the latest proxy of the five websites into DB, can add additional proxy access interface;

DB:

For storing proxy IP addresses, only SSDB is supported. As for why you choose SSDB, you can refer to this article. Personally, SSDB is a good alternative to Redis. If you have not used SSDB, it is easy to install, you can refer to this article.

Schedule:

Scheduled tasks Periodically check the availability of agents in the DB and delete agents that are unavailable. It also uses the ProxyGetter to get the latest proxy into DB.

ProxyApi:

The external interface to the agent pool, since the agent pool function is relatively simple now, spent two hours looking at Flask and happily decided to do it in Flask. The function is to provide get/delete/refresh interfaces for the crawler to use directly.

3. Code module

Python’s high-level data structures, dynamic typing, and dynamic binding make it ideal for rapid application development and as a glue language for connecting existing software components. The proxy IP pool is also easy to use in Python, with the code divided into six modules:

Api:

API interface related code, currently API is implemented by Flask, the code is very simple. The client requests are passed to Flask, which invokes the implementation in the ProxyManager, including get/ DELETE /refresh/get_all;

DB:

Database code, the current database is the use of SSDB. The code is implemented in factory mode to facilitate the expansion of other types of databases in the future;

Manager:

Get/DELETE /refresh/get_all interface implementation classes, currently the proxy pool is only responsible for managing proxy, the future may have more functions, such as proxy and crawler binding, proxy and account binding and so on;

ProxyGetter:

Agent to obtain the relevant code, currently grab the fast agent, agent 66, agent, agent, west thorn agent, Guobanjia the free agent of the five websites, the test of the five websites updated every day available agent only sixty or seventy, of course, also support their own extension agent interface;

Schedule:

Timed task-related code, now just implement timed to refresh code, and verify available agents, using multi-process;

Util:

Holds some common module methods or functions, including GetConfig: classes that read config. Ini, ConfigParse: Integrated override ConfigParser class to make it case-sensitive, Singleton: implement Singleton, LazyProperty: implement lazy calculation of class attributes. And so on;

Other documents:

Ini configuration file, database configuration and proxy access interface configuration, you can add a new proxy access method in GetFreeProxy, and register in config. ini to use;

4, installation,

Download code:

Git clone [email protected]: jhao104 / proxy_pool. Git or download the zip file to https://github.com/jhao104/proxy_pool directlyCopy the code

Install dependencies:

pip install -r requirements.txt
Copy the code

Activation:

Ini to configure your SSDB to Schedule: >>> Python proxyrefreshschedule. py to API: >>> Python proxyapi.pyCopy the code

5, use

After the scheduled task is started, all agents are fetched into the database and verified using the agent fetch method. By default, the command is executed every 20 minutes. About a minute or two after the scheduled task starts, you can see the refreshed agents available in the SSDB:



After starting proxyapi.py, you can use the interface to get the proxy in the browser. Here is a screenshot from the browser:

The index page:



Get the page:



The get_all page:



If you want to use it in the crawler code, you can wrap this API into a function and use it directly, for example:

The import requests def get_proxy () : return requests. Get (" http://127.0.0.1:5000/get/ "). The content def delete_proxy (proxy) : Requests. The get (" http://127.0.0.1:5000/delete/? proxy={}".format(proxy)) # your spider code def spider(): # .... requests.get('https://www.example.com', proxies={"http": "http://{}".format(get_proxy)}) # ....Copy the code

6, the last

Time is short, function and code are relatively simple, later have time to improve. Give a star on Github if you like it. Thank you!