Asynchronous crawler proxy pool, based on Python Asyncio, is designed to take full advantage of Python’s asynchronous capabilities.
Runtime environment
The project uses SANIC, an asynchronous networking framework. Therefore, it is recommended to run Python3.5+ and SANIC does not support Windows. Windows users (such as me 😄) can consider using Ubuntu on Windows.
How to use
Install Redis
The project database uses Redis, an open source (BSD-licensed), in-memory data structure storage system that can be used as a database, cache, and messaging middleware. So make sure your runtime environment has Redis installed correctly. For installation methods, refer to the guide on the official website.
Download the project source code
$ git clone https://github.com/chenjiandongx/async-proxy-pool.gitCopy the code
Install dependencies
Use requirements. TXT
$ pip install -r requirements.txtCopy the code
Using pipenv Pipfile
$ pipenv installCopy the code
The configuration file
Config file config.py, which holds all the configuration items used by the project. As shown below, users can make their own changes as required. Otherwise, use the default.
#! /usr/bin/env python # coding= utF-8 # request timeout (s) REQUEST_TIMEOUT = 15 # request delay (s) = 0 # redis address REDIS_HOST = "localhost" # redis port REDIS_PORT = 6379 # redis password REDIS_PASSWORD = None # redis set key REDIS_KEY = "Proxies: Ranking" # redis Maximum connection size REDIS_MAX_CONNECTION = 20 # redis SCORE MAX_SCORE = 10 # redis SCORE Minimum SCORE MIN_SCORE = 0 # REDIS SCORE Initial SCORE INIT_SCORE = 9 # sanic web host SANIC_HOST = "localhost" # sanic web port SANIC_PORT = 3289 # enable sanIC logging SANIC_ACCESS_LOG = True # batch test number of VALIDATOR_BATCH_COUNT = 256 # VALIDATOR_BASE_URL = "https://httpbin.org/" # VALIDATOR_RUN_CYCLE = 15 # CRAWLER_RUN_CYCLE = Headers = {" x-requested-with ": "XMLHttpRequest", "user-agent ": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 ""(KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36",}
Copy the code
Run the project
Run the client and start the collector and validator
Set /export VALIDATOR_BASE_URL="https://example.com" $python client.py 2018-05-16 23:41:39, 234-crawler working... The 2018-05-16 23:41:40, 509 - http://202.83.123.33:3128 2018-05-16 23:41:40 Crawler), 509 - Crawler) http://123.53.118.122:61234 23:41:40 2018-05-16, 510 - http://212.237.63.84:8888 2018-05-16 23:41:40 Crawler), 510 - http://36.73.102.245:8080 2018-05-16 23:41:40 Crawler), 511 - http://78.137.90.253:8080 2018-05-16 Crawler) 23:41:40, 512 - http://5.45.70.39:1490 2018-05-16 23:41:40 Crawler), 512 - http://117.102.97.162:8080 Crawler) The 2018-05-16 23:41:40, 513 - http://109.185.149.65:8080 2018-05-16 23:41:40 Crawler), 513 - Crawler) http://189.39.143.172:20183 23:41:40 2018-05-16, 514 - http://186.225.112.62:20183 2018-05-16 23:41:40 Crawler), 514 - Crawler) http://189.126.66.154:20183... 2018-05-16 23:41:55, 866-Validator Working... The 23:41:56 2018-05-16, 951 - the Validator * https://114.113.126.82:80 2018-05-16 23:41:56, 953 - the Validator x https://114.199.125.242:80 23:41:56 2018-05-16, 955 - the Validator * https://114.228.75.17:6666 23:41:56 2018-05-16, 957 - The Validator * https://115.227.3.86:9000 2018-05-16 23:41:56, 960 - the Validator x https://115.229.88.191:9000 2018-05-16 23:41:56, 964 - the Validator * https://115.229.89.100:9000 23:41:56 2018-05-16, 966 - the Validator by https://103.18.180.194:8080 The 23:41:56 2018-05-16, 967 - the Validator * https://115.229.90.207:9000 2018-05-16 23:41:56, 968 - the Validator x https://103.216.144.17:8080 23:41:56 2018-05-16, 969 - the Validator * https://117.65.43.29:31588 23:41:56 2018-05-16, 971 - The Validator * https://103.248.232.135:8080 2018-05-16 23:41:56, 972 - the Validator x https://117.94.69.166:61234 2018-05-16 23:41:56, 975 - the Validator * https://103.26.56.109:8080...Copy the code
Run the server and start the Web service
$ python server.py
[2018-05-16 23:36:22 +0800] [108] [INFO] Goin' Fast @ http://localhost:3289
[2018-05-16 23:36:22 +0800] [108] [INFO] Starting worker [108]Copy the code
The overall architecture
The main modules of the project are crawling module, storage module, verification module, scheduling module and interface module.
- Crawl module
Is responsible for crawling the agent site and storing the resulting agents in the database, with the initialization weight of INIT_SCORE for each agent.
- Storage module
Encapsulates some interfaces for Redis operations and provides Redis connection pooling.
- Check module
Verify that the proxy IP is available. If the proxy is available, the weight is +1 and the maximum value is MAX_SCORE. If not available, the weight is -1 until the agent is removed from the database when the weight is 0.
- Scheduling module
Responsible for scheduling crawler and verifier running.
- Interface module
Use SANIC to provide the WEB API.
/
The welcome page
$HTTP http://localhost:3289/ HTTP/1.1 200 OK Connection: keep-alive Content-Length: 42 content-type: application/json Keep-Alive: 5 { "Welcome": "This is a proxy pool system." }Copy the code
/pop
Returns a random proxy in three attempts.
- Try to return a weight of MAX_SCORE, which is the latest available agent.
- Try to return agents with random weights between (MAX_SCORE -3) and MAX_SCORE.
- Try to return an agent with weights between 0 and MAX_SCORE
$HTTP http://localhost:3289/pop HTTP / 1.1 200 OK Connection: keep alive - Content - Length: 38 the content-type: Application/json Keep Alive: - {" HTTP ":" http://46.48.105.235:8080 "}Copy the code
/get/<count:int>
Returns a specified number of agents, in order of weight.
$HTTP http://localhost:3289/get/10 HTTP / 1.1 200 OK Connection: keep alive - Content - Length: 393 the content-type: Application/json Keep Alive: - [{" HTTP ":" http://94.177.214.215:3128 "}, {" HTTP: "Http://94.139.242.70:53281"}, {" HTTP ":" http://94.130.92.40:3128 "}, {" HTTP: "Http://82.78.28.139:8080"}, {" HTTP ":" http://82.222.153.227:9090 "}, {" HTTP: "Http://80.211.228.238:8888"}, {" HTTP ":" http://80.211.180.224:3128 "}, {" HTTP: "Http://79.101.98.2:53281"}, {" HTTP ":" http://66.96.233.182:8080 "}, {" HTTP ":" http://61.228.45.165:8080 "}]Copy the code
/count
Returns the total number of agents in the agent pool
$HTTP http://localhost:3289/count HTTP / 1.1 200 OK Connection: keep alive - Content - Length: 15 the content-type: application/json Keep-Alive: 5 { "count": "698" }Copy the code
/count/<score:int>
Return the total number of weighting agents
$HTTP http://localhost:3289/count/10 HTTP / 1.1 200 OK Connection: keep alive - Content - Length: 15 the content-type: application/json Keep-Alive: 5 { "count": "143" }Copy the code
/clear/<score:int>
Delete agents whose weights are less than or equal to score
$HTTP http://localhost:3289/clear/0 HTTP / 1.1 200 OK Connection: keep alive - Content - Length: 22 the content-type: application/json Keep-Alive: 5 { "Clear": "Successful" }Copy the code
The extension agent crawls the site
Add your own crawler method to crawler.py.
class Crawler: @staticmethod def run(): ... # add your own crawl method @staticMethode@collect_funcs # Add a decorator for last run def crawl_xxx(): # crawl logicCopy the code
Select another Web framework
This project uses Sanic, but developers can choose other Web frameworks according to their own needs, the Web module is completely independent, replacing the framework will not affect the normal operation of the project. The following steps are required.
- Change the framework in webapi.py.
- Change app startup details in server.py.
Sanic performance test
Use WRK for server stress testing. Benchmark for 30 seconds, using 12 threads, with 400 CONCURRENT HTTP connections.
Test the http://127.0.0.1:3289/pop
$WRK - t12 - c400 - d30s 12 threads Running 30 s test @ http://127.0.0.1:3289/pop and http://127.0.0.1:3289/pop, 400 Connections Thread Stats Avg Stdev Max +/- Stdev Latency 350.37ms 118.99ms 660.41ms 60.94% Req/Sec 98.18 35.94 277.00 79.43% 33694 requests in 30.10s, 4.77MB Read Socket Errors: connect 0, read 340, write 0, timeout 0 requests/SEC: 1119.44 Transfer/SEC: 162.23 KBCopy the code
Test the http://127.0.0.1:3289/get/10
The test Running 30 s @ http://127.0.0.1:3289/get/10 12 threads and 400 connections Thread Stats Avg Stdev Max + / - Stdev Latency 254.90ms 95.43ms 615.14ms 63.51% Req/Sec 144.84 61.52 320.00 66.58% 46538 requests in 30.10s, 22.37MB Read Socket errors: connect 0, read 28, write 0, timeout 0 Requests/ SEC: 1546.20 Transfer/ SEC: 761.02KBCopy the code
The performance is still good, and test http://127.0.0.1:3289/ without Redis operation
$wrK-T12-C400-d30s http://127.0.0.1:3289/ Running 30S test @http://127.0.0.1:3289/ 12 threads and 400 connections Thread Stats Avg Stdev Max +/ -stdev Latency 127.86ms 41.71ms 260.69ms 55.22% Req/Sec 258.56 92.25 520.00 68.90% 92766 Requests in 30.10s, 13.45MB read requests/SEC: 3081.87 Transfer/ SEC: 457.47KBCopy the code
⭐ ️ Requests/SEC: 3081.87
Turn off SANIC logging and test http://127.0.0.1:3289/
$wrK-T12-C400-d30s http://127.0.0.1:3289/ Running 30S test @http://127.0.0.1:3289/ 12 threads and 400 connections Thread Stats Avg Stdev Max +/- Stdev Latency 34.631ms 12.661ms 96.281ms 26.07% Req/Sec 0.961k 137.29 2.21k 73.29% 342764 For requests in 30.10s, 49.69MB Read requests/SEC: 11387.89 Transfer/ SEC: 1.65MBCopy the code
⭐ ️ Requests/SEC: 11387.89
Actual agent performance tests
Test_proxy.py is used to test actual proxy performance
Run the code
$CD test $python test_proxy.py # TEST_COUNT = os.environ. Get ("TEST_COUNT") or 1000 TEST_WEBSITE = TEST_WEBSITE os.environ.get("TEST_WEBSITE") or "https://httpbin.org/" TEST_PROXIES = os.environ.get("TEST_PROXIES") or "http://localhost:3289/get/20"Copy the code
The measured results
httpbin.org/
Testing agency: http://localhost:3289/get/20 test site: https://httpbin.org/ test number: 1000 success number: 1000 times failure: the success rate of 0, 1.0
Copy the code
taobao.com
Testing agency: http://localhost:3289/get/20 test site: https://taobao.com/ test number: 1000 success number: 984 failure number: 16 success rate: 0.984
Copy the code
baidu.com
Testing agency: http://localhost:3289/get/20 test site: https://baidu.com test number: 1000 success number: 975 times failure: the success rate of 25:0.975
Copy the code
zhihu.com
Testing agency: http://localhost:3289/get/20 test site: https://zhihu.com test number: 1000 success number: 1000 times failure: the success rate of 0, 1.0
Copy the code
You can see that the performance is actually very good, the success rate is very high. 😉
Practical Application Examples
Import random import requests # ensure that the SANIC service has been started # fetch multiple and randomly select a try: proxies = requests.get("http://localhost:3289/get/20").json() req = requests.get("https://example.com", Proxies =random. Choice (Proxies) except: raise # proxy = requests.get("http://localhost:3289/pop").json() req = requests.get("https://example.com", proxies=proxy) except: raiseCopy the code
The pit of aiohttp
The whole project is based on the aiOHTTP asynchronous network library, and the proxy is described in the project documentation.
Aiohttp Supports HTTP/HTTPS Proxies
However, it doesn’t support HTTPS proxies at all well, that’s what it says in its code.
1. Only HTTP proxies are supported
My feelings are very mixed. 😲 but it doesn’t matter that only the HTTP proxy works well, see the test data above.
Reference project
✨ 🍰 ✨
- ProxyPool
- proxy_pool
License
MIT © chenjiandongx