Practical Python3 asynchronous crawler pool (open source)

January 31, 2024

by Chelsea Parrish DDS

No Comments

Asynchronous crawler proxy pool, based on Python Asyncio, is designed to take full advantage of Python’s asynchronous capabilities.

Runtime environment

The project uses SANIC, an asynchronous networking framework. Therefore, it is recommended to run Python3.5+ and SANIC does not support Windows. Windows users (such as me 😄) can consider using Ubuntu on Windows.

How to use

Install Redis

The project database uses Redis, an open source (BSD-licensed), in-memory data structure storage system that can be used as a database, cache, and messaging middleware. So make sure your runtime environment has Redis installed correctly. For installation methods, refer to the guide on the official website.

Download the project source code

$ git clone https://github.com/chenjiandongx/async-proxy-pool.gitCopy the code

Install dependencies

Use requirements. TXT

$ pip install -r requirements.txtCopy the code

Using pipenv Pipfile

$ pipenv installCopy the code

The configuration file

Config file config.py, which holds all the configuration items used by the project. As shown below, users can make their own changes as required. Otherwise, use the default.

#! /usr/bin/env python # coding= utF-8 # request timeout (s) REQUEST_TIMEOUT = 15 # request delay (s) = 0 # redis address REDIS_HOST = "localhost" # redis port REDIS_PORT = 6379 # redis password REDIS_PASSWORD = None # redis set key REDIS_KEY = "Proxies: Ranking" # redis Maximum connection size REDIS_MAX_CONNECTION = 20 # redis SCORE MAX_SCORE = 10 # redis SCORE Minimum SCORE MIN_SCORE = 0 # REDIS SCORE Initial SCORE INIT_SCORE = 9 # sanic web host SANIC_HOST = "localhost" # sanic web port SANIC_PORT = 3289 # enable sanIC logging SANIC_ACCESS_LOG = True # batch test number of VALIDATOR_BATCH_COUNT = 256 # VALIDATOR_BASE_URL = "https://httpbin.org/" # VALIDATOR_RUN_CYCLE = 15 # CRAWLER_RUN_CYCLE = Headers = {" x-requested-with ": "XMLHttpRequest", "user-agent ": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 ""(KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36",}Copy the code

Run the project

Run the client and start the collector and validator

Set /export VALIDATOR_BASE_URL="https://example.com" $python client.py 2018-05-16 23:41:39, 234-crawler working... The 2018-05-16 23:41:40, 509 - http://202.83.123.33:3128 2018-05-16 23:41:40 Crawler), 509 - Crawler) http://123.53.118.122:61234 23:41:40 2018-05-16, 510 - http://212.237.63.84:8888 2018-05-16 23:41:40 Crawler), 510 - http://36.73.102.245:8080 2018-05-16 23:41:40 Crawler), 511 - http://78.137.90.253:8080 2018-05-16 Crawler) 23:41:40, 512 - http://5.45.70.39:1490 2018-05-16 23:41:40 Crawler), 512 - http://117.102.97.162:8080 Crawler) The 2018-05-16 23:41:40, 513 - http://109.185.149.65:8080 2018-05-16 23:41:40 Crawler), 513 - Crawler) http://189.39.143.172:20183 23:41:40 2018-05-16, 514 - http://186.225.112.62:20183 2018-05-16 23:41:40 Crawler), 514 - Crawler) http://189.126.66.154:20183... 2018-05-16 23:41:55, 866-Validator Working... The 23:41:56 2018-05-16, 951 - the Validator * https://114.113.126.82:80 2018-05-16 23:41:56, 953 - the Validator x https://114.199.125.242:80 23:41:56 2018-05-16, 955 - the Validator * https://114.228.75.17:6666 23:41:56 2018-05-16, 957 - The Validator * https://115.227.3.86:9000 2018-05-16 23:41:56, 960 - the Validator x https://115.229.88.191:9000 2018-05-16 23:41:56, 964 - the Validator * https://115.229.89.100:9000 23:41:56 2018-05-16, 966 - the Validator by https://103.18.180.194:8080 The 23:41:56 2018-05-16, 967 - the Validator * https://115.229.90.207:9000 2018-05-16 23:41:56, 968 - the Validator x https://103.216.144.17:8080 23:41:56 2018-05-16, 969 - the Validator * https://117.65.43.29:31588 23:41:56 2018-05-16, 971 - The Validator * https://103.248.232.135:8080 2018-05-16 23:41:56, 972 - the Validator x https://117.94.69.166:61234 2018-05-16 23:41:56, 975 - the Validator * https://103.26.56.109:8080...Copy the code

Run the server and start the Web service

$ python server.py
[2018-05-16 23:36:22 +0800] [108] [INFO] Goin' Fast @ http://localhost:3289
[2018-05-16 23:36:22 +0800] [108] [INFO] Starting worker [108]Copy the code

The overall architecture

The main modules of the project are crawling module, storage module, verification module, scheduling module and interface module.

Crawl module

Is responsible for crawling the agent site and storing the resulting agents in the database, with the initialization weight of INIT_SCORE for each agent.

Storage module

Encapsulates some interfaces for Redis operations and provides Redis connection pooling.

Check module

Verify that the proxy IP is available. If the proxy is available, the weight is +1 and the maximum value is MAX_SCORE. If not available, the weight is -1 until the agent is removed from the database when the weight is 0.

Scheduling module

Responsible for scheduling crawler and verifier running.

Interface module

Use SANIC to provide the WEB API.

/

The welcome page

$HTTP http://localhost:3289/ HTTP/1.1 200 OK Connection: keep-alive Content-Length: 42 content-type: application/json Keep-Alive: 5 { "Welcome": "This is a proxy pool system." }Copy the code

/pop

Returns a random proxy in three attempts.

Try to return a weight of MAX_SCORE, which is the latest available agent.
Try to return agents with random weights between (MAX_SCORE -3) and MAX_SCORE.
Try to return an agent with weights between 0 and MAX_SCORE

$HTTP http://localhost:3289/pop HTTP / 1.1 200 OK Connection: keep alive - Content - Length: 38 the content-type: Application/json Keep Alive: - {" HTTP ":" http://46.48.105.235:8080 "}Copy the code

/get/<count:int>

Returns a specified number of agents, in order of weight.

$HTTP http://localhost:3289/get/10 HTTP / 1.1 200 OK Connection: keep alive - Content - Length: 393 the content-type: Application/json Keep Alive: - [{" HTTP ":" http://94.177.214.215:3128 "}, {" HTTP: "Http://94.139.242.70:53281"}, {" HTTP ":" http://94.130.92.40:3128 "}, {" HTTP: "Http://82.78.28.139:8080"}, {" HTTP ":" http://82.222.153.227:9090 "}, {" HTTP: "Http://80.211.228.238:8888"}, {" HTTP ":" http://80.211.180.224:3128 "}, {" HTTP: "Http://79.101.98.2:53281"}, {" HTTP ":" http://66.96.233.182:8080 "}, {" HTTP ":" http://61.228.45.165:8080 "}]Copy the code

/count

Returns the total number of agents in the agent pool

$HTTP http://localhost:3289/count HTTP / 1.1 200 OK Connection: keep alive - Content - Length: 15 the content-type: application/json Keep-Alive: 5 { "count": "698" }Copy the code

/count/<score:int>

Return the total number of weighting agents

$HTTP http://localhost:3289/count/10 HTTP / 1.1 200 OK Connection: keep alive - Content - Length: 15 the content-type: application/json Keep-Alive: 5 { "count": "143" }Copy the code

/clear/<score:int>

Delete agents whose weights are less than or equal to score

$HTTP http://localhost:3289/clear/0 HTTP / 1.1 200 OK Connection: keep alive - Content - Length: 22 the content-type: application/json Keep-Alive: 5 { "Clear": "Successful" }Copy the code

The extension agent crawls the site

Add your own crawler method to crawler.py.

class Crawler: @staticmethod def run(): ... # add your own crawl method @staticMethode@collect_funcs # Add a decorator for last run def crawl_xxx(): # crawl logicCopy the code

Select another Web framework

This project uses Sanic, but developers can choose other Web frameworks according to their own needs, the Web module is completely independent, replacing the framework will not affect the normal operation of the project. The following steps are required.

Change the framework in webapi.py.
Change app startup details in server.py.

Sanic performance test

Use WRK for server stress testing. Benchmark for 30 seconds, using 12 threads, with 400 CONCURRENT HTTP connections.

Test the http://127.0.0.1:3289/pop

$WRK - t12 - c400 - d30s 12 threads Running 30 s test @ http://127.0.0.1:3289/pop and http://127.0.0.1:3289/pop, 400 Connections Thread Stats Avg Stdev Max +/- Stdev Latency 350.37ms 118.99ms 660.41ms 60.94% Req/Sec 98.18 35.94 277.00 79.43% 33694 requests in 30.10s, 4.77MB Read Socket Errors: connect 0, read 340, write 0, timeout 0 requests/SEC: 1119.44 Transfer/SEC: 162.23 KBCopy the code

Test the http://127.0.0.1:3289/get/10

The test Running 30 s @ http://127.0.0.1:3289/get/10 12 threads and 400 connections Thread Stats Avg Stdev Max + / - Stdev Latency 254.90ms 95.43ms 615.14ms 63.51% Req/Sec 144.84 61.52 320.00 66.58% 46538 requests in 30.10s, 22.37MB Read Socket errors: connect 0, read 28, write 0, timeout 0 Requests/ SEC: 1546.20 Transfer/ SEC: 761.02KBCopy the code

The performance is still good, and test http://127.0.0.1:3289/ without Redis operation

$wrK-T12-C400-d30s http://127.0.0.1:3289/ Running 30S test @http://127.0.0.1:3289/ 12 threads and 400 connections Thread Stats Avg Stdev Max +/ -stdev Latency 127.86ms 41.71ms 260.69ms 55.22% Req/Sec 258.56 92.25 520.00 68.90% 92766 Requests in 30.10s, 13.45MB read requests/SEC: 3081.87 Transfer/ SEC: 457.47KBCopy the code

⭐ ️ Requests/SEC: 3081.87

Turn off SANIC logging and test http://127.0.0.1:3289/

$wrK-T12-C400-d30s http://127.0.0.1:3289/ Running 30S test @http://127.0.0.1:3289/ 12 threads and 400 connections Thread Stats Avg Stdev Max +/- Stdev Latency 34.631ms 12.661ms 96.281ms 26.07% Req/Sec 0.961k 137.29 2.21k 73.29% 342764 For requests in 30.10s, 49.69MB Read requests/SEC: 11387.89 Transfer/ SEC: 1.65MBCopy the code

⭐ ️ Requests/SEC: 11387.89

Actual agent performance tests

Test_proxy.py is used to test actual proxy performance

Run the code

$CD test $python test_proxy.py # TEST_COUNT = os.environ. Get ("TEST_COUNT") or 1000 TEST_WEBSITE = TEST_WEBSITE os.environ.get("TEST_WEBSITE") or "https://httpbin.org/" TEST_PROXIES = os.environ.get("TEST_PROXIES") or "http://localhost:3289/get/20"Copy the code

The measured results

httpbin.org/

Testing agency: http://localhost:3289/get/20 test site: https://httpbin.org/ test number: 1000 success number: 1000 times failure: the success rate of 0, 1.0Copy the code

taobao.com

Testing agency: http://localhost:3289/get/20 test site: https://taobao.com/ test number: 1000 success number: 984 failure number: 16 success rate: 0.984Copy the code

baidu.com

Testing agency: http://localhost:3289/get/20 test site: https://baidu.com test number: 1000 success number: 975 times failure: the success rate of 25:0.975Copy the code

zhihu.com

Testing agency: http://localhost:3289/get/20 test site: https://zhihu.com test number: 1000 success number: 1000 times failure: the success rate of 0, 1.0Copy the code

You can see that the performance is actually very good, the success rate is very high. 😉

Practical Application Examples

Import random import requests # ensure that the SANIC service has been started # fetch multiple and randomly select a try: proxies = requests.get("http://localhost:3289/get/20").json() req = requests.get("https://example.com", Proxies =random. Choice (Proxies) except: raise # proxy = requests.get("http://localhost:3289/pop").json() req = requests.get("https://example.com", proxies=proxy) except: raiseCopy the code

The pit of aiohttp

The whole project is based on the aiOHTTP asynchronous network library, and the proxy is described in the project documentation.

Aiohttp Supports HTTP/HTTPS Proxies

However, it doesn’t support HTTPS proxies at all well, that’s what it says in its code.

1. Only HTTP proxies are supported

My feelings are very mixed. 😲 but it doesn’t matter that only the HTTP proxy works well, see the test data above.

Reference project

✨ 🍰 ✨

ProxyPool

proxy_pool

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Practical Python3 asynchronous crawler pool (open source)

Runtime environment

How to use

Install Redis

Download the project source code

Install dependencies

The configuration file

Run the project

The overall architecture

The extension agent crawls the site

Select another Web framework

Sanic performance test

Actual agent performance tests

Run the code

The measured results

Practical Application Examples

The pit of aiohttp

Reference project

License

Practical Python3 asynchronous crawler pool (open source)

Runtime environment

Install Redis

Install dependencies

Run the project

The extension agent crawls the site

Sanic performance test

Run the code

Practical Application Examples

Reference project

License

Related Posts