Feel the power of Python asynchronous crawlers (gEvent, async, await)

Web crawler, a high-IO intensive application, has a low CPU utilization rate and a low speed because most of the time is spent waiting for response. In order to solve these problems, we use asynchronous crawler program.

Serial, if we want to climb a website, we usually finish one page of content and then move to the next, so that more than 90% of the CPU time is spent waiting for the page to respond.

Asynchronous, we can launch multiple requests at the same time, a request launched after the response does not wait for the request, immediately launched the second request, the third request…… And then the response to the content we then process one by one, so that the efficiency is much higher.

For example: First we build a Flask server and deliberately slow it down:

from flask import Flask import time app = Flask(__name__) @app.route('/') def hello_world(): Sleep (3) return 'Hello World! ' if __name__ == '__main__': app.run(threaded=True)Copy the code

First we use Python 3.5 + async, await, and aioHTTP:

import asyncio
import time
import aiohttp

start = time.time()

async def get(url):
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as res:
            print(res.status)
            text = await res.text()
            return text

async def hello():
    url = "http://127.0.0.1:5000/"
    print('Waiting for',url)
    res = await get(url)
    print('Result:',res)

loop = asyncio.get_event_loop()
tasks = [asyncio.ensure_future(hello()) for i in range(5)]
loop.run_until_complete(asyncio.wait(tasks))

end = time.time()
print('Cost time:',end-start)
Copy the code

Third party libraries using Python: GEvent can also implement network asynchrony (mainly for versions of Python 2) :

Monkey. Patch_all () import gevent import requests import time def get(url): print("Get from: ",url) r = requests.session() res = r.get(url) print(res.status_code,url,res.text) def synchronous_times(url): start = time.time() for i in range(5): Time () print(" synchronization time: ", start-end) def asynchronous_times(url): Start = time.time() gevent. Joinall ([gEvent. Spawn (get, URL) for I in range(5)]) end = time.time() print(" ", start - end) synchronous_times asynchronous_times (" http://127.0.0.1:5000/ ") (" http://127.0.0.1:5000/ ")Copy the code

The above uses AIOHTTP, Genvent to achieve asynchronous network requests.

Feel the power of Python asynchronous crawlers (gEvent, async, await)

Related Posts

Take a look at the JWT-JSON Web Token

Springcloud distributed Consul Exploration

Mysql > Select and Update