Inject soul into the AIOHTTP crawler

Those of you who have heard of asynchronous crawlers have more or less heard of the aioHTTP library. It implements asynchronous crawlers with Python’s native async/await.

Using AIOHTTP, we can use the Requests API to write crawlers that match Scrapy concurrency.

In the official aioHTTP documentation, you can see a code example as shown below:

Let’s modify it a little bit now to see how efficient it is to write a crawler like this.

The modified code is as follows:

import asyncio
import aiohttp

template = 'http://exercise.kingname.info/exercise_middleware_ip/{page}'


async def get(session, page):
    url = template.format(page=page)
    resp = await session.get(url)
    print(await resp.text(encoding='utf-8'))

async def main(a):
    async with aiohttp.ClientSession() as session:
        for page in range(100) :await get(session, page)


loop = asyncio.get_event_loop()
loop.run_until_complete(main())
Copy the code

This code visits my crawler practice station 100 times to get 100 pages of content.

You can check out the video below to see how it works:

! [] (kingname-1257411235.cos.ap-chengdu.myqcloud.com/slow.2019-1… 22_51_37.gif)

At this point, it’s almost as fast as the single-threaded crawler That Requests wrote, with so much more code.

So how do you release aiOHTTP’s powers correctly?

Let’s change the code now:

import asyncio
import aiohttp

template = 'http://exercise.kingname.info/exercise_middleware_ip/{page}'


async def get(session, queue):
    while True:
        try:
            page = queue.get_nowait()
        except asyncio.QueueEmpty:
            return
        url = template.format(page=page)
        resp = await session.get(url)
        print(await resp.text(encoding='utf-8'))

async def main(a):
    async with aiohttp.ClientSession() as session:
        queue = asyncio.Queue()
        for page in range(1000):
            queue.put_nowait(page)
        tasks = []
        for _ in range(100):
            task = get(session, queue)
            tasks.append(task)
        await asyncio.wait(tasks)


loop = asyncio.get_event_loop()
loop.run_until_complete(main())
Copy the code

In the modified code, I told the crawler to crawl 1000 pages of content, so let’s take a look at this video.

! [] (kingname-1257411235.cos.ap-chengdu.myqcloud.com/fast.2019-1… 22_49_49.gif)

As you can see, this speed is now comparable to Scrapy. And you need to know that the crawler only has 1 process and 1 thread, and it achieves this speed in an asynchronous way.

So, after modifying the code, why can be so much faster?

The key code is in:

tasks = []
for _ in range(100):
    task = get(session, queue)
    tasks.append(task)
await asyncio.wait(tasks)
Copy the code

In the slow version, we only have one coroutine running. In the current fast version, we created 100 coroutines and submitted them to Asyncio.wait for unified scheduling. Asyncio. wait returns when all coroutines have finished.

We place 1000 urls in an asynchronous Queue generated by asyncio.queue. Each coroutine will fetch and access urls from the Queue while True until the Queue is empty and exits.

When the program runs, Python will automatically schedule the 100 coroutines. When one coroutine is waiting for the network IO to return, it will switch to a second coroutine and initiate a request. When this coroutine is waiting for the network IO to return, it will continue to switch to a third coroutine and initiate a request… . The program makes full use of the waiting time of network IO, thus greatly improving the running speed.

Finally, I would like to thank xiaohe, the intern, for his acceleration plan.

Inject soul into the AIOHTTP crawler

Related Posts

Install and use nexus in K8S (1)

Programmers in another world to give birth to a baby 3: Maybe I am a king

ElasticSearch is an amazing distributed architecture design