Use AIOHTTP to make asynchronous crawlers

Introduction to the

Asyncio is a common asynchronous processing module in Python that implements single-thread concurrent I/O operations. As for the introduction of asyncio module, the author will introduce it in the subsequent articles. This paper will describe an HTTP framework based on asyncio implementation — AIOHTTP, which can help us asynchronously implement HTTP requests, thus greatly improving the efficiency of our program. This article will introduce a simple application of AIOHTTP in crawlers. In the original project, we used Python’s crawler framework scrapy to crawl the books on dangdang’s bestseller list. In this paper, the author will make crawler in two ways, compare the efficiency of synchronous crawler and asynchronous crawler (implemented by AIOHTTP), and show the advantages of AIOHTTP in crawler.

Synchronous crawler

First, let’s take a look at the general way the crawler is implemented, that is, the synchronous method. The complete Python code is as follows:


Copy the code

' ' '

Synchronous way to climb dangdang bestseller book information

' ' '

import time

import requests

import pandas as pd

from bs4 import BeautifulSoup

The table is used to store book information

table = []

# Handle web pages

def download(url):

html = requests.get(url).text

Parse the text into HTML using BeautifulSoup

soup = BeautifulSoup(html, "lxml")

# Get information about bestsellers on the web

book_list = soup.find('ul', class_="bang_list clearfix bang_list_mode")('li')

for book in book_list:

info = book.find_all('div')

Get the rank, name, number of reviews, author, publisher of each bestseller

rank = info[0].text[0:-1]

name = info[2].text

Comments = info[3].text.split(' bar ')[0]

author = info[4].text

date_and_publisher = info[5].text.split()

publisher = date_and_publisher[1] if len(date_and_publisher) >= 2 else ''

Add the above information for each bestseller to the table

table.append([rank, name, comments, author, publisher])

# All pages

Urls = [' http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-recent7-0-0-1-%d '% I for I in range (1, 26)]

# Count the elapsed time of the crawler

print('#' * 50)

T1 = time.time() # start time

for url in urls:

download(url)

# Convert table to a DataFrame in pandas and save it as a CSV file

df = pd.DataFrame(table, columns=['rank', 'name', 'comments', 'author', 'publisher'])

df.to_csv('E://douban/dangdang.csv', index=False)

T2 = time.time() # end time

Print (' %s' % (t2-T1))

print('#' * 50)

The following output is displayed:


Copy the code

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

Using the general method, total time: 23.522345542907715

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

The program ran for 23.5 seconds and crawled the information of 500 books, which was still efficient. We go to the directory to view the files as follows:

Asynchronous crawler

Next we look at the efficiency of aiOHTTP production of asynchronous crawler, the complete source code is as follows:


Copy the code

' ' '

Asynchronously climb the information of popular books

' ' '

import time

import aiohttp

import asyncio

import pandas as pd

from bs4 import BeautifulSoup

The table is used to store book information

table = []

# Get web pages (text messages)

async def fetch(session, url):

async with session.get(url) as response:

return await response.text(encoding='gb18030')

# Parse web pages

async def parser(html):

Parse the text into HTML using BeautifulSoup

soup = BeautifulSoup(html, "lxml")

# Get information about bestsellers on the web

book_list = soup.find('ul', class_="bang_list clearfix bang_list_mode")('li')

for book in book_list:

info = book.find_all('div')

Get the rank, name, number of reviews, author, publisher of each bestseller

rank = info[0].text[0:-1]

name = info[2].text

Comments = info[3].text.split(' bar ')[0]

author = info[4].text

date_and_publisher = info[5].text.split()

publisher = date_and_publisher[1] if len(date_and_publisher) >=2 else ''

Add the above information for each bestseller to the table

table.append([rank,name,comments,author,publisher])

# Handle web pages

async def download(url):

async with aiohttp.ClientSession() as session:

html = await fetch(session, url)

await parser(html)

# All pages

Urls = [' http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-recent7-0-0-1-%d '% I for I in range (1, 26)]

# Count the elapsed time of the crawler

print('#' * 50)

T1 = time.time() # start time

# Use asyncio module for asynchronous IO processing

loop = asyncio.get_event_loop()

tasks = [asyncio.ensure_future(download(url)) for url in urls]

tasks = asyncio.gather(*tasks)

loop.run_until_complete(tasks)

# Convert table to a DataFrame in pandas and save it as a CSV file

df = pd.DataFrame(table, columns=['rank','name','comments','author','publisher'])

df.to_csv('E://douban/dangdang.csv',index=False)

T2 = time.time() # end time

Print (' use aiohttp, total time: %s' % (t2-T1))

print('#' * 50)

We can see that this crawler has basically the same idea and processing method with the original general method of crawler, except that aiOHTTP module is used in processing HTTP request and the function is turned into coroutine when analyzing web page, and aySNcio is used for concurrent processing, which can undoubtedly improve the efficiency of crawler. It runs as follows:


Copy the code

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

Using AIOHTTP, the total time is: 2.405137538909912

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

2.4 seconds, amazing!! Take a look at the contents of the file:

conclusion

To sum up, there is a big difference in the efficiency of crawler made by synchronous method and asynchronous method. Therefore, asynchronous crawler can also be considered in the actual process of crawler making, and asynchronous modules such as AySNcio and AIOHTTP can be used more often. In addition, AIoHTTP only supports Python versions after 3.5.3.

The original article was published on November 29, 2018

Author: jclian

This article is from the Python Chinese Community, a cloud community partner. For more information, follow the Python Chinese Community.

Use AIOHTTP to make asynchronous crawlers

Related Posts

IOS is compatible with multiple SDKS with crash collection

React Native Expo Fast Tutorial 7 – AsyncStorage

Industry cloud game vendor comparison