♚ \
The author Jclian has been working on Python development for more than one year. He is a Python enthusiast who likes algorithms and sharing. I hope to make more friends with like-minded people and go further in learning Python together!
\
Introduction to the
Asyncio is a common asynchronous processing module in Python that implements single-thread concurrent I/O operations. As for the introduction of asyncio module, the author will introduce it in the subsequent articles. This paper will describe an HTTP framework based on asyncio implementation — AIOHTTP, which can help us asynchronously implement HTTP requests, thus greatly improving the efficiency of our program. This article will introduce a simple application of AIOHTTP in crawlers. In the original project, we used Python’s crawler framework scrapy to crawl the books on dangdang’s bestseller list. In this paper, the author will make crawler in two ways, compare the efficiency of synchronous crawler and asynchronous crawler (implemented by AIOHTTP), and show the advantages of AIOHTTP in crawler.
Synchronous crawler
First, let’s take a look at the general way the crawler is implemented, that is, the synchronous method. The complete Python code is as follows:
Synchronously access the information of popular books.
import time
import requests
import pandas as pd
from bs4 import BeautifulSoup
The table is used to store book information
table = []
# Handle web pages
def download(url) :
html = requests.get(url).text
Parse the text into HTML using BeautifulSoup
soup = BeautifulSoup(html, "lxml")
# Get information about bestsellers on the web
book_list = soup.find('ul', class_="bang_list clearfix bang_list_mode") ('li')
for book in book_list:
info = book.find_all('div')
Get the rank, name, number of reviews, author, publisher of each bestseller
rank = info[0].text[0: -1]
name = info[2].text
comments = info[3].text.split('条') [0]
author = info[4].text
date_and_publisher = info[5].text.split()
publisher = date_and_publisher[1] if len(date_and_publisher) >= 2 else ' '
Add the above information for each bestseller to the table
table.append([rank, name, comments, author, publisher])
# All pages
urls = ['http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-recent7-0-0-1-%d' % i for i in range(1.26)]
# Count the elapsed time of the crawler
print(The '#' * 50)
t1 = time.time() # Start time
for url in urls:
download(url)
# Convert table to a DataFrame in pandas and save it as a CSV file
df = pd.DataFrame(table, columns=['rank'.'name'.'comments'.'author'.'publisher'])
df.to_csv('E://douban/dangdang.csv', index=False)
t2 = time.time() # End time
print('Using normal methods, total time: %s' % (t2 - t1))
print(The '#' * 50)
Copy the code
The following output is displayed:
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # to use general method, the total time consuming: 23.522345542907715 # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #Copy the code
The program ran for 23.5 seconds and crawled the information of 500 books, which was still efficient. We go to the directory to view the files as follows:
\
\
Asynchronous crawler
Next we look at the efficiency of aiOHTTP production of asynchronous crawler, the complete source code is as follows:
Asynchronously climb the information of popular books.
import time
import aiohttp
import asyncio
import pandas as pd
from bs4 import BeautifulSoup
The table is used to store book information
table = []
# Get web pages (text messages)
async def fetch(session, url) :
async with session.get(url) as response:
return await response.text(encoding='gb18030')
# Parse web pages
async def parser(html) :
Parse the text into HTML using BeautifulSoup
soup = BeautifulSoup(html, "lxml")
# Get information about bestsellers on the web
book_list = soup.find('ul', class_="bang_list clearfix bang_list_mode") ('li')
for book in book_list:
info = book.find_all('div')
Get the rank, name, number of reviews, author, publisher of each bestseller
rank = info[0].text[0: -1]
name = info[2].text
comments = info[3].text.split('条') [0]
author = info[4].text
date_and_publisher = info[5].text.split()
publisher = date_and_publisher[1] if len(date_and_publisher) >=2 else ' '
Add the above information for each bestseller to the table
table.append([rank,name,comments,author,publisher])
# Handle web pages
async def download(url) :
async with aiohttp.ClientSession() as session:
html = await fetch(session, url)
await parser(html)
# All pages
urls = ['http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-recent7-0-0-1-%d'%i for i in range(1.26)]
# Count the elapsed time of the crawler
print(The '#' * 50)
t1 = time.time() # Start time
# Use asyncio module for asynchronous IO processing
loop = asyncio.get_event_loop()
tasks = [asyncio.ensure_future(download(url)) for url in urls]
tasks = asyncio.gather(*tasks)
loop.run_until_complete(tasks)
# Convert table to a DataFrame in pandas and save it as a CSV file
df = pd.DataFrame(table, columns=['rank'.'name'.'comments'.'author'.'publisher'])
df.to_csv('E://douban/dangdang.csv',index=False)
t2 = time.time() # End time
print('Using aiohttp, total time: %s' % (t2 - t1))
print(The '#' * 50)
Copy the code
We can see that this crawler has basically the same idea and processing method with the original general method of crawler, except that aiOHTTP module is used in processing HTTP request and the function is turned into coroutine when analyzing web page, and aySNcio is used for concurrent processing, which can undoubtedly improve the efficiency of crawler. It runs as follows:
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # use aiohttp, a total of time: 2.405137538909912 # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #Copy the code
2.4 seconds, amazing!! Take a look at the contents of the file:
conclusion
To sum up, there is a big difference in the efficiency of crawler made by synchronous method and asynchronous method. Therefore, asynchronous crawler can also be considered in the actual process of crawler making, and asynchronous modules such as AySNcio and AIOHTTP can be used more often. In addition, AIoHTTP only supports Python versions after 3.5.3. Of course, this article only as an example of asynchronous crawler, and did not specifically tell the story behind asynchronous, and asynchronous ideas in our real life and website production has a wide range of applications, the end of this article, welcome everyone to exchange ~
§ § \
Python Chinese community as a decentralized global technology community, to become the world’s 200000 Python tribe as the vision, the spirit of Chinese developers currently covered each big mainstream media and collaboration platform, and ali, tencent, baidu, Microsoft, amazon and open China, CSDN industry well-known companies and established wide-ranging connection of the technical community, Have come from more than 10 countries and regions tens of thousands of registered members, members from the Ministry of Public Security, ministry of industry, tsinghua university, Beijing university, Beijing university of posts and telecommunications, the People’s Bank of China, the Chinese Academy of Sciences, cicc, huawei, BAT, represented by Google, Microsoft and other government departments, scientific research institutions, financial institutions, and well-known companies at home and abroad, nearly 200000 developers to focus on the platform.
Further reading
\
Do you really know Python strings? \
\
Seven ways to concatenate strings in Python \
\
How to deploy and monitor distributed crawler projects easily and efficiently
\
Douyin little sister video crawler \
\
Email: [email protected]
**** You can join **** club for free