This article was originally published in the wechat public account “Geek Monkey”, welcome to follow the first time to get more original sharing
For crawler program, we often pay much attention to its crawler efficiency. There are several factors affecting the efficiency of crawler, such as whether to use multithreading, I/O operation, synchronous execution and so on. Among them, I/O operation and synchronous execution have the most influence on crawler efficiency.
As you know, the Requests library is an excellent HTTP library that makes it very easy to initiate HTTP Requests. However, all network requests executed by the library are synchronous. When the crawler process gets the CPU’s time slice, if the program is performing I/O operations (such as downloading pictures), the CPU will be idle during this IO execution time, which will cause the computing power of the CPU to be wasted.
If the CPU can take advantage of the wait time, then crawler efficiency is improved. The program needs to be modified to make synchronous I/O operations asynchronous. This article introduces aiOHTTP, a powerful library for asynchronous I/O operations.
Introduction to aiohttp
Speaking of AIoHTTP, asyncio has to be mentioned. Asyncio is a standard library introduced in Python version 3.4. It works in single-threaded concurrency, using collaboration to perform I/O operations. Asyncio’s programming model is a message loop. We implement asynchronous IO by getting a reference to an EventLoop directly from the Asyncio module and then throwing the coroutine that needs to be executed into the EventLoop.
Example of using asyncio to implement an asynchronous function hello() :
import asyncio
@asyncio.coroutine # modifier, equivalent to asyncio.coroutine(hello())
def hello(a):
print("Hello world!")
Asyncio.sleep (1):
r = yield from asyncio.sleep(1)
print("Hello again!")
# get EventLoop:
loop = asyncio.get_event_loop()
# execution coroutine
loop.run_until_complete(hello())
loop.close()
Copy the code
Aiohttp is an HTTP framework based on asyncio implementation. Aiohttp is the Async HTTP Client/Server Framework. Translated into Chinese is the client/server framework for asynchronous HTTP. From the name, we know that AIOHTTP is divided into server side and client side, specialized in asynchronously processing HTTP requests.
2 aiohttp installation
Aiohttp can be installed through PIP by executing the installation command in the terminal.
pip install aiohttp
Copy the code
3 async/await syntax
We talked about the use of asynchronous I/O earlier, but declaring asynchronous functions can be tedious and requires yield syntax. In Python 3.5, the async/await keyword was introduced to make writing asynchronous callbacks more intuitive and user-friendly.
Add the keyword async before the function def to indicate that the function is asynchronous. Equivalent to the alternative syntax @asyncio.coroutine. Specific examples are:
async def hello(a):
print("Hello World!")
Copy the code
In addition, use await instead of yield from to indicate that this part of the operation is asynchronous.
async def hello(a):
print("Hello World!")
r = await asyncio.sleep(1)
print("Hello again!")
Copy the code
Finally, the asynchronous function is executed using the EventLoop reference again, and then the asynchronous function is executed using the coroutine. The final code looks like this:
import asyncio
async def hello(a):
print("Hello world!")
r = await asyncio.sleep(1)
print("Hello again!")
if __name__ == '__main__':
loop = asyncio.get_event_loop()
tasks = [hello(), ]
loop.run_until_complete(asyncio.wait(tasks))
loop.close()
Copy the code
The running results are as follows:
Hello world! >> Will pause for a second Hello Again!Copy the code
4 Basic usage of AIOHTTP
We use AIoHTTP to make an HTTP request to the httpbin.org site in GET mode. HTTP handles HTTP requests asynchronously because it is AIO. So you must also follow Python’s async function syntax, which means using async/await syntax.
Use aiOHTTP to initiate an HTTP request, specific writing can be divided into the following steps: 1) Use async to define asynchronous functions 2) use aiohttp.ClientSession to GET a session object 3) Use this session object to request web pages in GET, POST, PUT, etc. 4) Finally GET EventLoop Reference to perform asynchronous functions.
import asyncio
import aiohttp
# define asynchronous function main()
async def main(a):
Get session object
async with aiohttp.ClientSession() as session:
Httbin async with session.get('http://httpbin.org/get') as response:
print(response.status)
print(await response.text())
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
Copy the code
Aiohttp supports user-defined HEADERS, timeout, proxy, and cookie Settings.
import asyncio
import aiohttp
url = 'http://httpbin.org/post'
headers = {
'User-agent': "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36",
}
data = {
'data': 'person data',}# define asynchronous function main()
async def main(a):
Get session object
async with aiohttp.ClientSession() as session:
Request httbin in post mode
async with session.post(url=url, headers=headers, data=data) as response:
print(response.status)
print(await response.text())
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
Copy the code
For more information on the use of aiOHTTP, you can read the documentation on the official website. To be honest, aioHTTP is used much the same way as Requests. If you already know the Requests library, you’ll soon be able to use AIOHTTP.
This article was first published in wechat public number, the original address is to improve the efficiency of crawler? Aiohttp. Welcome to reprint the article at any time, reprint please contact number to open the white list, respect the author’s original. I use my wechat account “Geek Monkey” to share original Python works every week. Related to web crawler, data analysis, Web development and other directions.