This is the 40th day of my first Text Challenge 2022. Using coroutines in Python crawlers can greatly improve the collection efficiency of target sites, so we will learn this concept over and over again and apply it to the crawler case.
Definition of coroutines
With the introduction of these two articles, defining a coroutine should now be very simple. Add the async keyword in front of a function and the function becomes a coroutine. You can verify the type of the isinstance function directly.
from collections.abc import Coroutine
async def func() :
print("I'm a coroutine function.")
if __name__ == '__main__':
# create coroutine object, note that coroutine object does not run function code, that is, does not output any information
coroutine = func()
# Type judgment
print(isinstance(coroutine, Coroutine))
Copy the code
Enter the following code:
True
sys:1: RuntimeWarning: coroutine 'func' was never awaited
Copy the code
According to the type judgment, the function that adds async keyword is the coroutine type. The following warning is ignored for the moment. The original is that the coroutine is not registered in the event loop and is called.
Using coroutines
The target site is banan.huiben.61read.com/. This site is a picture book website affiliated to China Children’s Press and Publication Group. There are a large number of children’s picture book animations on the website without advertisements, and the animations are in MP4 format, which is easy to download.
import asyncio
import requests
# coroutine function
async def get_html() :
res = requests.get("http://banan.huiben.61read.com/Video/List/1d4a3be3-0a72-4260-979b-743d9db8ad85")
if res is not None:
return res.status_code
else:
return None
Declare coroutine objects
coroutine = get_html()
# event loop object
loop = asyncio.get_event_loop()
Convert coroutines to tasks
task = loop.create_task(coroutine)
# Task = asyncio.ensure_future(coroutine) # Using this method, it is also possible to convert coroutines to tasks
Put the task in an event loop and call it
loop.run_until_complete(task)
# output result
print("Result output",task.result())
Copy the code
The above code can also be modified to run top-level entry functions using the asyncio.run() method after python3.7.
import asyncio
import requests
# coroutine function
async def get_html() :
res = requests.get("http://banan.huiben.61read.com/Video/List/1d4a3be3-0a72-4260-979b-743d9db8ad85")
if res is not None:
print(res.status_code)
else:
return None
async def main() :
await get_html()
Declare coroutine objects
coroutine = get_html()
asyncio.run(main())
Copy the code
Next reference to the above code, the realization of two MP4 video download.
# http://static.61read.com/flipbooks/huiben/huazhuangwuhui/web/1.mp4
# http://static.61read.com/flipbooks/huiben/jingubanghexiaofengche/web/1.mp4
import asyncio
import time
import requests
async def requests_get(url) :
headers = {
"Referer": "http://banan.huiben.61read.com/"."User-Agent": "Mozilla / 5.0 (Windows NT 6.1; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36"
}
try:
res = requests.get(url, headers=headers)
return res
except Exception as e:
print(e)
return None
async def get_video(url) :
res = await requests_get(url)
if res is not None:
with open(f'./mp4/{time.time()}.mp4'."wb") as f:
f.write(res.content)
async def main() :
start_time = time.perf_counter()
# http://static.61read.com/flipbooks/huiben/huazhuangwuhui/web/1.mp4
# http://static.61read.com/flipbooks/huiben/jingubanghexiaofengche/web/1.mp4
await get_video("http://static.61read.com/flipbooks/huiben/huazhuangwuhui/web/1.mp4")
await get_video("http://static.61read.com/flipbooks/huiben/jingubanghexiaofengche/web/1.mp4")
print("Code runtime:", time.perf_counter() - start_time)
if __name__ == '__main__':
asyncio.run(main())
Copy the code
In this test, it took 44 seconds to download the two videos on different computers and Internet speeds.
The asyncio.create_task() function is used to run multiple coroutines concurrently to continue modifying the code and optimize execution time.
import asyncio
import time
import requests
async def requests_get(url) :
headers = {
"Referer": "http://banan.huiben.61read.com/"."User-Agent": "Mozilla / 5.0 (Windows NT 6.1; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36"
}
try:
res = requests.get(url, headers=headers)
return res
except Exception as e:
print(e)
return None
async def get_video(url) :
res = await requests_get(url)
if res is not None:
with open(f'./mp4/{time.time()}.mp4'."wb") as f:
f.write(res.content)
async def main() :
start_time = time.perf_counter()
# http://static.61read.com/flipbooks/huiben/huazhuangwuhui/web/1.mp4
# http://static.61read.com/flipbooks/huiben/jingubanghexiaofengche/web/1.mp4
task1 = asyncio.create_task(
get_video("http://static.61read.com/flipbooks/huiben/huazhuangwuhui/web/1.mp4"))
task2 = asyncio.create_task(
get_video("http://static.61read.com/flipbooks/huiben/jingubanghexiaofengche/web/1.mp4"))
await task1
await task2
print("Code runtime:", time.perf_counter() - start_time)
if __name__ == '__main__':
asyncio.run(main())
Copy the code
The code runs in 27S, and you can see an improvement in efficiency. Before you formally examine the code above, learn a concept of awaitables
An object that can be used in an await statement is an awaitable object. There are three main types of awaitable objects: coroutines, tasks, and futures
A distinction must be made in Python between coroutine functions and coroutine objects, which are the objects returned by the former.
Asyncio.create_task (coro, *, name=None) creates a task object and schedules its execution. Parameter 1 is a coroutine object and parameter 2 is the task name. Use the asyncio.ensure_future() function.
The prototype of the concurrent running task function is shown below:
asyncio.gather(*aws, loop=None, return_exceptions=False) -> awaitable
Copy the code
A waitable object in a sequence is run concurrently. If a waitable object in AWS is a coroutine, it is automatically scheduled as a task.
Return_exceptions Parameter description:
return_exceptions
False (the default), the first exception raised is immediately propagated to the waitgather()
The task. Other waitable objects in the AWS sequence will not be cancelled and will continue to run;return_exceptions
True, the exception is treated as a successful result and aggregated into the result list.
If Gather () is cancelled, all submitted (unfinished) waitable objects are also cancelled.
The simple wait function prototype is as follows:
asyncio.wait(aws, *, loop=None, timeout=None, return_when=ALL_COMPLETED) -> coroutine
Copy the code
Run awaitable objects specified by AWS concurrently and block the thread until the condition specified by return_WHEN is met.
If a waitable object in AWS (above parameters) is a coroutine, it is automatically scheduled as a task. Passing coroutine objects directly to wait() is deprecated.
This function returns two Task/Future collections, typically written (done, pending).
Return_when specifies when this function should return. It must be one of the following constants:
FIRST_COMPLETED
The function returns when any waitable object terminates or is cancelled;FIRST_EXCEPTION
The: function returns when any waitable object ends by throwing an exception. It is equivalent to ALL_COMPLETED when no exceptions are thrown;ALL_COMPLETED
The: function returns when all waitable objects are finished or cancelled.
A similar method to wait() is wait_for, which looks like this:
asyncio.wait_for(aw, timeout, *, loop=None) -> coroutine
Copy the code
Wait aw can wait for the object to complete and timeout after the specified timeout seconds.
This function can pass coroutines, and if a timeout occurs, the task is canceled and asyncio.timeouterror is raised.
Wait () differs from wait_for() in that wait() does not cancel a waitable object when a timeout occurs.
Bind the callback function
The implementation principle of asynchronous I/O is to suspend the program at the I/O site and continue the program after the I/O is complete. When writing a crawler, a lot of times you rely on the return value of IO, and that’s where the callback comes in. Synchronous programming implements callbacks
Declare the variable directly before await and get the callback value
import asyncio
import time
import requests
async def requests_get(url) :
headers = {
"Referer": "http://banan.huiben.61read.com/"."User-Agent": "Mozilla / 5.0 (Windows NT 6.1; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36"
}
try:
res = requests.get(url, headers=headers)
return res
except Exception as e:
print(e)
return None
async def get_video(url) :
res = await requests_get(url)
if res is not None:
with open(f'./mp4/{time.time()}.mp4'."wb") as f:
f.write(res.content)
return (url,"success")
else:
return None
async def main() :
start_time = time.perf_counter()
# http://static.61read.com/flipbooks/huiben/huazhuangwuhui/web/1.mp4
# http://static.61read.com/flipbooks/huiben/jingubanghexiaofengche/web/1.mp4
task1 = asyncio.create_task(
get_video("http://static.61read.com/flipbooks/huiben/huazhuangwuhui/web/1.mp4"))
task2 = asyncio.create_task(
get_video("http://static.61read.com/flipbooks/huiben/jingubanghexiaofengche/web/1.mp4"))
Synchronous callback method
ret1 = await task1
ret2 = await task2
print(ret1,ret2)
print("Code runtime:", time.perf_counter() - start_time)
if __name__ == '__main__':
asyncio.run(main())
Copy the code
throughasyncio
Add callback function functionality to implement
The method used is add_done_callback, which adds a callback that will be run when the Task object completes. The corresponding remove callback function, remove_done_callback.
import asyncio
import time
import requests
async def requests_get(url) :
headers = {
"Referer": "http://banan.huiben.61read.com/"."User-Agent": "Mozilla / 5.0 (Windows NT 6.1; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36"
}
try:
res = requests.get(url, headers=headers)
return res
except Exception as e:
print(e)
return None
async def get_video(url) :
res = await requests_get(url)
if res is not None:
with open(f'./mp4/{time.time()}.mp4'."wb") as f:
f.write(res.content)
return (url, "success")
else:
return None
async def main() :
start_time = time.perf_counter()
# http://static.61read.com/flipbooks/huiben/huazhuangwuhui/web/1.mp4
# http://static.61read.com/flipbooks/huiben/jingubanghexiaofengche/web/1.mp4
task1 = asyncio.create_task(
get_video("http://static.61read.com/flipbooks/huiben/huazhuangwuhui/web/1.mp4"))
task1.add_done_callback(callback)
task2 = asyncio.create_task(
get_video("http://static.61read.com/flipbooks/huiben/jingubanghexiaofengche/web/1.mp4"))
task2.add_done_callback(callback)
Synchronous callback method
await task1
await task2
print("Code runtime:", time.perf_counter() - start_time)
def callback(future) :
print(The callback function returns:, future.result())
if __name__ == '__main__':
asyncio.run(main())
Copy the code
The crawler case for this lesson
In this tutorial, the full code can be downloaded from Codechina. The main ideas are as follows.
Step 1: Get the addresses of all the list pagesThe specific location of the data is shown below. Because the data is all in one page, the acquisition method is relatively simple, and the web page can be directly parsed. Step 2: Get the video download addressThe following process is used to obtain the video address. During the search process, it is found that the address of the video thumbnail has certain rules with the address of the video player, as shown below:
# # video thumbnail address http://static.61read.com/flipbooks/huiben/chudiandetouyuzei/cover.jpg video address http://static.61read.com/flipbooks/huiben/chudiandetouyuzei/web/1.mp4Copy the code
The removal ofcover.jpg
And replaced with theweb/1.mp4
That dramatically lowers the level at which we can get video. Step 3: Write code to download the video
import asyncio
import time
import requests
from bs4 import BeautifulSoup
import lxml
BASE_URL = "http://banan.huiben.61read.com"
async def requests_get(url) :
headers = {
"Referer": "http://banan.huiben.61read.com/"."User-Agent": "Mozilla / 5.0 (Windows NT 6.1; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36"
}
try:
res = requests.get(url, headers=headers)
return res
except Exception as e:
print(e)
return None
async def get_video(name, url) :
res = await requests_get(url)
if res is not None:
with open(f'./mp4/{name}.mp4'."wb") as f:
f.write(res.content)
return (name, url, "success")
else:
return None
async def get_list_url() :
""" Get list page address """
res = await requests_get("http://banan.huiben.61read.com/")
soup = BeautifulSoup(res.text, "lxml")
all_a = []
for ul in soup.find_all(attrs={'class'.'inline'}):
all_a.extend(BASE_URL + _['href'] for _ in ul.find_all('a'))
return all_a
async def get_mp4_url(url) :
"" get the MP4 address ""
res = await requests_get(url)
soup = BeautifulSoup(res.text, "lxml")
mp4s = []
for div_tag in soup.find_all(attrs={'class'.'item_list'}) :# Get thumbnails of images
src = div_tag.a.img['src']
# Replace thumbnail address with MP4 video address
src = src.replace('cover.jpg'.'web/1.mp4').replace('cover.png'.'web/1.mp4')
name = div_tag.div.a.text.strip()
mp4s.append((src, name))
return mp4s
async def main() :
Get the list page address task
task_list_url = asyncio.create_task(get_list_url())
all_a = await task_list_url
Create a task list
tasks = [asyncio.ensure_future(get_mp4_url(url)) for url in all_a]
Add a callback function
# ret = map(lambda x: x.add_done_callback(callback), tasks)
# async execution
dones, pendings = await asyncio.wait(tasks)
all_mp4 = []
for task in dones:
all_mp4.extend(task.result())
Get all MP4 addresses
totle = len(all_mp4)
print("Accumulate [", totle, "】 video")
print("_" * 100)
print("Ready to download video")
# Download 10 at a time
totle_page = totle // 10 if totle % 10= =0 else totle // 10 + 1
# print(totle_page)
for page in range(0, totle_page):
print("Downloading video on page {}".format(page + 1))
start_page = 0 if page == 0 else page * 10
end_page = (page + 1) * 10
print("Download address")
print(all_mp4[start_page:end_page])
mp4_download_tasks = [asyncio.ensure_future(get_video(name, url)) for url, name in all_mp4[start_page:end_page]]
mp4_dones, mp4_pendings = await asyncio.wait(mp4_download_tasks)
for task in mp4_dones:
print(task.result())
if __name__ == '__main__':
asyncio.run(main())
Copy the code
Write in the back
For the complete code, check out the comments section at the top.
Today is the 243/365 day of continuous writing. Expect attention, likes, comments and favorites.