Introduction to the speed of light in Python coroutine
Original author: French Dog NLP data scientist - Captain Big Head
preface
Network IO, synchronous, asynchronous, or blocking non – blocking is always a topic of discussion. As for the IO intensive task of crawler, if a synchronous network request is used, as long as a request is blocked, the whole process time will be prolonged, which greatly reduces the speed of crawler. This article introduces one of Python’s ways of implementing asynchrony, coroutines. And through some examples to illustrate the use of coroutine, and the benefits.
Basic knowledge of
- Synchronous, asynchronous
I mentioned synchronous and asynchronous, so what’s the difference between the two? Take the crawler scenario as an example. For example, the crawler now needs to open ten links. The IO process is the process of opening these ten links, and the CPU is responsible for clicking the links. Obviously, the click event is very fast, while the link display process is slow. Synchronous IO is a crawler that clicks on one url, waits for a complete response, and then clicks on the next url. Asynchronous IO is a crawler that clicks on a web site, not waiting for a response, immediately clicks on the next web site, and finally waits for a response. In contrast, asynchrony is more efficient.
- IO intensive versus computation intensive
IO intensive tasks refer to those that require disk I/O or network I/O and require little computation, such as web page requests and file reads and writes. Computation-intensive tasks refer to tasks where CPU computing is the dominant factor, such as matrix computing in graphics rendering (which is now done on the GPU).
- Why not use multithreading
In general, the traditional way to solve parallel events might be to use multithreading. However, multithreading has several disadvantages. The first is the cost of resources. The second is the existence of the Python GIL (global interpretation lock).
- What is a coroutine
The concept of A coroutine is relatively easy to understand. It allows execution A to interrupt, then go to execution B, and then go back at the appropriate time, thus achieving the effect similar to multithreading in A single thread. And it has the following advantages:
- Theoretically, the number can be infinite, because it is carried out in a single thread, there is no switching between threads, relatively high efficiency.
- There is no need for a “lock” mechanism, all coroutines are in one thread.
- Debugging is easier because the code is executed sequentially.
The above explains the background to using coroutines, the difference from traditional multithreading, and the concept of coroutines. Next, a piece of code explains how coroutines work in detail.
Coroutine implementation in Python
A person on StackOverflow asked how to send 10,000 HTTP requests the fastest using Python, and the best answer was this:
from urlparse import urlparse
from threading import Thread
import httplib, sys
from Queue import Queue
concurrent = 200
def doWork():
while True:
url = q.get()
status, url = getStatus(url)
doSomethingWithResult(status, url)
q.task_done()
def getStatus(ourl):
try:
url = urlparse(ourl)
conn = httplib.HTTPConnection(url.netloc)
conn.request("HEAD", url.path)
res = conn.getresponse()
return res.status, ourl
except:
return "error", ourl
def doSomethingWithResult(status, url):
print status, url
q = Queue(concurrent * 2)
for i in range(concurrent):
t = Thread(target=doWork)
t.daemon = True
t.start()
try:
for url in open('urllist.txt'):
q.put(url.strip())
q.join()
except KeyboardInterrupt:
sys.exit(1)
Copy the code
The main idea of this code is to use multiple threads to fetch data from the queue, and use the time of WAITING for I/O blocking to execute other threads, improving efficiency. As explained in Python’s multithreading above, multithreading does not execute tasks simultaneously, but alternately. Is it possible to use a coroutine to complete the IO task using only one thread?
Let’s look at the basics of how Python implements coroutines. In Python, coroutines are implemented through the concept of generators. Let’s look at an example:
def consumer(): print("[Consumer] Init Consumer ......" ) r = "init ok" while True: # The Consumer uses yield to get the message. N = yield r print("[Consumer] conusme n = %s, r = %s" % (n, r)) r = "consume %s OK" % n def produce(c): print("[Producer] Init Producer ......" R = c. end(None) print("[Producer] Start Consumer, return %s" % r) n = 0 while n < 5: n += 1 print("[Producer] While, Producing %s ......" R = c. end(n) print("[Producer] consumer return: %s" % r) c.close() print("[Producer] Close Producer ......" ) produce(consumer())Copy the code
The above example is from Liao xuefeng’s Python tutorial, and the result is as follows:
[Producer] Init Producer ……
[Consumer] Init Consumer ……
[Producer] Start Consumer, return init ok
[Producer] While, Producing 1 ……
[Consumer] conusme n = 1, r = init ok
[Producer] Consumer return: consume 1 OK
[Producer] While, Producing 2 ……
[Consumer] conusme n = 2, r = consume 1 OK
[Producer] Consumer return: consume 2 OK
[Producer] While, Producing 3 ……
[Consumer] conusme n = 3, r = consume 2 OK
[Producer] Consumer return: consume 3 OK
[Producer] While, Producing 4 ……
[Consumer] conusme n = 4, r = consume 3 OK
[Producer] Consumer return: consume 4 OK
[Producer] While, Producing 5 ……
[Consumer] conusme n = 5, r = consume 4 OK
[Producer] Consumer return: consume 5 OK
[Producer] Close Producer ……
It can be seen that the tasks between producers and consumers switch back and forth. In the traditional producer-consumer mode, one thread writes messages and another thread retrieves messages, controlling queues and waits through the locking mechanism, and deadlock may occur. However, after the producer finishes producing the message, the coroutine directly jumps to the consumer to start execution through yield. After the consumer finishes executing the message, the coroutine switches back to the producer to continue production, which is very efficient.
For the network request I/O scenario, we can do an experiment to compare the speed of synchronous AND asynchronous I/O,
import requests import time def consumer(): r = '' while True: n = yield r if not n: return print('[CONSUMER] Consuming %s... ' % n) r = requests.get(n).status_code def produce(c): c.send(None) for i in range(73000, 73100): request_url = "http://www.jb51.net/article/%d.htm" % i print('[PRODUCER] Producing %s... ' % request_url) r = c.send(request_url) print('[PRODUCER] Consumer return: %s' % r) c.close() async_start = time.time() produce(consumer()) print(time.time() - async_start) sync_start = time.time() for i in range(73000, 73100): url = "http://www.jb51.net/article/%d.htm" % i response = requests.get(url) print(time.time() - sync_start)Copy the code
In the final output of a run, asynchronous IO through coroutines completed the request in 9.8 seconds, while synchronous IO took 28.6 seconds. It can be seen that the efficiency of asynchronous I/O has been greatly improved.
Written in the book after
The above text briefly describes the basic features of coroutines and how to use Python to implement coroutines to achieve asynchronous IO. There’s a lot more that can be said about network IO, but here’s a learning route,
- If you are interested, consider Python3’s support for asynchronous IO, such as libraries like aiohttp, which greatly reduces coding difficulty.
- Web frameworks like Tornado that support asynchronous non-blocking are also of particular interest, and the crawler Web service that wrote the Q&A system is based on this library.
- If you want to learn more, I suggest you check out the Linux/Unix System Programming Manual’s chapters on sockets, which are particularly helpful for understanding the underlying I/O mechanisms (such as epoll, SELECT).
The article was published from the official technology blog of The French dog – the technology nest of the French dog