Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”
preface
Web crawler is an IO intensive program (page request, file read), which will block the running of the program and consume a lot of time. Python provides a variety of concurrent programming methods, which can improve the execution efficiency of IO intensive program to a certain extent. Understand the following concepts before you begin!
Basic knowledge of
Concurrency: something happens over a period of time. In a single core CPU, perform multiple tasks running in concurrent way, since there is only one core processor, the CPU to a time period is divided into several time interval, the task only in their own time interval, if didn’t finish the task in their own time stage, will switch to the next task, because each time is very short, switch is frequent, So it feels like it’s running “simultaneously.”
Parallel: something that happens at the same time. In multi-core cpus, it is possible to achieve true “simultaneous” running. When one CPU executes a process, other cpus can execute other processes, and neither process occupies CPU resources.
Synchronization: In synchronization, tasks are not run independently, but in alternate order. Only after the first task is completed, the next task can be run.
Asynchronous: In asynchronous mode, each task can run independently without affecting each other.
In crawler process, asynchronism is equivalent to opening a web page without waiting for the completion of page loading and continuing to open a new web page. Synchronization is the equivalent of opening a web page and waiting for it to load completely before opening the next one.
There are three ways to improve the speed of crawler: multi-thread, multi-process and coroutine. So what are processes, threads, coroutines?
Process: A process is a unit of program that can run independently. It is a collection of threads, made up of one or more threads.
Thread: the smallest unit of operation scheduling in the operating system and the smallest unit of operation in the process.
Coroutines: A coroutine is a smaller unit of execution than a thread, which is a lightweight thread. Threads are scheduled in the operating system, whereas coroutines are scheduled in user space. Its advantage over threads is lower switching costs.
GIL
In Python multithreading, each thread executes as follows:
Get the GIL >>> execute code for the corresponding thread >>> release the GIL
To execute, a thread needs to get the GIL, which can be considered a license, and there is only one GIL in a Python process. A license is required to execute threads, which results in multiple threads in a Python process executing only one thread at a time, even under multicore conditions.
For IO intensive (page requests, etc.) tasks, this is not a big problem; For CPU-intensive tasks, the overall performance of multithreading may be lower than that of single threading due to the presence of GIL.
multithreading
Multithreaded application scenarios: I/O intensive programs. Such as
- Database request
- Page request
- Read and write files
Due to the GIL, allowing only one thread to execute at a time globally means that frequent thread switching is required to ensure that each thread can complete its task.
Python uses the threading module. Each Thread object we create represents a Thread, and each Thread can handle a different task.
There are two ways to create Thread objects.
- Create the callback function directly as an argument
Thread
Object. - from
threading.Thread
Inheriting creates a new subclass, overriderun()
Method after instantiationstart()
Method to start a new thread.
Creating a Thread object
threading.Thread(target=None, name=None, args=(), kwargs=None, *, daemon=None)
- target: Specifies to be
run()
Callable object for method invocation. The default isNone
, indicates that no function is called. - name: Thread name. By default, a single name is
"Thread - N"
Where N is a decimal number. - args: Parameter tuple of the target call (
target
Fixed parameter). The default is (). - kwargs: keyword argument dictionary of the target call (
target
The variable parameter of. The default value isNone
. - daemon: Whether to enable the daemon thread. Default
MainThread
The main thread must wait for other threads to finish before terminating. The default isNone
.
import threading
import time
def block(second) :
print(threading.current_thread().name, 'Thread running')
# Sleep second Second
time.sleep(second)
print(threading.current_thread().name, 'Thread terminated')
print(threading.current_thread().name, 'Thread running')
for i in [1.3] :Create the thread object and specify the callback function block, name, and the fixed argument I
thread = threading.Thread(target=block, name=f'thread test {i}', args=[i])
# start thread
thread.start()
print(threading.current_thread().name, 'Thread terminated')
Copy the code
Threading.current_thread ().name Gets the name of the current thread. A brief description of the logic of the above code is as follows: define a function block that outputs information about the current thread, loop twice to create a thread object, then start the thread, and finally output information about the end of the thread. Note the order in which each message is output. The main thread ends before the test1 or test3 thread ends.
Custom classes inherit Thread
Now modify directly from the above example to implement multithreading using custom classes that inherit Thread.
import threading
import time
class TestThread(threading.Thread) :
def __init__(self, name=None, second=0) :
threading.Thread.__init__(self, name=name)
self.second = second
def run(self) :
print(threading.current_thread().name, 'Thread running')
time.sleep(self.second)
print(threading.current_thread().name, 'Thread terminated')
print(threading.current_thread().name, 'Thread running')
for i in [1.3]:
thread = TestThread(name=f'thread test {i}', second=i)
# start thread
thread.start()
print(threading.current_thread().name, 'Thread terminated')
Copy the code
This is the beginning of a simple tutorial that will continue until you master Python concurrent crawlers.
For startersPython
Or they want to get startedPython
You can search through wechatA new vision of Python
Contact the author, exchange and study together, are from the novice, sometimes a simple question stuck for a long time, but may be others a little dial will suddenly see light, heartfelt hope that we can make progress together.