Python crawlers slow? Let's take a look at concurrent programming

Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”

preface

Web crawler is an IO intensive program (page request, file read), which will block the running of the program and consume a lot of time. Python provides a variety of concurrent programming methods, which can improve the execution efficiency of IO intensive program to a certain extent. Understand the following concepts before you begin!

Basic knowledge of

Concurrency: something happens over a period of time. In a single core CPU, perform multiple tasks running in concurrent way, since there is only one core processor, the CPU to a time period is divided into several time interval, the task only in their own time interval, if didn’t finish the task in their own time stage, will switch to the next task, because each time is very short, switch is frequent, So it feels like it’s running “simultaneously.”

Parallel: something that happens at the same time. In multi-core cpus, it is possible to achieve true “simultaneous” running. When one CPU executes a process, other cpus can execute other processes, and neither process occupies CPU resources.

Synchronization: In synchronization, tasks are not run independently, but in alternate order. Only after the first task is completed, the next task can be run.

Asynchronous: In asynchronous mode, each task can run independently without affecting each other.

In crawler process, asynchronism is equivalent to opening a web page without waiting for the completion of page loading and continuing to open a new web page. Synchronization is the equivalent of opening a web page and waiting for it to load completely before opening the next one.

There are three ways to improve the speed of crawler: multi-thread, multi-process and coroutine. So what are processes, threads, coroutines?

Process: A process is a unit of program that can run independently. It is a collection of threads, made up of one or more threads.

Thread: the smallest unit of operation scheduling in the operating system and the smallest unit of operation in the process.

Coroutines: A coroutine is a smaller unit of execution than a thread, which is a lightweight thread. Threads are scheduled in the operating system, whereas coroutines are scheduled in user space. Its advantage over threads is lower switching costs.

GIL

In Python multithreading, each thread executes as follows:

Get the GIL >>> execute code for the corresponding thread >>> release the GIL

To execute, a thread needs to get the GIL, which can be considered a license, and there is only one GIL in a Python process. A license is required to execute threads, which results in multiple threads in a Python process executing only one thread at a time, even under multicore conditions.

For IO intensive (page requests, etc.) tasks, this is not a big problem; For CPU-intensive tasks, the overall performance of multithreading may be lower than that of single threading due to the presence of GIL.

multithreading

Multithreaded application scenarios: I/O intensive programs. Such as

Database request
Page request
Read and write files

Due to the GIL, allowing only one thread to execute at a time globally means that frequent thread switching is required to ensure that each thread can complete its task.

Python uses the threading module. Each Thread object we create represents a Thread, and each Thread can handle a different task.

There are two ways to create Thread objects.

Create the callback function directly as an argumentThreadObject.
fromthreading.ThreadInheriting creates a new subclass, overriderun()Method after instantiationstart()Method to start a new thread.

Creating a Thread object

threading.Thread(target=None, name=None, args=(), kwargs=None, *, daemon=None)

target: Specifies to berun()Callable object for method invocation. The default isNone, indicates that no function is called.
name: Thread name. By default, a single name is"Thread - N"Where N is a decimal number.
args: Parameter tuple of the target call (targetFixed parameter). The default is ().
kwargs: keyword argument dictionary of the target call (targetThe variable parameter of. The default value isNone.
daemon: Whether to enable the daemon thread. DefaultMainThreadThe main thread must wait for other threads to finish before terminating. The default isNone.

import threading
import time

def block(second) :
    print(threading.current_thread().name, 'Thread running')
    # Sleep second Second
    time.sleep(second)
    print(threading.current_thread().name, 'Thread terminated')

print(threading.current_thread().name, 'Thread running')

for i in [1.3] :Create the thread object and specify the callback function block, name, and the fixed argument I
    thread = threading.Thread(target=block, name=f'thread test {i}', args=[i])
    # start thread
    thread.start()

print(threading.current_thread().name, 'Thread terminated')
Copy the code

Threading.current_thread ().name Gets the name of the current thread. A brief description of the logic of the above code is as follows: define a function block that outputs information about the current thread, loop twice to create a thread object, then start the thread, and finally output information about the end of the thread. Note the order in which each message is output. The main thread ends before the test1 or test3 thread ends.

Custom classes inherit Thread

Now modify directly from the above example to implement multithreading using custom classes that inherit Thread.

import threading
import time

class TestThread(threading.Thread) :
    def __init__(self, name=None, second=0) :
        threading.Thread.__init__(self, name=name)
        self.second = second

    def run(self) :
        print(threading.current_thread().name, 'Thread running')
        time.sleep(self.second)
        print(threading.current_thread().name, 'Thread terminated')

        
print(threading.current_thread().name, 'Thread running')

for i in [1.3]:
    thread = TestThread(name=f'thread test {i}', second=i)
    # start thread
    thread.start()

print(threading.current_thread().name, 'Thread terminated')
Copy the code

This is the beginning of a simple tutorial that will continue until you master Python concurrent crawlers.

For startersPythonOr they want to get startedPythonYou can search through wechatA new vision of PythonContact the author, exchange and study together, are from the novice, sometimes a simple question stuck for a long time, but may be others a little dial will suddenly see light, heartfelt hope that we can make progress together.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Python crawlers slow? Let’s take a look at concurrent programming

preface

Basic knowledge of

GIL

multithreading

Creating a Thread object

Custom classes inherit Thread

Python crawlers slow? Let’s take a look at concurrent programming

preface

Basic knowledge of

GIL

multithreading

Creating a Thread object

Custom classes inherit Thread

Related Posts

Interviewer: How do you index a million urls?

[Java Technology exploration] In-depth analysis of JDK dynamic proxy source code

The Python crawler crawls the bilibili Top videos list