Hello, I’m Yue Chuang. Today, I’m going to talk to you about multiprocessing and multithreading. My official account: AI Yue Chuang, blog address: www.aiyc.top/

1. Global interpreter lock

Global Interpreter Lock (abbreviation GIL)

A mechanism used by computer programming language interpreters to synchronize threads so that only one thread is executing at any one time. Even on multi-core processors, interpreters using GIL allow only one thread to execute at a time. Common interpreters that use GIL include CPython and Ruby MRI.

If you don’t understand the above, no problem. The common explanation is: your computer has one or more cores, or your code has multiple threads, but because of the GIL lock you can only run one thread, not multiple threads at the same time.

Let’s use a picture to explain:


For example, if you have two threads (Py thread1 and Py tread2),

  1. When our thread one (Py thread1) starts executing, that thread will request a lock from our interpreter. That’s our GIL lock;
  2. Then, when the interpreter receives a request, it goes to our OS and asks for our system thread;
  3. When the system executes on your thread, it executes on your CPU. (Assuming you have a quad-core CPU);
  4. Our other thread, py thread2, is also running in sync.
  5. Thread 2 is stuck here (the Python interpreter) when it requests the GIL, because thread 1 has already taken the GIL lock (i.e., it must get the GIL lock to execute).
  6. For thread 2 to run, it must wait for our thread to finish running (that is, release our GIL (step 5 in the picture) before thread 2 can get the lock.
  7. When thread two gets the lock, it runs the same as thread one.
① Create > ② GIL > ③ Apply for native thread (OS) > ④ CPU execution (if there is any other thread, it will be stuck outside the Python interpreter)Copy the code
This lock is the father of Python’s attempt to solve the thread safety problem once and for all.

2. Multithreaded testing

To make it more intuitive, I’ll write each thread code separately and compare it:

Single threaded streaking :(this is also a main thread)

import time

def start() :
	for i in range(1000000):
		i += i
	return

# Do not use any threads (naked)
def main() :
	start_time = time.time()
	for i in range(10):
		start()
	print(time.time()-start_time)
if __name__ == '__main__':
	main()
Copy the code

Output:

6.553307056427002
Copy the code

Note: Because the performance of each computer is different, the results of the operation are relatively different (please analyze according to the actual situation).


Next, let’s write a multithread

We first create a dictionary (thread_name_time) to store the name of each of our threads and the corresponding time

import threading,time

def start() :
	for i in range(1000000):
		i += i
	return

# # Do not use any threads (naked)
# def main():
# start_time = time.time()
# for i in range(10):
# start()
# print(time.time()-start_time)
# if __name__ == '__main__':
# main()

def main() :
	start_time = time.time()
	thread_name_time = {}We create a dictionary (thread_name_time) to store the name of each thread and the corresponding time

	for i in range(10) :That is, each thread executes sequentially
		thread = threading.Thread(target=start)# target= Write the function you want to run in multiple threads, without parentheses
		thread.start()The previous line starts the thread, and this line starts the run.
		thread_name_time[i] = thread # add data to our dictionary. Why use I as key here? That's because it's convenient for us to join


	for i in range(10):
		thread_name_time[i].join()
	# join() wait until the thread finishes executing.
	print(time.time()-start_time)

if __name__ == '__main__':
	main()
Copy the code

The output

6.2037984102630615
Copy the code
# 6.553307056427002 streaking
# 6.2037984102630615 Single-thread sequential execution
# 6.429047107696533 Thread concurrency
Copy the code

As we can see, there is not much difference in speed.

Multi-threaded concurrent execution is not as fast as single-threaded sequential execution

This is a pyrrhic gain

GIL is the reason for this situation

This is computationally intensive, so it doesn’t work

When we do addition, subtraction, multiplication, division, or image processing, we do it from the CPU. Because Python has a GIL, there must be only one thread running at a time, which is why we don’t make much difference between a thread and a thread.

Most of our web crawlers are IO – intensive and computer – intensive

3. IO intensive vs. Computer intensive [I: Input O: Output]

[Img – ICIACNqY-1583392394098] (assets/1571801967486.png) [Img – ICIacNqy-1583392394098]


BIOS: B: Base, I: Input, O: Output, S: System

That is, when your computer is turned on, it will start.

1. Computing intensive

Up here, we have two threads running, and if both threads are executing at the same time, only one thread is executing on the CPU at the same time.

As you can see from the figure above, these two threads need to switch context frequently.

Ps: Our green indicates that our thread is executing, and red indicates blocking.

Therefore, it can be clearly observed that the context switch of threads also consumes resources (time-ms) by constantly returning and fetching GIL and so on to switch contexts. Obviously a huge waste of resources.

2. IO intensive

We now suppose, there is a server program (Socket) is our new open a program (that is, the bottom of our web crawler) began to crawl the target page, that page, we have two threads running at the same time, we start running Thread 2 the request has been successful, namely above (Thread 2) green, a way in the past.

Our Thread 1 – Datagram (here it has a UDP on) and then waits for data to be created (i.e., what HTML, CSS, etc.). This would block for a period of time, while our thread two could run without stopping and without context switching. This IO intensity has a huge benefit.

IO intensive, which takes into account the amount of time we have to wait, and saves us a lot of time.
The thing to notice here is that our multithreading is IO intensive, so we have to make that distinction.

And resource waiting, for example, sometimes when we make a Get request with the browser, the browser icon goes around in circles and that’s how long we’re waiting for the resource, Datagram to Ready to Receive We don’t need to implement it at all, we just let it wait. At this point, just let another thread execute

In other words: in the first thread, we crawl the page and let the other thread continue to crawl as we spin around. This avoids the waste of resources. (Use all your time)

Note: Requesting resources does not require CPU computation, CPU participation is minimal, and our first example, the for loop that calculates a number, does require CPU computation.


3. Avoid GIL

[Img – G9ah2mZV-1583392394099] (assets/1571888520939.png) [Img – G9ah2mZV-1583392394099]

As mentioned above, because of the GIL, no matter how many threads we have on, only one thread is executing at any one time. So how do we avoid GIL?

In this case, we don’t have to use the thread, (its existence is inevitable, so we choose not to use it is equivalent to not exist). So this is, you think: well, what are we going to do if we don’t have threads?

Good question!

Let’s start: process, how about that? Don’t worry! Please listen to me.

For example, if you have 3 cpus (although you may have more, let’s use 3 cpus as an example), let’s use 3 processes. Just one CPU.

Ps: Our processes can run simultaneously.

Take a look at the picture below:

Task manager

[Img – 6ad47AID-1583392394100)(assets/1571966626435.png)]

Each item in our task management is a process.

What are the disadvantages of multiprocessing over multithreading?

Multiple processes are also more expensive to create and destroy.

You may be able to run many threads, but your process depends on the number of cpus you have.

Processes cannot see each other’s data, and need to use the stack or queue to obtain data.

Each process is independent of each other.

Just as our Google Chrome has nothing to do with Pycharm, the data on Google Chrome is definitely not visible to Pycharm. This is what we call independence between processes.

If you want one to fetch data and one to call data, then you can’t call directly, you need to define your own structure to use. >>> Increased programming complexity.


4. Multithreading and multiprocess

So that’s the basics, so let’s move on to the topic.

4.1 Multi-threaded and non-daemon threads

# !/usr/bin/python3
# -*- coding: utf-8 -*-
# @author: AI yuecang @datetime: 2019/10/25 9:50 @function: Function Development_tool: PyCharm
# code is far away from bugs with the god animal protecting
# I love animals. They taste delicious.

import threading, time

def start() :
	time.sleep(1)
	print(threading.current_thread().name) # Current thread name
	print(threading.current_thread().is_alive()) # Current thread state
	print(threading.current_thread().ident) # Number of the current thread

print('start')
>>>target= function,name= give the multithread a name
If you don't have a name, then it will have a name of its own, pid, which is ident
# Similar statements
thread = threading.Thread(target=start,name='my first thread')

If you do not start() after each thread is written, it is similar to just declaring
thread.start()
print('stop')
Copy the code

The output

"C:\Program Files\Python37\python.exe"C:/daima/pycharm_daima/ Crawler Master class/knowledge/multi-threaded/multi-threaded and non-daemon thread. py start stop my first threadTrue
2968

Process finished with exit code 0
Copy the code

If there are arguments, we pass the arguments to the multithreaded arguments. Example code:

import threading, time

def start(num) :
	time.sleep(num)
	print(threading.current_thread().name)
	print(threading.current_thread().isAlive())
	print(threading.current_thread().ident)

print('start')
thread = threading.Thread(target=start,name='my first thread', args=(1,))

thread.start()
print('stop')
Copy the code

Resolution:

I took a closer look at our results,

start
stop
my first thread
True
2968

We will find that this sequence of code is not executed according to our normal logic.

Instead, we’re going to start and then we’re going to stop and then we’re going to execute the other three of our functions.

One thread and it goes straight through. We’ll run the code in our main thread before we run the code in it.

Our code does not execute print(‘stop’) when it reaches thread.start(). Instead, we proceed to thread.start(), and then execute the code inside (i.e., the code inside the **start()** function). (it doesn’t get stuck at thread.start()) and it doesn’t end with the main thread

Print (‘stop’) : print(‘stop’) : print(‘stop’) When our main thread executes this stop, we are done.

Threads that are not destroyed when the main thread ends are called non-daemon threads

  1. The main thread will skip the created thread and continue executing;
  2. Until the creation thread finishes running;
  3. End of procedure;

Now, there are non-daemon threads. Then there are the daemon threads.

4.2 Daemon Thread

To change to a daemon thread, you need to add one before thread.start() :

thread.setDaemon(True)
Copy the code

It needs to be set up before we start.

import threading, time

def start(num) :
	time.sleep(num)
	print(threading.current_thread().name) # Name of the current thread
	print(threading.current_thread().isAlive())
	print(threading.current_thread().ident)

print('start')
thread = threading.Thread(target=start,name='my first thread', args=(1,))
thread.setDaemon(True)
thread.start()
print('stop')
Copy the code

So let’s see what happens when we run it

start
stop
Copy the code

As you can see, the program runs directly: start, stop, and ends at **print(‘stop’). ** ends when our main thread ends. It doesn’t matter what’s left in it. (regardless of the time.sleep() in it) As soon as our main thread finishes, our daemon thread will be destroyed along with the main thread.

We start daily non – daemon threads, daemon threads are used less.

The daemon thread ends with the main thread. Set setDaemon to True.

Student problem: Task manager above more than five or six processes. If it’s all progress, how can you open so many?

A: We can execute more than one process on a CPU. For example, I have a CPU with many processes in it. (let’s say I’m running six processes) concurrently. It’s just that the computer executes it very quickly, and I’ll show you a little bit about that. This is a computer principle class.

For any operating system, let’s say it’s a single-core operating system: let’s say there’s only one CPU, six processes in a CPU, and only one process is running at any one time. But we calculate that execution is very fast, so when this program is done, it’s going to do a context switch, go to the next one. (Because it executes so fast, you’ll feel like it’s executing concurrently.)

In fact, only one process is executing on a CPU at a time, and only one thread is executing on a process. (Of course, this is five or six years ago. Now there must be at least two cores.

That means we have a second CPU.

And the second and CPU above many processes, the two CPUS are unrelated.

At this point, the first CPU has a process running on it, and our second CPU has a process running on it, so the two are complementary. (That’s like having two computers on.)

But there is only one process on the same CPU at a time. No matter how fast you are, there is essentially only one process executing. If you’re dual-core, there are two processes. (Four cores have four processes)


One of the problems with Python is that if you have two cpus, you have two processes executing (and four cpus mean four processes executing). However, because Python has a GIL, even if you have four cpus, only one thread can enter it at a time. At the same time, one thread in one process on one CPU is executing. The rest won’t work, and our Python can’t take advantage of multicore.

If, everybody uses is C, Java, Go of this kind of do not have this statement.

5. The Lock locks

Now the hard part is, let’s say we have two threads, one is adding ten million times and the other is subtracting ten million times. The original plan was to add $10 million and subtract $10 million and it would still be zero. But it’s not going to be equal to zero, and we’re going to run it a couple of times and we’re going to see that it’s going to be different. The multithreaded code is as follows:

import threading
import time

number = 0

def addNumber(i) :
	time.sleep(i)
	global number
	for i in range(1000000):
		number += 1
	print("Add",number)

def downNumber(i) :
	time.sleep(i)
	global number
	for i in range(1000000):
		number -= 1
	print("Cut",number)

print("start") # Output a start
thread = threading.Thread(target = addNumber, args=(2)),# Start a thread (declaration)
thread2 = threading.Thread(target = downNumber, args=(2)),# Open the second thread (declaration)
thread.start() # start
thread2.start() # start
thread.join()
thread2.join()
# join blocks here until we have finished blocking the thread
print("Outside", number)
print("stop")
Copy the code

Even a single thread can produce two values: 1000000 and -1000000. Whichever function runs first will output the result. Why? Since the two functions are calling the global variable number, if we run the addition function first and the addition is 1000000, then the global value of number will also be 1000000, and the subtraction will also be 0. The reverse is also true.

import threading
import time

number = 0

def addNumber() :
	global number
	for i in range(1000000):
		number += 1
	print("Add",number)
	return number

def downNumber() :
	global number
	for i in range(1000000):
		number -= 1
	print("Cut",number)
	return number

sum_num = downNumber() + addNumber()
print("Result", sum_num)

# outputReduction -10000000
Result -1000000


Change the following code, other unchanged:
sum_num = addNumber() + downNumber()

# output10000000
Result 1000000
Copy the code

From the multi-threaded code above, I can see the result: two threads operate on the same number, and the resulting numbers are confused. Why is it chaotic?

So what we’re going to do now is we’re going to assign number += 1 which is actually number = number + 1. In Python, we do two steps: compute the one on the right, and then assign to the one on the left.

Let me first look at the correct process:

# 我们的 number = 0
The first step is to run our code first:
a = number + 1 # is the same thing as 0+1=1
So # is going to run the right-hand side first, and then assign to a

number = a Then, assign a number to the result of a

# After running the addition above, we add down to run the operation of weight loss.
b = number - 1 # is the same thing as 1 minus 1 = 0
# Then, assign a number

# Last number equals 0
number = 0
Copy the code

The above is the correct process, but in multithreading?

number = 0 The initial value is 0
a = number+1 # is the same thing as 0+1=1
# pay attention this place!!
# After running the above step, the result is not assigned to number immediately
# start subtraction operation:
b = number-1 # is the same thing as 0-1=-1
# Then, these two runs are assigned:
number=b # b = -1
number=a # a = 1

# Final result:
number = 1
Copy the code

This is the reason why our results are unhinged, that is, we have two computations and two assignments, but this multithread is not executed in order, which is what we call thread unsafe.

Because the execution is too fast, the two threads interweave and we end up with the wrong result. That’s the problem with thread insecurity.

We need to Lock it to get the effect of our number. At this point, to avoid errors, we need to Lock it.

import threading
import time

lock = threading.Lock() Create a simple read/write lock
number = 0

def addNumber() :
	global number
	for i in range(1000000):
		lock.acquire() # to get first
		number += 1
		The intermediate process lets him force the calculation and assignment process, that is, let him perform the two operations, and then switch.
		# This will not finish the calculation, and the value of the assignment has not yet run to the next.
		# This prevents thread insecurity
		lock.release() # to release

def downNumber() :
	global number
	for i in range(1000000):
		lock.acquire()
		number -= 1
		lock.release()

print("start") # Output a start
thread = threading.Thread(target = addNumber) # Start a thread (declaration)
thread2 = threading.Thread(target = downNumber) # Open the second thread (declaration)
thread.start() # start
thread2.start() # start
thread.join()
thread2.join()
# join blocks here until we have finished blocking the thread
print("Outside", number)
print("stop")

# outputstart0
stop

Copy the code

The process between lock.acquire() and lock.release() makes it enforce the computation and assignment, that is, make it complete the two operations before switching. So you don’t finish the calculation, and the assignment that hasn’t come will just go to the next one. This prevents thread insecurity.

Then, our first thread will acquire the lock lock.acquire(), and the other thread will block on lock.acquire() until our other thread releases the lock.release() lock, then obtains the lock and executes it, and so on.

** Deadlock: ** deadlock is when the previous thread has the lock, but does not release the lock, the next thread is waiting for the previous thread to release the lock, this is a deadlock. To put it bluntly, wait for each other. It’s like looking in a mirror. You’re in me, you’re in me. In the case of no release. (You wait for my confession, I wait for your confession)

6. RLOCK

Reuse, a lock can be nested within another lock. To our regular lock above, inside a thread, you can only obtain once. If you get it a second time, you get an error.

When is a recursive lock used? You need lower precision, less force, for less force.

import threading
import time

class Test:
	rlock = threading.RLock()
	def __init__(self) :
		self.number = 0

	def execute(self, n) :
		If you forget to write lock.release(), it will become a deadlock.
		# and with can solve this problem.
		with Test.rlock:
			# with has an internal resource release mechanism
			self.number += n

	def add(self) :
		with Test.rlock:
			self.execute(1)

	def down(self) :
		with Test.rlock:
			self.execute(-1)

def add(test) :
	for i in range(1000000):
		test.add()

def down(test) :
	for i in range(1000000):
		test.down()
        
if __name__ == '__main__':
	thread = Test() # instantiation
	t1 = threading.Thread(target=add, args=(thread,))
	t2 = threading.Thread(target=down, args=(thread,))
	t1.start()
	t2.start()
	t1.join()
	t2.join()
	print(t.number)

Copy the code

We can see that this recursive lock is time consuming, so the more locks you open, the more resources you use, and the slower your program will run. Some large projects rarely use this many locks, because the speed of the lock will slow down your entire program. So you have to think about whether or not you use these things.

7. Multiple processes

Multithreading is used more in IO intensive, that is, in the crawler aspect. CPU intensive doesn’t need multithreading at all.

Our general strategy is multi-process plus multi-thread, which is the best combination. I need to use this library:

import multiprocessing
Copy the code
import multiprocessing
import time

def start(i) :
	time.sleep(3)
	print(i)
	# current process
	# Current process
	print(multiprocessing.current_process().name) # Name of the current process
	print(multiprocessing.current_process().pid) # Process control
	print(multiprocessing.current_process().is_alive()) # Determine whether the process is alive
    # Because, some of our processes are stuck, so I have to stop the process myself

if __name__ == '__main__':
	print('start')
	p = multiprocessing.Process(target=start, args=(1,), name='p1')
	p.start()
	print('stop')
Copy the code

PID (Process control symbol) English full name Process Identifier, it also belongs to the electrical and electronic technical terms.

PID is the identity of each process. The system will automatically assign a unique PID to the process once the program is running. After the process is aborted, the PID is reclaimed by the system and may be assigned to a new running program.

The PID column represents the process ID of each process, that is,PID is the identity of each process.

In actual debugging, you can only set a rough experience value first, and then modify it according to the adjustment effect.

[Img – BQoutcxb-1583392394101] (08-Multi-threaded and Multi-process. assets/image-20200229140520623.png)

8. Process communication

By default, Python processes cannot communicate with each other because they execute concurrently. So you need other data structures.

Here’s an example:

When you have one process fetching data and giving it to another process, you need the process to communicate.

Queuing: Like queuing, first in, first out. The data that you put in first, you get out first.

Stack: Data structure used mainly in C and C++. Mainly stores user – defined data. It’s last in, first out. Those who go first are on the bottom and those who go last are on the top.

from multiprocessing import Process, Queue
# Process: Process
# Queue: Queue
# import multiprocessing

def write(q) :
	# multiprocessing.current_process().name
	# multiprocessing.current_process().pid
	# multiprocessing.current_process().is_alive()
	print("Process to write: {}" .format(Process.pid))
	for i in range(10) :print("Put {} to queue...".format(i))
		q.put(i) # Put the number in our queue

def read(q) :
	print("Process to read: {}" .format(Process.pid))
	while True:
		Why use while here? Because we're going to be looping around, there's probably no data in the queue, so we're going to be looping around.
		Of course, you can also specify the number of loops directly
		value = q.get() # Retrieve data from the queue (if there is no data in the queue, it will block there)
		print("Get {} from queue." .format(value))

One thread grabs the URL and puts it in the queue, and the other queue parses it
if __name__ == '__main__':
	The parent process creates a Queue and passes it to each child process:
	q = Queue() # queue
	pw = Process(target=write, args=(q, ))
	pr = Process(target=read, args=(q, ))
	# start child pw, write:
	pw.start()
	# Start child process pr, read:
	pr.start()
	# wait for pw to finish
	pw.join()
Copy the code

Here’s a practical example:

from multiprocessing import Process, Queue
import requests
from lxml import etree
headers = {'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.116 Safari}
def spider_url(queue) :
	session = requests.Session()
	session.headers = headers
	html = requests.get('https://www.baidu.com')
	xml = etree.HTML(html.text)
	url = xml.xpath("//div[@class="f-tag"]")
	queue.put(url)

def parse_url(queue) :
	while True:
		value = queue.get()
		titl = value[0]

if __name__ == '__main__':
	queue = Queue()
	spider_url = Process(target=spider_url, args=(queue,))
	parse_url = Process(target=parse_url, args=(queue,))
	spider_url.start()
	parse_url.start()
	spider_url.join()
	parse_url.join()
Copy the code

Process pool and thread pool

Why do we need a process pool and a thread pool? I will use the previous one. We have a resource consumption when doing context switches, and on top of that, creating threads and deleting threads consume more resources. And this pool saves resources, so we don’t have to create and destroy, just get the use inside.

9.1 process pool

Method 1 (multitasking) :

from multiprocessing import Pool
def function_square(data) :
	result = data*data
	return result

if __name__ == '__main__':
	inputs = [i for i in range(100)]
	# inputs = (i for i in range(100))
	# inputs = list(range(100))
	pool = Pool(processes=4) If you do not specify a number of deifications, it will be created according to the state of your computer.
	# Automatically create the corresponding number according to your computer
	# map assigns tasks to the process pool
	# pool.map(function, iterable)
	pool_outputs = pool.map(function_square, inputs)
	Outputs = pool. Map (function_square, (2,3, 4, 5))
	pool.close()
	pool.join()
	print("Pool :", pool_outputs)
Copy the code

Second method (single task) :

from multiprocessing import Pool
def function_square(data) :
	result = data*data
	return result

if __name__ == '__main__':
	pool = Pool(processes=4) If you do not specify a number of deifications, it will be created according to the state of your computer. (Automatically create the corresponding number according to your computer)
	# map assigns tasks to the process pool
	# pool.map(function, iterable)
	pool_outputs = pool.apply(function_square, args=(10, ))
	pool.close()
	pool.join()
	print("Pool :", pool_outputs)
Copy the code

Use from multiprocessing import Pool: To import a Pool of processes, the Pool can provide a specified number of processes, if new requests are submitted to the Pool, if the Pool is not full, create new processes to execute the requests. If the pool is full, it will wait.

So, we can declare the process pool first;
Then, use the map method, which is the same as the normal map method.
# the map:
# pool = Pool()
# pool.map(main, [i*10 for i in range(10)])
# First argument: it takes each element of the array and takes it as an argument to the function, then creates a process and runs it in the process pool.
# second argument: construct an array, and then loop from 0 to 90
Copy the code

9.3 Actual combat (TOP100 + Re + Multiprocessing)

# !/usr/bin/python3
# -*- coding: utf-8 -*-
# @author: AI DateTime: 2020/2/12 15:23@function: Function Development_tool: PyCharm
# code is far away from bugs with the god animal protecting
# I love animals. They taste delicious.
# https://maoyan.com/board/4?offset=0
# https://maoyan.com/board/4?offset=10
# https://maoyan.com/board/4?offset=20
# https://maoyan.com/board/4?offset=30
import requests,re,json
from requests.exceptions import RequestException
from multiprocessing import Pool # Introduce a process pool
headers = {
	'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML like Gecko) Chrome/80.0.3987.87 Safari/537.36'
}
session = requests.Session()
session.headers = headers

def get_one_page(url) :
	try:
		response = session.get(url)
		if response.status_code == 200:
			return response.text
		return None
	except RequestException:
		return None
def parse_one_page(html) :
	pattern = re.compile('
      
.*? board-index.*? >(\d+).*? data-src="(.*?) ". *? name.*? >
+". *? > (. *?) .*? star">(.*?)

.*? releasetime">(.*?)

.*? integer">'
+'(. *?) .*? fraction">(.*?) .*? ', re.S) Start and end the hashtag!! items = re.findall(pattern, html) Use yield to turn this method into a generator # To return the result as a key-value pair for item in items: yield { 'index': item[0].'image': item[1].'title': item[2].'actor': item[3].strip()[17:].'time': item[4] [5:].'score': item[5]+item[6]}def write_to_file(content) : # print(type(content)) # with open('result.txt', 'a') as f: with open('result.txt'.'a', encoding='utf-8') as f: Dictionary to string # f.dumps (json.dumps(content) + '\n') # f.dumps (json.dumps(content) + '\n' f.write(json.dumps(content, ensure_ascii=False) + '\n') f.close() def main(offset) : url = f'https://maoyan.com/board/4?offset={offset}' html = get_one_page(url) for item in parse_one_page(html): print(item) write_to_file(item) # 1.0 # if __name__ == '__main__': # for i in range(10): # range(0, 100, 10) # main(i*10) # 2.0 if __name__ == '__main__': pool = Pool() pool.map(main, [i*10 for i in range(10)]) * From multiprocessing import Pool * from multiprocessing import Pool * from multiprocessing import Pool If there are new requests submitted to the process pool, if the pool is not full, create a new process to execute the request. If the pool is full, it will wait first So, we can declare the process pool first; Then, use the map method, which is the same as the normal map method. # the map: # pool = Pool() # pool.map(main, [i*10 for i in range(10)]) # First argument: it takes each element of the array and takes it as an argument to the function, then creates a process and runs it in the process pool. # second argument: construct an array, and then loop from 0 to 90 Copy the code

9.2 the thread pool

I found a lot of packages, but this one is good: Pip install Threadpool

# project = 'Code', file_name = 'thread pool ', author = 'AI'
# time = '2020/3/3 0:05', product_name = PyCharm
# code is far away from bugs with the god animal protecting
# I love animals. They taste delicious.

import time
import threadpool

# Execute time-consuming functions that require multithreading
def get_html(url) :
	time.sleep(3)
	print(url)
The running time of the single thread is 300s
# and multi-threaded pool: 30s
Use multiple threads to execute telent function
urls = [i for i in range(100)]
pool = threadpool.ThreadPool(10) # Create thread pool

# Submit task to thread pool
requests = threadpool.makeRequests(get_html, urls)

# Start the task
for req in requests:
	pool.putRequest(req)
pool.wait()
Copy the code

homework

Make any crawler you’ve ever written multithreaded or multi-process.