Before the launch of this article on Zhihu, we tested the improvement of running efficiency by multi-threading respectively for computation-intensive and IO-intensive tasks. The following parts are still divided into three parts: computation-intensive, file read and write, and network request. How to improve running efficiency by using thread pool and process pool

The first step is to enter the library and define the functions for the three tasks

import requests

from bs4 import BeautifulSoup

import time

import numpy as np

from multiprocessing.dummy import Pool as ThreadPool

from multiprocessing import Pool

Increment from 1 to 50 million

def cal(a = None): The # I parameter is useless, just to be consistent with the following

s = 0

for i in range(50000000) :

s = s + i

Write files 5000000 times

def file(a = None): The # I parameter is useless, just to be consistent with the following

with open('try.txt'.'w') as f:

for i in range(5000000) :

f.write('abcd\n')

# Grab 10 pages of Douban top250

def gettitle(a):

url = 'https://movie.douban.com/top250?start={}&filter='.format(a*25)

r = requests.get(url)

soup = BeautifulSoup(r.content, 'html.parser')

lis = soup.find('ol', class_='grid_view').find_all('li')

for li in lis:

title = li.find('span', class_="title").text

print(title)

Copy the code

The following defines linear computed, multithreaded, multiprocess functions

Pass the above three functions separately and calculate10Returns the total running time of the normal cycle

def nothing(func) :

t = time.time(a)

for i in range(10):

func(i)

duration = time.time(a) - t

return duration

Pass in the above three functions separately and count 10 times to return the total running time using multiple threads

def thread(func):

t = time.time(a)

pool = ThreadPool(4)

pool.map(func.range(10))

duration = time.time(a) - t

return duration

# # Pass the above three functions separately, count 10 times, and return the total running time using multiple processes

def process(func):

t = time.time(a)

pool = Pool(4)

pool.map(func.range(10))

duration = time.time(a) - t

return duration

Copy the code

The following defines a function that calculates the running time

def get_duration(curr, func):

l = []

for _ in range(5) :

l.append(curr(func))

mean_duration = '%.2f' % np.mean(l)

all_duration = ['%.2f' % i for i in l]

return mean_duration, all_duration

Copy the code

Run the code below to calculate the time

if __name__ == '__main__':

# CPU-intensive task comparison

print(get_duration(nothing, cal))

print(get_duration(thread, cal))

print(get_duration(process, cal))

Compare file read and write tasks

print(get_duration(nothing, file))

print(get_duration(thread, file))

print(get_duration(process, file))

# Network request task comparison

print(get_duration(nothing, gettitle))

print(get_duration(thread, gettitle))

print(get_duration(process, gettitle))

Copy the code

The results are as follows

------CPU intensive task elapsed time -------

Linear operation

('39.98'['39.57'.'39.36'.'40.53'.'40.09'.'40.35'])

multithreading

('38.31'['39.07'.'37.96'.'38.07'.'38.31'.'38.13'])

Multiple processes

('27.43'['27.58'.'27.11'.'27.82'.'27.53'.'27.11'])

------ File read/write task running time -------

Linear operation

('54.11'['53.54'.'53.96'.'54.46'.'53.54'.'55.03'])

multithreading

('53.86'['55.44'.'54.12'.'52.48'.'53.17'.'54.08'])

Multiple processes

('34.98'['35.14'.'34.35'.'35.27'.'35.20'.'34.94'])

------ Network request task running time -------

Linear operation

('4.77'['4.74'.'4.70'.'4.77'.'4.91'.'4.72'])

multithreading

('1.96'['1.88'.'2.09'.'1.91'.'2.04'.'1.91'])

Multiple processes

('3.79'['3.55'.'3.70'.'3.50'.'3.92'.'4.30'])

Copy the code

Analysis of the following

  • First, CPU intensive computing. Multi-threading does not improve performance; multi-processing does. Because the multi-process can take advantage of the multi-core advantage of the computer, more resources are called for calculation

  • In cpu-intensive computation, you can monitor the task manager and find that the CPU utilization is less than half for linear computation and multithreading. In the case of multiple processes, all cpus are used up

  • Note: this process (pool) using a thread pool is only opened four thread (process), because in my computer, with four process to maximize the use of CPU computing ability, opening more process cannot on computationally intensive tasks run has more advantages, it will increase the process to create and switch time.

  • Second, file read and write tasks. The results above show that only multiple processes improve performance.

  • Actually file reading and writing task sometimes also can improve the efficiency of multithreading, is in the open files, read and write is slow, and shown above may be because of the relatively small file, read the content is less, so the cost of time is the basic operation frequently, or CPU load problem, so multithreading can’t improve the operation efficiency

  • If reading and writing files is connected to a database, the wait time is longer, and multithreading can take greater advantage

  • In order to better demonstrate the advantages of multithreading in file reading and writing, I conducted the following tests

Change the file function to (write more each time)

Write files 500 times

def file(a = None): The # I parameter is useless, just to be consistent with the following

for i in range(500) :

with open('try.txt'.'w') as f:

f.write('abcd'*100000 + '\n')

Copy the code

The result is as follows

Linear operation

('55.15'['49.96'.'55.75'.'45.11'.'52.16'.'72.75'])

multithreading

('26.57'['26.67'.'23.89'.'25.48'.'32.84'.'23.94'])

Multiple processes

('25.72'['24.10'.'25.82'.'24.13'.'28.03'.'26.50'])

Copy the code

As you can see, in this case, multi-threading improves efficiency to the same extent as multi-processing.

  • Again, the network requests the task. The above results show that multi-threading improves the efficiency of this task most obviously, and multi-process also has some improvement.

  • Because the network request task is the main time consumption is waiting for the reply of the web page, then if you can wait for the reply of multiple web pages at the same time, it can greatly improve the operation efficiency, multithreading can play a perfect role in this.

  • Finally, note that the process creation time is very expensive when using multiple processes, so the above code can only show the benefits of multiple processes by increasing the function runtime. This is why the network request task multi-process has not improved much. Creating multiple threads is a lot easier.

  • Conclusion: CPU intensive tasks generally use multi-process to improve the running efficiency, while network request tasks generally use multi-thread to improve the running efficiency. File reading and writing is mainly CPU calculation time or waiting time.

Welcome to my zhihu column

Column home: Programming in Python

Table of contents: table of contents

Version description: Software and package version description