Light work: Four times faster Python data processing scripts with three lines of code

Python is a great programming language for processing data and automating repetitive tasks that we often need to preprocess before training machine learning models with data, such as resizing hundreds of thousands of images. Python works! You can almost always find a Python library that makes it easy to do data processing.

However, while Python is easy to learn and use, it is not the fastest language. By default, Python programs run as a single process using a SINGLE CPU. But if you have a computer in the last few years, it’s usually quad-core, so it has four cpus. This means that 75% or more of your computer’s computing resources are sitting idle while you wait for Python scripts to finish processing data!

Today I’m going to show you how to get the full processing power of your computer by running Python functions in parallel. Thanks to Python’s concurrent.futures module, we can turn an ordinary data-processing script into a script that can process data in parallel four times faster in just three lines of code.

Beginner have what don’t understand can private letter me — I just organized a set of 2021 the latest 0 basic introductory tutorial, unselfish sharing, access methods: pay attention to xiaobian CSDN, send a private letter: [learning materials] can be obtained, attached: development tools and installation package, and system learning roadmap.

The normal Python way of handling data

Let’s say we have a folder full of image data and want to create thumbnails for each image in Python. Therefore, if you want to learn Python, it is necessary to listen to the teacher’s class and receive the benefits of Python. If you want to learn Python, you can go to the weixin (same pronunciation) of Teacher Mengya: the front group is: Mengy, the back group is: 7762, and she will arrange to learn the above two groups of letters in order.

Here is a short script that uses Python’s built-in glob function to get a list of all JPEG images in a folder, and then saves a 128-pixel thumbnail for each image using the Pillow image processing library:

This script follows a simple pattern that you’ll often see in data processing scripts:

Start by getting a list of the files (or other data) you want to work with

Write a helper function that can process a single piece of data from the above file

Use the for loop to call helper functions that process each individual piece of data, one at a time.

Let’s test this script with a folder of 1000 JPeGs to see how long it takes to run:

It took 8.9 seconds to run the program, but how hard did the computer really work?

Let’s run the program again to see what the activity monitor looks like as the program runs:

75% of the computer’s processing resources are idle! What’s going on here?

The reason for this problem is that my computer has four cpus and Python only uses one. So the program just goes all out on one CPU while the other three do nothing. So I needed a way to divide the workload into four separate parts that I could work on in parallel. Fortunately, there is an easy way to do this in Python!

Try creating multiple processes

Here’s a way to process data in parallel:

1. Divide the JPEG file into 4 small pieces. 2. Run four separate instances of the Python interpreter. 3. Have each Python instance process one of these four blocks of data. 4. Combine the results of the four parts to obtain the final list.

Four Python copies running on four separate cpus should handle about four times as much work as one CPU, right?

Best of all, Python has done the hard work for us. All we have to do is tell it which function we want to run and how many instances to use, and it will do the rest. We only need to change three lines of code.

First, we need to import the concurrent.futures library, which is built into Python:

import concurrent.futures

Next, we need to tell Python to start four additional Python instances. We do this by asking Python to create a Process Pool:

with concurrent.futures.ProcessPoolExecutor() as executor:

By default, it creates one Python process for each CPU on your computer, so if you have four cpus, four Python processes will start. Therefore, if you want to learn Python, it is necessary to listen to the teacher’s class and receive the benefits of Python. If you want to learn Python, you can go to the weixin (same pronunciation) of Teacher Mengya: the front group is: Mengy, the back group is: 7762, and she will arrange to learn the above two groups of letters in order.

The final step is to have the created Process Pool use these four processes to execute our helper functions on the data list. To complete this step, we loop our existing for loop:

The executor.map() function is called with auxiliary functions and a list of data to process. This function does all the dirty work for me, including dividing the list into sublists, sending the sublists to each subprocess, running the subprocesses, and merging the results. Well done!

This also returns the result of each function call. The executor.map () function returns the results in the same order as the input data. So I used Python’s zip() function as a shortcut to get the original file name and matching results in each step in one step.

Here is the code after these three changes:

Let’s run this script to see if it completes data processing faster:

The script processed the data in 2.2 seconds! Four times faster than the original version! The data was processed faster because we used four cpus instead of one.

But if you take a closer look, the “user” time is almost 9 seconds. So how is it that the process time is 2.2 seconds, but somehow it still runs in 9 seconds? Doesn’t that seem possible?

This is because “user” time is the sum of all CPU time, and we ended up with the same total CPU time of 9 seconds, but we did it with 4 cpus and only 2.2 seconds of actual data processing time!

Note: Enabling more Python processes and assigning data to child processes takes time, so this approach does not always guarantee a significant increase in speed. If you’re dealing with very large data sets, it’s helpful to read an article on how to slice a data set into smaller pieces.

Does this always speed up my data processing scripts?

If you have a list of data and each data can be processed individually, using what we call Process Pools here is a good way to speed things up. Here are some examples where parallel processing is appropriate:

Pull statistics from a series of separate web server logs.

Parsing data from a bunch of XML, CSV, and JSON files.

A large number of image data are preprocessed to establish a machine learning data set.

But also keep in mind that Process Pools is not a panacea. Using a Process Pool requires passing data back and forth between separate Python processing processes. This approach won’t work if the data you’re dealing with can’t be effectively passed along in the process. In short, you must work with data of the type Python knows how to work with.

At the same time, data cannot be processed in a desired order. If you need the results of the previous step to proceed to the next step, this approach will not work.

What about GIL?

You probably know that Python has something called the Global Interpreter Lock, or GIL. This means that even if your program is multithreaded, only one Python instruction can be executed per thread. GIL ensures that only one Python thread is executing at any time. In other words, multithreaded Python code doesn’t really run in parallel to take full advantage of multi-core cpus.

But Process Pool solves this problem! Because we are running separate Python instances, each has its own GIL. This gives us Python code that is truly parallel!

Don’t be afraid of parallel processing!

With the concurrent.futures library, Python allows you to put all the cpus on your computer to work with a simple script change. Don’t be afraid to try this approach, once you’ve mastered it, it’s as simple as a for loop, but it can make your data processing scripts fly.

Light work: Four times faster Python data processing scripts with three lines of code

Related Posts

1. Installation and deployment

SELinux and Permision denied problems

【 例 】Async/Await (B) — Futures

【例】Async/Await (B) — Futures