Python is a great programming language for processing data and automating repetitive tasks. Before we can train machine learning models with data, we often need to preprocess data. Python is great for doing this, such as resizing hundreds of thousands of images. You can almost always find a Python library that makes data processing easy.

However, while Python is easy to learn and easy to use, it is not the fastest language. By default, Python programs use a CPU to run as a single process. But if you’ve configured a computer in the last few years, it’s usually a quad-core processor, which means it has four CPUs. This means that 75% or more of your computer’s computing resources are sitting idle while you wait for your Python script to finish processing the data!

Today I’m going to show you how to take advantage of the full processing power of your computer by running Python functions in parallel. Thanks to Python’s concurrent-futures module, we can turn a normal data processing script into one that can process data in parallel in just three lines of code, a fourfold increase in speed.

Beginners have what do not understand can private message me – I just organized a set of 2021 latest 0 basic introductory tutorial, selfless share, access methods: pay attention to the small series CSDN, private message: [learning materials] can get, attached: development tools and installation package, as well as the system learning roadmap.

Normal Python methods of handling data

For example, let’s say we have a folder full of image data and we want to use Python to create thumbnails for each image. So if you want to learn Python, it is necessary to listen to the teacher’s class and get Python benefits. If you want to learn Python, you can go to the teacher’s Weixin: the first group is Mengy, the second group is 7762. Put the above two groups of letters together in order, and she will arrange the learning.

Here’s a short script that uses Python’s built-in glob function to get a list of all the JPEG images in the folder, and then uses the Pillow image processing library to save a 128-pixel thumbnail for each image:

This script follows a simple pattern that you’ll often see in data processing scripts:

Start by getting a list of the files (or other data) you want to work with

Write a helper function that can process a single piece of data from the above file

Use the for loop to call helper functions, processing each single piece of data, one at a time.

Let’s test this script with a folder containing 1000 JPEG images to see how long it takes to run:

Running the program took 8.9 seconds, but how hard is the computer really working?

Let’s run the program again to see what the activity monitor looks like when the program is running:

75% of your computer’s processing resources are idle! What’s going on here?

The reason for this problem is that my computer has four CPUs, but Python only uses one. So the program is doing all it can with one CPU and doing nothing with the other three. So I needed a way to divide the work into four separate parts that I could work on in parallel. Fortunately, there’s a way in Python that makes it easy to do this!

Try creating multiple processes

Here’s one way we can process the data in parallel:

1. Divide the JPEG file into 4 small chunks. 2. Run four separate instances of the Python interpreter. 3. Let each Python instance process one of these four blocks. 4. Combine the processing results of these four parts to obtain the final list of the results.

Four Python copies running on four separate CPUs should be able to handle about four times more work than a single CPU, right?

Best of all, Python has already done the hard part for us. We just tell it which function we want to run and how many instances we want to use, and it will do the rest. We only need to change 3 lines of code for the whole process.

First, we need to import the concurrent-futures library, which is built into Python:

import concurrent.futures

Next, we need to tell Python to start four additional Python instances. We do this by asking Python to create a Process Pool:

with concurrent.futures.ProcessPoolExecutor() as executor:

By default, it will create one Python process for each CPU on your computer, so if you have four CPUs, you will start four Python processes. So if you want to learn Python, it is necessary to listen to the teacher’s class and get Python benefits. If you want to learn Python, you can go to the teacher’s Weixin: the first group is Mengy, the second group is 7762. Put the above two groups of letters together in order, and she will arrange the learning.

The final step is to have the created Process Pool execute our helper functions on the data list with these four processes. To do this, we need to loop the existing for loop:

The executor.map() function is called with helper functions and a list of data to process. This function does all the messy work for me, including splitting the list into multiple sublists, sending the sublists to each child process, running the child processes, and merging the results. Well done!

This also returns us the result of each function call. The executor.map () function returns the results in the same order as the input data. So I used Python’s zip() function as a shortcut to get the original file name in one step and the matching result in each step.

Here is the program code after these three steps:

Let’s run this script to see if it does data processing faster:

The script processed the data in 2.2 seconds! 4 times faster than the original version! The data was processed faster because we used four CPUs instead of one.

But if you look closely, the “user” time is almost nine seconds. So why does the program take 2.2 seconds to process, but somehow still take 9 seconds to run? Doesn’t that seem possible?

This is because the “user” time is the sum of all the CPU time, and we finished the work in the same total CPU time of 9 seconds, but we used 4 CPUs to complete the actual processing time of 2.2 seconds!

Note: Enabling more Python processes and assigning data to child processes takes time, so this approach does not always guarantee a big speed increase. If you are dealing with very large data sets, here is an article on how to cut the data set into small pieces. It will be helpful to read it.

Does this always help speed up my data-processing scripts?

Using what we call Process Pools here is a good way to speed things up when you have a column of data and each of them can be processed individually. Here are some examples where parallel processing might be useful:

Fetch statistics from a series of separate web server logs.

Parsing data from a bunch of XML, CSV, and JSON files.

A large number of image data are preprocessed to build a machine learning data set.

But keep in mind, too, that Process Pools are not a panacea. Using Process Pool requires passing data back and forth between separate Python processing processes. This approach will not work if the data you are trying to process cannot be passed efficiently along the way. In short, you have to be dealing with data of a type that Python knows how to handle.

At the same time, the data cannot be processed in a desired order. If you need the results of the previous process to proceed to the next step, this approach won’t work either.

What about the GIL problem?

You probably know that Python has something called the Global Interpreter Lock, or GIL. This means that even if your program is multithreaded, each thread can only execute one Python instruction. GIL ensures that only one Python thread is executing at any one time. In other words, multithreaded Python code does not really run in parallel to take full advantage of multicore CPUs.

But the Process Pool solves this problem! Because we are running separate Python instances, each instance has its own GIL. So we have Python code that can actually be processed in parallel!

Don’t be afraid of parallel processing!

With the concurrent-futures library, Python lets you simply change a script and get all the CPUs on your computer working right away. Don’t be afraid to try this method, once you get the hang of it, it’s as simple as a for loop, but it will make your data processing script fly.