As an extension of the Python language, Numpy supports a large number of dimensional array and matrix operations, which has greatly helped the Python community. Numpy enables data scientists, machine learning practitioners, and statisticians to process large amounts of matrix data in a simple and efficient way. So can Numpy speed be improved? This paper introduces how to use CuPy library to speed up Numpy operation.

From Towardsdatascience, by George Seif, Compiled by The Heart of Machines, with participation by Du Wei and Zhang Qian.

On its own, Numpy is already a huge speed improvement over Python. When you find Python code running slowly, especially with a large number of for-Loops, you can usually move the data processing to Numpy and implement its vectorized maximum speed processing.

One thing, however, is that Numpy acceleration is only implemented on the CPU. Because consumer cpus typically have eight cores or less, there is a limit to the amount of parallel processing and the acceleration that can be achieved.

This led to the creation of a new acceleration tool, the CuPy library.

What is a CuPy?

CuPy is a library that implements Numpy arrays on Nvidia Gpus with the help of CUDA GPU libraries. Based on the implementation of Numpy array, the MULTIPLE CUDA cores of GPU itself can facilitate better parallel acceleration.

The CuPy interface is a mirror image of Numpy, and in most cases, it can be used directly in place of Numpy. Users can achieve GPU acceleration simply by replacing Numpy code with compatible CuPy code.

CuPy supports most of Numpy’s array operations, including indexing, broadcasting, array mathematics, and various matrix transformations.

Users can also write custom Python code that takes advantage of CUDA and GPU acceleration if they encounter special cases that are not supported. All it takes is a small piece of code in C++ format, and CuPy can then do the GPU conversion automatically, much like using Cython.

Before starting using CuPy, users can install the CuPy library via PIP:
pip install cupyCopy the code

Run on GPU using CuPy

To meet the benchmarks, the PC configuration is as follows:

  • I7 CPU – 8700 k

  • 1080 Ti GPU

  • 32 GB of DDR4 3000MHz RAM

  • CUDA 9.0

After CuPy is installed, users can import CuPy just like Numpy:
import numpy as npimport cupy as cpimport timeCopy the code
In the following encoding, switching between Numpy and CuPy is as simple as replacing Numpy’s NP with cp of CuPy. The following code creates a 3D array of 1 billion 1 “s for Numpy and CuPy. To measure the speed of creating arrays, users can use Python’s native Time library:
### Numpy and CPUs = time.time()* x_CPU = np.ones((1000,1000,1000))*e = time.time()print(e-s)### CuPy and GPUs = Time. Time ()*x_gpu = cp.ones((1000,1000,1000))*e = time. Time ()print(e-s)Copy the code
It’s easy!


Incredibly, CuPy is still much faster, even though this is just creating an array. Numpy took 1.68 seconds to create an array of one billion 1 “s, while CuPy took just 0.16 seconds, achieving a 10.5-fold acceleration.

But CuPy can do more than that.

Like doing some math in an array. This time multiply the entire array by 5 and check the speed of Numpy and CuPy again.

### Numpy and CPUs = time.time()*x_cpu *= 5*e = time.time()print(e - s)### CuPy and GPUs = time.time()*x_gpu *= 5*e = time.time()print(e - s)Copy the code
Sure enough, CuPy beat Numpy again. Numpy clocked in 0.507 seconds, while CuPy clocked in 0.3010 seconds, a 714.1 times faster.


Now try using more arrays and perform the following three operations:

  1. Array times 5

  2. Arrays multiply by themselves

  3. Array added to itself

### Numpy and CPUs = time.time()*x_cpu *= 5x_cpu *= x_cpux_cpu += x_cpu*e = time.time()print(e - s)### CuPy and GPUs = time.time()*x_gpu *= 5x_gpu *= x_gpux_gpu += x_gpu*e = time.time()print(e - s)Copy the code
The results show that Numpy performed the entire process in 1.49 seconds on the CPU, while CuPy performed the entire process in 0.0922 seconds on the GPU, a 16.16-times faster performance.

The size of the array (data points) has reached 10 million, and the computing speed has been greatly improved

Multifold acceleration of Numpy and matrix operations can be achieved on gpus using CuPy. It’s worth noting that the acceleration a user can achieve is highly dependent on the size of the array they are working with. The following table shows the acceleration differences for different array sizes (data points) :

Once you hit 10 million data points, the speed increases dramatically; More than 100 million, the speed increase is very significant. Numpy actually runs faster at less than 10 million data points. In addition, the more memory a GPU has, the more data it can process. So users should be aware of whether the GPU memory is large enough to handle the data that CuPy needs to process.




Original link:
Towardsdatascience.com/heres-how-t…