Incremental learning algorithm can learn nodes and parameters of the network at the same time, but with the growth of model structure, the cost of calculation becomes higher and higher. There are two ways to reduce the time cost of calculation :(1) study the model division method, divide the larger model into several smaller sub-models; (2) By improving the computing capacity of the computer (GPU or CPU). TX2 can use CUDA to carry out GPU parallel computing. Pycuda, as a parallel computing library of Python, can facilitate GPU parallel acceleration. In this paper, pyCUDa is used to achieve parallel acceleration and is compared with Numpy.
Pycuda implements parallel computing
Please refer to the PyCuda website for installation and easy use tutorials.
A simple example
import pycuda.autoinit
import pycuda.driver as drv
import numpy as np
import time
from pycuda.compiler import SourceModule
mod = SourceModule(''' __global__ void Text_GPU(float *A , float *B, float *K, size_t N){ int bid = blockIdx.x; int tid = threadIdx.x; __shared__ float s_data[2]; s_data[tid] = (A[bid*2 + tid] - B[bid*2 + tid]); __syncthreads(); If (tid == 0) {float sum_d = 0.0; for(int i=0; i)
multiply_them = mod.get_function("Text_GPU")
tic = time.time()
A = np.random.random((1000.20)).astype(np.float32)
B = np.random.random((1000.20)).astype(np.float32)
K = np.zeros((1000,), dtype=np.float32)
N = 20
N = np.int32(N)
multiply_them(
drv.In(A), drv.In(B), drv.InOut(K), N,
block=(20.1.1), grid=(1000.1))
toc = time.time()
print("time cost is:"+str(toc-tic))
Copy the code
Time cost: 0.00536298751831
annotation
The grid and the block
Blocks communicate with each other through Global Memory, and threads in the same Block communicate with each other through Shared Memory. Each thread block has its own Local Memory.
SourceModule
mod = SourceModule(''' __global__ void Text_GPU(.....) {... } ' ' ')
Copy the code
This code is C++ kernel function, which defines the main code of GPU parallel computing. For example, define a kernel function that adds two vectors
mod = SourceModule(""" __global__ void multiply_them(float *dest, float *a, float *b) { const int i = threadIdx.x; dest[i] = a[i] + b[i]; } "" ")
Copy the code
_shared_Variables;
Defines shared memory under the same block.
__syncthreads()
Synchronization function, after all the above code in the same block, to execute the following code synchronization function.
BlockIdx. X with threadIdx. X
Blockidx. x takes the block ID and threadIdx. X takes the thread ID.
Compare with NUMPY computation without GPU acceleration
Numpy Mode 1 (non-for loop)
import numpy as np
import time
A = np.random.random((1000.20)).astype(np.float32)
B = np.random.random((1000.20)).astype(np.float32)
tic = time.time()
dk = A-B
dd = [np.sum(a**2) for a in dk]
K1 = np.exp(-np.array(dd))
toc = time.time()
print("time cost is:"+str(toc-tic))
Copy the code
The time cost is: 0.0174951553345
Numpy mode 2 (for loop)
import numpy as np
import time
A = np.random.random((1000.20)).astype(np.float32)
B = np.random.random((1000.20)).astype(np.float32)
def Guassion_kernel(x, u) :
d = x-u
dd = [np.sum(a**2) for a in d]
return np.exp(-sum(dd))
tic = time.time()
Phi_x = []
for j in range(1000):
Phi_x.append(Guassion_kernel(A[j], B[j]))
toc = time.time()
print("time cost is:"+str(toc-tic))
print(Phi_x)
Copy the code
The time cost is: 0.0264999866486
contrast
The table shows the cost of each of the three methods. It can be seen that the computation time cost of GPU acceleration is the lowest, while the computation time cost with FOR cycle is the highest. This is just a preliminary comparison. In practice, GPU speeds are sometimes no faster than CPU speeds. When the data dimension is small and GPU acceleration preconfiguration requires extra computation, sometimes the computation time with GPU acceleration is longer than that with CPU.
types | GPU | CPU without for loop | CPU with for loop |
---|---|---|---|
time cost | 0.00536298751831 | 0.0174951553345 | 0.0264999866486 |
Finally, it is applied in the incremental algorithm, because the number of nodes is slowly increasing. The cost of each iteration step calculated in the three ways is shown in the figure below. We find that, at the beginning of the operation, the nodes are smaller, and the computation time cost of GPU is higher than that of CPU, but later, GPU and CPU(… It seems that GPU does not have great advantages, mainly because the number of nodes is still very small, more than 100 nodes). I believe that with the further increase of nodes, the effectiveness of GPU will become more obvious. | |||
Conclusion # # | |||
This article introduces PyCUDa and implements Python GPU parallel computing. |