Author: Lawliet
As if there were light
Preface:
A few months ago, I wrote my own autoencoder based on Simoncelli’s 2016 paper for research purposes. At first, I wanted to use some popular deep learning frameworks such as Tensor Flow, Caffe2 or MXNet for my experiments. However, after a few weeks of investigating all of these frameworks, I found a real headache — extensibility. I’m not saying these frameworks are poorly designed, but they don’t allow users to develop third-party operators. It’s like writing a plug-in. You give me a function without any arguments. The only way to change the behavior of the function is to modify the source code, which is a huge project due to poorly organized documentation. (This seems to be a common problem with open-source software.) Therefore, since the uncommon operator GDN is not included in all of these frameworks, designing a new framework seems to be the only solution.
Focus on computer vision technology summary and sharing
GDN
This operator is the core nonlinear function in the theory, and the expression is as follows (the formula doesn’t matter, if you don’t like the damn notation, you can skip this section). :
Superscripts (k) and (k+1) represent the number of layers, w and U are the inputs and outputs of multi-channel images, and subscript I is the number of channels. Beta and gamma are the parameters I want to train. Assuming we have N channels, then γ is an N by N matrix and β is an N by 1 vector. At first glance, this feature looks very similar to batch normalization (BN) or local response normalization (LRN), which is well supported by CUDNN and all deep learning frameworks. But trust me, don’t let your eyes fool you. It’s very different. (Note that big division is element division.)
Forward doesn’t consume much computing power, and backward consumes most of my GPU’s power. Now let’s look at the back. I need to calculate three gradients, nation-β, nation-γ, and Nation-U.
I know how people feel when they see this for the first time, because I wanted to kill myself the first time I saw this monster. But if I could draw a picture of all this shit, you’d feel more comfortable.
First of all, it’s easy to notice that the input can be viewed as a vector of length m x n. Second, (blabla…). To the minus 3/2 appears in all of these gradients. This means that we can only calculate the terms once and cache them for later use. We call it “blabla… To the minus 1/2 “matrix D. Finally, δ is the error propagating to the previous layer.
Fig 1. Computation of
gamma
With some simplification, it’s a little bit clearer, right? I know it still needs some explanation. On the right-hand side of the equation, each rectangle is a stacked vector of the matrices we mentioned above. D is the denominator in the GDN formula, remember we just mentioned “(blabla…)” ^ (1/2) “?
Unlike some advanced algorithms, this calculation is intuitive to most people, and we can easily write A CPU program to handle it. With a little knowledge of CUDA, everyone can port their CPU code to the GPU. However, if you can choose a different organization to boot the kernel, the speed can be quite different.
1. It’s not just naive algorithms.
I call this approach “more than naive” because it’s the first one I’ve ever used. Even with small-size images as input, it used up almost all of my GPU memory and achieved the slowest performance. Instead of leveraging any memory reuse, I just copied all of these little rectangles vertically and horizontally to get a larger matrix, as shown in the figure below, and started the kernel of many one-dimensional organizations. And then you add them up.
Fig 2. Less than naive Algo.
The only advantage of this algorithm is that there is no need to compute the index in each CUDA thread, because the thread ID is only the unique corresponding in-memory index. So all you need to do is do some multiplication, and then use cublas to add the dot product of each little colored rectangle with the 1 vector (a vector filled with all 1’s). But as you can see, the size of the rectangle is not as small as I’ve drawn it here, it’s the same size as the image. For each vector in this image, the size will be N x N x imageSize x batchSize. Obviously, we wasted (n-1) x N x imageSize x batchSize x 4 bytes, not to mention the time wasted accessing all this redundant global memory.
2. Naive algorithm.
For the first algorithm, I was able to train less than 4 images of 128 x 128 size in my network per iteration in almost 2 seconds. (My GPU is a GTX 1080.) This reality forced me to improve my algorithm, otherwise, I had to wait nearly 2 months to get my results.
Because THE number of kernels I need to boot is definitely much larger than the NUMBER of CUDA kernels in my GPU, the CUDA driver will serialize these tasks no matter what method I use. Then I decided not to copy all those memories. Instead, I’ll boot the N x imageSize kernel of an N x one-dimensional organization N times (N is the total number of channels).
Fig 3. Without memory replication
As you can see, the improvement is obvious. Because, we no longer need to copy a lot of data. Global memory access in gpus is very expensive. The memory access pattern is also simple, because when you get the thread ID, you can get the memory index (memory index = thread ID % imageSize) with just one mod operation. However, in this approach, since the kernel is still one-dimensional organized, and we use for loops to boot all of these cores, then we may not benefit from the GPU’s smarter scheduling algorithms, although I’ve gotten a taste of the blood. Now, with this small change, two months of training can be reduced to nearly two weeks.
3. Smarter organization algorithms.
So far, I haven’t considered the power of shared memory, because designing a good kernel mode is usually boring and a headache for me. Obviously, the one-dimensional kernel mode is the easiest code to write. However, better performance deserves more careful design. To my surprise, the algorithm in this section achieves three times the speed of the second algorithm.
Going back to Figure 1, you can see that the first row δ0, w0, and D0 of the first three right-hand matrices are identical. Thus, we can compute a row of gamma in a block, we can start imageSize threads for each block, and we can compute all channels for each thread using a for loop.
Fig 5. Computation in one block
So from Figure 5, it is very intuitive to place δ0, w0, and D0 in shared memory, while for thread I, it reads one pixel in N channels from 0 to N-1 and shares memories multiplied by δ0, w0, and D0. The pseudocode is as follows:
blockId = blockIdx.x; threadId = threadIdx.x; shareDelta <- delta[blockId]; shareW <- W[blockId]; shareD <- D[blockId]; _synchronize(); for(i = 0; i < N-1; i++) { result[threadIdx i*imgSize] = shareDelta[threadId] * shareW[threadId] * shareD[threadId] * W[threadId + i*imgSize]; }Copy the code
Algo 2 chose row master calculation instead of column master calculation because by calculating a row in a grid, we can share three vectors δ0, w0, and D0. But if we compute a column like we did in Algo, we can only share 1 vector w0. (See Figure 1 again.) .
In this snippet, there is no if… else … Block. This is very important in parallel computing. Because all threads run in parallel, the ideal situation is for all of these threads to do their work at the same time. But if there’s an if… else … Blocking, branching makes these threads do different tasks so that they finish at different times. The calculation time is then determined by the slowest thread.
Indexless computation is also an advantage. By designing a one-dimensional pattern, we have to calculate the in-memory index using the threadId, but we don’t need to convert blockId and threadId into a one-dimensional in-memory index to access the data.
Finally, because my data is stored in column Major, this means that, like the vector δ0, all the elements in this vector are stored continuously. So it benefits from the global memory merge mechanism. Global memory is also an important concept in CUDA.
On the hardware side, 16 CUDa cores are organized in a single WARP. When one of the threads accesses data, such as A1 in the figure above, the data bus transfers not only A1 but also A1 through A32 to the cache to speed up data access for the other 15 cores. So when I read global data to share memory, I only read every 32 bytes, and all other bytes are read from the cache, hundreds of times faster. Thanks to the local theory of space-time.
4. Improve a little bit
Today I realized that I don’t really need shared memory, but I can use const memory. Because for vectors δ0, w0, and D0, each thread in a block needs to access it only once. So before the for loop, we can actually cache the element in const memory. The other sugar is because each thread accesses only one element and does not require thread synchronization.
The code is as follows:
blockId = blockIdx.x; threadId = threadIdx.x; const float constDelta = delta[blockId * imgSize + threadId]; const float constW = W[blockId * imgSize + threadId]; const float constD = D[blockId * imgSize + threadId]; for(i = 0; i < N-1; i++) { result[threadIdx + i*imgSize] = constDelta * constW * constD * W[threadId + i*imgSize]; }Copy the code
As you can see from the above code, constDelta, constW, constD can be reused N times from local memory, which is always stored in a local register. Therefore, the bandwidth is greater than the shared memory.
Reduce Operation
None of the algorithms I’ve talked about are complete, because what I actually get from the above algorithms is the original gamma, as follows:
I need to accumulate each vector on the left to get one element. The first option is cublas API, cublasSsbmv. This function performs matrix vector multiplication. So we can view this guy on the left as a matrix, multiply it by a 1 vector, and get the one-row gradient of γ. And repeated N times to get the final result. But I noticed that there are other apis cublasSgemmBatched. This function enables batch matrix vector multiplication. Then I did an experiment to test which was faster:
For loop for N matrix vector multiplication VS batch matrix vector multiplication.
It turns out that the for loop is much faster. But I don’t know why, maybe because I have a small N here.
I will not show how to calculate ∇β and ∇ U because they are similar to ∇γ. I knew there had to be a further refinement or a better design than I had. CUDA optimization is often difficult for people who don’t have a deep understanding of the GPU organization. Programmers familiar with cpus have always benefited from modern operating systems and powerful compilers. However, the GPU is quite different and complex from the CPU in terms of writing enough code, although it is much more convenient than before to do calculations using graphics shaders. It will take a few years for the ecological environment to be perfected.
Link to original article:
medium.com/@Lawliet032…
This article comes from the public account CV technical guide of the paper sharing series.
Welcome to the public account CV technical guide, focusing on the technical summary of computer vision, the latest technology tracking, classical paper interpretation.
To get a PDF summary of the following articles, reply to the official account with the keyword “Technical summary”.
Other articles
Self-attention in computer vision
Classic paper series — Capsule Networks: New deep learning networks
Review column | attitude estimation were reviewed
Gossip about CUDA optimization
Why IS GEMM central to deep learning
Why is 8 bits enough to use deep neural networks?
Classic paper series | target detection – CornerNet & also named anchor boxes of defects
What about the AI bubble
Use Dice Loss for clear boundary detection
PVT– Backbone function without convolution dense prediction
CVPR2021 | open the target detection of the world
Siamese network summary
Past, present and possibility of visual object detection and recognition
What concepts or techniques have you learned as an algorithm engineer that have made you feel like you’ve grown tremendously?
Summary of computer vision terms (1) to build a knowledge system of computer vision
Summary of underfitting and overfitting techniques
Summary of normalization methods
Summary of common ideas of paper innovation
Summary of methods of reading English literature efficiently in CV direction
A review of small sample learning in computer vision
A brief overview of knowledge distillation
Optimize OpenCV video read speed
NMS summary
Technical summary of loss function
Technical summary of attention mechanisms
Summary of feature pyramid technology
Summary of pooling technology
Summary of data enhancement methods
Summary of CNN structure Evolution (I) Classic model
Summary of CNN structure evolution (II) Lightweight model
Summary of CNN structure evolution (III) Design principles
How to view the future of computer vision
Summary of CNN Visualization Technology (I) Visualization of feature map
Summary of CNN visualization technology (2) Visualization of convolution kernel
Summary of CNN Visualization Technology (III) Class visualization
CNN Visualization Technology Summary (IV) Visualization tools and projects