preface
For neural networks, we’re talking more about how accurate they are, which is a few tenths of a percent improvement. But when it comes to speed, deep learning neural networks may not be as fast as traditional algorithms.
So when do we need to speed up? Suppose you have the following scenario:
- Run the model on your phone
- You need real time scenes, like high-speed cameras to capture the action
- Run on embedded devices
Speed may seem easy to get for a desktop-grade graphics card, but on these devices, the algorithms are useless if the speed is not up to speed on effective hardware.
The so-called speed increase, not talking about hardware-level optimization, for neural networks is nothing more than two points:
- Network Design
- Size of input data
Let’s not talk about the input data size, and the design of neural network this is more important, the design of the network can be subdivided into: the size of the weights of network model, the network running process of the size of the intermediate variable, all kinds of calculation in the design of the network speed of execution and so on these will affect the speed, in general, speed is proportional to the model parameters and model.
In terms of speed and accuracy, this is often a measure, and accuracy and speed generally cannot be balanced. Just like YOLO, which is very popular in the industry, and Mask-RCNN, which is well-known in the academic world, one pursues speed and the other pursues accuracy (of course, the premise of speed is that accuracy is within an acceptable range).
The computation
Those who are familiar with ACM must know the concepts of time complexity and space complexity. Time complexity measures how long an algorithm runs, while space responsibility measures how much space an algorithm occupies. Neural networks are similar. The most straightforward way to determine how fast a network is is to see how many floating-point operations it contains (which also depends on the memory bandwidth).
Closely related to this concept is the FLOPS(floating-point operations per second). Most desktop graphics cards today are TFLOPs, or 1TFLOP, which performs 1,000,000,000,000 floating-point operations per second.
Matrix multiplication
In neural networks, the most common is matrix multiplication:
As shown in the image below with input 4×4, the convolution kernel is 3×3 and the output is 2×2:
The above operations are decomposed into:
The calculation process of a scalar in the result can be expressed by the formula:
y = w[0]*x[0] + w[1]*x[1] + w[2]*x[2] + ... + w[n-1]*x[n-1]
Copy the code
W and x are both vectors, w is the weight and x is the input. The final result is that Y is a scalar. This is called multipy-accumulate operations. And one of the operations w[0]*x[0] +.. It is called a multipy-accumulate operation. The above calculation contains n MACCs in total.
In short, the dot product of two n-dimensional vectors requires n MACCs operations (actually n-1, since the first one does not count, but we consider it approximately n). Whereas n MACCs operations consist of 2n-1 FLOPs(n multiplications and n-1 addition), we approximate 2N FLOPs.
In other words, the product of two n-dimensional vectors requires 2n FLOPs.
Of course, in many hardware facilities (such as video cards), a MACC can be called a unit of operations, rather than separate addition and multiplication, because the hardware has been greatly optimized, and we can later test a convolution operation according to MACC such units.
The connection layer
The fully connected layer is the most common layer besides the convolution layer. In the fully connected layer, the number of inputs is I and the number of outputs is O. These nodes are connected one by one, and then the weight W is stored in the matrix of I x J.
y = matmul(x,W) + b
Copy the code
(from:Leonardoaraujosantos. Gitbooks. IO/artificial -…)
In the formula above (combined with the figure above), our x-dimension I is 3, x is a 3-dimensional vector, and the output y is a 2-dimensional vector, so the number of weights W is 3 x 2, plus the offset b.
So we want to calculate how many MACCs are executed at the full connection layer. Let’s first look at the operation matmul in the full connection layer. This is a matrix operation.
The matrix operation is basically a collection of multiplication and addition. The dimension of our input is I and the dimension of our output is O, and the dimension of the middle W is I x O(3×2 in the figure above). So it’s easy, all we have to do is I x O MACCs, which is the same as the number of weight matrices.
Oh, there’s a bias b that’s not included, that’s actually negligible, that’s not included in the normal calculation.
When we look at some of the full connection layer formulas, we might see that we move the bias into the matrix instead of adding the bias vector after the matrix operation. I’m doing a matrix operation of (I + 1) x O, which is just to simplify the calculation, and it doesn’t affect the amount of computation.
In other words, we added 100 inputs and 200 outputs to our fully connected layer, so 100 x 200 = 20,000 MACCs were executed.
In general, I x J MACCs and (2i-1) x J FLOPs are performed for input I to output O.
The full connection layer is the operation before the vector, and the full connection layer is usually placed after the convolution layer. When calculating these values in programming, we always need to perform Flatten operation on the values after the convolution. By contrast, we should be familiar with Flatten, which is to transform a (N,C,H,W) tensor into a shape of (N,I). To perform a full join operation.
The activation function
Usually we add a nonlinear activation function, such as RELU or Sigmoid, after the convolution layer or full connection layer. In this case, WE use FLOPs to measure the calculation, because the activation function does not involve dot product operation, so MACCs is not used.
For RELU:
y = max(x, 0)
Copy the code
X is the input, and the input here is the output of other layers. If n vectors are passed to RELU layer by other layers, RELU layer will calculate these N vectors, which is n FLOPs.
For Sigmoid:
y = 1 / (1 + exp(-x))
Copy the code
The above formula contains an addition, a subtraction, a division and a exponentiation operation, which we boil down to a single FLOP(as well as multiplication and the square root, etc.). So a sigmoID is four FLOPs. If I type in n that’s going to be 4 x N FLOPs.
But usually we only care about the big matrix operations, and we don’t care about that kind of computation.
Convolution layer
The main processing object in the convolution layer is not the vector mentioned before, but the tensors of the three channels (C,H,W) that we usually see, where C represents the number of channels and H and W represent the height and width of the feature graph.
For a convolution layer with kernel K (only square convolution layer is mentioned here, and square convolution is also used at ordinary times), the required MACCs is:
K x K x Cin x Hout x Wout x Cout
Copy the code
How it came about:
- The size of the output feature graph is
Hout x Wout
, consisting of each pixel in the calculation - The window size of weights and input feature graph calculation is
K x K
- The number of channels in the input feature graph is
Cin
- The number of channels generated by the convolution of each channel is
Cout
We’re ignoring the bias, which we normally do when we’re computing a parameter, but we’re ignoring it when we’re computing FLOPs.
For example, if an image with three channels of 256*256 is input, the size of convolution kernel is 3 and the number of convolution layers is 128, the total computation amount is as follows:
256 x 256 x 3 x 3 x 3 x 128 = 226,492,416
Copy the code
That’s about 226M-FLOPS, which is a lot of computing.
The stride used above is 1, that is, the convolution operation is carried out on the feature graph every other step. If the STRID of the above convolution layer is 2, it is equivalent to convolution in the image of half size, and the 256×256 above becomes 128×128
Depth separable convolution structure
The deeply separable convolution architecture is the basic structure of many efficient networks, such as MobileNet and Xception. Both use the network structure of Depthwise-convolution. The network structure is not complex and can be divided into two parts:
(from machinethink.net/blog/mobile…).
Note that deep Separable Convolution in the following paragraphs corresponds to Depthwise Convolution, which consists of Depthwise Convolution and pointwise Convolution (also known as 1×1 Convolution).
Deep separation of convolution operation and normal operation of similar, but no longer three channels (RGB) become a channel (common convolution kernels is the image of three-channel separately convolution again together into a channel), the three channels are three channels directly input output, which is corresponding to three independent parameters, different parameters of convolution content, Each convolution check should have one channel (one channel for input and one channel for output).
There is a concept called depthwise channel Multiplier, which is the depthwise channel multiplier. If the amplifier is greater than 1, say 5, then a convolution kernel is equivalent to putting in one channel and putting out five channels, and this parameter is a hyperparameter for scaling the model.
The number of operations performed is:
K x K x C X Hout X Wout
Notice that compared with the ordinary convolution operation before, it can be said that the multiplication of C is less, and the amount of computation can be said to be greatly improved.
For example, the depth separable convolution of 3×3 is used to conduct convolution operation on a feature graph of 112 x 112 with channel 64, then the MACCs we need is:
3 x 3 x 64 x 112 x 112 = 7,225,344
Copy the code
For the pointwise convolution operation, the required operation is:
Cin X Hout X Wout X Cout
Copy the code
K here, the size of the nucleus is 1.
Similarly, if we have 112x112x64 dimension data, we use point separation convolution to project it into 128 dimensions to create a 112x112x128 dimension data, then we need MACCs as follows:
64 x 112 x 112 x 128 = 102,760,448
Copy the code
It can be seen that point separation operation requires more computation than depth separation operation.
Let’s compare the above two operations together with the ordinary 3×3 convolution operation:
7 x 3 depthwise: 7,225,344 1 x 1 pointwise: 102,760,448 depthwise: 109,985,792 MACCs regular 3 x 3 convolution 924844032 MACCsCopy the code
It can be seen that the speed increased by more than 8 times (8.4) ~
But it’s a bit is not very fair, because ordinary 3 x3 convolution learned information more complete, can learn more information, but we have to know that under the same amount of calculation, compared with the traditional 3 x3 convolution, we can use more than eight the depth of the separable convolution, it shows than down the gap.
For the calculation of the number of parameters in the model, see this article: Deep Learning: How to calculate the model and the video memory footprint of the intermediate variables.
Let’s sort it out. The specific MACCs required for depth separability is:
(K x K x Cin X Hout X Wout) + (Cin x Hout x Wout x Cout)
Copy the code
Simplified as:
Cin x Hout x Wout X (K x K + Cout)
Copy the code
And if we compare that to the ordinary 3 by 3 convolution, we see that the final + Cout in the ordinary 3 by 3 convolution is x Cout. This small difference makes a huge difference in performance.
The speed increase ratio of the traditional deeply separable convolution kernel can be considered as K x K(that is, the larger the convolution, the faster the speed increase). According to the example of 3×3 convolution above, we found that the speed increase by 8.4 times is actually similar to 3×3=9 times.
In fact, the actual speed increase ratio is :K x K x Cout/(K x K + Cout) In addition, it should be noted that, like traditional convolution, the stride is greater than 1. At this time, the feature size of the output of the first part of the deeply detachable convolution will decrease. The second depth-separable point convolution preserves the dimension of the input convolution.
The deeply separable convolution introduced above is a classic structure in MobileNet V1. In MobileNet V2, this structure is slightly changed, specifically, there is an expansion and contraction part:
- The first part is the 1×1 convolution, which is used to add more channels on the input feature image (this could be interpreted as expansion_layer).
- The second part is the 3 by 3 depthwise that I’ve already mentioned.
- The third part is another 1×1 convolution, which reduces the channels on the input feature image (this is called projection_layer, also known as bottleneck convolution).
Let’s talk about the number of computations for the structure above:
Cexp = (Cin × expansion_factor)
expansion_layer = CinX HinX Win × Cexp
depthwise_layer = K × K × Cexp × Hout × Wout
projection_layer = Cexp × Hout × Wout × Cout
Copy the code
Cexp in the above formula represents the number of expanded layers of the extended layer. Although H and W of the feature graph will not be changed in either the extended layer or the bottleneck layer, the depth separation layer will be changed if the stride is greater than 1, so Hin Win and Hout Wout here are sometimes different.
Simplify the above formula:
Cin x Hin X Win X Cexp + (K x K + Cout) x Cexp x Hout x Wout
Copy the code
When the stride=1, the above equation is simplified as :(K x K + Cout + Cin) x Cexp x Hout x Wout
Compared with MobileNet V1, 112x112x64 is also used as input, and the stride of “expansion_factor” (6,3 x3) is 1. In this case, the calculation amount of V2 is:
(3 × 3 + 128 + 64) × (64 × 6) × 112 × 112 = 968,196,096
Copy the code
Can be found that the amount of calculation seems to be a much greater than the previous version V1, and 3 x3 convolution are much larger than normal, why, the reason is very simple, we set the expansion coefficient of 6, so we calculate the 64 x 6 = 384 channels, than the previous 64 – > 128 learn more parameters, but the amount of calculation.
Batch standardization -BatchNorm
Batch standardization can be said to be an essential operation in modern neural networks except convolution operation. Batch standardization is usually placed after the convolution layer or full connection layer and before the activation function. For y output in the previous layer, batch standardization takes the following actions:
z = gamma * (y - mean) / sqrt(variance + epsilon) + beta
Copy the code
The y of the last output is first normalized (subtracting its mean and the variance, epsilon 0.001 here to avoid calculation problems). But we multiply the normalized number by gamma, plus beta. These two parameters are learnable.
That is, for each channel, we need 4 parameters, that is, for C channels, batch standardization needs to learn C x 4 parameters.
It may seem like a lot of parameters need to be computed, but in practice we can optimize it by combining batch standardization with convolution or full connection layers, which will speed things up even further. We won’t discuss it here.
In conclusion, we generally do not discuss the amount of calculation generated by batch standardization when discussing the amount of calculation in the model, because we did not use it in the inference.
Other layers
In addition to the above some of the basic layer (convolution, the whole connection, special convolution, batch of standardization), pooling layer also can produce a part of the amount of calculation, but compared with the convolution layer and layer full connection pooling is negligible, and in the design of a new type of neural network, pooling layer can be replace by convolution, So we generally don’t focus on these layers.
The next step
This article simply discussed some model calculation, a network running fast, and is associated with the amount of calculation of the network, not only the size of the network, the network optimization of the parameters of high and low precision, intermediate variable and mixing, etc as part of the speed, precision limited space will be discussed in the next section.
This article is from OLDPAN blog, welcome to visit: OLDPAN blog
Welcome to Oldpan blog public account, continue to brew deep learning quality articles: