Make writing a habit together! This is my first day to participate in the “Gold Digging Day New Plan · April More text challenge”, click to see the details of the activity.

Please follow my official account [Jizhi Vision] for more notes to share

Hi, I’m Jizhi Vision. The realization principle of direct convolution computation and Img2Col convolution acceleration algorithm is explained in detail.

1. Direct convolution calculation

Direct convolution calculation must be very direct, and it is also the way of convolution calculation that most people intuitively understand when learning convolutional neural network.

Direct convolution is calculated according to the calculation characteristics of the convolution layer. The weight matrix in the convolution kernel slides in the input feature graph after zeroing, and each time a sub-matrix with the same size as the weight matrix is drawn in the input feature graph for the multiplication and accumulation of corresponding elements (dot product operation).

Specifically, for the input feature graph matrix X(5 X 5) with only one channel, there are now two convolution kernels, where the weight of convolution kernels 1 is W1 and all elements in the bias of B1 are 0. The weight of convolution kernel 2 is W2, and all elements in b2 bias are 1; The weights of the two convolution kernels are both 3 x 3 matrices, the step size of the convolution kernels is 2, and pad is 1.

During the convolution calculation, pad zeros are firstly added for the input characteristic matrix X. Since pad is 1, the characteristic matrix is extended outwardly for one circle to obtain the matrix Xp.

During the formal calculation, the initial position X1 of weight W1 in Xp matrix starts to calculate, and the corresponding dot product operation is carried out between weight elements and all elements at X1 of input feature sub-matrix in Xp, and the value -2 of the first element of output feature matrix Y1 is obtained after bias. The calculation process is as follows: 0 x 1 + 0 x 0 + 0 x 1 + 0 x 0 + 1 x 0 + 1 x 0 + 1 x (-1) + 0 x 0 + 1 x (-1) + 0 x (-1) + 0 x (-1) + 1 x (-1) + 0 = -2. After the calculation of the first element of Y1, the row sliding calculation is started. The weight W1 matrix slides 2 element steps to the right, then the weight W1 overlaps with the input feature sub-matrix X2 in the matrix Xp. The dot product of the corresponding elements is performed and the bias is added to obtain the value -3 of the second element of the output feature matrix Y1. In this way, all elements of the first line of the output characteristic matrix Y1 are calculated. In the calculation of the elements in the second row of Y1 matrix, the weight W1 slides vertically down the step of 2 elements from the position of the input eigensubmatrix X1 to calculate the value 1 of the first element in the second row of the output eigenmatrix Y1. The step size of the convolution kernel movement plays a role in both horizontal and vertical movement in the input feature graph. After 9 times of dot product operation of weight and input feature submatrix, all element values of output feature graph Y1 are obtained. For convolution kernel 2, the same method can be used to obtain the output characteristic graph Y2.

At this point, the convolution calculation of the input feature graph and the weight matrix is completed by means of direct convolution. In practice, each convolution kernel has multiple channels, and the weight matrix in one channel of each convolution kernel needs to be used for the convolution operation of the input feature graph in the same channel. After completing the convolution operation of all channels in the convolution kernel, the results of all channels are summed up as the final output feature graph of the convolution kernel.

Img2Col Convolution Acceleration algorithm

Img2Col implements convolution through matrix multiplication, which is widely used in CPU, GPU and other general-purpose programming computing chips.

Convolution of the input characteristics of layer and convolution kernel weight matrix, then the convolution of the input corresponding element characteristic matrix and weight matrix of the dot product operation is converted into matrix the bank of China and of the columns by operation, such a large amount of convolution computation in convolution layer will be converted to the parallelism of matrix operation itself. Therefore, the convolution operation can be carried out efficiently only when the matrix multiplication is efficiently realized in the processor. Both cpus and Gpus provide specialized basic Linear algebra assembly libraries (BLAS) to efficiently implement vector and matrix operations.

Img2Col’s expansion method is to expand each input feature submatrix into a row (or column) to generate a new input feature matrix with the same number of rows as the input feature submatrix. At the same time, the weight matrix in the convolution kernel is expanded into a column (or a row), and multiple weight matrices can be arranged into multiple columns.

The input feature graph Xp is expanded into the first row of the new matrix XI2c through Img2Col, and the input feature sub-matrix X2 is expanded into the second row. Since there are 9 input feature sub-matrices, all the input feature sub-matrices are expanded into 9 rows using this method, and the final XI2c matrix is generated. Similarly, the weight matrices W1 and W2 of the second convolution kernel can be expanded into matrix WI2c according to the columns, and the bias matrix bI2c can also be expanded to obtain. Next, the matrix multiplication is carried out. The first row of XI2c and the first column of WI2c are calculated, and the first element value of YI2c can be obtained by adding bias. The whole output characteristic matrix YI2c can be obtained by calculation in turn.

Is the role of Img2Col convolution computation by matrix multiplication, which means that the characteristics of calculation will be required in calculation process sub-matrix in contiguous memory, is conducive to a will need to compute the data calculated according to the format of the need to take out directly, thus reduced the number of memory access, thus reducing the total time of calculation. However, in direct convolution calculation, because the input feature submatrix is stored in the memory with overlapping and discontinuous addresses, it may need to access the memory for several times during calculation. Since multiple access to the memory directly increases the data transmission time and further affects the speed of convolution calculation, Img2Col plays a facilitating role in the convolution acceleration calculation and provides the necessary foundation for the transformation of convolution computation into matrix multiplication.

If the convolution kernel bias value is constant, the calculation with bias can also be optimized. The weight matrix of the convolution kernel is combined with the bias value, and the coefficients are added into the input eigenmatrix. Matrix multiplication and cumulative bias can be calculated by matrix multiplication at one time.

Add coefficient matrix I to XI2c matrix, I matrix is 9 x 1 matrix and its elements are all 1; Add bias matrix B to WI2c matrix, where b = [0 1]. Then the calculation formula is:

By increasing the dimension of matrix, only matrix multiplication can be used for calculation, which saves computing resources on hardware and simplifies calculation steps.

Convolution in modern processors can be implemented in other ways, such as the fast Fourier Transform (FFT) and the Winograd algorithm, which will be covered later. All these methods are equivalent transformation of complex convolution operation into simple operation of another space, thus reducing the computational complexity. It is worth mentioning that Winograd algorithm is used in the convolution part of cuDNN library provided by Nvidia. Generally, in convolutional neural networks, the number of parameters and computation of the convolutional layer accounts for the vast majority of the entire network. Therefore, reasonably accelerating the computation of the convolutional layer can greatly improve the calculation speed of the entire network and the execution efficiency of the system.

[Public Account transmission]

Img2Col Convolution Acceleration Algorithm