“This is the 27th day of my participation in the First Challenge 2022. For details: First Challenge 2022”

Please follow my official account [Jizhi Vision] for more notes to share

Hi, I’m Jizhi Vision. This article mainly discusses why convolution acceleration prefers NHWC data layout.

The data Layout types I have been exposed to so far (mainly for convolution) include NCHW (PyTorch, Caffe), NHWC (Tensorflow, which is also the preferred data Layout on TVM GPU and Cambrian MLU Core), CHW (TensorRT does not consider the dynamic batch is the N out, only 3d), NCHWC0 (Huawei Centerm AI Core five-dimensional Layout, C0 INT8 is 32, FP16 is 16). The reason why there are so many data type layouts is probably due to different training frameworks, such as PyTorch and TensorFlow (most people’s alchemy furnace), and more consideration of which data type layouts are preferred by hardware/reasoning capabilities when reasoning.

Here I will mainly talk about why NHWC is more suitable than NCHW (on GPU) for IMG2Col + GEMM and Winograd convolution acceleration algorithm, which is also my personal understanding.

1, IMG2Col + GEMM and Winograd algorithm principle

Img2col + GEMM The detailed principle of img2col+ GEMM can be seen in my article “[Model Reasoning] To understand img2col Convolution Acceleration Algorithm”. That is, the feature map is expanded, spliced and stretched by the “trace” of the convolution kernel for GEMM. Winograd is a fairly common one, and Cudnn will also be used. I’ll give you a general overview here, and I’ll write more about it later. The essence of winograd’s conv_2D_3x3 acceleration is to reduce The Times of multiplication and addition calculation (mainly multiplication). At the beginning, it will still make IMG2col, then compute the block cutting, transformation and convolution kernel constant first, and finally achieve the purpose of reducing The Times of multiplication and addition, that is, to accelerate.

2. Why is NHWC better

In my opinion, NHWC optimized the process of collecting feature map data compared with NCHW, because whether img2col+ GEMm or Winograd, the feature map and kernel were expanded at the beginning, similar to this:

At the beginning of the convolution process, it will involve how you take the convolution block data in the feature map. The actual storage of NCHW is “RRRGGGBBB”, where all pixels of the same channel are stored together; The actual storage mode of NHWC is “RGBRGBRGB”, where the pixel values of the same location of multiple channels are sequentially stored together. (I wanted to draw a graph, but I’m too lazy to draw it haha, use code instead)

#Feature map NCHW channel C=3
#Channel C=0 channel C=1 channel C=2
a0 b0 c0 d0    a1 b1 c1 d1   a2 b2 c2 d2
e0 f0 g0 h0    e1 f1 g1 h1   e2 f2 g2 h2
i0 j0 k0 l0    i1 j1 k1 l1   i2 j2 k2 l2
m0 n0 o0 p0    m1 n1 o1 p1   m2 n2 o2 p2
Copy the code
# kernel
A0 B0 C0        A1 B1 C1      A2 B2 C2
D0 E0 F0        D1 E1 F1      D2 E2 F2
G0 H0 I0        G1 H1 I1      G2 H2 I2
Copy the code

If you use NCHW Layout, row Major, For a0b0c0d0e0f0g0h0i0j0k0l0m0no0p0a1b1c1d1e1f1g1h1i1j1k1l1m1n1o1p1a2b2c2d2e2f2g2h2i2j2k2l2m2n2o2p2 feature map data storage mode. With 3_x_3 convolution kernel, for a convolution action, the feature map requires N * kernel_size data fetching actions. They are A0B0C0, e0F0G0, I0j0k0, A1b1C1, e1F1G1, I1j1K1, a2b2C2, e2f2G2 and i2J2K2 respectively, where n=3 and kernel_size=3 mean that the data is taken for 9 times.

Let’s look at the NHWC as follows

# feature map NHWC C=3
      a2 b2 c2 d2
   a1 b1 c1 d1 h2
a0 b0 c0 d0 h1 l2
e0 f0 g0 h0 l1 p2
i0 j0 k0 l0 p1
m0 n0 o0 p0
Copy the code

Can you imagine it? Intuitively, the first plane is the first channel (C=0), the second plane is the second channel (C=1), and the third channel (C=2), which is a bit abstract. Kernel remains the same, as follows:

# kernel
A0 B0 C0        A1 B1 C1      A2 B2 C2
D0 E0 F0        D1 E1 F1      D2 E2 F2
G0 H0 I0        G1 H1 I1      G2 H2 I2
Copy the code

For NHWC Layout, For a0a1a2b0b1b2c0c1c2d0d1d2e0e1e2f0f1f2g0g1g2h0h1h2i0i1i2j0j1j2k0k1k2l0l1l2m0m1m2n0n1n2o0o1o2p0p1p2 feature map data storage mode. With 3_x_3 convolution, for a convolution action, feature map only needs three data fetching actions, which are A0A1A2b0b1B2C0c1C2, E0E1e2F0f1f2g0G1g2 and I0i1i2J0J1J2k0k1K2 respectively. In this way, only one convolution action, NHWC reduces 6 data fetch operations compared with NCHW.

According to the analysis, for a convolution action, the number of times for NHWC to fetch data is kernel_size times, while the number of times for NCHW to fetch data is kernel_size * N times. Therefore, NHWC is better for convolution accelerated data access, and this is good. It gets better and better as n goes up.

Well, the above share NHWC Layout accelerated convolution operation of some discussion, I hope I can share a little help to your study.


[Public Account transmission]

Why convolutional acceleration prefers NHWC Layout