Faster Neural Networks straight from JPEG Notes

Introduction

JPEG is a common form of image preservation, which actually saves the low-frequency coefficient of the image after DCT transformation. However, at present, CNN is convolution of RGB tensor, which requires a DCT->RGB conversion. If the DCT coefficient is operated directly, time can be saved and the effect can be accelerated. This paper practices acceleration under the premise of guaranteeing performance.

Method

Considering how to put the tensor composed of DCT coefficients into CNN, the author mainly considers the alignment of feature graphs.

Fig1

The three YCbCr channels obtained from RGB transformation are usually not of the same size, and the brightness channel Y is generally larger, as shown in Fig1(a). Therefore, two different transformation strategies, T1&T2, should be designed respectively for the brightness and chroma channels. The author tries three ideas: upsampling, downsampling and late concat. Among them, sampling Y down to the same size as chroma channel will lead to a decrease in accuracy, so it will not be discussed. The specific operation of Upsampling is shown in Fig2b. The chroma channel shall be of the same size as Y after up-sampling. Late-Concat takes convolution to the same size and then concatenation. The difference between the two approaches is whether alignment is done by convolution or up-sampling.

FIG. 2 Schematic diagram of alignment method of feature images

Results

In order to prove the effectiveness of DCT transform, the author tried to directly put the DCT coefficients of 88 and 44 into network training. By observing Fig3, it can be seen that the corresponding DCT-frozen and DCT-Frozen2 have achieved similar results to other methods. According to Fig4, the DCT method can achieve a good balance between computing speed and accuracy.

Fig.3 Comparison of accuracy of different methods

Fig4 FLOP comparison diagram

Comparison

The improvement of Learning in the Frequency method is that YCbCr is directly transformed to the same size, which is convenient for unified processing of the whole tensor in the follow-up, simplifying the complexity of channel processing in the early stage of this method. Second, Learning… Using Senet’s idea for reference, redundant channels are eliminated by gate mechanism to further simplify the network. Finally, Learning.. In addition, the DCT method is extended to the case segmentation neighborhood, and the generalization of the method is proved. Only in terms of classification task accuracy, it is not much improved compared with this paper.

And Learning… The model used for comparison is 88, which is not the 44DCT model with the best experimental results of the author in this paper.

Conclusion

Finally, Learning… Another advantage of using DCT is to allow a larger RGB input by reasonably choosing the size of the DCT core, but this paper is directly processing the JPEG encoded results without involving RGB.

Although frequency domain learning has a slight improvement in accuracy, it has little advantage over traditional RGB convolution, and its main contribution is to improve the computing speed.

Reference

[1] Gueguen, L., Sergeev, A., Kadlec, B., Liu, R., & Yosinski, J. (2018). Faster neural networks straight from jpeg. Advances in Neural Information Processing Systems, 31, 3933-3944.

[2] Xu, K., Qin, M., Sun, F., Wang, Y., Chen, Y. K., & Ren, F. (2020). Learning in the frequency domain. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1740-1749).

[3] Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7132-7141).