Recently, CVPR 2022 officially announced the list of accepted papers. Twelve papers from bytedance’s intelligent creative team were included in CVPR, including one Oral presentation paper.

The IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), founded in 1983, is a leading Conference in the field of Computer Vision and Pattern Recognition. Every year, it attracts papers submitted by universities, scientific research institutions and technology companies. Many important computer vision technology achievements are selected and published on CVPR.

Next, we will share the core breakthrough of CVPR2022 paper, and come to learn the most cutting-edge research results in the field of computer vision!

Dressing in the Wild by Watching Dance Videos universal virtual Dressing in the Wild by Watching Dance Videos

The paper was co-authored by Bytedance and Sun Yat-sen University.

This paper focuses on the virtual dressing task of complex human posture in the real scene, and proposes a video self-supervised training model wFlow that combines 2D and 3D. The effect is improved significantly in challenging loose clothes and complex posture, and full-body and partial dressing can be realized. At the same time, a new large-scale video dataset Dance50k is constructed, which covers various types of clothing and complex human poses, in order to promote the research of virtual costume change and other human-centered image generation.

Due to the lack of potential 3D information perception ability of human body and corresponding diversified posture & clothing data set, the existing virtual dressing work is limited to simple human posture and underclothes, which greatly limits its application ability in real scenes. In this paper, a new real world video dataset Dance50k is proposed, and combined with the introduction of 2D pixel flow and 3D vertex flow, a more general appearance flow prediction module (named wFlow) is formed to solve the deformation of loose clothes while improving the adaptability to complex human posture. Through cross-frame self-supervised training on Dance50k and online ring optimization for complex examples, the experiment proves that wFlow has higher generalization in real world images than existing single pixel or vertex appearance flow methods, which is superior to other SOTA methods and provides a more general solution for virtual fitting.

Arxiv: arxiv.org/abs/2203.15…

code:  awesome-wflow.github.io/

GCFSR: A Face Superpartition Method with Controllable Detail Generation without Face Priori a Generative and Controllable Face Super Resolution Method Without Facial and GAN Priors

The paper was jointly written by ByteDance and the Institute of Advanced Technology of the Chinese Academy of Sciences.

Face superresolution usually relies on facial priors to recover real details and retain identity information. With the help of GAN Piror, recent advances can yield impressive results. They either design complex modules to modify the fixed GAN Prior, or employ complex training strategies to fine-tune the generator. We propose a face hyper-segmentation framework for generating controllable details, called GCFSR, which can reconstruct images with real identity information without any additional priors.

GCFSR is an encoder-generator architecture. In order to achieve multiple face hypersegmentation with magnification, we design two modules: pattern modulation and feature modulation. Style modulation is designed to generate realistic facial details; Feature modulation dynamically fuses multi-scale coding features and generated features according to conditional magnification. The architecture is simple and elegant and can be trained from scratch in an end-to-end manner.

For small rate overscores (<=8), GCFSR can produce surprisingly good results with only GAN Loss constraints. After adding L1 and perceptual loss, GCFSR can achieve THE results of SOTA on large magnum super-fraction tasks (16, 32, 64). In the test phase, we can adjust the intensity of generated details through feature modulation, and achieve various generated effects by constantly changing the conditional magnification.

Arxiv: arxiv.org/abs/2203.07…

Code: github.com/hejingwenhe…

SemanticStyleGAN: Learning Compositional Generative Priors for Controllable Image Synthesis and Editing

StyleGAN has been very successful in image generation and editing. In practice, however, an important issue is that its Latent space is decomposed on an image scale (64×64, 128×128), which makes StyleGAN good at global style control but not local editing. This paper proposes a new GAN network that enables latent space to be decoupled at different semantic loci.

In order to realize this goal, this paper focuses on the Perspectives of Human Variability bias and information control. In the first aspect, the low-level generation module of StyleGAN generator is decomposed into different local generators, and each generator generates local feature map and pseudo-depth map corresponding to a certain region. These pseudo-depth maps then render the image by combining the global Semantic mask and feature map in a z-buffering manner. In terms of monitoring information, this paper proposes a kind of dual-branch discriminator, which models the image and its semantic label simultaneously to ensure that each local generator can be meaningful to the local part.

The resulting model can build independent latent space for each semantic part to achieve local style transformation. At the same time, as an upstream model similar to StyleGAN, the image can be edited with latent Space under the premise of ensuring local control.

Arxiv: arxiv.org/abs/2112.02…

code:  semanticstylegan.github.io

3D-Aware Image Synthesis via Learning Structural and Textural Representations based on Learning Structure and Texture Representation

The paper was co-authored by ByteDance, the Chinese University of Hong Kong and Zhejiang University.

In recent years, generated models have been developed rapidly in the field of image, and the quality and resolution of generated images have been greatly improved. However, most algorithms still focus on the generation of two-dimensional images. Making the generated model aware of three-dimensional information is an important step in bringing the model closer to our real world. Some attempts have been made to take advantage of generative adversative networks (gans), which are common in two-dimensional image generation, and replace the generator with a neural radiation field (NeRF). NeRF can render an image pixel by pixel using three-dimensional spatial coordinates as a priori. However, implicit functions in NeRF have a very local receptive field, making it difficult for generators to be aware of the global structure of the object. At the same time, NeRF is based on volume rendering, which increases generation cost and optimization difficulty.

To solve these two problems, we propose a new 3d perception generator to explicitly learn the structural representation and texture representation of objects. We call it VolumeGAN. Specifically, our generator first learns a feature volume that is used to represent the underlying structure of an object, and then converts this feature volume into a feature field, which is then converted into a feature map through integration. Finally, a neural renderer is used to synthesize a two-dimensional image. This design can achieve independent control of the structure and appearance of the generated object. A large number of experiments on a large number of data sets show that our method achieves better generated image quality and more accurate 3d controllability than previous methods.

Arxiv: arxiv.org/abs/2112.10…

Code: github.com/genforce/vo…

Demo: www.youtube.com/watch?v=p85…

Xmp-font: self-supervised cross-modality pre-training for few-shot Font Generation

Due to the large number of Chinese characters, the traditional manual font design process is time-consuming and laborious. Small sample font generation aims to generate a full font with only one or a few Chinese characters as references. However, the font style of Chinese characters is not only the simple shape and texture, but also the frame structure between strokes. In order to understand the style characteristics of Chinese characters, it is necessary to deeply understand the complex relationship between the basic strokes of Chinese characters, otherwise the quality of the generated font cannot be guaranteed.

To solve the above problems, we propose a small sample font generation algorithm based on self-supervised cross-modal pre-training model, which is mainly divided into two stages:

(1) Pre-training stage: pre-training a feature extraction model based on BERT’s cross-modal (text image and stroke sequence information) to ensure that the extracted font features fully understand the relationship between strokes without loss of information through reconstruction loss and stroke prediction loss.

(2) Font generation stage: the pre-trained feature extractor is used to extract the features of the source domain word and the reference word respectively, and then decouples and recombines, and finally generates the source domain word with the same font as the reference font.

In addition, stroke Loss for Chinese characters was proposed in the font generation stage, which further improved the generation quality.

The experimental quantitative indicators and questionnaire survey results show that our proposed XMP-FONT method is superior to other SOTA methods.

Transformer/Shunted Self-attention via multi-scale Token Aggregation (Oral Presentation)

The paper was written in collaboration with The National University of Singapore and South China University of Technology.

This paper proposes a new multi-scale self-attention mechanism: during correlation learning at each layer, different tokens are given different sensory fields, and then the correlation between different scales semantics can be learned.

Different from the current multi-scale, the multi-scale information in this paper exists in parallel on the input token of the same block, instead of being fused by passing tokens between different blocks. Therefore, the performance advantage of the method is particularly obvious in data sets containing objects of different sizes, such as COCO. Compared with SWIN Transformer, the performance of 3-4% mAP can be improved with similar model memory and computation.

Arxiv: arxiv.org/pdf/2111.15…

Code: github.com/oliverrensu…

End-to-end Compressed Video Representation Learning for Generic Event Boundary Detection

The paper was jointly written by ByteDance, the University of Chinese Academy of Sciences and the Institute of Software of the Chinese Academy of Sciences.

In this paper, an end-to-end general event detection (GEBD) solution in Compressed Domain is proposed.

Traditional video processing algorithms need to decode the video, train and reason on the DECOded RGB frame. However, video decoding itself requires considerable computing resources, and there is a lot of redundant information between adjacent frames of video. In addition, the Motion Vector and Residual in the video encoding format contain the Motion information of the video, which can provide more help to better understand the video.

Based on the above two considerations, we hope to be able to use the decoded intermediate information in the video compression domain for fast and high-quality feature extraction of non-key frames. Therefore, we propose a Spatial Channel Compressed Encoder (SCCP) module. For key frames, features are extracted using conventional backbone network after full decoding. For non-key frames, high quality features of non-key frames are extracted on lightweight networks by using motion vectors, residuals and corresponding key frame features. At the same time, using Temporal Contrasitive module to achieve end – to – end training and reasoning. Experimental results show that the speed of our method is 4.5 times faster than that of the traditional complete decoding method.

Arxiv: arxiv.org/abs/2203.15…

Mimicing the Oracle: Improving class incremental learning through initial characterization de-correlation An Initial Phase Decorrelation Approach for Class Incremental Learning

The paper was a collaboration between Bytedance and the National University of Singapore, the Institute of Automation of the Chinese Academy of Sciences and the University of Oxford.

In this paper, class incremental learning is studied. The final learning goal is to obtain a model matching the performance of Joint training through phase-by-phase learning. The biggest challenge with incremental class learning is that after learning the classes of a certain stage, the performance of the model’s classes in the previous stage will be significantly reduced. This phenomenon is called forgetting.

For a class incremental learning process divided into multiple stages, we can divide it into two parts, namely initial phase (the first learning phase) and later phase (except all the learning stages after the first learning phase). Previous work tended to regularize the model in the later phase to reduce forgetting, but did not make special treatment for the initial phase. However, in this paper, the authors found that the initial phase was also critical in the process of class incremental learning.

Through visualization, the authors found that the biggest difference between the representation of the output of an Oracle model obtained only in initial phase training and that of the Joint Training was as follows: The distribution of initial-Phase-model representation will only focus on a long and narrow region of representation space (i.e., a subspace with a lower dimension). Oracle Model’s representation will be evenly distributed in all directions (i.e., a subspace with a relatively high dimension). This result is shown in Figure (a) and (b).

Based on this finding, the authors propose a novel regular term: class-wise Decorrelation (CwD). This regular term only applies to the training process of initial phase, which aims to make the representation of the model acquired by initial phase learning more evenly distributed in all directions in space, so as to be more similar to oracle Model. This result is shown in Figure (c).

The authors find that CwD regular terms can significantly improve (1% ~ 3%) the previous state-of-the-arts class incremental learning methods. I hope that through this work, colleagues in the scientific research community can better understand the significance of initial phase in incremental learning, so as to pay more attention to how to improve incremental learning in initial phase.

Arxiv: arxiv.org/abs/2112.04…

Code: github.com/Yujun-Shi/C…

DINE: Domain Adaptation from Single and Multiple black-box Predictors

The work was done by Bytedance in collaboration with the Institute of Automation, Chinese Academy of Sciences and the National University of Singapore.

In this paper, the authors propose an efficient unsupervised visual domain adaptive approach that requires only a pre-trained black box source domain model. Different from the previous domain adaptation based on source domain data or white box source domain model (model parameters are visible), in the black box domain adaptive problem, only the prediction of source domain model is visible. The author proposed the first distillation and then fine tuning method (DINE) to solve this problem. In the distillation stage, we use the adaptive label smoothing strategy to obtain effective pseudo-labels by using only the first K predicted values of the source model, which can be used for knowledge distillation of a single sample.

In addition, we use sample mixing strategy to achieve uniform regularization of random interpolation between samples, and maximize mutual information to achieve regularization for global samples. In order to learn the model that is more suitable for the target domain data, the author only fine-tuned the model after distillation by maximizing mutual information in the fine-tuning stage. DINE can use a single or multiple source model, protect the information security of the source domain, and does not require the same cross-domain network structure, can achieve simple and effective self-adaptation for the computing resources of the target domain. In a number of scenarios such as single source, multi-source and partial set of domain adaptive experimental results confirmed that, compared with the domain adaptive method based on source domain data DINE has obtained very competitive performance.

Arxiv: arxiv.org/abs/2104.01…

Code: github.com/tim-learn/D…

NightLab: A Dual-level Architecture with Intrinsic Detection for Night Segmentation

The paper was a collaboration between Bytedance and the University of California, Merced.

Semantic segmentation of night scenes is an important and challenging research problem in many visual applications, such as automatic driving. However, the research on night scene segmentation is limited. Due to the low exposure at night, the acquired image will lose a lot of information, resulting in many dark and fuzzy image areas. In addition, since images at night are dependent on illumination from other sources, the exposure difference between images is also significant. Night scene segmentation presents many unexplored challenges compared to daytime data. The same model performed well on daytime data, but poorly on nighttime data. This drove us to explore the main factors affecting night scene segmentation and effective model development.

In order to solve the above problems, this paper proposes NightLab, a night scene segmentation method integrating multiple deep learning modules. NightLab has better night perception and analysis capabilities. It mainly consists of two granularity level models, i.e., the full map and the region level, and each level model is composed of light adaptation and segmentation modules. Given a night image, the full-image-level model will provide an initial segmentation result, while NightLab will use the detection model to provide some areas that are difficult to identify in the image. Images corresponding to these difficult-to-identify regions will be further analyzed by region-level models. Region-level model will focus on these hard-to-identify regions to improve segmentation results. All models in NightLab are trained end-to-end. In this paper, a large number of experiments have been done to prove that the proposed NightLab achieves SoTA in NightCity and BDD100K open data sets.

Introduction of intelligent creative team

The intelligent creation team is the center of Bytedance’s audio and video innovation technology and business, covering the technical fields of computer vision, graphics, voice, shooting and editing, special effects, client and server engineering, and realizing the closed-loop of cutting-edge algorithm — engineering system — full link of products within the department. The company aims to provide the industry’s most cutting-edge content understanding, content creation, interactive experience and consumption capabilities and industry solutions to all internal business lines and external partners in a variety of forms.

At present, the intelligent Creative team has opened its technical capabilities and services to enterprises through ByteDance’s Volcano Engine.