ACM MM2021 International Conference on Multimedia Science and Technology (ACM MM2021) Tao department of technical content interactive algorithm team 4 papers selected! Congratulations!

International Multimedia Academic Conference (ACM MM) is recognized as the international top conference in the field of multimedia and computer vision in computer science, and also recommended by China Computer Society (CCF) as A class A international academic conference. The research field of ACM MM covers image, video, audio, human-computer interaction, social media and other topics. A total of 1,942 paper applications were received for this ACM MM2021, and 542 papers were selected (about 27.9% acceptance rate).

Tao is a technical content interactive algorithm team, focusing on machine learning, vision algorithm, NLP, end side intelligence, and other fields, relying on the tao is billions of level of video data, business support on taobao, live, visit and tao, has the rich business scenarios and technical direction, constantly explore new technology, and derivative subversive Internet team members come from famous universities at home and abroad, In the past two years, I have won 4 championships in CVPR competition, published more than 10 papers on top journals of computer vision (such as CVPR, TPAMI, TIP, etc.), and won the second prize of National Science and Technology Progress for my technical achievements.

In this ACM MM2021 conference, a total of 4 papers of the team were received, and relevant technical innovations were applied in the business scenarios of Tao Department. Later, the paper will be introduced in detail the innovation points and ground use.

NO.1

The title

Understanding Chinese Video and Language via Contrastive Multimodal pre-training

The author

Lei Chen-yi, LUO Shi-xian, LIU Yong, HE Wang-Gui, WANG Jia-busy, Wang Guo-xin, TANG Hai-hong, MIAO Chun-yan, LI Hou-qiang

Thesis innovation & impact on the industry

The pre-training model has achieved great success in the field of natural language processing, vision and even multimodal field. This paper focuses on joint video-text pre-training strategies in the multimodal domain, especially for Chinese video and text. There are the following challenges in video-text pre-training: First, different from static images, video has dynamic spatio-temporal sequence relationship, so directly porting the image-text pre-training method to the video-text field is not enough to capture these complex relationship information; Second, there are conflicts between video-text alignment tasks and other mask-based reconstruction tasks in pre-training model. Third, the lack of large-scale and high-quality Chinese video-text data sets limits the development of pre-training models in the Chinese field.

Therefore, this paper proposes a multi-modal pre-training model VICTOR based on reconstruction and comparison learning tasks, and establishes tens of millions of high quality Chinese video-text data sets. With Transformer as the main body, VICTOR designed seven task training models based on reconstruction and comparative learning. The reconstruction tasks include mask language modeling, mask sentence generation, mask frame sequence modeling and mask sentence sequence modeling, which can fully capture the sequence information and interactive information of video and text. The comparison task includes three tasks: dual video-text alignment, mask frame comparison learning within video and mask frame comparison learning between videos, which can enhance the spatio-temporal information fusion in video while avoiding the uncertain multi-modal information fusion due to simple video-text alignment task. VICTOR model has hundreds of millions of parameters, which is pre-trained in the tens of millions of magnitude of the constructed tao series video-text data set, and obtains the performance improvement of SOTA in several downstream tasks (such as video text matching, video recommendation, and title generation). The design and proposal of VICTOR model effectively promotes the progress of pre-training in Chinese video-text field, and can be widely used in many video-related businesses (such as video recommendation, video classification, etc.).

The application of relevant technology in the real scene of tao Department

We applied the video features of VICTOR’s pre-training to many fields, such as content retrieval, recommendation, classification, and live broadcast. Compare the Strong Baseline of the service in each application scenario. To be specific, :

  1. Content recommendation – browse content recommendation. Under the stable efficiency, the proportion of new content in 3 days increased by 22.81%, cold start UCTR +4.29%, PCTR +4.72%;
  2. Content retrieval — taobao experience. In cross-modal retrieval, the result-free rate decreased from 3.23% to 0.95% on the basis of ensuring correlation evaluation
  3. Content Classification – Browse content classification. The relative accuracy of image and text classification is improved by 3.94% (60.97%->63.37%), and that of video classification is improved by 7.33%(51.99%->55.80%).
  4. Object detection and matching – review of live coverage. The detection accuracy of all categories was improved by 4.83% (89%->93.3%), and the detection accuracy of difficult cases of beauty makeup was improved by 8.05%(75.8%->81.9%).

Thesis reading/download link

Arxiv.org/abs/2104.09…

NO.2

The title

Pre-training Graph Transformer with MultimodalSide Information for Recommendation

The author

Liu Yong, Yang Su-sen, LEI Chen-yi, Wang Guo-xin, TANG Hai-hong, ZHANG Ju-yong, SUN Ai-xin, MIAO Chun-yan

Thesis innovation & impact on the industry

In the field of personalized recommendation, especially in the field of short video recommendation, multi-modal information plays an important role. The effective use of multi-modal information of Item, such as text and visual information, can effectively improve the performance of recommendation and alleviate the problem of cold start. The existing recommendation models that fuse multi-modal information are all end-to-end modal fusion based on specific tasks, which consumes resources and limits the generalization of the model. In addition, in the field of recommendation, there are various correlations between items (such as tag-based semantic correlation, behavior-based user interest correlation, etc.). In order to save resources, improve model utilization, and capture the correlation between items, this paper proposes a graph pre-training framework PMGT based on multi-modal information fusion, which can guide the fusion of multi-modal information of items while capturing item correlation, and the item features after pre-training can be applied to a variety of downstream tasks. This avoids the resource waste and time consumption of having to refuse modal information for each specific task. PMGT first builds an item multimodal graph according to the relevant information of item, where the nodes of the graph are items, and the edges reflect the relationship between items (such as the edges created by items that have been interacted with by the same user). The features of each node are composed of the multimodal features of item. For each node in the graph, an efficient parallel sampling method MCNSampling is designed to sample a number of nodes related to it from the graph to form a node sequence, and a transformer framework based on diversity is used to converge node features to alleviate the redundancy of modal fusion. Finally, the task based on graph structure reconstruction and the task based on node feature reconstruction are used to guide the fusion of related nodes and the multi-modal information fusion of nodes themselves. PMGT is pre-trained and tested on Amazon and MovieLens public data sets, and achieves SOTA performance compared with the latest graph pre-training model. PMGT uses graphs to guide the fusion of multi-modal information of items, and effectively captures the correlation between items, so that pre-training is not limited to items themselves, and enhances the expression ability of pre-trained item features, which can be applied to a variety of downstream tasks and fields.

The application of relevant technology in the real scene of tao Department

In the field of short video recommendation of Tao Series, we established a video multimodal map with 4 million nodes and 400 million edges based on the tag information of short video, and directly applied the features after PMGT pre-training to the recall stage of short video, and the proportion of new content in 7 days increased by 7%. After that, the pre-training features can be applied to the sorting stage, or even other business scenarios (such as video classification), and PMGT can be used as the basic framework to use specific task fine-tuning models to further improve the effect.

Thesis reading/download link

Arxiv.org/abs/2010.12…

NO.3

The title

Shape Controllable Virtual try-on for Underwear Models sc-VTON: Virtual Try-on for Underwear Models

The author

Gao Xin, Liu Zhenjiang, Feng Zunlei, Shen Chengji, OU Kairi, Tang Haihong, Song Mingli

Thesis innovation & impact on the industry

We propose a shape-controlled virtual fitting network (SC-VTON). For the fitting task of underwear models, we use the GAT network which integrates the information of models and clothes to generate the deformed pictures of clothes. In addition, we added control points in SC-VTON to achieve shape control of the dress. Furthermore, by adding Splitting Network and Synthesis Network, we can use the pair of clothing and model to optimize the data model, and generalize the task to the conventional 2D virtual fitting task. Our method is able to achieve precise shape control of clothing. At the same time, compared with other schemes, our scheme can generate texture realistic high resolution images, and can be landed in practical application. This is the first time in the industry that the Digital Attention network has been applied to the virtual fitting task, while achieving precise and controllable deformation of clothing.

The application of relevant technology in the real scene of tao Department

Clothing is the most important category of tao department, virtual fitting as a novel interactive display, to bring innovative experience for users, for businesses to create a new brand display. From the New Year’s day of 20 years opened in hand Amoy shot litao, scan, cloud theme and other public domain scene online “virtual fitting room” products: to provide different sizes of models for users to choose, support hundreds of thousands of clothes online real-time fitting. Virtual fitting room “PV20-30W, UV10W, two-hop page per capita stay time of 2min, the average number of fitting 12. In addition, the operation student used the function of virtual fitting products to launch the topic marketing campaign of “trying on 500 luxury goods a day” on Weibo, which exposed 240 million people and generated 151,000 discussions. The product was recognized by merchants and online users.

Thesis reading/download link

Arxiv.org/pdf/2107.13…

NO.4

The title

TransRefer3D: Entity-and-relation Aware Transformer for fine-grained 3D Visual Grounding Grounding Fine-grained 3D visual referential positioning of Transformer models based on entity-relationship knowledge

The author

He Dai-lan, ZHAO Yu-sheng, LUO Jun-yu, Hui Tian-rui, HUANG Shao-fei, ZHANG Ai-xi, LIU CAI

Thesis innovation & impact on the industry

In this paper, a Transformer based model is proposed to extract multi-modal contexts between objects in 3D scenes, so as to model more discriminating features to locate referred objects.

Each layer of the model consists of two modules

  1. Entity knowable attention module. This module matches the entity information in the language with the visual entity features, and extracts the entity features conforming to the language description.
  2. Relational knowable attention modules. This module matches the relationship information in the language with the pair of relationship features between visual entities to enhance the entity features that conform to the relationship description. The model achieves the current optimal results on two fine-grained 3D visual referential positioning reference data sets.

The application of relevant technology in the real scene of tao Department

The fine-grained 3D visual reference positioning task has no practical application in the business of Tao Department at present, but it has a wide range of potential application scenarios in the future, such as video structured information extraction, intelligent robot control and human-computer interaction. The model proposed in this paper can assist intelligent robots to better understand the correspondence between human users’ instruction language and visual information, so as to realize accurate positioning of objects in real 3D scenes, and provide a technical basis for downstream complex tasks.

Thesis reading/download link

Colalab.org/media/paper…

conclusion

Interactive algorithm to obtain the above papers included taobao content team, responsible for taobao live broadcast, video and graphic algorithms as well as evaluating the content of the UGC business research and development, the use of cutting-edge knowledge of artificial intelligence technology in the content business, understanding and cognition, the study said, smart clips and content generation such as topic and research direction to build platform alibaba content algorithm.

At present, the team in training models of large-scale multimodal, structured and digital multimedia content, the content of the fusion industry operating knowledge map construction, user content consumption interest representation and cognitive content recommendation and creative generation and interaction (read smart, intelligent, the collection, virtual fitting, 3 d studio, Virtual anchor, etc.). Hope that through the deepening understanding of users in the global interest in taobao and real-time perception, in the content areas established a consummate system classification and attribute of the tag, fine to objects, scenes, characters of properties and sound style, coarse to content type, filming technique, characterization of different levels of generalization to the content of study more granularity content such as cognition, content of general characterization of learning, To improve the efficiency and experience of multimedia content search and recommendation matching, and make Taobao the first position for consumers to make purchase decisions. At the same time, we warmly welcome students who are interested in the topic and direction to join us.