From June 19 to 25, CVPR2021 (Computer Vision and Pattern Recognition, that is, International Machine Vision and Pattern Recognition), the world’s top visual conference, was held online, but it was still very popular. The enthusiasm of participants was as hot as summer.

This year, Aliyun multimedia AI team (composed of Aliyun Video Cloud and Damo Academy Vision Team), (MMAI) participated in ActivityNet, the large-scale open challenge for human behavior understanding, AVA-Kinetics, the current largest spatio-temporal motion localization challenge, HACS, and the total on Epic-Kitchens, the first-view human behavior understanding challenge He won 5 championships and 1 runner-up in 6 tracks, among which he won the champion in ActivityNet and HACS for two consecutive years!

The top challenge has a great record

Large-scale sequential motion detection challenge ActivityNet was launched in 2016, sponsored by KAUST, Google, DeepMind, etc., and has been successfully held for six times so far.

The challenge is one of the most influential in the field, solving the problem of temporal behavior detection to verify the ability of AI algorithms to understand long-term video. Previous contestants came from many well-known organizations at home and abroad, including Microsoft, Baidu, Shanghai, Huawei, SenseTime, Peking University, Columbia University, etc.

This year, Aliyun MMAI team won the championship of this challenge with the result of 44.67% of AVG. MAP.

Figure 1 ActivityNet Challenge Certificate

AVA-Kinetics, a space-time localization challenge run by Google, DeepMind and Berkeley since 2018, aims to identify atomic-level behaviors occurring in videos in two dimensions of space and time.

Due to its difficulty and practicality, it has attracted many top international universities and research institutions over the years, such as DeepMind, Fair, SenseTime-CuHK, Tsinghua University and so on.

This year, Aliyun MMAI team beat the opponent with 40.67% mAP and won the first place!

Figure 2. AVA-Kinetics Challenge Award Certificates

The Very Large Scale Behavioral Detection Challenge (HACS) started in 2019 and is hosted by MIT. It is the largest behavioral detection challenge in the current series. The challenge consists of two tracks: the fully supervised behavior test and the weakly supervised behavior test.

As the amount of data is more than twice that of ActivityNet, it is very challenging. The previous teams include Microsoft, Samsung, Baidu, hand in, SenseTime, Xi ‘an Jiao Tong, etc.

This year, Aliyun MMAI team participated in two tracks and won the championship with AVG.MAP 44.67% and 22.45% respectively!

Fig. 3 Certificates of two HACS Challenge tracks

Epic-Kitchens, a first-view human motion comprehension challenge, has been held for three sessions since 2019 and is hosted by the University of Bristol, which is dedicated to solving the problem of interactive understanding of human movements and target objects under first-view conditions.

Over the years, the participating teams include Baidu, FAIR, NTU, NUS, INRIA-Facebook, Samsung (SAIC-Cambridge), etc.

This year, Aliyun MMAI team participated in the two circuits of timing action detection and action recognition, and won the champion and runner-up of the two challenge competitions with AVG.MAP 16.11% and AC.48.5% respectively!

Fig. 4 Prize certificate of Epic-Kitchens Challenge

Key technology exploration of the four major challenges

The Behavioral Understanding Challenge has four main challenges:

Firstly, the duration of the behavior is widely distributed, ranging from 0.5 seconds to 400 seconds. Taking a test video of 200 seconds as an example, 15 frames of images are collected every 1 second, and the algorithm must accurately locate 3000 frames of images.

Secondly, the background of the video is complex, and there are usually many irregular non-target behaviors embedded in the video, which greatly increases the difficulty of behavior detection.

Moreover, there is a big difference within the class, and the visual representation of the same behavior will change obviously due to the change of individual, perspective and environment.

Finally, the human motion detection algorithm is also faced with mutual occlusion between human bodies, insufficient video resolution, illumination, visual Angle and other varied interference.

In this challenge, the team was able to achieve such outstanding results, mainly supported by the advanced technology framework EMC2 behind it, which mainly explored the following core technologies:

(1) Strengthen the optimization training of the basic network

The basic network is one of the core elements of behavioral understanding.

In this challenge, Ali Cloud MMAI team mainly explored the following two aspects: in-depth research on Video Transformer (Vivit); Explore the complementarities of Transformer and CNN’s heterogeneous models.

Training for Vivit, the primary infrastructure network, also includes pre-training and fine-tuning. During the fine-tuning process, the MMAI team analyzes the effects of variables such as input size and data increment to find the best configuration for the task at hand.

In addition, considering the complementarity of Transformer and CNN, Slowfast and CSN are also used. Finally, through integration learning, the classification performance of 48.5%, 93.6% and 96.1% has been achieved on Epic-Kitchens, ActivityNet and HACS, respectively, which is a significant improvement compared with the champion of last year.



Figure 5 The structure and performance of Vivit

(2) Modeling of entity spatiotemporal relationship in video understanding

For the task of time and space domain motion detection, learning the relationship between person and person, the relationship between person and object, and the relationship between person and scene in the video based on relationship modeling is particularly important for the correct realization of action recognition, especially for interactive action recognition.

Therefore, in this challenge, Ali Cloud MMAI focuses on the modeling and analysis of these relationships.

To be specific, firstly, people and objects in the video are located, and their feature representations are extracted respectively. For more fine-grained modeling different types of relations, when will the above features and global video in airspace to enhance features, and the time domain or in different application based on the structure of the Transformer relationship between airspace location learning module, at the same time, different location associated learning for the related areas by means of weighted sharing position invariance.

In order to further model the long-order time-domain correlation, a two-stage time-series feature pool combining online and offline maintenance was constructed to integrate the feature information before and after video clips into the association learning.

Finally, human features with associative learning were used for the task of action recognition, and the decoupled learning method was used to effectively learn difficult and small sample categories under the long-tail distribution of action categories.

Figure 6. Relationship modeling network

(3) Long video comprehension based on motion nomination relation coding

In the multi-task related to motion comprehension, the long duration of video is one of the main challenges under the limited computational conditions, and the learning of temporal relationship is an important means to solve the long-term video theory.

In EMC2, a module based on motion nomination relation coding is designed to improve the long-term perception ability of the algorithm.

Specifically, the basic behavior detection network is used to produce dense action nominations, in which each action nomination can be roughly regarded as the time interval of specific action entities.

Then, based on the self-attention mechanism, these nominated entities are coded in the temporal dimension, so that each action nominee can perceive the global information, so as to predict a more accurate behavior position. With this technology, EMC2 has won the champion in the temporal behavior detection of AcitivityNet and so on.

Figure 7. Relationship coding between action nominations

(4) Network initialization training based on self-supervised learning

Initialization is an important process of deep network training and one of the main components of EMC2.

Ali Cloud MMAI team designed a self-training based initialization method MOXI, that is, training video models from static images.

MOSI mainly consists of two components: pseudo-motion generation and static mask design.

Firstly, pseudo video clips are generated according to the specified direction and speed according to the way of sliding window, and then the motion mode of local area is retained only by designing appropriate mask, so that the network can have the ability of local motion perception. Finally, in the training process, the model optimization goal is to successfully predict the velocity and direction of the input pseudo video.

In this way, the trained model will have the ability to perceive motion in the video. In the challenge, taking into account the rule that no additional data is used, MOSI training only on a limited number of challenge video frames can achieve significant performance improvements and ensure the quality of model training for each challenge.

Fig. 8 MOSI training process and semantic analysis

“Video behavior analysis has always been considered a very challenging task, mainly due to the diversity of its content.

Although various advanced technologies in basic machine vision have been proposed, our innovations in this competition mainly include: 1) in-depth exploration of self-supervised learning and Transformer+CNN’s heterogeneous fusion; 2) Continuous research on modeling methods of relationships between different entities in video. These explorations confirm the importance of current advanced technologies, such as self-supervised learning, for video content analysis.

In addition, our success illustrates the importance of entity relationship modeling in understanding video content, but it has not received enough attention in the industry.” Alibaba senior researcher Jin Rong concludes a way.

Build multimedia AI cloud products based on video understanding technology

Based on the technology base of EMC2, the MMAI team of Ali Cloud has carried out in-depth research on video understanding, and also actively industrialized it, and launched the technical products of Multimedia AI: Retina Video Cloud Multimedia AI Experience Center (click 👉 Multimedia AI Cloud Product Experience Center for experience)

The products implement video search, review, structured and core functions such as production, dealing with millions of hours of video data, is recommended for the customer in video search, video, video review, copyright protection, video cataloging, interactive video, the video auxiliary production such as application provides the core competence in the scene, greatly improve the efficiency of customer’s working efficiency and flow.

Figure 9 Multimedia AI products

At present, multimedia AI cloud products have been implemented in the media industry, pan-entertainment industry, short video industry, sports industry and e-commerce industry:

1) In the media industry, mainly supported the business production process of top customers in the media industry such as CCTV and People’s Daily, greatly improved production efficiency and reduced labor costs. For example, in the news generation scene, the catalog efficiency was increased by 70% and the search efficiency was increased by 50%;

2) in the entertainment industry as well as the short video industry, main support group within business square youku, weibo, interesting headline generic video entertainment industry such as video structured, image/video under review, video fingerprint search, copyright traceability, video to heavy, cover generating, highlights to generate scenarios, such as help protect the video copyright, improve the efficiency of flow distribution, average daily calls to hundreds of millions of times;

3) In the sports industry, it supports the 21st World Cup Soccer by getting through multi-modal information such as vision, movement, audio and voice, and realizing cross-modal analysis of live streaming of football matches, which improves the efficiency of editing by an order of magnitude compared with traditional editing;

4) In the e-commerce industry, support Taobao, Xianyu and other businesses, support the structuring of new videos, video/image review, assist customers to quickly generate short videos, and improve distribution efficiency.

Figure 10 Label recognition of sports industry and film and television industry by multimedia AI



Fig. 11 Label recognition of media industry and e-commerce industry by multimedia AI

Supported by EMC2, Retina Video Cloud Multimedia AI Experience Center has the following advantages:

1) Multi-modal learning: make use of massive multi-modal data such as video, audio and text to conduct cross-media understanding and integrate the understanding/production system of knowledge in different fields;

2) Lightweight customization: users can register entities to be identified independently. The algorithm can “plug and play” for new entity labels, and use light data for new categories to get close to the effect of known categories.

3) Efficient performance: self-developed high-performance audio and video codec library, deep learning inference engine, GPU preprocessing library, oriented optimization for video scene IO and computation-intensive features, nearly 10-fold performance improvement in different scenes;

4) Strong versatility: Multimedia AI cloud products have been applied in the media industry, pan-entertainment industry, short video industry, sports industry and e-commerce industry.

“Video helps to enhance content easy to understand, easy to accept and spread, over the past few years we have also seen in all walks of life, all sorts of scenarios are in the process of accelerating video content, the whole society for video output demands more and more intense, how to efficient, high quality produce in accordance with user requirements of video, has become a core problem, It involves a lot of the details, such as the discovery of the hot spots, a large number of video material to understand the content of the portrait, multimode retrieval, based on user/scene template construction and so on, all of these need a lot of relying on visual development of AI technology, continuous improvement MMAI team combination of industry, the scene in the area of visual AI, And based on this polishing and building of business-level multimedia AI cloud products, so that video can be produced with high quality and efficiency, so as to effectively promote the process of video content of all walks of life and all scenes.” Ali cloud video cloud head Bi Xuan comments.

In this CVPR2021, MMAI defeated a number of strong domestic and foreign competitors and won a number of championships through a number of academic challenges, which is a strong verification of its strong technology. Its cloud product multimedia AI has served the top customers in multiple industries, and will continue to create multi-industry application value.

👇 Click to experience multimedia AI cloud product experience center: Go

The source code open source address: https://github.com/alibaba-mm…

References:

[1] Huang Z, Zhang S, Jiang J, et al. Self-supervised motion learning from static images. CVPR2021: 1276-1285. [2] Arnab A, Dehghani M, Heigold G, et al. Vivit: A Video Vision Transformer [J]. ArXiv Preprint ArXiv :2103.15691, 2021. [3] Feichtenhofer C, Fan H, Malik J, et al. Slowfast networks for video recognition. ICCV2019: 6202-6211. [4] Tran D, Wang H, Torresani L, et al. Video classification with channel-separated convolutional networks. ICCV2019: 5552-5561. [5] Lin T, Liu X, Li X, et al. Bmn: Boundary-matching network for temporal action proposal generation. ICCV2019: 3889-3898. [6] Feng Y, Jiang J, Huang Z, [8] Kawasaki M, Kawasaki M, Kawasaki M, et al. Localization of a Temporal Object for Localization[J]. IEEE Transactions on Waveform and Waveform Systems, 2010, 15 (2) : 159-164. ArXiv Preprint, ArXiv :2106.06942. Huang Z, Wang X, et al. Stronger Basis for Ego-centric Action Detection[J]. ArXiv Preprint, ArXiv :2106.06942, 2021. [8] Huang Z, Qing Z, Wang X, et al. Towards training stronger video vision transformers for epic-kitchens-100 action recognition[J]. arXiv preprint ArXiv :2106.05058, 2021. [9] Wang X, Qing Z., [10] Wang X, Wang X, Wang X, et al.Proposal Relation Network for Temporal Action Detection[J]. ArXiv Preprint ArXiv :2106.11812, 2021. Qing Z., et al. Weakly-Supervised Temporal Action Localization Through Local-Global Background Modeling[J]. arXiv preprint ArXiv :2106.11811, 2021. [11] Qing Z, Huang Z, Wang X, et al. Exploring Stronger Feature for Temporal Action Localization

“Video cloud technology” is your most noteworthy public account of audio and video technology. Every week, you will push practical technical articles from the front line of Ali Cloud, where you can exchange ideas with first-class engineers in the field of audio and video. Public number backstage reply [technology] can join Ali cloud video cloud technology exchange group, and the author together to discuss audio and video technology, access to more industry latest information.