CVPR2021 (Computer Vision and Pattern Recognition, international Machine Vision and Pattern Recognition) was held online from June 19 to 25, but it was still very popular, and the passion of participants was as hot as summer.
This year ali Cloud multimedia AI team (composed of Ali Cloud video cloud and Dharma Courtyard vision team, MMAI has participated in ActivityNet, avA-Kinetics, HACS and EpIC-Kitchens of the First View Human behavior Understanding Challenge, the largest current temporal motion localization Challenge, a total of 6 Five wins and one runner-up at ActivityNet and HACS for two consecutive years.
Top challenges have a great record
ActivityNet was launched in 2016 by KAUST, Google, DeepMind and others, and has been successfully held for six years.
The challenge, one of the most influential challenges in the field, focuses on timing behavior detection to verify the ability of AI algorithms to understand long-duration video. The participants come from many well-known domestic and foreign organizations, including Microsoft, Baidu, Shanghai Telecom, Huawei, Sensetime, Peking University, Columbia University, etc.
This year ali Cloud MMAI team finally won the challenge with 44.67% of Avg. MAP!
Figure 1 ActivityNet Challenge certificate
The spatio-temporal motion localization Challenge (AVA-Kinetics), launched in 2018 by Google, DeepMind, and Berkeley, aims to identify atomic-level actions occurring in videos in both spatial and temporal dimensions.
Due to its difficulty and practicality, it has attracted many international top universities and research institutions, such as DeepMind, FAIR, Sensetime-CuHK, Tsinghua University and so on.
This year ali Cloud MMAI team beat its rivals with 40.67% mAP and won the first place!
Figure 2 AvA-Kinetics Challenge Award Certificate
HACS, a large-scale behavior detection challenge, was started in 2019 and sponsored by MIT. It is the largest challenge in the current time sequence behavior detection task. The challenge consists of two tracks: fully supervised behavior Detection and weakly supervised behavior detection.
With more than twice as much data as ActivityNet, it can be challenging. Previous teams include Microsoft, Samsung, Baidu, Shanghai Jiao Tong, Sensetime, Xi ‘an Jiao Tong, etc.
This year ali Cloud MMAI team participated in two circuits at the same time, and won the Avg. MAP 44.67% and 22.45% respectively!
Figure 3. Certificates for the two tracks of the HACS Challenge
The First View Human Motion Understanding Challenge/EpIC-Kitchens was started in 2019 and has been held three times so far, hosted by The University of Bristol, dedicated to solving the problem of interaction understanding of human motion and target objects under first view conditions.
Past entries include Baidu, FAIR, NTU, NUS, InriA-Facebook, Samsung (SAIC-Cambridge), etc.
This year, Ali Cloud MMAI team participated in the two circuits of timing motion detection and motion recognition, and won the champion and runner-up of the two challenges with Avg. MAP 16.11% and ACC.48.5% respectively.
Figure 4: Certificates of the EPIC-Kitchens Challenge
Key technology exploration for four challenges
The Behavioral Understanding Challenge has four main challenges:
Firstly, the behavior duration is widely distributed, ranging from 0.5 seconds to 400 seconds. Taking a 200-second test video as an example, 15 frames of images are collected every second, and the algorithm must accurately locate in 3000 frames of images.
Secondly, the video background is complex, and there are often many irregular non-target behaviors embedded in the video, which greatly increases the difficulty of behavior detection.
Moreover, the difference within the class is large, and the visual expression of the same behavior will change obviously due to the transformation of individuals, perspectives and environments.
Finally, algorithm detection of human movement is also faced with the mutual occlusion between human bodies, insufficient video resolution, illumination, perspective and other diverse interference.
In this challenge, the team achieved such outstanding results mainly supported by the advanced technical framework EMC2, which mainly explored the following core technologies:
(1) Strengthen the optimization training of basic network
Basic network is one of the core elements of behavior understanding.
In this challenge, ali Cloud MMAI team mainly explored the following two aspects: in-depth study of Video Transformer (ViViT); Explore the complementarity of Transformer and CNN heterogeneous models.
As the main basic network, the training of ViViT also includes pre-training and fine-tuning. During the fine-tuning process, the MMAI team fully analyzes the influence of variables such as input size and data augmentation to find the optimal configuration for the current task.
In addition, considering the complementarity of Transformer and CNN structures, Slowfast, CSN and other structures are also used. Finally, through integrated learning, I achieved classification performance of 48.5%, 93.6% and 96.1% respectively on epIC-Kitchens, ActivityNet and HACS, which was significantly improved compared with the champion result of last year.
Figure 5 ViViT structure and performance
(2) Modeling of entity space-time relationship in video comprehension
For temporal and spatial motion detection tasks, it is particularly important to learn human-human relationship, human-object relationship and human-scene relationship in video based on relationship modeling for correctly realizing action recognition, especially for interactive action recognition.
Therefore, ali Cloud MMAI focuses on modeling and analyzing these relationships in this challenge.
To be specific, firstly, people and objects in the video are located and their feature representation is extracted respectively. For more fine-grained modeling different types of relations, when will the above features and global video in airspace to enhance features, and the time domain or in different application based on the structure of the Transformer relationship between airspace location learning module, at the same time, different location associated learning for the related areas by means of weighted sharing position invariance.
In order to further model the long-order time-domain association, we constructed a two-stage sequential feature pool combining online and offline maintenance to fuse the feature information before and after the video clip into the association learning.
Finally, the human body features after association learning are used for the action recognition task, and the learning of difficult and small sample categories under the long tail distribution of action categories is realized based on decoupled learning method.
Figure 6. Relationship modeling network
(3) Long video comprehension based on action nomination relation coding
In the multi-task related to motion comprehension, the long duration of video is one of the main challenges under limited computational conditions, and temporal relationship learning is an important method to solve the long duration of video theory.
In EMC2, a module based on action nomination relation coding is designed to improve the long time perception of the algorithm.
Specifically, the basic behavior detection network is used to produce dense action nominations, where each action nomination can be roughly viewed as the time interval in which a particular action entity occurs.
Then, based on the self-attention mechanism, the temporal relationship of these nominated entities is encoded in the time dimension, so that each action nomination can perceive the global information, so as to predict a more accurate behavior location. With this technology, EMC2 has won the champion in AcitivityNet and other temporal behavior detection results.
FIG. 7 Relational coding between action nominations
(4) Network initialization training based on self-supervised learning
Initialization is an important process of deep network training and one of the main components of EMC2.
Ali Cloud MMAI team designed a self-training-based initialization method, MoSI, that is, training video models from static images.
MoSI consists of two main components: pseudo-motion generation and static mask design.
Firstly, pseudo video clips are generated in the specified direction and speed according to the way of sliding window. Then, the network can have the ability of local motion perception by designing an appropriate mask to retain only the motion mode of its local area. Finally, in the training process, the model optimization goal is to successfully predict the speed and direction of the input pseudo video.
In this way, the trained model will have the ability to perceive video motion. In challenge, considering the rule of not using extra data, MoSI training only in a limited number of challenge video frames can achieve significant performance improvement, ensuring the quality of model training in all challenges.
FIG. 8 MoSI training process and semantic analysis
“Video behavior analysis has always been considered a very challenging task due to the diversity of its content.
Although various advanced technologies in basic machine vision have been proposed, our innovations in this competition mainly include: 1) in-depth exploration of self-supervised learning and Transformer+CNN heterogeneous fusion; 2) Continuous research on modeling methods of relationships between different entities in videos. These explorations confirm the importance of current advanced technologies such as self-supervised learning for video content analysis.
In addition, our success illustrates the importance of entity-relationship modeling for understanding video content, which has not received enough attention in the industry.” Alibaba senior researcher Jin Rong concluded.
Build multimedia AI cloud products based on video understanding technology
Based on the technical base of EMC2, ali Cloud MMAI team not only conducts in-depth research on video understanding, but also actively carries out industrialization and launches MultiMedia AI technical products: **Retina Video Cloud Multimedia AI Experience Center **(click 👉 Multimedia AI Cloud Product Experience Center for experience).
The products implement video search, review, structured and core functions such as production, dealing with millions of hours of video data, is recommended for the customer in video search, video, video review, copyright protection, video cataloging, interactive video, the video auxiliary production such as application provides the core competence in the scene, greatly improve the efficiency of customer’s working efficiency and flow.
Figure 9. Multimedia AI product
At present, multimedia AI cloud products have landed in the media industry, pan-entertainment industry, short video industry, sports industry and e-commerce industry:
1) In the media industry, I mainly support the business production process of leading customers in the media industry such as CCTV and People’s Daily, greatly improve production efficiency and reduce labor costs. For example, in the scene of news generation, I improve the cataloguing efficiency by 70% and search efficiency by 50%;
2) in the entertainment industry as well as the short video industry, main support group within business square youku, weibo, interesting headline generic video entertainment industry such as video structured, image/video under review, video fingerprint search, copyright traceability, video to heavy, cover generating, highlights to generate scenarios, such as help protect the video copyright, improve the efficiency of flow distribution, average daily calls to hundreds of millions of times;
3) In the sports industry, it supports the 21st World Cup, integrates multi-modal information such as visual, sports, audio and voice, and realizes cross-modal analysis of live streams of football matches, which is an order of magnitude more efficient than traditional editing.
4) In the e-commerce industry, support taobao, Xianyu and other businesses, support the structuralization of new videos, video/image review, assist customers to quickly generate short videos and improve distribution efficiency.
FIG. 10 Label recognition of sports industry and film industry by multimedia AI
FIG. 11 Label recognition of media industry and e-commerce industry by multimedia AI
Supported by EMC2, Retina Video Cloud multimedia AI Experience Center has the following advantages:
1) Multi-modal learning: use massive multi-modal data such as video, audio and text to conduct cross-media understanding and integrate knowledge in different fields of understanding/production system;
2) Lightweight customization: users can register the entities to be identified independently, and the algorithm can achieve “plug and play” for the newly added entity label, and the effect of the newly added category can be close to the known category by using lightweight data;
3) High performance: Self-developed high-performance audio and video codec library, deep learning reasoning engine and GPU pre-processing library, optimized for IO and computation-intensive characteristics of video scenes, and achieved nearly 10 times performance improvement in different scenes;
4) Strong versatility: multimedia AI cloud products have been applied in media industry, pan-entertainment industry, short video industry, sports industry and e-commerce industry.
“Video helps to enhance content easy to understand, easy to accept and spread, over the past few years we have also seen in all walks of life, all sorts of scenarios are in the process of accelerating video content, the whole society for video output demands more and more intense, how to efficient, high quality produce in accordance with user requirements of video, has become a core problem, It involves a lot of the details, such as the discovery of the hot spots, a large number of video material to understand the content of the portrait, multimode retrieval, based on user/scene template construction and so on, all of these need a lot of relying on visual development of AI technology, continuous improvement MMAI team combination of industry, the scene in the area of visual AI, Based on this, we will polish and build business-level multimedia AI cloud products, so that videos can be produced with high quality and high efficiency, so as to effectively promote the process of content video in all walks of life and scenes.” Ali Cloud video cloud director Bi Xuan comments.
In this CVPR2021, MMAI defeated a number of strong domestic and foreign competitors and won a number of champions through a number of academic challenges, which is a strong verification of its excellent technology. Its cloud product multimedia AI has served the head customers in many industries, and will continue to create multi-industry application value.
👇 Click experience Multimedia AI Cloud Product Experience Center: retina.aliyun.com
Open source address: github.com/alibaba-mma…
References:
[1] Huang Z, Zhang S, Jiang J, et al. Self-supervised motion learning from static images. CVPR2021: 1276-1285. [2] Arnab A, Dehghani M, Heigold G, et al. Vivit: [3] Feichtenhofer C, Fan H, Malik J, Et al. A video Vision transformer[J]. ArXiv Preprint arXiv: 203.15691, 2021. et al. Slowfast networks for video recognition. ICCV2019: 6202-6211. [4] Tran D, Wang H, Torresani L, et al. Video classification with channel-separated convolutional networks. ICCV2019: 5552-5561. [5] Lin T, Liu X, Li X, et al. Bmn: Boundary-matching network for temporal action proposal generation. ICCV2019: 3889-3898. [6] Feng Y, Jiang J, Huang Z, [7] Wang J, Wang J, Wang J, et al. Spatial-temporal Localization in Temporal Environment [J]. ArXiv Preprint arXiv: 10608061, 2021. ArXiv Preprint arXiv:2106.06942 Huang Z, Wang X, Wang X, et al. A Stronger Baseline for Egocentric Action Detection[J]. ArXiv preprint arXiv:2106.06942, 2021. [8] Huang Z, Qing Z, Wang X, et al. Towards training stronger video vision transformers for epic-kitchens-100 action recognition[J]. arXiv preprint ArXiv :2106.05058, 2021. [9] Wang X, Qing Z. Temporal Action Detection based on Temporal Relation Network [J]. ArXiv Preprint arXiv:2106.11812, 2021. [10] Qing Z., et al. Weakly-Supervised Temporal Action Localization Through Local-Global Background Modeling[J]. arXiv preprint [11] Qing Z, Huang Z, Wang X, et al. Exploring Stronger Feature for Temporal Action Localization
“Video cloud technology” your most noteworthy audio and video technology public account, weekly push from Ali Cloud front-line practice technology articles, here with the audio and video field first-class engineers exchange exchange. You can join ali Cloud video cloud technology exchange group to discuss audio and video technology with the author and get more latest information in the industry.