The full text is 5352 words, and the expected reading time is 14 minutes

With the rapid development of short video and the increasing demand for security management, related technology applications in the field of video, including video smart label, intelligent coach, intelligent clip, intelligent security management, text and video retrieval, video highlights extraction, and video smart cover are gradually becoming an important part of people’s life.

Taking video-related businesses as an example, short video websites hope to quickly label each new work and push it to appropriate users. Editors hope to easily extract highlights of highlights from competition videos. Coaches hope to systematically analyze athletes’ movements and carry out technical statistics and analysis. Security authorities also want to be able to accurately review video content such as identifying violations in real time, editors want to use text to retrieve relevant video clips as news material, and advertising or recommendation sites want to generate better-looking covers for videos to increase conversion rates. These operations are a great challenge to traditional manual handling methods.

Video understanding by AI technology to machine understanding video content, now in a short video, recommend, search and advertising, safety management, and other fields has a broad application and research value, like action orientation and recognition, video playing tag, text, such as video retrieval and video content analysis task can be done by video understanding technology.

PaddleVideo is an industry-level deep learning open source platform for paddler video development kit independently developed by Baidu, which contains numerous model algorithms and industry cases in the video field. The main upgrade points of this open source are as follows:

  • Release 10 industry-level application cases in the video field, covering sports, Internet, medical, media and security industries.

  • For the first time open source 5 champion/summit/industry-level algorithms, including video-text learning, video segmentation, depth estimation, video-text retrieval, motion recognition/video classification and other technical directions.

  • Supporting rich documents and tutorials, but also live courses and user exchange group, you can discuss and exchange with baidu senior R&D engineers.

One, ten video scene application – tool details

PaddleVideo based on the sports industry in football/basketball/table tennis/figure skating and other scenes, open source a set of general sports action recognition framework; For Internet and media scenes, open source solutions such as large-scale multi-modal classification labeling, intelligent clipping and video stripping based on knowledge enhancement; Open source detection and identification cases for security, education, medical and other scenarios. Baidu Intelligent Cloud combined with flying oar deep learning technology has also formed a series of deeply polished industry-level multi-scene action recognition, video intelligent analysis and production, medical analysis and other solutions.

1. Football scene:

Open source FootballAction highlights intelligent clip solution

FootballAction is based on the combination of behavior recognition PP-TSM model, video action positioning BMN model and sequence model AttentionLSTM, which can not only accurately identify the type of action, but also accurately locate the start and end time of the action. At present, there are eight categories of actions that can be identified, including: background, goal, corner kick, free kick, yellow card, red card, substitution and throw-in. The accuracy rate is over 90%.

2. Basketball scene:

Open source BasketballAction highlights smart clipping solution

The overall framework of BasketballAction is similar to FootballAction, which contains 7 action categories, respectively: background, goals-three-point shot, goals-two-point shot, goals-dunk, free throw shot and jump ball. The accuracy rate is over 90%.

3. Table tennis scene

Open source action classification model for large-scale data training

In Baidu Create 2021, PaddleVideo and Peking University jointly released a table tennis action recognition model, and built a standard training data set based on more than 500GB of game videos, with labels covering eight major categories such as serve, pull and short swing. Among them, the accuracy rate of the start and end rounds has reached more than 97%, and the movement recognition has reached more than 80%.

4. Figure skating movement recognition

The attitude estimation algorithm is used to extract the node data, and finally the node data is input into the ST-GCN model of the spatio-temporal graph convolution network for motion classification, and 30 kinds of motion classification can be realized. The figure skating movement recognition competition was held jointly with CCF, attracting more than 3,800 participants from 300 universities and 200 enterprises. The accuracy of the champion scheme was 12 points higher than the baseline scheme, and the top3 schemes in the competition were open source.

5. Knowledge-enhanced video large-scale/multimodal classification labeling

In the direction of video content analysis, Flyblade has opened source the basic VideoTag and MultimodalVideoTag. VideoTag supports 3000 practical labels derived from industrial practice, with good generalization ability, which is very suitable for the application of large-scale short video classification scenes in China. The tag accuracy rate reaches 89%.

Based on real short video service data, MultimodalVideoTag model integrates text, video image and audio modes to classify video multi-mode labels. Compared with pure video image features, it can significantly improve the effect of high-level semantic labels. The model provides 25 first-level tags and 200+ second-level tags, with an accuracy rate of more than 85%.

6. Intelligent production of video content

In the direction of intelligent video production, the main goal is to assist content creators to edit videos twice. Feoar has opened source video quality analysis model based on PP-TSM, which can realize two production application solutions of news video stripping and video intelligent cover. News stripping is an important source of material for editors in radio and television media industry. Smart covers play an important role in the click-through rate and recommendation effect of pan-Internet industries such as live streaming and mutual entertainment.

7. Open source video interactive annotation tool

Feool has opened source the Interactive VOS tool based on MA-NET, which provides a small amount of manual supervision signals to achieve better segmentation results. The whole video can be annotated only by labeling a few frames, and the video segmentation quality can be continuously improved through multiple interactions with the video until the segmentation quality is satisfied.

8. Achieve 87 kinds of universal behavior recognition based on single model of spatio-temporal action detection

The flying paddle realizes the scheme of recognizing various human behaviors based on the spatio-temporal motion detection model, and solves the problem of poor effect of traditional single frame detection by using video multi-frame timing information. From data processing, model training, model testing to model reasoning, It can realize the recognition of 80 actions in AVA dataset and 7 self-developed abnormal behaviors (swinging, fighting, kicking, chasing, arguing, running fast and falling). The effect of the model is much better than the target detection scheme.

‍‍‍‍‍

9. Drone detection

Uav detection in restricted areas has the following challenges:

(1) Uav targets are small and difficult to observe.

(2) Uav moves at variable speed.

(3) The flying environment of UAV is complex and may be blocked by buildings and trees.

In view of the above challenges, FeoAR has developed an open source UAV detection model to realize uav detection in many complex environments.

10. Classification and identification of medical images

Based on the public 3D-MRI brain imaging database, the Second Affiliated Hospital of Zhejiang University School of Medicine and Baidu Research Institute open source the classification and identification project of 3D-MRI brain imaging of Parkinson’s disease, including neurocon, Taowu, PPMI and OASIS-1 data sets. A total of 378 cases of Parkinson’s disease (PD) and normal (Con) were included. 2D and 3D baseline models, 4 classification models and 3D-MRI brain imaging pre-training models were provided. Pp-tsn and PP-TSM achieved accuracy of over 91% and AUC of over 97.5%, while the TimeSformer achieved the highest accuracy of over 92.3%

Two, five champions, top algorithm open source

Baidu research institute for the first time from research champion, top algorithm

1. CVPR 2020 Summit Paper:

Multimodal pretraining model ActBERT is open source for the first time

ActBERT is a multimodal pre-training model that combines video, image and text. It uses a new entanglement coding module for multimodal feature learning from three sources to enhance the interaction between two visual inputs and languages. Under the guidance of global action information, the entanglement coding module injects visual information into the language model and integrates the language information into the visual model. The entanglement encoder dynamically selects the appropriate context to facilitate target prediction. Simply put, entangled encoders use action information to catalyze the interaction between local regions and text. ActBERT is superior to other methods in five downstream tasks, including text and video retrieval, video description and video question and answer. The following table shows the performance of the ActBERT model on the text and video retrieval dataset MSR-VTT.

2. CVPR 2021 Summit Paper:

T2VLAD is the first open source text and video retrieval model

With the popularity of all kinds of Internet videos, especially short videos, text video retrieval has gained extensive attention in academia and industry recently. Especially after the introduction of multi-modal video information, how to finely register local video features and natural language features becomes a big difficulty. T2VLAD uses an efficient global-local alignment method to automatically learn the semantic center of text and video information sharing, and performs corresponding matching for local features after clustering, avoiding complex calculation and enabling refined understanding of local information of language and video.

In addition, the video of T2VLAD directly to multimodal information (sound, action and scenario, researched, OCR, face, etc.) is mapped to the same space, using the same set of semantic fusion center for clustering, calculate the same video and text in the center of the characteristics of local similarity, partly solves the multimodal information difficult problem to the comprehensive utilization. T2VLAD has achieved excellent performance in all three standard text-video Retrieval Dataset.

CVPR2020 video segmentation model MA-net is open source for the first time

Video object segmentation (VOS) is a basic task in the field of computer vision, which has many important application scenarios, such as video editing, scene understanding and automatic driving. Interactive video object by the user in a frame of video to the target object simple annotations (such as the target object with a few simple lines), can through the algorithm to obtain the target in the video object segmentation result, users can interact with the video many times and constantly improving the quality of video segmentation, until the user satisfied with the quality of segmentation.

Since interactive video segmentation requires multiple interactions between users and the video, timeliness and accuracy of the algorithm should be considered. Ma-net uses a unified framework for interaction and propagation to generate segmentation results, which ensures the timeliness of the algorithm. In addition, MA-NET stores and updates the information of multiple rounds of user interaction through memory storage, which improves the accuracy of video segmentation. The following table shows the performance of the model on the DAVIS2017 dataset.

4. The first open source ECCV 2020 Spotlight Video Segmentation Model CFBI, CVPR2021 Video Object Segmentation International Competition, the SOLUTION based on CFBI design won the first prize in two tasks

In the field of video object segmentation, the semi-supervised field has attracted much attention this year. Given the target calibration in the first frame or multiple reference frames in the video, the semi-supervised method needs to accurately track and segment the mask of the target object in the whole video. Previous video object segmentation methods focus on extracting robust features of given foreground objects, but it is very difficult in complex scenes such as occlusion, scale change and similar objects in the background. Based on this, we rethink the importance of background features, and propose a video object segmentation method based on pre-background integration (CFBI).

CFBI simultaneously extracts the foreground and background features of the target in the form of duality, and improves the contrast between the foreground and background features through implicit learning method to improve the segmentation accuracy. Based on CFBI, we further introduce multi-scale matching and void matching strategies into video targets, and design a more robust and efficient framework, CFBI+.

CFBI series methods keep the record of the highest accuracy of single model in the field of video object segmentation. In particular, the performance of baidu Research’s single model is superior to the results of the megvii Tsinghua team’s fusion of three powerful models in CVPR2020 video object segmentation International Competition. In the CVPR2021 Video Object Segmentation International Competition, which just concluded this year, the SOLUTION based on the CFBI design took first place in two tasks. The following table shows the CFBI model’s performance on the Davis-2017 dataset.

5. ICCV 2021 unsupervised monocular depth estimation model ADDS is open source for the first time

ADDS is based on the night image of the supervision of monocular depth estimation model, the use of the image data complementary nature of the day and night, day and night slows down the image of the larger sphere and the change of illumination on the impact of depth estimation precision in challenging Oxford RobotCar data sets to estimate the depth of the image of the most advanced throughout the day. The following table shows the test performance of ADDS model on both day and night data sets.

You can directly go to Github address to get the complete open source project code, remember to Star collection support oh:

Github.com/PaddlePaddl…

———- END ———-

Baidu said Geek

Baidu official technology public number online!

Technical dry goods, industry information, online salon, industry conference

Recruitment information · Internal push information · technical books · Baidu surrounding