At the just-concluded CVPR 2021, the top conference in the field of computer vision and pattern recognition, the results of various international challenges have been announced.

Alibaba Tao Department technology multimedia algorithm & Video content understanding algorithm team, won at one fell move

🎉 3 international championships 🎉

🎉 1 international runner-up 🎉

🎉 1 international third place 🎉

The technical fields include image description generation, large-scale instance level object recognition, multi-modal video emotion understanding and video character interaction.

As a leading team in the field of multimedia algorithms in the industry, tao Technology’s team focuses on creating a video content perception and understanding algorithm platform with “end-to-end cloud integration and cross-modal understanding”. Focus on AR live streaming, 3D digital field, content intelligent production, audit, retrieval and high-level semantic understanding and other technical fields; Support taobao live, browse, diantao and other taobao content business, through self-developed content center for the entire Ali Group to provide capacity support for the content business.

Below are the details of the competition for the 3 international champions & our solutions.

🏆 Champion 🏆 VizWiz Image Captioning

▐ topic

Workshop: CVPR 2021 VizWiz Grand Challenge Workshop TRACK: Image Captioning

▐ contestants

Hong Li, Hong Ji, Yong Liang, Yu Qi, Shao Lin, Ding Ren

J. 2008 Technical Field

Image description generation

J. Bradley j. Bradley 2008

The VizWiz Grand Challenge has been held since 2018 and aims to use computer vision technology to help blind people with visual impairments “see” the world.

The input of the task is an image taken by a blind person and the output is a description of the image.

Unlike other Image Caption data, the competition’s data were taken by visually impaired blind people, and the images were of poor quality, making the task more difficult.

J. WE have a record

We won the first place with CIDEr D score of 94.06, much higher than the second CIDEr D score of 71.98.

The total score also surpassed last year’s winner, IBM CIDEr D Score of 81.04.

2008 2008 Red Eye

There are two difficulties in this task:

  1. Poor image quality: including a variety of indoor and outdoor scenes, at the same time due to the photographer’s visual impairment, the shooting image will appear out of focus blurred, incomplete shooting, occlusion and other problems;
  2. Many image descriptions need to understand the text, different objects, colors and other information in the image, as well as the ability to understand details such as OCR and object detection.

We solve these difficulties in the following ways

  1. According to the characteristics of VizWiz data image, the grid feature of swin-Transformer extracted image is used to replace the object feature, so as to fully represent the characteristics of different image regions.
  2. Considering that OCR and object information would lead image caption generation forward, we extracted OCR and object detection category information as feature supplement.
  3. Not all images contain OCR information, so we use a variety of models to complement and merge. Visual modal model is used to strengthen the data without OCR, and visual + text (OCR+ object category) multi-modal model is used to strengthen the data with rich OCR information.
  4. As for the results generated by various models, considering that the final measurement index is CIDEr, the results are integrated through the fusion of self-cider and OCR Maximization strategies.

J. J. Can apply scenarios

As a combination of visual and NLP tasks, Image captioning can be applied to automatically generate content titles of Internet products, and can also help blind and visually impaired users improve their perception of the world.

Bradley Bradley 2008 2008

  1. Workshop:vizwiz.org/workshops/2…
  2. Challenge: eval. Ai/web/challen…

🏆 Winner 🏆 Herbarium 2021-Half-Earth Challenge

▐ topic

Workshop: The Eight Workshop on fine-visual Categorization Task: fine-plant species Identification

▐ contestants

First year, LAN 枻, Liu Xiao, there are neighbors, warm rain, Ji Yu, hedge leisurely

J. 2008 Technical Field

Large-scale instance level object recognition

J. Bradley j. Bradley 2008

Herbarium 2021 is a competition for CVPR2021 FGVC8 Workshop, which addresses instance-level fine-grained identification problems and is in its eighth consecutive year.

Herbarium 2021 Competition data set is a collection of 6.5W class 2.5m plant sample images from the Americas, Oceania and other half of the world collected from several large botanical gardens, used to train plant recognition algorithms, assist botanists in plant recognition, and discover and protect new species.

This dataset has a long-tail distribution, and the category with the least number of samples has only 3 samples. At the same time, different plants have very similar vision, and different samples of the same plant have great differences, which brings great challenges to instance-level recognition.

J. WE have a record

We won the first place in this competition with F1 score of 0.757, far exceeding 0.735 and 0.689 in the second and third places.

2008 2008 Red Eye

This task mainly has the following two difficulties:

  1. There are many plant species and fine categories. Different plants have very similar vision, while different samples of the same plant are different, which leads to confusion among different plants and difficult to distinguish.
  2. The sample distribution of the data set is unbalanced, and there is a long-tail distribution. The category with the least number of samples has only 3 samples. How to improve the accuracy of long-tail category is very important.

We solve these difficulties in the following ways

The instance-level plant recognition problem in natural scenes is transformed into large-scale fine-grained feature expression problem, and self-attention pooling is proposed to enhance local feature expression ability. Imbalanced Sampler and adaptive category Loss are introduced to solve the problem of unbalanced category distribution. In addition, based on the mixing precision of large-scale multi-machine multi-card training ability, to achieve nearly three million data scale fast iteration ability.

It achieves efficient online difficult sample mining of ten thousand levels, which greatly improves the generalization ability of features in complex scenes. Finally with the advantage of 2.2% ahead of the runner-up, won the championship.

J. J. Can apply scenarios

Instance-level fine-grained recognition technology can distinguish subtle visual differences between objects to achieve fine object recognition, widely used in commodity recognition, animal and plant recognition, pedestrian recognition, landmark recognition and other fields.

Bradley Bradley 2008 2008

  1. Workshop:sites.google.com/view/fgvc8/…
  2. Challenge:sites.google.com/view/fgvc8/…
  3. Kaggle leadboard:www.kaggle.com/c/herbarium…

🏆 champion 🏆 ActivityNet Home Action Genome Challenge

▐ topic

Workshop: International Challenge on Activity Recognition Task: Home Action Genome Challenge

▐ contestants

Shao Lin, LIAO Yue (Beihang), YONG Liang, YE Ying, LI You, LIU CAI (Beihang)

J. 2008 Technical Field

Video character interaction relationship

J. Bradley j. Bradley 2008

Held for the first time this year at the CVPR2021 ActivityNet Workshop, the Home Action Genome Challenge, organized by Stanford University Professor Fei-Fei Li’s research group, provides a large-scale, multi-view video data set, using multimodal video analysis, Detect the character interaction relationship in the video.

J. WE have a record

We ranked first in the competition with an accuracy of 76.5%, far ahead of 68.4% and 65.7% in the second and third place respectively.

Home Action Genome Challenge Award certificate

2008 2008 Red Eye

There are three difficulties in this task:

  1. The daily home scene of the data set is complex and the target detection of human body and object is difficult
  2. Character relations include action relations and spatial relations, focusing on different visual features
  3. Each group of human bodies and objects have multiple human relationships, and the evaluation must be completely correct before the count is correct

We solve these difficulties in the following ways

Adopting better detection model: We adopt THE SOTA detection model of Swin-Transformer and ResNeSt as backbone, and improve the accuracy of target detection through various data enhancement strategy training and multi-scale fusion reasoning.

Visual features to enhance character relationships: Firstly, swin-Transformer is integrated into the two-stage relationship detection network for end-to-end training, and then the one-stage relationship detection network is improved. The binary group < person, object > is extracted directly, and the triplet < person, object and relationship > is given through cascade structure determination. Strategically, we use visual features to determine action relations, and spatial position as input to determine spatial relations.

Generation strategies based on statistical bias: When we generate the final interactive relationship group of characters, we adopt a variety of strategies integrating the symbiotic probability of < person, object and relationship > and statistical bias weighting.

J. J. Can apply scenarios

Video character interaction relationship detection, detection of dynamic < person, object, relationship > structured information in the video, can be applied to video information structure, human-computer interaction and other application scenarios in the future.

Bradley Bradley 2008 2008

  1. Challenge:homeactiongenome.org/results.htm…
  2. Workshop:activity-net.org/challenges/…

In addition to the three winners mentioned above, we also won the second place in the Hotel-ID 2021-Hotel Recognition Challenge. And won the third place in the Differentiated Expressions from Videos (EEV) Challenge competition, ranking the top in the field of multimedia algorithms.

Amoy Technology multimedia algorithm team said: “As video traffic in the representation of media is becoming more and more high, video information for individuals and platforms, there is information overload problem. Multidimensional structured representation of video content will be one of the hot research directions in the field of vision. In the future, we will integrate text, voice, visual and other multi-modal information to better understand video content, so that users can see more of their favorite content, reduce the time cost of information selection for users, and bring users a better visual experience.”

About us

We are responsible for the core recommendation algorithm of Alibaba e-commerce, including the information flow recommendation business of multiple core scenes such as taobao/Tmall home page and shopping link. We are committed to providing billions of accurate and personalized information push services to hundreds of millions of users every day, creating the ultimate shopping experience. The team has been working in many fields of artificial intelligence for many years, including massive online deep learning, deep reinforcement learning, graph embedding learning, edge computing, intelligent interaction, natural language understanding, causal inference, and commercialization mechanisms. Welcome to join us and explore the infinite possibilities of artificial intelligence in the field of e-commerce. Welcome to send your resume: [email protected]