Visual language understanding and generation expected to be hot in 2022: one-click image annotation generation, visual q&A, open source code, official Demo playable

“This is the 22nd day of my participation in the First Challenge 2022. For details: First Challenge 2022”

🎉 statement: as one of the bloggers in AI field, ❤️ lives up to your time and your ❤️

🍊 Readers, happy Lantern Festival
📆 Last updated: February 12, 2022
🍊 This article shares a fun visual language comprehension and generation task that is expected to be popular in 2022
🍊 AI road, road obstruction and long, thanks to countless predecessors of the giant dedication

📕 one key to generate image annotation, visual question and answer, official Demo can play

The basic information of the paper is as follows

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
BLIP: Guide language image pre-training to achieve unified visual language understanding and generation
Address: arxiv.org/pdf/2201.12…
Code address: github.com/salesforce/…
Demo address: huggingface co/Spaces/akha…

📕 Official Demo Demo effect

Visual language understanding and generation, operation trilogy is as follows

Upload a desired image
Click the submit button below
Wait for a few seconds. The corresponding image content description can be generated on the right

Guided: Intelligent q&A

Ask: If I upload an image of myself, will it directly guess what I’m thinking

Some official examples in the paper, red is the question, green is the answer, artificial intelligence YYDS

📕 Network Structure

Multiple encoders – decoders

The proposed framework

The researchers used a visual Transformer as an image encoder, which decomposed the input image into patches and then encoded these patches into sequential embeddings, representing the global image features using an additional [CLS] token. Compared with the method of using pre-trained target detector for visual feature extraction, using ViT is more computationally friendly and has been adopted by many recent methods.

In order to pre-train a unified model with the ability of understanding and generation, the researchers propose a multi-task model MED (Mixture of encoder-decoder), which can perform any of the following three functions:

Single-peak encoder
Image based text encoder
Image-based text decoder

Pre-training objective

During the pre-training, the researchers jointly optimized three goals, namely, two comprehension-based goals and one generation-based goal. Each image text pair requires only one forward propagation through the computation-heavier visual transformer, and three forward propagation through the text transformer, which activates different functions to calculate the following three losses:

Image-text contrastive Loss (ITC), the activation of single-peak encoder, aims to align the feature space of visual and text Transformer by encouraging positive image text pairs (rather than negative pairs) to have similar representations;
Image-text Matching Loss (ITM), which activates image-based text encoders, is designed to learn and capture multimodal representations of image text with fine-grained alignment between vision and language;
Language Modeling Loss (LM), which activates image-based text decoders designed to generate text descriptions when given an image.

In order to achieve efficient pre-training while utilizing multitasking learning, the text encoder and decoder must share all parameters except the self-attention (SA) layer. Specifically, the encoder uses bidirectional self-attention to construct a representation for the current input token, while the decoder uses causal self-attention to predict the following token.

In addition, embedding layer, cross attention (CA) layer and FFN have similar functions between encoding and decoding tasks, so sharing these layers can improve training efficiency and benefit from multi-task learning.

🍊 speaking of encoder, decoder, layer sharing, a network before this way to improve the effect is also good, you can continue to refer to my interest in this blog, I hope to help you bring some inspiration
🍊 [Introduction to Deep Learning Project] Change the style of students and change the style of painting [❤️CVPR 2020 style transfer nice-Gan ❤️]
🍊 Nice-GAN environment construction — Effective tutorial for model training

📕 Experiment: Data set index PK

The experimental results

The researchers implemented the model in PyTorch and pre-trained the model on two 16-GPU nodes. Image Transformer is derived from ViT pre-trained on ImageNet and text Transformer is derived from BERT_base.

Mainstream data sets: COCO, Flickr
Hyperpartitioned reconstructed dataset DIV2K & Flickr2K

The work of this paper is still quite interesting and you are interested in it. You can download the original text for detailed study. The portal address is as follows

Address: arxiv.org/pdf/2201.12…
Code address: github.com/salesforce/…
Demo address: huggingface co/Spaces/akha…

📙 I wish you a bright future, can pick the stars

🎉 as one of the bloggers with the most dry goods in the field of AI, ❤️ lives up to his time and qing ❤️

❤️ every day in the past, I think you have to work hard, I wish you the future

🍊 deep learning model training reasoning – basic environment building recommended blog entry order [basic installation – carefully help you organize]
🍊 thumbs up 👍 collect ⭐ comments 📝 are all the bloggers insist on writing, updating high-quality blog post the biggest motivation!

Visual language understanding and generation expected to be hot in 2022: one-click image annotation generation, visual q&A, open source code, official Demo playable

📕 one key to generate image annotation, visual question and answer, official Demo can play

📕 Official Demo Demo effect

📕 Network Structure

📕 Experiment: Data set index PK

📙 I wish you a bright future, can pick the stars

Related Posts

Get it right: Data science vs. machine learning vs. data analytics vs. business analytics

Stephen Hawking warns of government ‘AI arms race’

Application practice of machine learning in 360 mobile assistant