“This is the 22nd day of my participation in the First Challenge 2022. For details: First Challenge 2022”
🎉 statement: as one of the bloggers in AI field, ❤️ lives up to your time and your ❤️
- 🍊 Readers, happy Lantern Festival
- 📆 Last updated: February 12, 2022
- 🍊 This article shares a fun visual language comprehension and generation task that is expected to be popular in 2022
- 🍊 AI road, road obstruction and long, thanks to countless predecessors of the giant dedication
📕 one key to generate image annotation, visual question and answer, official Demo can play
The basic information of the paper is as follows
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
- BLIP: Guide language image pre-training to achieve unified visual language understanding and generation
- Address: arxiv.org/pdf/2201.12…
- Code address: github.com/salesforce/…
- Demo address: huggingface co/Spaces/akha…
📕 Official Demo Demo effect
Visual language understanding and generation, operation trilogy is as follows
- Upload a desired image
- Click the submit button below
- Wait for a few seconds. The corresponding image content description can be generated on the right
Guided: Intelligent q&A
Ask: If I upload an image of myself, will it directly guess what I’m thinking
Some official examples in the paper, red is the question, green is the answer, artificial intelligence YYDS
📕 Network Structure
Multiple encoders – decoders
The proposed framework
The researchers used a visual Transformer as an image encoder, which decomposed the input image into patches and then encoded these patches into sequential embeddings, representing the global image features using an additional [CLS] token. Compared with the method of using pre-trained target detector for visual feature extraction, using ViT is more computationally friendly and has been adopted by many recent methods.
In order to pre-train a unified model with the ability of understanding and generation, the researchers propose a multi-task model MED (Mixture of encoder-decoder), which can perform any of the following three functions:
- Single-peak encoder
- Image based text encoder
- Image-based text decoder
Pre-training objective
During the pre-training, the researchers jointly optimized three goals, namely, two comprehension-based goals and one generation-based goal. Each image text pair requires only one forward propagation through the computation-heavier visual transformer, and three forward propagation through the text transformer, which activates different functions to calculate the following three losses:
-
Image-text contrastive Loss (ITC), the activation of single-peak encoder, aims to align the feature space of visual and text Transformer by encouraging positive image text pairs (rather than negative pairs) to have similar representations;
-
Image-text Matching Loss (ITM), which activates image-based text encoders, is designed to learn and capture multimodal representations of image text with fine-grained alignment between vision and language;
-
Language Modeling Loss (LM), which activates image-based text decoders designed to generate text descriptions when given an image.
In order to achieve efficient pre-training while utilizing multitasking learning, the text encoder and decoder must share all parameters except the self-attention (SA) layer. Specifically, the encoder uses bidirectional self-attention to construct a representation for the current input token, while the decoder uses causal self-attention to predict the following token.
In addition, embedding layer, cross attention (CA) layer and FFN have similar functions between encoding and decoding tasks, so sharing these layers can improve training efficiency and benefit from multi-task learning.
- 🍊 speaking of encoder, decoder, layer sharing, a network before this way to improve the effect is also good, you can continue to refer to my interest in this blog, I hope to help you bring some inspiration
- 🍊 [Introduction to Deep Learning Project] Change the style of students and change the style of painting [❤️CVPR 2020 style transfer nice-Gan ❤️]
- 🍊 Nice-GAN environment construction — Effective tutorial for model training
📕 Experiment: Data set index PK
The experimental results
The researchers implemented the model in PyTorch and pre-trained the model on two 16-GPU nodes. Image Transformer is derived from ViT pre-trained on ImageNet and text Transformer is derived from BERT_base.
- Mainstream data sets: COCO, Flickr
- Hyperpartitioned reconstructed dataset DIV2K & Flickr2K
The work of this paper is still quite interesting and you are interested in it. You can download the original text for detailed study. The portal address is as follows
- Address: arxiv.org/pdf/2201.12…
- Code address: github.com/salesforce/…
- Demo address: huggingface co/Spaces/akha…
📙 I wish you a bright future, can pick the stars
🎉 as one of the bloggers with the most dry goods in the field of AI, ❤️ lives up to his time and qing ❤️ ❤️ every day in the past, I think you have to work hard, I wish you the future
-
🍊 deep learning model training reasoning – basic environment building recommended blog entry order [basic installation – carefully help you organize]
-
🍊 thumbs up 👍 collect ⭐ comments 📝 are all the bloggers insist on writing, updating high-quality blog post the biggest motivation!