Thesis: arxiv.org/abs/2108.09…
Code: not open source yet
1. The innovation point
A new token-aware cascaded contrastive learning (TACO) algorithm is proposed, which has two innovations:
- The comparative loss of token perception is calculated by considering the syntactic categories of words
- Cascade sampling method was adopted to generate a small amount of hard negative, which was used to estimate the loss of multi-mode fusion layer
SOTA levels are set on three common text and video retrieval benchmarks, YouCook2, MSR-VTT, and ActivityNet
Conclusion 2.
The authors propose a contrastive learning algorithm, TACO, for learning video-text alignment. The author aims to solve two kinds of problems in current contrastive learning:
- Lack of fine-grained alignment
- Multi – mode fusion sampling is inefficient
Through comparison, the experimental result is a good choice to replace the traditional comparison learning channel
3. Implementation method
Traditional video-text alignment uses contrast learning to pre-train on a large number of noisy video-text pairs and then zero-shot or fine-turned for various downstream tasks. TACO improves traditional video-text alignment in two ways:
- The comparative loss of token perception is calculated by considering the syntactic categories of words
The author argues that verbs and nouns correspond better to visual content than function words when given a particular text
- Cascade sampling method, which finds a small amount of hard negative for training the multimodal fusion layer
Given a batch of K video-text pairs, for each video-text pair,Ideally, we use the remaining K-1 negative videos or texts to calculate the contrast loss after multi-mode fusion.
{video1:text1} is recognized as a positive class, then {videO2 :text1}, {video3:text1}… {video_K:text1} is recognized as a negative class
Or {video1:text2}, {video1:text3}… {video1:text_k} is assumed to be negative.And that’s way too much computationThe traditional approach is to use random sampling. Instead of random sampling, the author proposes a cascade sampling method as shown below.It leverages the video text alignment scores calculated in L1 and L2 prior to the multimodal fusion layerAnd helps learn multimodal fusion layers more efficiently without any additional overhead
3.1 Video-text alignment model
According to the figure above, the method can be divided into three modules:
3.1.1 Video encoding module
The video coding module is realized by θ V \theta_vθ V parameterized self-attention layer. The author used 2D CNN or 3D CNN to extract high-dimensional features in advance, and then projected them onto dimension D, which is the same as the self-attention layer, through a linear layer. The author uses x={x1,x2… , xm} Rm ∈ x dx = \ {{x ^ 1}, {x} ^ 2,… ,{x^m}\} \in {\mathbb{R}^{m \times d}}x={x1,x2,… , xm} Rm ∈ x d said
3.1.2 Language encoding module
The author uses Tokenizer and BERT in advance to tokenize the input text and extract the text features, which can obtain a sequence consisting of n text features y={y1,y2… Yn} ∈ Rn x dy = \ {{y ^ 1}, {2} y ^,… {y^n}\} \in {R ^{ n \times d}}y={y1,y2,… Yn}∈Rn×d, the training process updates θt\theta_tθt to fit the specific text
3.1.3 Multi – modal fusion module
The multimode fusion module consists of a self-attention layer with learnable parameters θm\theta_mθm. It takes two independent modes of video feature x∈Rm×dx \in \R^{m \times d}x∈Rm×d and text feature Y ∈Rn×dy \in R^{n \times d}y∈Rn×d as input, and then outputs feature z={z1,z2… Z (m + n)} (m + n) x ∈ R dz = \ {z_1, z_2,… z_{(m+n)}\} \in R^{(m+n) \times d}z={z1,z2,… Z (m+n)}∈R(m+n)× D In order to distinguish video token from language token, the author uses the token type embedding layer to learn the two embedding.
3.2 Token Perception Cascading comparison Loss
Given K, the first video, text to {} (vi, ti) I = 1 K \ {\} (v_i, t_i) ^ K_ (I = 1} {} (vi, ti) I = 1 K of a batch, and use first f theta tf_ {\ theta_t} f theta t get video feature X = {x1, x2,… The xt} ∈ fairly RK * m * dX = \ {{x_1}, {x_2},… ,{x_t}\} \in {R^{K \times m \times d}}X={x1,x2,… ,xt}∈RK×m× D and text features Y={y1,y2… , yK} ∈ fairly RK * n * dY = \ {{y_1}, {y_2},… ,{y_K}\} \in {R^{K \times n \times d}}Y={y1,y2,… ,yK}∈RK× N × D, then the average of all tokens of a video clip viv_ivi is xi∈R1×d {x_i} \in {R^{1 \times d}}xi∈R1× D, and the first [CLS]token of each text is obtained Yi ‾ R1×d\overline {y_i} \in R^{1 \times d}yi∈R1×d. Based on xi‾ R1×d and yi‾ R1×d\overline {x_i} \in R^{1 \times d} and \overline {y_i} \in R^{1 \times d} \in R^{1 \times d}xi∈R1×d and yi∈R1×d, the sentence level comparison loss is as follows:
Because of the adoption ofAverage of the first [CLS]token and the video token, may not push a particular verb or noun token to a particular video frame, so the authors also introduce a token level of contrast loss:
The above equation needs to be decided at
Which tokens should be included in. The author of this article uses nouns or verbs as targets.
For “people” and “gymnasts”, the latter contains more information, so the authors use the inverse document frequency (IDF) to assign different weights to the words.
For tokens belonging to the same word, the author assigns the same weight.
After calculating the contrast loss of token perception, the authors input features from different modes into the multi-mode fusion layer. Similar to the previous work, the author adopts features corresponding to “[CLS]” in (m+n) outputs. Think of this as a combination of the two modes and then calculate the comparative loss:
In order to reduce the computational cost, the author uses a cascade sampling strategy to find hard negative.Here’s how: for each text – video
, the author takes the global similarity of their calculation
The aggregation
All interested tokens in
, the author then adds the two similarities to the alignment score for the given pair.
For each text, the author selects the front
Six aligned negative video samples and vice versa. And then you get
Two pairs are fed into the multimodal fusion layer. C. Hard negative D. hard negative
Goal 4.
The method of training goal is to minimize the combination of the above three contrast loss to find the optimal theta = {theta v, t, theta theta m} \ theta = \ {\ theta_v \ theta_t, \ theta_m \} theta = {theta v, t, theta theta m}
The following table shows the text-video retrieval performance of the proposed method on YouCook2 and MSR-VTT with different video characteristics. As you can see, the S3D pre-trained on the HowTo100M is superior to other features with a huge performance advantage