In this issue, we present three papers on self-supervised learning, two of which were co-published by Yann LeCun, the father of convolutional networks.
For large machine vision training tasks, self-supervised learning (SSL) and supervised methods are becoming more and more difficult to distinguish.
In addition, self-supervised learning means that pretext is used to mine its own supervised information from large-scale unsupervised data to improve the quality of learning representation, and pretext is used to construct supervised information to train the network so as to learn valuable representations for downstream tasks.
This paper will share three papers on self-supervised learning in order to improve people’s understanding of self-supervised learning.
Barlow Twins: SSL based on redundancy reduction
Topic:
Barlow Twins: Self-Supervised Learning via Redundancy Reduction
The author:
Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, Stephane Deny
For self-supervised learning methods, a very useful method is embedding vector learning, because it is not affected by the input sample distortion.
However, there is an unavoidable problem with this approach: the Trivial Constant Solution. Most current approaches attempt to circumvent the trivial Constant Solution by honing in on the implementation details.
In this paper, the author proposes an objective function to avoid collapse by measuring the cross-correlation matrix between the outputs of two identical networks (using distortion samples) as close as possible to the identity matrix.
This makes the embedded vectors of the sample similar and minimizes redundancy among the vector components. This method is called the Barlow Twins. Schematic diagram of Barlow Twins principle
Barlow Twins do not require large batches, nor do network Twins need asymmetry between batches (e.g. Pradictor network, Gradient Stopping, etc.), thanks to a very high-dimensional output vector.
Barlow Twins loss function:
Where λ is a positive constant, which is used to weigh the importance of the first and second terms of Loss. C is the cross-correlation matrix calculated along the Batch dimension between outputs of two same networks:
B represents batch sample. I and j represent the vector dimension of network output; C represents the square matrix, whose size is the dimension of network output (between -1 and 1).
On ImageNet, the Barlow Twins perform better than all previous approaches in semi-supervised classification under a low-data regime; In the ImageNet classification task, the effect is equivalent to that of the most advanced Linear classifier. The same is true in migration tasks for classification and target detection.
Semi-supervised learning was performed with 1% and 10% training examples on ImageNet, with best results in bold
Experiments have shown that the Barlow Twins perform slightly better (with 1% of the data) or equal (with 10%) than the other methods.
Barlow Twins: Self-supervised Learning via Redundancy Reduction
VICReg: Variance-invariance-covariance regularization
Topic:
VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning
The author:
Adrien Bardes, Jean Ponce, Yann LeCun
The self-supervised method used for image representation learning is generally maximized based on the consistency between embedding vectors of the same image and different views. A Trivial Solution appears when the Encoder outputs a constant vector.
Implicit bias (the lack of a clear reason or explanation) is often used to avoid collapse problems.
In this paper, the author introduces VICReg (full variance-invariance-coregularization), which has a simple Regularization term on the embedding Variance of each dimension, so it can definitely avoid the collapse problem.
VICReg combines the variance term with a decorrelation mechanism based on reducing redundancy and covariance regularization, and achieves comparable results on several downstream tasks.
In addition, experiments show that incorporating the new variance into other methods can help stabilize training and improve performance.
VICReg principle diagram
Given a batch of images I, X and X’ represent different views respectively, which are encoded to represent Y and Y’. The representation is input to an expander to generate the embedding vectors Z and Z’.
The distance between two embeddings from the same image is minimized, the variance of each embedded variable in a batch is kept above the threshold, and the covariance between pairs of embedded variables in a batch is attracted to zero, making these variables correlated with each other.
Although the two branches do not need the same architecture, nor do they need to share weights, in most experiments they are Siamese twins of shared weights: the encoder is a Resnet-50 backbone with an output dimension of 2048; The extender includes three fully connected layers of size 8192.
A comparison of the performance of the different methods on ImageNet highlights the top three self-supervised methods for best performance
Evaluate the characterization obtained by resNET-50 backbone network pre-trained with VICReg:
1. Linear classification based on ImageNet frozen representation;
2. Semi-supervised classification from 1% and 10% fine tuning characterization of ImageNet samples.
The picture shows the accuracy of top-1 and top-5 (unit: %).
VICREG: VARIANCE-INVARIANCE-COVARIANCE REGULARIZATION FOR Self-supervised LEARNING
IBOT: Image BERT Online Tokenizer
Topic:
iBOT: Image BERT Pre-Training with Online Tokenizer
The author:
Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, Tao Kong
The success of Transformer models in the NLP field is mainly attributed to the masked Language Modeling (MLM), which first classifies text into semantic fragments.
In this paper, the author studies masked Image modeling (MIM) and proposes a self-monitoring framework iBOT.
IBOT can use online Tokenizer to perform masked prediction. Specifically, the author conducted self-distillation of masked patch token and teacher network as an online word separator, while self-distillation of class token. To get visual semantics.
Online segmentation can also be co-learned with MIM targets and eliminates the need for multi-stage training pipelines for segmentation to be pre-trained in advance.
Overview of iBOT framework, mask image modeling with online word segmentation
IBOT performed well, achieving the most advanced results in the downstream tasks related to classification, target detection, instance segmentation and semantic segmentation.
Table 2: Fine tuning on Imagenet-1K, Table 3: fine tuning on Imagenet-1K, and pre-training on Imagenet-22K
The experimental results show that iBOT can achieve 82.3% linear probing accuracy and 87.8% fine tuning accuracy on Imagenet-1K.
Read the full paper in: IMAGE BERT pre-training WITH ONLINE TOKENIZER
DocArray: Data structures for unstructured data
One of the challenges facing self-supervised learning is representational learning from a large amount of unlabeled data.
With the rapid development of Internet technology, the amount of unstructured data has been unprecedented increase, and the data structure also covers audio and video in addition to text and image, and even 3D mesh.
DocArray can greatly simplify processing and utilization of unstructured data.
DocArray is an extensible data structure ideal for deep learning tasks. It is primarily used for the transfer of nested and unstructured data, including text, image, audio, video, 3D Mesh, and more.
Compared to other data structures:
✔ Come to ✅, come to some, come to ❌, come to none
With DocArray, deep learning engineers can efficiently process, embed, search, recommend, store, and transfer data with the help of the Pythonic API.
That’s all the content of this paper sharing, what else would you like to know about the paper, tutorial and tool recommendation? Welcome the public number backstage message to tell us, we will send pictures every week according to the message
Reference links:
Jina GitHub
DocArray
Finetuner
Join the Slack