Here’s a look at the latest hot dino from Facebook

An in-depth reflection on the unsupervised study DINO released by Facebook

This article is an original article, without authorization is not allowed to reprint, if you need to reprint, please send me for authorization.

I haven’t updated the article for a long time. I’m so busy with my work that I don’t have time to think. Today I’m documenting my thoughts on Facebook’s latest paper, Dino. This is actually a paper in Transformer system, but this one is more awesome. As always, I do not do text translation of the paper, only write my thinking and summary, even some of the wrong criticism, throw a tile to attract jade, welcome to polite communication and discussion.

Link to the corresponding code:

Github.com/facebookres…

I want to write this paper because the title is: Emerging Properties in self-supervised Vision Transformers: Generally, I may not be interested in papers with Transformer in the title. But self-supervised (unsupervised) + Transformer will definitely interest me. Personally, I think unsupervised learning will be a trend in the future, especially self-supervised way, and the only way is Transformer (at least currently it is feasible). Through the research in this direction, it will be possible to build more impressive AI systems in the future. Even GAI, it’s not as simple as swiping lists. So let’s talk about it today.

Review of some recent unsupervised work

As always, the output of my blog is very small. In order to achieve quality, I will intersperse a lot of other content into each post, connecting the dots and making the thread into a surface, so that every reader can have a sense of the big picture. Speaking of unsupervised, I have to say several recent work such as:

  • Momentum Contrast for Unsupervised Visual Representation Learning, 2019 (2020收录)
  • Improved Baselines with Momentum Contrastive Learning, 2020
  • An Empirical Study of Training Self-Supervised Vision Transformers, 2021

For those of you who are familiar with it, I’m talking about the Moco series. In fact, these three papers were all written by He Kaiming. Let’s take a look at the timeline of these three papers. The first paper was published in the context of the NLP boom in Transformer, which was applied to unsupervised representation learning tasks to produce BERT and GPT models that swept across all NLP tasks. The second is a review of SimCLR for the industry’s new SOTA, and the last is, of course, a hot topic this year: using Transformer for unsupervised tasks. For an overview of Transformer, see what I wrote earlier:

zhuanlan.zhihu.com/p/342512339

(for MAO I looked back to see my writing articles always feel that it is too simple to write… In the future, I will do some videos on Transformer in detail!

First, let’s answer these questions:

  • What problem does the MOCO series solve?
  • How was it solved?
  • What’s the effect?

I believe this is also the reader’s concern. Before we go any further, let’s get back to the point of this article: We’re talking about dino, so how does this thing relate to dino? Before you rush, let’s answer these three questions.

In fact, MOCO uses self-supervised mode to learn tasks such as classification, and this learned backbone can further migrate to tasks such as detection and segmentation. Experimental results show that its fine-tuned effect is better than that of supervision, perfectly bridging the gap between supervised learning and unsupervised learning.

mocov1

We do not go into some of the implementation details, first post a paper inside the pseudo code, to tell you how to do the specific details, do not do in-depth discussion:

This pseudo-code maintains a queue that stores the output key of the model, that is, the key, and then we have another set of query variables. The two variables update the key in the queue by comparing the loss and query variables, and the original batch key is spouted. Through such operations, the key in the queue becomes more unique. At the same time, I update the new sample data until my queue contains all characteristics, and each characteristic is different, so as to realize my unsupervised task.

One advantage of this is that the queue length is actually controllable, which means I can increase the size of the dictionary to improve classification accuracy and performance. It’s like doing KNN with back propagation, you can control your clustering effect by controlling the size of K.

As for how to use loss, if you know that you can modify momentum to make it work and derive the corresponding formula, you can read the original paper carefully. The link to the paper is in the quote.

mocov2

In the second part, there is nothing to talk about, but some tricks are added to make the effect better. Let’s see what tricks are added. In a nutshell, it borrows some of the design from SimCLR, applies it to Moco, and then goes beyond SimCLR.

Trcik borrows from SimCLR include:

  • A larger batch;
  • The final FC became an MLP (that works too??).
  • Better data enhancements

Well, this one feels more water

mocov3

This recent article is actually applying unsupervised to Transformer, or taking Transformer and piling it on to unsupervised tasks. So what else has changed? Or is it the same moco thing? I don’t even look at the paper and I think it can’t be the same thing. Why? Transformer is a natural queue. And the token is your key! So this article came out, in fact, very logical.

The paper devoted some ink to describing a problem they encountered when using ViT as a backbone network for unsupervised learning, which became unstable as the training progressed. They didn’t know the root cause, so they tried to find it by controlling variables, and they found some interesting results.

It can be seen from the experiment that with the increase of batch or LR, kNN accuracy gradually appears dip, and the degree of DIP gradually increases, appearing periodically. When LAMB Optimizer is used, with the increase of LR, although kNN accuracy is still a smooth curve, the middle part will decline.

The article also mentioned some tricks on how to make training more stable.

we explore freezing the patch projection layer during training. In other words, we use a fixed random patch projection layer to embed the patches, which is not learned. This can be easily done by applying a stop-gradient operation right after this layer.

In other words, the patch of free projection layer is projected with fixed random patch. This part of parameters does not participate in learning, and it is relatively easy to implement. You can see the code of MOCO to see how it is implemented.

As you can see in the figure, using random projections does seem to solve the problem.

In the end, MoCOV3 also outperformed all previous unsupervised architectures:

That’s the story of Moco so far, but it’s not really the same thing as the DINO I’m going to make. Because the problem dino will solve is:

Beyond classification, can UNsupervised learning + Transformer solve the segmentation problem? In other words, does it know how objects relate to each other without any supervisory information?

DINO may have enabled a more hierarchical visual understanding

To be honest, once I understood the specific implementation of ViT, I suddenly have a very friendly feeling towards Transformer. It is just like seeing a teenager, who at first glance is so clear that I am sure it will produce great products in the future, so now I need to invest in transformer. In fact, transformer has given me such a feeling, and after understanding DINO more deeply, I feel like I have further discovered the potential of Transformer, so I will write more to share with you.

Over the years, whether it’s CNN, whether it’s Transformer, they’ve been working on the oversight task of telling models to model what they see. It really depends on your definition of the data, which is very different from NLP. For example, when we use word embedding, I don’t tell the model the task information, it can know that “Andy Lau” is a person’s name, and it can even tell you other nouns related to the noun. This is a long time ago, and now OpenAI can tell you that their models can not only understand words, they can tell you how each word relates to each other, and they can even write your natural language database retrieval problems in SQL statements. Isn’t that awesome? Furthermore, the GPT3 model is connected to the image domain to implement dall-E. Tell it the green clock, and it generates the green clock. It’s clearly learning higher-order features.

Is it possible that Computer Vision can do this?

This dino, which I’m going to talk about today, actually does that. So understand two things, the idea of implementing a split and the idea of unsupervised implementing a split. Of course, there are other ways to achieve unsupervised semantic segmentation, which is not difficult, but with Transformer, DINO should be the first.

This is what DINO looks like.

It should be noted that this is AttentionMap. Dino does not have any semantic ground truth, nor does it have any category to tell it that it is a monkey, but it can learn all its attention on this.

When I say this, you might think: That’s it? Dude, don’t be naive. It’s Game Changer! This is like throwing you a picture, without any GT, it can automatically help you learn these attention! Think again, is this similar to the embedding in NLP? Think again, what do we need now? The most important thing is raw data. There are so many images and videos generated on the Internet every day. If Transformer is so awesome, what kind of AI do you think will be produced? A very large version of Resnet101??

NO, NO, NO. It’s not that simple. This is why I am optimistic about this direction and why I write this paper analysis. Of course, we cannot fully do unsupervisor at present, but self-supervisor is enough!

Get ready for Google AI or FAIR to come out with something bigger that could be a game changer.

What else can this do? Using the method of self-supervisor, the model learned by DINO can be directly classified. To put it simply, when you throw the pictures you want to classify, DINO can automatically dig out the different features of each picture and automatically classify them for you. This is the future AI.

Before put into, I may feel it’s grandstanding, but DINO reveal, is not only the effect of classification, and network learning out very clear heat map, attention to, these are enough to prove that, DINO can do, not just classification, including segmentation, detection, future CV of task is the future period, Use an unsupervised approach.

Here is a simple principle of the DINO system.

In fact, the principle of DINO is also very simple. Since it is unsupervised learning, it does not need label when learning. In order to achieve this unsupervised purpose, it needs

Basically, a learning network, and a teaching network. The input is actually the same picture, but after different transformation, the network structure of the two is the same, but the only difference is the parameter. The gradient of the teacher network is transmitted to the student network. The parameters of the teacher network are updated as the parameters of the student network are updated.

Let’s take a look at the flow chart of DINO’s network:

The flow chart is easy to understand and suitable for all ages. Just like I said above and it’s pretty much the same as the flow chart.

Let’s take a look at what the DINO looks like:

Looking at the Linear column alone, it’s a comparison of all the methods under the fixed category output. It can be seen that the method of DINO has the highest accuracy. Based on the architecture of DeiT, the accuracy is far higher than that of RN50, that is, the traditional CNN.

If you look at the k-NN unsupervised line, it’s not as bad as the fixed class, but it’s not much different, especially when using the Transformer architecture.

conclusion

DINO is actually a “crossing the river by feeling for stones” article, and there’s a lot of detailed study in it, which I won’t elaborate on all of them, but for those of you who are interested, take a look at the paper. There is no doubt that DINO is leading the way into a new space that will usher in a new combination of Unsupervise and Transformer. As I have judged before, with the huge amount of data required in Transformer training, it is not enough to learn from supervision alone, but if we can make Transformer work fast, it will undoubtedly be a new chapter in CV.

Reference

  1. Momentum Contrast for Unsupervised Visual Representation Learning
  2. Improved Baselines with Momentum Contrastive Learning
  3. An Empirical Study of Training Self-Supervised Vision Transformers
  4. Tourbillon and Moco trilogy
  5. Emerging Properties in Self-Supervised Vision Transformers