• Original article from wechat public account “Machine Learning Alchemy”
  • Author: Brother Alchemist
  • Contact: Wechat CYX645016617

This article mainly introduces two unsupervised papers that are relatively new and popular in 2020:

  • Bootstrap Your Own Latent: A New Approach to Self-supervised Learning
  • Links to papers: arxiv.org/pdf/2006.07…

0 review

BYOL is Boostrap Your Own Latent, this unsupervised framework is very elegant and simple, and works. Received a lot of compliments. The last one I knew was YOLO. Both are very simple and beautiful.

1 Mathematical Symbols

This structure has two networks, one is online network, the other is target network.

  • Online network: Use theta theta to represent the online network parameters, including encoder F theta f_{theta}fθ,projector G theta g_{theta}gθ, and Predictor Q theta q_\ theTAq θ
  • Target netowrk: ξ\xiξ is used as an parameter. There are also F ξf_{\xi}fξ and G ξg_{\xi}gξ, but no Predictor.

We will update the online network, and then update the target network in the way of moving average:


Is deduced please tau Is deduced + ( 1 tau ) Theta. \xi\leftarrow \tau\xi + (1-\tau)\theta

Now we have an image data set D, in which we get a picture X ∈Dx\in Dx∈D, and then we do different image enhancement to this D to get two new distributions T\TauT and T ‘\Tau’T’. Then we get the images from the two new distributions, labeled with VVV and V ‘v ‘v ‘. That is to say, if use t () () () t t and t ‘() t’ t ‘() () said to do the process of image enhancement, image = t (x), then v v’ = ‘t = t (x) (x) v, v’ = ‘t = t (x) (x) v, v =’ t ‘(x).

2 Loss function

We now have VVV, through encoder, y= F θ(v)y=f_{\theta}(v)y= F θ(v), through prejector, z=gθ(y)z=g_{\theta}(y)z=gθ(y), through predictor, Get q theta q_ (z) {\ theta} theta (z) q (z); The same is true for Target network, except without the final predictor, you end up with Z ‘z ‘Z’.

There has been l2-normalization of Z ‘Z’ Z ‘and Q θ(z)q_{\theta}(z)qθ(z). The meaning of the L2-normalization is to remove the absolute magnitude of these two implicit variables while retaining their directness for the subsequent vector dot product.

Above, q theta ˉ theta (z) = q (z) ∣ ∣ q theta (z) ∣ ∣ 2 \ bar {q_ {\ theta}} (z) = \ frac {q_ (z)} {\ theta} {| | q_ {\ theta} (z) | | _2} q theta ˉ (z) = ∣ ∣ q theta (z) ∣ ∣ 2 q theta (z), loss function is not difficult, It’s like: 2−2cos⁡θ2-2\cos\theta2−2cosθ

Above, we get the loss Lθ,ξL_{\theta,\xi}Lθ,ξ. Next, we need to calculate the symmetric loss. This is to put V and V ‘into target network and online network respectively. The obtained L~θ,ξ\widetilde{L}_{\theta,\xi}L θ,ξ, is minimized by SGD


L Theta. . Is deduced B Y O L = L Theta. . Is deduced + L ~ Theta. . Is deduced L^{BYOL}_{\theta,\xi}=L_{\theta,\xi} + \widetilde{L}_{\theta,\xi}

It should be noted that in this optimization process, only the online network is updated, and the parameters of target network are not changed, so that the online network gradually has the performance of target network

Thus, the entire training process for this BYOL can be condensed into the following two lines:

3 Details

3.1 Image enhancement

3.2 structure

Encoder in the image above
f Theta. . f Is deduced f_{\theta},f_{\xi}
Resnet50 and POST Activation were used. When I first saw POST activation, I went to see that it was actually convolution or activation. If RELU was placed after CONV, it was POST activation. Relu in front of CONV is pre activation.

After encoder, an image outputs 2048 features, then through MLP, features are expanded to 4096 features, and finally 256 features are output. In SimCLR model, MLP is followed by a BN layer and Relu activation layer, but there is no BN layer in BYOP.

3.3 the optimizer

The LARS optimizer was used to train 1000 epochs, including 10 WARN-up epochs, using cosine learning rate decay strategy. The learning rate is set to 0.2.

τ\tauτ, τbase=0.996\tau_{base}=0.996τbase=0.996,


tau = 1 ( 1 tau b a s e ) ( cos PI. k K + 1 ) 1 2 \tau=1-(1-\tau_{base})(\cos\frac{\pi k}{K}+1)\frac{1}{2}

k is current training step and K is maximum training steps.

3.4 money

Batchsize is 4096, distributed in 512 TPU V3 cores, training encoder needs about 8 hours.

4 Model Evaluation

Do supervised learning on ImageNet, first use unsupervised training encoder, then use standard resnet50 for supervised fine tuning:

ImageNet semi-supervised training: ImageNet semi-supervised training: ImageNet semi-supervised training: ImageNet semi-supervised training

The effect of such an approach on other classified datasets:

If you think your notes are good, you can follow the author’s wechat public account “Machine Learning Alchemy”.