Kloud Strife has collected the most interesting papers on deep learning from his blog about architecture/models, generative models, reinforcement learning, SGD & Optimization, and theory. Some are well known, others are less so. Can be extracted according to personal needs, if you want to batch extraction, you can see the bottom of the access!


Architecture/Model


The previous Convnet network architecture was much less, and everything was stable. Some papers are certainly pushing the research. First among them is Andrew Brock’s cracked SMASH, which, despite ICLR’s comments, has run a neural architecture search on 1,000 Gpus.


SMASH: Supernetwork-based model structure search

SMASH : one shot model architecture search through Hypernetworks

Download address:

https://arxiv.org/pdf/1708.05344.pdf


DenseNets(2017 update) is an impressive and very innocent idea. TLDR is “Computer vision, Eyes + Fur = Cat, so Everything is Connected (including layers)”


Densely connected convolutional nerves

Densely connected convolutional networks

Download address:

https://arxiv.org/pdf/1608.06993.pdf


In CNNs, a very underestimated concept is the wavelet filter bank coefficient scattering transform (ConV + Maxpool and ReLUctant component wavelet theory). Somehow, amazingly, this reveals why the first few layers of a ConvNet are like Gabor filters, and you probably don’t need to train them. In the words of Stephane Mallat, “I was amazed at how it worked!” See below.


Scaling scattering transformation

Scaling the Scattering Transform

Download address:

https://arxiv.org/pdf/1703.08961.pdf


On Wikipedia, Tensorized LSTM is the new SOTA, and the code limit for some English is 1.0,1.1 BPC (for reference, LayerNorm LSTMs is around 1.3 BPC). Because of its novelty, I prefer to refer to this paper as “The Road to the Revival of super Networks”.


Sequence learning Tensorized LSTMs

Tensorized LSTMs for sequence learning

Download address:

https://arxiv.org/pdf/1711.01577.pdf


Finally, say no more.


Dynamic routing between capsules

Dynamic Routing Between Capsules

Download address:

https://arxiv.org/pdf/1710.09829.pdf


EM Routing matrix capsule

Matrix capsules with EM routing

Download address:

https://openreview.net/pdf?id=HJWLfGWRb


Generate models


I deliberately omitted Nvidia’s paper on the growing growth of GAN networks, which is quite alarming.


Start with autoregressives — One of the files, VQ-Vae, Aaron van den Oord’s latest book, looks obviously backward, but coming up with the background fade stop-loss feature is no small feat. I am sure that a bunch of iterations, including the ALA PixelVAE wrapped in the ELBO ‘ed Bayesian layer, will come into play.


Neural discretization represents learning

Neural Discrete Representation Learning

Download address:

https://arxiv.org/pdf/1711.00937.pdf


Another surprise came from parallel WaveNetwavenet. While everyone was looking to be consistent with Tom LePaine’s work, DeepMind gave us the separation of teachers and students, and by explaining high-dimensional isotropic Gaussian/logistic potential Spaces, as a process that could flow through reverse-regressive self-noise shaping. Very, very neat.


Parallel Wavenet

Parallel Wavenet

Download address:

https://arxiv.org/pdf/1711.10433.pdf


Number one document that no one expected – Nvidia set the standard. The GAN theory completely replaced Wassersteinizing (Justin Solomon’s masterwork), maintaining only KL loss. The problem of disjoint support is eliminated by using multi-resolution approximation of data distribution. This still requires some finesse to stabilize the gradient, but the empirical results speak for themselves.


GAN grows

Progressive growing of GANs

Download address:

https://arxiv.org/pdf/1710.10196.pdf


Earlier this year Peyre and Genevay led a French school that defined the minimum Kantorovich Estimators. This is the Google team led by Bousquet who wrote the eif-gan final framework. This WAAE paper is probably one of the top ICLR2018 papers.


The campaign of manual

The VeGAN cookbook

Download address:

https://arxiv.org/pdf/1705.07642.pdf


Wasserstein autoencoder

Wasserstein Autoencoders

Download address:

https://arxiv.org/pdf/1711.01558.pdf


When it comes to variational reasoning, no one has done better than Dustin Tran to borrow ideas from reinforcement learning strategies and GAN to further advance VI.


Hierarchical model

Hierarchical Implicit Models

Download address:

https://arxiv.org/pdf/1702.08896.pdf


Reinforcement learning


“Dominated by software/max-entropy q-learning for a year, we were wrong, these years!

Schulman proved the equivalence between the two main members of the RL algorithm. A landmark paper, “Nuff said.


Equivalence between strategy gradient and Soft Q-learning

Equivalence between Policy Gradients and Soft Q-learning

Download address:

https://arxiv.org/pdf/1704.06440.pdf


Is he doing very careful math and redoing partition functions to prove the equivalence of paths? No one knows, except Ofir:


Closing the gap between RL strategy and value

Bridging the gap between value and policy RL

Download address:

https://arxiv.org/pdf/1702.08892.pdf


In another underrated paper, Gergely quietly outshines everyone by identifying similarities between the RL program and the Convex optimization theory. This year’s IMHO paper on RL excellent, but not well-known.


Unified entropy rule MDP view

A unified view of entropy-regularized MDPs

Download address:

https://arxiv.org/pdf/1705.07798.pdf


If David Silver’s Predictron is rejected at ICLR 2017 for somehow missing the radar, then Theo’s paper is like a dual point of view, which kicks off with beautiful and intuitive Sokoban experimental results:


Imagination enhancer

Imagination-Augmented Agents

Download address:

https://arxiv.org/pdf/1707.06203.pdf


Marc Bellemare published another transformational paper – abolising all DQN stabilization plug-ins and simply learning to distribute (and beating SotA in the process). Beautiful. Many possible extensions include links to Wasserstein’s distance.


RL with quantile regression

A distributional perspective on RL

Download address:

https://arxiv.org/pdf/1707.06887.pdf


Distribution perspective of distribution RL

Distributional RL with Quantile Regression

Download address:

https://arxiv.org/pdf/1710.10044.pdf


A simple, yet very effective, double whammy idea.


Noise networks for exploration

Noisy Networks for Exploration

Download address:

https://arxiv.org/pdf/1706.10295.pdf


Of course, this list would still be incomplete without AlphaGo Zero. The idea of front-and-back alignment of policy network MCTS, that is, MCTS as a policy improvement algorithm (and a means of smoothing NN approximation errors rather than propagating) is legendary.


Master the Go game without human knowledge

Mastering the game of Go without human knowledge

Download address:

https://deepmind.com/documents/119/agz_unformatted_nature.pdf


SGD & optimization


2017 has been an annual maturation of why SGD works the way it does in the non-convex case, which is so hard to beat from a generalized error perspective.


The winner of this year’s most technical paper is Chaudhari. Almost everything is connected from SGD and gradient flow to PDE. A masterpiece of following and completing “entropy-SGD” :


Deep relaxation: Partial differential equations for optimizing deep networks

Deep Relaxation : PDEs for optimizing deep networks

Download address:

https://arxiv.org/pdf/1704.04932.pdf


Bayes thinks it’s Mandt&Hoffman’s sgd-vi connection. As you know, I have been a busy man for many years, sic.


SGD as approximate Bayesian inference

SGD as approximate Bayesian inference

Download link:

https://arxiv.org/pdf/1704.04289.pdf


The previous article depends on the continuous relaxation of SGD as a stochastic differential equation (because of CLT, gradient noise is treated as gaussian). This explains the effect of batch size and gives a very good Chi-square formula.


Batch size, diffusion approximation frame

Batch size matters, a diffusion approximation framework

Download address:

My DL papers of the year


Another Ornstein-Uhlenbeck inspired paper, from Yoshua Bengio’s lab, produced similar results:


Three factors that influence the minimum SGD

Three factors influencing minima in SGD

Download address:

https://arxiv.org/pdf/1711.04623.pdf


Finally, another Chandhari paper on the SGD-SDE-VI trinity:


SGD performs VI, converging to the limit period

SGD performs VI, converges to limit cycles

Download address:

https://arxiv.org/pdf/1710.11029.pdf


The theory of


I firmly believe that in explaining why deep learning is useful, the answers will come from harmonic/second-order analysis and the intersection of information theory and entropy-based measurement. Naftali Tishby’s idea, while controversial due to the recent ICLR 2018 submission, still brings us closer to understanding deep learning.


On opening the dark box of deep network through information theory

Opening the black box of deep networks via information

Download address:

https://openreview.net/pdf?id=ry_WPG-A-


On the information bottleneck theory of deep learning

On the information bottleneck theory of deep learning

Download address:

https://arxiv.org/pdf/1703.00810.pdf


Similarly, a beautiful paper from ICLR2017 takes a variant approach to information bottleneck theory.


Information bottleneck of depth variation

Deep variational information bottleneck

Download address:

https://arxiv.org/pdf/1612.00410.pdf


This year there have been billions of generative models, 1.2 billion factorization log likelihood methods, most of which can be grouped under convex duality.


A Lagrangian perspective on latent variable modelling

Lagrange’s view of modeling latent variables

Download address:

https://openreview.net/pdf?id=ryZERzWCZ


This last paper demonstrates amazing technical prowess and tells us that the mathematical deep learning arms race is still alive and well! This paper combines complex analysis, random matrix theory, free probability and graph morphisms to derive an exact law for Hessian eigenvalues of neural network loss functions, while the shape of graph is only known empirically, as discussed in the paper of Sagun et al. Required.


Look at the loss surface geometry of neural network by RMT

Geometry of NN loss surfaces via RMT

Download address:

http://proceedings.mlr.press/v70/pennington17a/pennington17a.pdf


Deep learning nonlinear RMT

Nonlinear RMT for deep learning

Download address:

http://papers.nips.cc/paper/6857-nonlinear-random-matrix-theory-for-deep-learning.pdf

Package collection method

Follow the public account [Pegasus Club]

Navigation recovery number [17]


You can view the download method