Kloud Strife has collected the most interesting papers on deep learning from his blog about architecture/models, generative models, reinforcement learning, SGD & Optimization, and theory. Some are well known, others are less so. Can be extracted according to personal needs, if you want to batch extraction, you can see the bottom of the access!
Architecture/Model
The previous Convnet network architecture was much less, and everything was stable. Some papers are certainly pushing the research. First among them is Andrew Brock’s cracked SMASH, which, despite ICLR’s comments, has run a neural architecture search on 1,000 Gpus.
SMASH: Supernetwork-based model structure search
SMASH : one shot model architecture search through Hypernetworks
Download address:
https://arxiv.org/pdf/1708.05344.pdf
DenseNets(2017 update) is an impressive and very innocent idea. TLDR is “Computer vision, Eyes + Fur = Cat, so Everything is Connected (including layers)”
Densely connected convolutional nerves
Densely connected convolutional networks
Download address:
https://arxiv.org/pdf/1608.06993.pdf
In CNNs, a very underestimated concept is the wavelet filter bank coefficient scattering transform (ConV + Maxpool and ReLUctant component wavelet theory). Somehow, amazingly, this reveals why the first few layers of a ConvNet are like Gabor filters, and you probably don’t need to train them. In the words of Stephane Mallat, “I was amazed at how it worked!” See below.
Scaling scattering transformation
Scaling the Scattering Transform
Download address:
https://arxiv.org/pdf/1703.08961.pdf
On Wikipedia, Tensorized LSTM is the new SOTA, and the code limit for some English is 1.0,1.1 BPC (for reference, LayerNorm LSTMs is around 1.3 BPC). Because of its novelty, I prefer to refer to this paper as “The Road to the Revival of super Networks”.
Sequence learning Tensorized LSTMs
Tensorized LSTMs for sequence learning
Download address:
https://arxiv.org/pdf/1711.01577.pdf
Finally, say no more.
Dynamic routing between capsules
Dynamic Routing Between Capsules
Download address:
https://arxiv.org/pdf/1710.09829.pdf
EM Routing matrix capsule
Matrix capsules with EM routing
Download address:
https://openreview.net/pdf?id=HJWLfGWRb
Generate models
I deliberately omitted Nvidia’s paper on the growing growth of GAN networks, which is quite alarming.
Start with autoregressives — One of the files, VQ-Vae, Aaron van den Oord’s latest book, looks obviously backward, but coming up with the background fade stop-loss feature is no small feat. I am sure that a bunch of iterations, including the ALA PixelVAE wrapped in the ELBO ‘ed Bayesian layer, will come into play.
Neural discretization represents learning
Neural Discrete Representation Learning
Download address:
https://arxiv.org/pdf/1711.00937.pdf
Another surprise came from parallel WaveNetwavenet. While everyone was looking to be consistent with Tom LePaine’s work, DeepMind gave us the separation of teachers and students, and by explaining high-dimensional isotropic Gaussian/logistic potential Spaces, as a process that could flow through reverse-regressive self-noise shaping. Very, very neat.
Parallel Wavenet
Parallel Wavenet
Download address:
https://arxiv.org/pdf/1711.10433.pdf
Number one document that no one expected – Nvidia set the standard. The GAN theory completely replaced Wassersteinizing (Justin Solomon’s masterwork), maintaining only KL loss. The problem of disjoint support is eliminated by using multi-resolution approximation of data distribution. This still requires some finesse to stabilize the gradient, but the empirical results speak for themselves.
GAN grows
Progressive growing of GANs
Download address:
https://arxiv.org/pdf/1710.10196.pdf
Earlier this year Peyre and Genevay led a French school that defined the minimum Kantorovich Estimators. This is the Google team led by Bousquet who wrote the eif-gan final framework. This WAAE paper is probably one of the top ICLR2018 papers.
The campaign of manual
The VeGAN cookbook
Download address:
https://arxiv.org/pdf/1705.07642.pdf
Wasserstein autoencoder
Wasserstein Autoencoders
Download address:
https://arxiv.org/pdf/1711.01558.pdf
When it comes to variational reasoning, no one has done better than Dustin Tran to borrow ideas from reinforcement learning strategies and GAN to further advance VI.
Hierarchical model
Hierarchical Implicit Models
Download address:
https://arxiv.org/pdf/1702.08896.pdf
Reinforcement learning
“Dominated by software/max-entropy q-learning for a year, we were wrong, these years!
Schulman proved the equivalence between the two main members of the RL algorithm. A landmark paper, “Nuff said.
Equivalence between strategy gradient and Soft Q-learning
Equivalence between Policy Gradients and Soft Q-learning
Download address:
https://arxiv.org/pdf/1704.06440.pdf
Is he doing very careful math and redoing partition functions to prove the equivalence of paths? No one knows, except Ofir:
Closing the gap between RL strategy and value
Bridging the gap between value and policy RL
Download address:
https://arxiv.org/pdf/1702.08892.pdf
In another underrated paper, Gergely quietly outshines everyone by identifying similarities between the RL program and the Convex optimization theory. This year’s IMHO paper on RL excellent, but not well-known.
Unified entropy rule MDP view
A unified view of entropy-regularized MDPs
Download address:
https://arxiv.org/pdf/1705.07798.pdf
If David Silver’s Predictron is rejected at ICLR 2017 for somehow missing the radar, then Theo’s paper is like a dual point of view, which kicks off with beautiful and intuitive Sokoban experimental results:
Imagination enhancer
Imagination-Augmented Agents
Download address:
https://arxiv.org/pdf/1707.06203.pdf
Marc Bellemare published another transformational paper – abolising all DQN stabilization plug-ins and simply learning to distribute (and beating SotA in the process). Beautiful. Many possible extensions include links to Wasserstein’s distance.
RL with quantile regression
A distributional perspective on RL
Download address:
https://arxiv.org/pdf/1707.06887.pdf
Distribution perspective of distribution RL
Distributional RL with Quantile Regression
Download address:
https://arxiv.org/pdf/1710.10044.pdf
A simple, yet very effective, double whammy idea.
Noise networks for exploration
Noisy Networks for Exploration
Download address:
https://arxiv.org/pdf/1706.10295.pdf
Of course, this list would still be incomplete without AlphaGo Zero. The idea of front-and-back alignment of policy network MCTS, that is, MCTS as a policy improvement algorithm (and a means of smoothing NN approximation errors rather than propagating) is legendary.
Master the Go game without human knowledge
Mastering the game of Go without human knowledge
Download address:
https://deepmind.com/documents/119/agz_unformatted_nature.pdf
SGD & optimization
2017 has been an annual maturation of why SGD works the way it does in the non-convex case, which is so hard to beat from a generalized error perspective.
The winner of this year’s most technical paper is Chaudhari. Almost everything is connected from SGD and gradient flow to PDE. A masterpiece of following and completing “entropy-SGD” :
Deep relaxation: Partial differential equations for optimizing deep networks
Deep Relaxation : PDEs for optimizing deep networks
Download address:
https://arxiv.org/pdf/1704.04932.pdf
Bayes thinks it’s Mandt&Hoffman’s sgd-vi connection. As you know, I have been a busy man for many years, sic.
SGD as approximate Bayesian inference
SGD as approximate Bayesian inference
Download link:
https://arxiv.org/pdf/1704.04289.pdf
The previous article depends on the continuous relaxation of SGD as a stochastic differential equation (because of CLT, gradient noise is treated as gaussian). This explains the effect of batch size and gives a very good Chi-square formula.
Batch size, diffusion approximation frame
Batch size matters, a diffusion approximation framework
Download address:
Another Ornstein-Uhlenbeck inspired paper, from Yoshua Bengio’s lab, produced similar results:
Three factors that influence the minimum SGD
Three factors influencing minima in SGD
Download address:
https://arxiv.org/pdf/1711.04623.pdf
Finally, another Chandhari paper on the SGD-SDE-VI trinity:
SGD performs VI, converging to the limit period
SGD performs VI, converges to limit cycles
Download address:
https://arxiv.org/pdf/1710.11029.pdf
The theory of
I firmly believe that in explaining why deep learning is useful, the answers will come from harmonic/second-order analysis and the intersection of information theory and entropy-based measurement. Naftali Tishby’s idea, while controversial due to the recent ICLR 2018 submission, still brings us closer to understanding deep learning.
On opening the dark box of deep network through information theory
Opening the black box of deep networks via information
Download address:
https://openreview.net/pdf?id=ry_WPG-A-
On the information bottleneck theory of deep learning
On the information bottleneck theory of deep learning
Download address:
https://arxiv.org/pdf/1703.00810.pdf
Similarly, a beautiful paper from ICLR2017 takes a variant approach to information bottleneck theory.
Information bottleneck of depth variation
Deep variational information bottleneck
Download address:
https://arxiv.org/pdf/1612.00410.pdf
This year there have been billions of generative models, 1.2 billion factorization log likelihood methods, most of which can be grouped under convex duality.
A Lagrangian perspective on latent variable modelling
Lagrange’s view of modeling latent variables
Download address:
https://openreview.net/pdf?id=ryZERzWCZ
This last paper demonstrates amazing technical prowess and tells us that the mathematical deep learning arms race is still alive and well! This paper combines complex analysis, random matrix theory, free probability and graph morphisms to derive an exact law for Hessian eigenvalues of neural network loss functions, while the shape of graph is only known empirically, as discussed in the paper of Sagun et al. Required.
Look at the loss surface geometry of neural network by RMT
Geometry of NN loss surfaces via RMT
Download address:
http://proceedings.mlr.press/v70/pennington17a/pennington17a.pdf
Deep learning nonlinear RMT
Nonlinear RMT for deep learning
Download address:
http://papers.nips.cc/paper/6857-nonlinear-random-matrix-theory-for-deep-learning.pdf
Package collection method
Follow the public account [Pegasus Club]
Navigation recovery number [17]
You can view the download method
▼