From GitHub, Bayesian Methods Research Group, Machine Mind collation.

In the Deep | Bayes theorem in summer course, ShouKeRen will discuss how to combine the bayesian method Deep learning, and achieve better results in the application of machine learning. Recent studies show that the use of Bayesian methods can bring many benefits. Students will learn methods and techniques that are important to understanding current machine learning research. They will also experience the connection between Bayesian methods and reinforcement learning and learn modern stochastic optimization methods and regularization techniques for neural networks. After the course, the instructor also set up a practice session.

  • Project address: github.com/bayesgroup/…

  • Video address: www.youtube.com/playlist?li…

  • PPT address: drive.google.com/drive/folde…


Teachers’

Most of the lecturers and teaching assistants are members of the Bayesian methodology team and researchers from the world’s top research centers. Many lecturers have presented papers at top international machine learning conferences such as NIPS, ICML, ICCV, CVPR, ICLR, AISTATS, etc. The Bayesian Methods research team has developed a range of university courses, including Bayesian methods, deep learning, optimization, and probabilistic graph models, and has extensive teaching experience.


students

This summer course is open to:

  • Undergraduate students (preferably with at least two years of undergraduate study) and master’s students with a strong mathematical background and sufficient knowledge of machine learning, including deep learning.

  • Researchers and industry experts in the field of machine learning or related fields who want to expand their knowledge and skills.


The necessary foundation for studying this course

  1. Solid foundation of machine learning, familiar with deep learning.

  2. Math: Proficiency in linear algebra and probability theory (important).

  3. Programming: Python, PyTorch and NumPy.

  4. Deep | Bayes theorem in 2018 summer course in English, so students should be familiar with technical English.


I am in Deep | Bayes theorem can learn?

  • Why is the Bayesian approach so useful (in machine learning and everyday life)? What exactly is randomness?

  • Implicit variable model. How to train models to recognize patterns unknown before training?

  • Extensible probabilistic models. Why is it useful to turn a probabilistic inference problem into an optimization problem?

  • The link between reinforcement learning and Bayesian methods. How to train a random computation graph?

  • Fine tuning of automatic Dropout rate. Can neural networks overfit? (will be)

  • Random optimization. How can you optimize a function faster than calculating the value of a function at a point?

The goal of this course is to show that using Bayesian methods in deep learning can extend its application and improve performance. Although there are many different problem sets in machine learning, probabilistic inference in Bayesian networks can solve them in a similar way. Are you — are you into it?


Main Contents of course

The course covers all aspects of Bayesian learning, from the most basic Bayesian principle to the more difficult variational inference and Markov chain Monte Carlo method. Below is a list of topics for the entire course, and Heart of the Machine will briefly introduce some of the course content.


Day 1:

  • Introduction to bayesian methods

  • Bayesian inference

  • Implicit variable model and EM algorithm

  • The EM algorithm


Day 2:

  • Introduction to Stochastic optimization

  • Extensible Bayesian method

  • Variational autoencoder

  • Dirichlet hidden variable


Day 3:

  • Advanced methods for variational inference

  • Reinforcement learning from the perspective of variational inference

  • Reinforcement learning

  • Distributed reinforcement learning


Day 4:

  • Generate models

  • Against learning

  • Extend the technique of reparameterization


Day 5:

  • Gaussian process

  • Bayesian optimization

  • Depth Gaussian process

  • Markov chain Monte Carlo method

  • Random Markov chain Monte Carlo method


Day 6:

  • Bayesian Neural Networks and variational Dropout

  • Sparse variational Dropout and variance networks

  • Information bottleneck

The whole course needs six days to complete, and each day has a very large amount of courses, so the Heart of machine only briefly introduces the most basic Bayesian method and hidden variable model, among which Bayesian method is the core idea of the whole course, while hidden variable model is the basis of many advanced methods such as model generation.


Introduction to bayesian methods

We first introduce Bayes’ theorem around the example of “blind men touching an elephant”, and then briefly describe the difference between frequency school and Bayes school.

1 bayes’ theorem:

First of all, the basic form of Bayes’ theorem is

That is, posterior = likelihood x prior/evidence

The formalization is

Now we come to the question of the blind men and the elephant.

A group of “blind people” were petting an elephant, trying to guess what it was, but none of them guessed correctly. In an uncertain world, this is what we look like when we use probability theory to understand the world.

For simplicity’s sake, let’s keep the question simple: A group of “blind people” are touching an elephant and knowing that it is an elephant, hoping to guess the elephant’s weight based on what they touch.

How does the Bayesian approach solve this problem?

We assume that these blind people communicate their observations with each other and share some common sense, which is the initial guess about the elephant’s weight:

Then they can go like this:

The first person made the observation by touching the tail and its length, Y1, and re-guessing the weight of the elephant;

The second person took the first person’s guess as a priori, and observed that he felt the belly and its area y2, and then guessed the weight of the elephant again.

The third person is the same, according to the second person’s guess, continue to observe, guess…

In the process, they started the common sense of weight, elephants, speculation about a priori P (x), the first man observation of likelihood P (y1 | x), the possibility of observation itself is evidence of P (y1), finally got the P (x | y1), Is the probability (probability distribution) that the elephant weighs x based on observation Y:

And on this basis, the second will be able to get P (x | y1, y2) :

The third person will be able to get P (x | y1, y2, y3)…

Well, as observations add up, so does the elephant’s weight (spikes become sharp) :

Of course, the instructor will explain the concepts step by step in detail in the course, including the relationship between conditional distribution, joint distribution and edge distribution, as well as the introduction of product rules and rules. All concepts involved in the above example can be connected in series to help students understand them more thoroughly.

2. The connection and difference between frequency school and Bayes School:

The frequency school does not assume any prior knowledge, does not refer to past experience, and only makes probabilistic inference according to existing data. The Bayes school will assume the existence of prior knowledge (guess the weight of the elephant), and then use sampling to gradually modify the prior knowledge and approximate the real knowledge. But in fact, as the amount of data approaches infinity, the frequency school and the Bayesian school get the same results, that is to say, the frequency method is the limit of the Bayesian method.

The above is the general content of the basic theory of Bayesian method, and then there are the differences between generation and discrimination models, the training process of Bayesian and the advantages of Bayesian method.


Implicit variable model

In this chapter, Dmitry Vetrov focuses on the hidden variable model. Hidden variable model is the basis of many complex methods. For example, in the generation model of variational autoencoder, we hope to compress the image into a series of hidden variables, which represent the high-level semantic information of the image, such as the dip Angle, color and position of the image subject.

In this section we will discuss the intuitive concepts of hidden variable models, KL divergence, mixed distribution, and lower bounds of variational, based on Dmitry Vetrov’s introduction.

As mentioned before, the biggest advantage of VAE is that the short vector of the intermediate code represents some semantic features of the image, but because we can’t know exactly which features of the image, we can call this short vector as hidden variable. Intuitively, it is very difficult to completely generate an image from the whole pixel to pixel, because there are so many possibilities to consider. It is much easier to generate images from this blueprint if you first decide what features you want to generate.

VAE does exactly this, first learning how to compress the image correctly into a set of implicit variables, and then learning how to generate the image according to the implicit variables. When the model finishes learning, given any set of implicit variables, the model will try to generate the correct image. That’s the intuition of the implicit variable model.

KL divergence is generally used as a measure of the distance between two distributions and is often used to generate the loss function of the model. The following illustrates an intuitive understanding of the KL divergence, that is, the more the distribution Q(z) and P(z) overlap, the smaller the KL divergence and the closer the two distributions are to each other.

In the case of discrete variables, the KL divergence measures the amount of additional information needed to send a message containing the symbol produced by the probability distribution P using a code designed to minimize the length of the message produced by the probability distribution Q. KL divergence has many useful properties, the most important of which is that it is non-negative. KL divergence is 0 if and only if P and Q are equally distributed in the case of discrete variables, or “almost everywhere” in the case of continuous variables.

Then Dmitry Vetrov showed the case of hidden variable modeling. If we have some samples that obey the unknown Gaussian distribution, we can infer the mean and variance of the unknown distribution using maximum likelihood estimation or point estimation.

Now if we assume that we have a set of samples from different Gaussian distributions, and we need to estimate the parameters of those Gaussian distributions. The problem seems unsolvable, but it is easier to solve if we know which samples to sample from which gaussian distribution.

But if we don’t know from which Gaussian the sample is sampled, then we have to use the hidden variable model. The main idea is to first estimate which Gaussian distribution these samples belong to, that is, to map the samples to hidden variables “mean” and “variance”. Then the three Gaussian distributions are modeled based on hidden variables.

Following this idea, we can build a mixed Gaussian model, and hope to encode the data as a hidden variable Z, and then complete the modeling according to the hidden variable. As shown below, when we do not know the implicit variable Z, maximizing the probability of sampling sample X from Z can deduce the maximized variational lower bound, which is also the most core expression of the variational autoencoder.

The maximized variational lower bound (ELBO) in variational autoencoder can be used as the optimization objective or loss function of the whole model. In the case above, maximizing the lower bound of this variation means finding some Gaussian distribution to which every sample is most likely to belong.

The whole course introduces a lot of theoretical knowledge, especially about the various theories of bayes school. If you feel more confident about math, you can go through this series of tutorials in detail.