Lilian Weng
Lao Qi
Recommended book for this article: Data Preparation and Feature Engineering
I recently listened to professor Naftali Tishby’s talk on “Information Theory in Deep Learning” and it was very interesting. In his speech, he explained how to use Information theory to study the growth and transformation of deep neural networks. He created a new field for deep neural networks (DNN) by using IB (Information Bottleneck) method. Because the number of parameters increases exponentially, traditional learning theories are not feasible in this field. Another keen observation shows that there are two distinct stages in DNN training: first, the training network fully represents the input data and minimizes generalization errors; Then, by compressing the representation of the input, it learns to forget irrelevant details.
The basic concept
Markov chain
Markov processes are random processes with “no memory” (also known as “Markov properties”). Markov chains are a class of Markov processes with multiple discrete states, that is, the conditional probability of the future state of the process is determined only by the current state, not by the past state.
KL divergence
KL divergence is used to measure the degree to which one probability distribution P deviates from another expected probability distribution Q, and it is asymmetric.
when= =When,The minimum value is zero.
Mutual information
Mutual information measures the degree of interdependence between two variables. It quantifies the “amount of information” obtained by one random variable through another random variable. Mutual information is symmetric.
Data Processing Inequality (DPI)
For any Markov chain:, we have.
A deep neural network can be thought of as a Markov chain, so as we move down the DNN layer, the mutual information between the layer and the input only decreases.
Reparametric invariance
For two invertible functions.Mutual information is still:.
For example, if we adjust weights in one layer of DNN, it does not affect the mutual information between this layer and another layer.
Deep neural networks for Markov chains
Training data fromandJoint distribution of sampling observations, input variablesAnd the weights of hidden layers are high-dimensional random variables. The real valueAnd predictedIs the random variable of the smaller dimension in the classification setting.
Figure 1: Structure of a deep neural network containing tagsInput layer,, hidden layerAnd predicted.
If we mark the hidden layer of DNN as, as shown in Figure 1, we can regard each layer as a state of Markov chain:. According to DPI, we have:
DNN is designed to learn how to describeIn order to predict; In the end, will beCompressed to contain only andRelevant information. Tishby describes this process as “the successive refinement of relevant information”.
Information plane theorem
DNN was implemented in turnInternal representation of a set of hidden layers. According to the information plane theorem, each layer is described by its encoder and decoder information, the encoder is a pair of input dataEncode, and the decoder converts the information in the current layer into the target output.
Specifically, in an information plan:
- X-axis: sampleComplexity is determined by encoder mutual informationDecide, sample complexity is how many samples you need to get a certain degree of accuracy and generalization.
- Y-axis: accuracy (generalization error), mutual information by decoderDecision.
Figure 2: Interaction information between encoder and decoder for 50 DNN hidden layers. Different layers have different colored encoders, green being the layer next to the input and orange being the layer farthest from the input. There are three snapshots, initial phase, 400 phases, and 9000 phases.
Each point in Figure 2 represents mutual information for the encoder or decoder of an implicit layer (regularization is not used; No weight decay, no loss, etc.). They move up as expected, because knowledge of real tags increases (accuracy increases). In the early stages, the hidden layer learns a lot about the inputBut then they begin to compress to forget some of the information they entered. Tishby argues that “the most important part of learning is actually forgetting”.
Figure 3: This is a summary view of Figure 2. Compression is performed after the generalization error becomes very small.
Two stages of optimization
The timely tracking of the mean and standard deviation of the weights of each layer also shows two optimization stages of the training process.
Figure 4: The mean value and standard deviation norm of each layer’s weight gradient are used as training functions. Use different colors for different layers.
In the early stages, the mean is three orders of magnitude greater than the standard deviation. After enough time, the error tends to saturate, and then the standard deviation becomes larger. The further the layer is from the output, the greater the noise is because the noise can be amplified and accumulated through the backpropagation process (not due to the width of the layer).
Learning theory.
The “old” generalization
The generalization range of the definition of classical learning theory is as follows:
- : The difference between training error and generalization error. Generalization error is a measure of how well an algorithm can predict previously unseen data.
- : assume, usually we assume the size is.
- : the reliability.
- : Number of training samples.
- : Assumed VC dimension.
The definition states that the difference between the training error and the generalization error is limited by the size of the hypothesis space and the size of the data set. The larger the hypothesis space, the larger the generalization error.
However, it does not apply to deep learning. The larger the network, the more parameters you need to learn. With this generalization bound, larger networks (larger) there are worse boundaries. It’s intuitive to think that a larger network will allow for better performance and better presentation. Here it’s counterintuitive.
“New” input compression
In response to this counterintuitive observation, Tishby et al. proposed a new input compression range for DNN.
Let’s start withAs an input variabletheThe partition. This partition compresses the input about label homogeneity into small cells that can cover the entire input space. Can be used if the output binary value is predictedBase in place of the hypothesis.
whenWhen I get big,The size of theta is about theta. The size of each cell in ϵ is. So we have. Then, the input compression range becomes:
Figure 5: The black line is the best IB limit that can be achieved. When training on a finite sample set, the red line corresponds to the upper limit of out-of-sample IB distortion. δ C is the complexity gap and δ G is the generalization gap.
Size of network and training data
More hidden layer benefits
Having more layers gives us computational benefits and speeds up the training process for better generalization.
Figure 6: There are more hidden layers and less optimization time (fewer stages).
Compress by random relaxation algorithm: according to the diffusion equation, the relaxation time of k layer and the compression amount of this layer: Is proportional to the exponent of. We can followCompute layer compression. becauseSo we expect to use more hidden layers (larger ones)) to reduce the training cycle exponentially.
The benefits of more training samples
Fitting more training data requires capturing more information through the hiding layer, and as the amount of training data increases, the decoder mutual information(Remember, this is directly related to the generalization error) is pushed up closer to the boundary of theoretical IB. Tishby emphasized that, unlike standard theory, it is mutual information, not layer size or VC dimension, that determines generalization
Figure 7: Training data of different sizes are color-coded. The information planes of multiple aggregation networks are plotted here. The more training data, the better the generalization effect.
Lilianweng.github. IO /lil-log/201…
Search the public account of technical question and answer: Lao Qi classroom
Reply in the official account: Lao Qi, you can view all articles, books and courses.
> < p style = “max-width: 100%; clear: both;