From the abbreviated as Slerp
In ICLR’s 2017 submission, there was an interesting but rejected entry titled “Sampling Generative Networks” by Tom White. The article loosely describes some sampling and visualization techniques that are useful in latent Spaces. One important point is that in GAN latent Spaces, interpolation along the “arc” between two sampling points is more reasonable than the usual linear interpolation. This is done by extending Slerp(Spherical Linear Interpolation) in high dimensional space:
Images are not hard to understand. Take this picture on wiki:
What is required is point P in the diagram. Based on the Wiki diagram, I added three points A, B and O, and O is the origin. So P is actually going to be. First the considerThe first step is to introduce the sumThe verticalThat’s the blue arrow. then 和 The ratio of PI to PI is the same thing as the ratio of each of them projected onto the blue vector, which is the orange line on either side of the blue arrow over the red line. This ratio is exactly whatSo,is. And obviously, the same thing can be doneAnd then I plug inAnd make, the formula of Slerp is obtained. And notice that even though I did the derivation with the 和 They’re on the same sphere, but they’re actually of different lengths 和 Slerp can also be interpolated naturally, and the vector length obtained is between them and monotonically (nonlinear) increases and decreases. This is also discussed in the ICLR Review.
What are the advantages of using Slerp over pure linear interpolation? The author explains it as follows:
“Frequently linear interpolation is used, which is easily understood and implemented. But this is often inappropriate as the latent spaces of most generative models are high dimensional (> 50 dimensions) with a Gaussian or uniform prior. In such a space, linear interpolation traverses locations that are extremely unlikely given the prior. As a concrete example, consider a 100 dimensional space with the Gaussian prior µ=0, σ=1. Here all random vectors will generally a length very close to 10 (standard deviation < 1). However, linearly interpolating between any two will usually result in a “tent-pole” effect as the magnitude of the vector decreases from roughly 10 to 7 at the midpoint, which is over 4 standard deviations away from the expected length.“
In other words, linear interpolation in high dimensional (>50) space will pass through some unlikely locations, as if the data are distributed on the tent cloth, but linear interpolation goes through the tent pole.
To understand this phenomenon in more detail, we need to start with the prior distribution commonly used in GAN. In GAN, the most commonly used are Uniform and Gaussian (it feels like Gaussian is in the majority right now). Regardless of which prior, for an N-dimensional sample, the Euclidean distance to the center is:
In general, the dimension of GAN’s sampling space is relatively high. At this time, we regard d squared as a series of N independent identically distributed random variablesThe sum of, is determined byCentral limit theoremWe know that d squared is approximately normally distributedChi – square distribution) :. Consider the very common 100-dimensional standard normal distribution as prior, followed by a Chi-square distribution of k=1, with a mean of 1 and a variance of 2. So the square of the distance from each sample to the origin approximates N(100, 200), with standard deviation ~14.14, if considered=14.14/100 is small enough to make, then D can also be approximated as a Gaussian distribution (actually it isChi distribution), the mean value is 10, and the standard deviation is 0.707, which is what the author said in the original text.
Uniform prior is similar, but more complicated, because uniform distribution is not isotropic. In higher dimensions, a hypercube shape if you think of it as a ball with spikes around it, each spike is an extreme value in the quadrant. I don’t know how to do it, but it’s easy to do it in a program. The result obtained by taking 100,000 samples is 100 dimensions, with uniformly distributed prior in each dimension [-1, 1]. The mean distance from the sample to the center is about 5.77, and the standard deviation is about 0.258. So in either case, samples in higher dimensional space have one characteristic: they are far from the center and concentrated near the mean. So linear interpolation would be like interpolation on a tent pole, passing through areas where the probability of real samples being present is extremely low.
In 100 dimension Gaussian prior, the distance from the point to the center of linear interpolation ranges from 10 to 7. I can’t imagine this mathematically, but qualitatively if you take two samples at random, they have a very high tendency to go vertical, because to go in the same direction or in the opposite direction you have to be close enough or far enough apart in each dimension, and that probability is very low. Quantitative words can write a program to simulate:
import numpy
from matplotlib import pyplot
def dist_o2l(p1, p2):
# distance from origin to the line defined by (p1, p2)
p12 = p2 - p1
u12 = p12 / numpy.linalg.norm(p12)
l_pp = numpy.dot(-p1, u12)
pp = l_pp*u12 + p1
return numpy.linalg.norm(pp)
dim = 100
N = 100000
rvs = []
dists2l = []
for i in range(N):
u = numpy.random.randn(dim)
v = numpy.random.randn(dim)
rvs.extend([u, v])
dists2l.append(dist_o2l(u, v))
dists = [numpy.linalg.norm(x) for x in rvs]
print('Distances to samples, mean: {}, std: {}'.format(numpy.mean(dists), numpy.std(dists)))
print('Distances to lines, mean: {}, std: {}'.format(numpy.mean(dists2l), numpy.std(dists2l)))
fig, (ax0, ax1) = pyplot.subplots(ncols=2, figsize=(11, 5))
ax0.hist(dists, 100, normed=1, color='g')
ax1.hist(dists2l, 100, normed=1, color='b')
pyplot.show()
Copy the code
The results are as follows:
The left is the distribution of the distance from the sample to the center in latent space, and the right is the distribution of the distance from the projection point of the origin to the center on the line where the two samples are randomly sampled, that is, the distribution of the distance from the nearest point to the center in linear interpolation. It can be seen that the random sampling and linear interpolation method is really easy to pass through areas where the sample is almost impossible to appear (5-7.5 distance from the origin). However, the comment of Sampling Generative Networks was rejected: “Neither the reviewers nor I was convinced that multidisciplinary makes more sense than linear interpolation”. In this regard, It feels like Tom White has been wronged, and while it’s certainly not an immediate improvement, it’s a valid one. Why didn’t reviewer feel much better than linear interpolation? The reason may be:
Linearity based on ReLU networks
When CNN hit the headlines in 2012, it should be said that the ReLU family of activation functions played a crucial role in making the Deep Web trainable. Whether it’s LeakyReLU, ELU, Swish, etc., the part greater than zero is very linear. So although nonlinear transformations (activation functions) are the basis of neural networks as universal approximators, neural networks based on the ReLU system are actually very linear. For common discriminant networks, Ian Goodfellow believes that such linearity combined with the strong expression ability of Distributed Representation is the basis for the network to be vulnerable to sample attack (see this article for details). Based on this, the Fast Gradient Sign method was invented to quickly generate counter samples.
So how linear is RELu-based CNN? First, take a look at the generative network, taking DCGAN as an example, and the schematic diagram is as follows
In terms of structure, DCGAN is more linear than common discriminant network, because the less linear part is only the Tanh of the final output image without Max pooling. Especially in the step from Latent space to the first group of feature map, the common implementation method is to regard the 100-dimensional noise as 100 channels and 1×1 feature map, and then directly use transposed convolution without bias to up-sample. It’s a purely linear transformation! Qualitatively, if the entire subsequent portion of the network is sufficiently linear, then any sample of latent space should be scaled in all dimensions at the same time, resulting in an image with almost the same contrast.
Training a GAN generator can verify this conclusion, thanks to He Zhiyuan in the article GAN learning Guide: from the principle of introduction to the production of the Demo to provide a high quality for GAN and easy to download animation avatar data. Based on this data and PyTorch’s official DCGAN example, it was easy to train a model. Based on the trained model, random sampling and Slerp and linear interpolation were performed respectively. The results are as follows:
Lines 1, 3 and 5 are the results of linear interpolation, while lines 2, 4 and 6 are the results of Slerp. If you look carefully, you will find that the middle part of the linear interpolation results is a little lighter than that of Slerp, and the other differences are very small. No wonder the Reviewer found Tom White’s conclusion unconvincing. Linear interpolation looks good without comparison. In addition, the homogeneity of linear interpolation is similar to that of Slerp due to the fact that samples are far from the center in high-dimensional space. But with the above analysis, looking back at the linear interpolation results in DCGAN’s original text, it seems that the middle part of the color is really a little light…
A direct comparison between Slerp and linear interpolation is not very convincing, we can do a more violent experiment, let the sample start from the origin in a random direction all the way to 20 away from the origin, the result is as follows:
As you can see, as the latent sample moves further away from the origin, the image basically changes with higher and higher contrast until saturation. At least as far as human eyes are concerned, it’s pretty linear. What about the reality? If the Tanh layer is removed and some samples and the values of the random positions of the output image are randomly selected, the trend of the change with the distance from the center of latent sample is roughly as follows:
It can be seen that linearity is obvious only at a great distance from the center, which is consistent with the graph properties in Goodfellow’s paper. Near 10, the largest sample set, linearity is moderate, and some outputs are not even monotonic.
So what if the latent sample was generated only on a hypersphere? Or is it evenly distributed from the origin? Give it a try and here are the results:
1) Generate latent sample on a hypersphere with a distance of 10 from the origin
The naked eye still looks linear, and the output curve looks similar to direct Gaussian sampling.
2) The distance from latent sample to the origin is evenly distributed from 0 to 10
From the curve and the first two cases are obviously different, the generated image quality also decreased a little. However, from the perspective of the generated images, the linearity is still strong. At least to the part with the distance <10 from the center, the conclusion that the “identity” of the images is still basically valid.
This is an interesting phenomenon. In both Gaussian sampling and the case of 1) and 2), to the human eye, this very rough linearity is enough for the latent sample along one direction to produce what appears to be an indistinguishable “identity” sample. However, based on this phenomenon, a rough corollary is drawn that GAN’s Latent Space is distinguishable only along the variation of the hypersphere.
Walk on Great Circle
After some analysis, I came to a conclusion that didn’t seem to help. Returning to the original question, linear interpolation passes through low probability areas (although it does not have an impact), and Slerp is no more visually essential than linear interpolation, so is there a more elegant way to navigate latent space? I think it’s: Great Circle. Compared with Slerp, Greate Circle usually has 3 times more distance to travel, which is not fundamentally different from Slerp, but it feels better. Moreover, it feels better to walk along Great Circle at the same starting point and ending point.
Generating great Circle paths is much simpler than Slerp: 1) Generating a hypersphere radius R based on the distribution used (Gaussian is a Chi distribution, or Gaussian approximation as discussed above); 2) Generate a random vector U and a random vector V perpendicular to U, and then take the plane of U and V as the plane of Great Circle; 3) U and V are equivalent to two axes of a coordinate system, so any point of Great Circle can be represented by the projection on U and V. Finally, the sampling of walking on Great Circle can be obtained by multiplying by R. The code is as follows:
from __future__ import print_function import argparse import os import numpy from scipy.stats import chi import torch.utils.data from torch.autograd import Variable from networks import NetG from PIL import Image parser = argparse.ArgumentParser() parser.add_argument('--nz', type=int, default=100, help='size of the latent z vector') parser.add_argument('--niter', type=int, default=10, help='how many paths') parser.add_argument('--n_steps', type=int, default=23, help='steps to walk') parser.add_argument('--ngf', type=int, default=64) parser.add_argument('--ngpu', type=int, default=1, help='number of GPUs to use') parser.add_argument('--netG', default='netG_epoch_49.pth', help="trained params for G") opt = parser.parse_args() output_dir = 'gcircle-walk' os.system('mkdir -p {}'.format(output_dir)) print(opt) ngpu = int(opt.ngpu) nz = int(opt.nz) ngf = int(opt.ngf) nc = 3 netG = NetG(ngf, nz, nc, ngpu) netG.load_state_dict(torch.load(opt.netG, map_location=lambda storage, loc: storage)) netG.eval() print(netG) for j in range(opt.niter): # step 1 r = chi.rvs(df=100) # step 2 u = numpy.random.normal(0, 1, nz) w = numpy.random.normal(0, 1, nz) u /= numpy.linalg.norm(u) w /= numpy.linalg.norm(w) v = w - numpy.dot(u, w) * u v /= numpy.linalg.norm(v) ndimgs = [] for i in range(opt.n_steps): t = float(i) / float(opt.n_steps) # step 3 z = numpy.cos(t * 2 * numpy.pi) * u + numpy.sin(t * 2 * numpy.pi) * v z *= r noise_t = z.reshape((1, nz, 1, 1)) noise_t = torch.FloatTensor(noise_t) noisev = Variable(noise_t) fake = netG(noisev) timg = fake[0] timg = timg.data timg.add_(1).div_(2) ndimg = timg.mul(255).clamp(0, 255).byte().permute(1, 2, 0).numpy() ndimgs.append(ndimg) print('exporting {} ... '.format(j)) ndimg = numpy.hstack(ndimgs) im = Image.fromarray(ndimg) filename = os.sep.join([output_dir, 'gc-{:0>6d}.png'.format(j)]) im.save(filename)Copy the code
The results are as follows:
It doesn’t feel very useful, but in case anyone wants to try it, here’s the code: great-circle-interp
=========== add the dividing line ===========
I used the pre-training resNET-18 provided by pyTorch to try it out. I randomly found 6 images and took the input intensity from 0 to 10 times of the original image. The results are as follows:
When the input intensity is high you see the expected strong linearity. Qualitatively, the larger the input intensity is, the further the value on any neuron deviates from the “zero”, and the lower the possibility of passing through the zero of ReLU in this process, so the linearity will become stronger and stronger. From the graph, the nonlinear response of RESNET-18 is much more obvious than that of GAN within a reasonable range (0~2). My guess is that the classification network has no assumptions about the input and output, unlike GAN’s latent space, which has a simple prior and is highly symmetrical. It most likely acts as regularization. Qualitatively, this may explain why networks like PPGN produce highly diverse images of much higher quality, and c-gan in a broad sense can be understood in this way. That might not be such a boring question, and I’d like to know the answer.