The author | Ryan Dahl
Compile | AI100
tinyclouds.org/residency/
Last year, after some success with TensorFlow, I applied for the first Google Brain Residency Program and was successful. Twenty-four people were invited to participate in the project, each with a different machine learning background.
Twenty-four of us will spend a year working in Google’s Deep Learning research lab in Mountain View, working daily with Google scientists and engineers on cutting-edge TensorFlow research. Just thinking about it makes me so excited.
Now that the year-long program is over, it has indeed been fruitful. I also hope to sum up and share what I have learned with my experience, hoping to provide some help for my partners in the field of machine learning.
Let’s start with my original goal. My goal is to correct images from old movies or TV shows.
Imagine what a gritty ’90s TV show or a black and white movie from the’ 60s would look like if it could be modified to 4K with gorgeous colors.
And it seems perfectly feasible: we could easily convert 4K video to grainy, low-resolution, even black-and-white, by training a supervised model to reverse the process. Moreover, the training data available is endless. Ahem, that’s a pretty strong idea!
With that goal in mind, I moved from Brooklyn, New York, to the Bay Area once again (last time for node.js) to better implement this deep learning technology. After a few days, my daily routine turned into discussions with Deep learning experts at Google, browsing code in Google’s vast library of software, Blablabla…
I’m going to go into a lot of technical detail about what I’ve been doing these days. Of course, if you don’t want to see it, you can just skip to the summary.
Super-resolution pixel recursion
The zooming techniques used by the FBI on “CSI” are notoriously impossible. No one can zoom in. However, it is possible to zoom in on a photo image and make a reasonable pattern out of the relevant pixels. Being able to increase the image resolution smoothly would be the first step towards my goal.
This problem is described in this paper by the term super-resolution, and people have been trying to solve it for a long time.
Based on this, we realized that using ConvNet alone would not solve the problem completely: it simply minimizes the pixel spacing (L2) of the low-resolution images you input to output high-resolution images. What such loss functions learn is to output the average of all possible outcomes — the resulting image looks fuzzy. What we want is a model that, for a given low-resolution image, picks out from all possible intensification results the particular high-resolution image that works best. If we “enhance” a blurry photo of a tree, we expect it to give some details about the positions of the branches and leaves, even if their positions are not the actual positions of the branches and leaves on the tree.
Some kind of conditional GAN (generative adjunctive network) looked promising, but it was difficult to build, and after several failed attempts we switched to a new production model: PixelCNN, which also looked promising. (After we started, SRGAN for the super-resolution problem was released, and it produced excellent results.)
PixelCNN is an oddly counterintuitive model. It rewrites the image generation problem to select a sequence of pixels one at a time. Gate controlled recursive networks such as LSTM (Short and Long-term Memory Network) are very successful at sequence generation, which is usually used for words or characters. PixelCNN cleverly constructs a convolutional neural network (CNN), which can accurately generate pixels based on the probability distribution of previous pixels. This is a hybrid of RNN and CNN.
Schematic drawing by Van den Oord et al
Surprisingly, the images generated by PixelCNN look very natural. Unlike adversarial networks, which struggle to balance two targets, the model has only one target and thus has better robustness in the face of hyperparameter changes. In other words, it’s easier to optimize.
In my first attempt to solve the super-resolution problem, I was too ambitious and chose ImageNet to train PixelCNN. (ImageNet is a more difficult data set than CIFAR-10, CelebA, or LSUN, and is used for many generative model studies.) But it is clear that the process of generating images sequentially by pixel is extremely slow. If the output image is larger than 64×64, it will take more than hours! However, when I limited the image size to a small size and used a small dataset of faces or bedrooms, the results started to get exciting.
Figure 1: High resolution image generated by super-resolution pixel recursive model trained from celebrity face image dataset. To the left is an 8×8 low resolution input image used for the test dataset. In the middle is the PixelCNN model output 32×32 high resolution image, the right is the original 32×32 resolution image. Our model first integrates facial features and then synthesizes more realistic hair and skin details.
With the infinite computing resources available at Google, scaling up the training was another goal of the project — even with these small data sets, it would take weeks to complete the training on a single CPU.
Asynchronous stochastic gradient descent (Asynchronous SGD) is the most ideal distributed training method. With this approach, you train the same model with N machines independently, but each time step shares a weight parameter. Weight parameters are hosted on a separate “parameter server,” which makes remote procedure calls (RPC) within each time step to get the latest values and send gradient updates. If the data pipeline is good enough, you can linearly increase the number of training steps per second in your model by adding machines — because machines don’t depend on each other. However, as machines are added, the weights of the new machines will gradually expire or become “outdated” as the weights of the old machines are updated. In a classification network, this is not a problem, scaling up training to dozens of machines is not difficult. PixelCNN, on the other hand, is extremely sensitive to outdated gradients and has little to gain from adding machines to its asynchronous stochastic gradient descent algorithm.
The alternative is Synchronous Stochastic Gradient Descent (SGD). With this approach, the machines are synchronized at each time step and the gradient for each machine is averaged. It is mathematically identical to the stochastic gradient descent algorithm. More machines will increase the batch size. But the Synchronous stochastic Gradient Descent algorithm (Sync SGD) allows machines to use smaller, faster batch sizes to increase the number of steps per second (steps/ SEC).
However, the synchronous stochastic gradient descent algorithm has its own problems. First, it requires a large number of machines to synchronize frequently, which inevitably leads to increased downtime. Second, it cannot increase the number of steps trained per second by increasing machines unless the batch size of each machine is set to 1. Eventually, I found that the simple setup was to use synchronous stochastic gradient descent with 8 Gpus on one machine — but it still took days to complete the training.
Another approach to large-scale computation is to perform a larger hyperparametric search. How to determine the batch size used? Try them all. I tried hundreds of configurations before I found the one I used in the paper.
How to evaluate the results quantitatively is another challenge. How can we prove that our image is better than the baseline model? A typical measure of super-resolution quality is the distance between the enhanced image and the corresponding pixels of the original image (peak signal to noise ratio (PSNR)). Although the face images produced by this model are significantly better in quality, they are on average inferior to the blurred images produced by the benchmark model in pixel comparison. We tried using PixelCNN’s own similarity measurement to prove that our sample had a higher probability value than the benchmark version, but again failed. Finally, we crowdsourced the task to human evaluators — asking them which images looked more realistic. It worked.
For detailed results, see this paper: Pixel Recursion at Super-resolution
https://arxiv.org/abs/1702.00783
PixColor: An attempt at coloring
PixColor output in two-color mode
Slim’s creator, Sergio Guadarrama, has been experimenting with coloring images. : he told me a test component interfaces (the interface of the image gray scale, color separated) for a 224 x 224 x 3 image, the color channel down to 28 * 28 * 2 ultra low resolution, and then use bilinear interpolation method to enlarge color channel, the income of the high-resolution image and color almost no difference compared to the original image.
Fig3: All you need is some color. The top row is the original color image. The middle row is the true chroma image with the sampling rate reduced to 28 pixels. The bottom line is the result of bilinear improvement of the sampling rate of the middle line combined with the original gray image.
This shows that by changing the problem to predicting only low-resolution colors, we can simplify the coloring problem. I was ready to give up PixelCNN altogether, since it obviously doesn’t hold large and small images, but it works fine for generating 28×28×2 images. We further simplified the coloring problem by reducing the color value to 32 digits instead of 256.
Sergio builds an “improved” network that can clean up the output of low-resolution colors and push the colors that overflow the boundary back to the correct position — using a feedforward image to image convolutional neural network with a loss of just L2. We also used a pre-trained ResNet as a conditional network to eliminate the need for additional loss terms that we had already used in the super-resolution project.
Using these methods, we get the best results on ImageNet, whether it’s crowdsourcing or color histogram intersection. It turns out that PixelCNN, properly trained, can simulate image statistics very well without any mode crashes.
Figure 7: Edge statistics of color channels in the laboratory color space. Left: The histogram of each method is shown in blue, and the histogram of ImageNet’s test dataset is shown in black. Right: Histogram intersection of color channel.
Since the model declares a probability distribution for the possible coloration of each grayscale input, we can sample the distribution multiple times to obtain different coloration of the same input. The following figure shows the diversity of distribution in terms of structural similarity (SSIM) :
Figure 8: To prove that our model can generate different samples, we compared two outputs of the same input with a multi-scale SSIM. The figure above shows the SSIM distance histogram of the ImageNet test data set. The figure shows each pair of images at multiple SSIM intervals. An SSIM value of 1 means that the two images are identical.
The model is far from perfect. ImageNet, despite its size, does not represent all images. However, this model is not ideal when dealing with non-Imagenet images. We found that real black and white photographs (as opposed to black and white photographs where color is reduced to a grayscale) produced different statistics and showed many objects that were not present in color photographs. For example, there aren’t many color photos of the Model T car, and perhaps none in the ImageNet collection. Larger data sets and better data amplification may simplify these problems.
To get a sense of image quality, take a look at these images:
-
A small group of very difficult images in the middle of our model processing
http://tinyclouds.org/residency/step1326412_t100/index.html
-
ImageNet random test data set images for our model
http://tinyclouds.org/residency/rld_28px3_t100_500_center_crop_224/
As a comparison, here are the results of processing the same ImageNet test data set with other algorithms:
-
Color the image!
http://tinyclouds.org/residency/ltbc_500_center_crop_224/index.html
-
Color image coloring
http://tinyclouds.org/residency/cic_500_center_crop_224/index.html
-
Learning representation for automatic coloring
http://tinyclouds.org/residency/lrac_500_center_crop_224/index.html
Finally, the full details are in our paper: PixColor: Pixel Recursive Colorization
https://arxiv.org/abs/1705.07208
Failed and unreported experiments
During the year, I briefly fell in love with a number of side projects, although they all failed… I’ll briefly describe a few of them:
Prime factorization of large numbers
Prime factorization is always a big problem. But even now, we’re still finding new questions about prime factorization. Given enough examples of a deep neural network, can it find something new? Mohammad and I tried two approaches. He modified the Seq2SEq neural model of Google Machine Translation, which takes a sequence of integers with a semi-prime large number as input and predicts its prime factors as output. I use a simpler model that takes fixed-length integers as inputs and uses several fully connected layers to predict the classification of inputs: prime or composite. Both methods learn only the most obvious rule (if the mantissa is 0, it’s not prime!). We can only abandon the idea.
Adversarial Dreaming
Inspired by Michael Gygli’s project, I wanted to explore whether the discriminator could act as its own generator. To this end, I constructed a simple binary classification convolutional neural network to judge whether the input is true or false. To generate an image, you need to give the noise and have it update the input with gradients (sometimes called deep Dreaming) so that the network maximizes the “real” category. The model is trained by alternately generating “false” instances, and then, like a typical GAN discriminator, differentiates true and false instances by upgrading weights.
My thinking is that, given fewer architectural decisions, this network might be easier to train than a typical GAN. In fact, it does work with MNIST. Each column in the figure below shows a different noise image being progressively pushed towards a red MNIST value.
But I couldn’t get it to work on the CIFAR-10 data set, and its practical significance was extremely limited. That’s a shame. I’m sure “Adversarial Dreaming” would be a cool thesis topic.
Use PixelCNN to train the generator
PixelCNN took too long to generate samples, which frustrated me. So I wanted to see if I could train a feedforward image-to-image convolutional neural network with a pre-trained PixelCNN (8×8 to 32×32 LSUN bedroom photo collection). The training method I have set up is to perform automatic regression on the output of a feedforward network. Update the weights under PixelCNN to maximize the probabilities. It produces very strange images with lines like this:
Exploration on modification of asynchronous stochastic gradient descent algorithm
As mentioned above, many models are not suitable for asynchronous stochastic gradient descent algorithms. A recent paper called DCASGD suggests a possible solution to the obsolete gradient problem: use difference vectors in the weight space before machines start stepping in to apply their weights. This method can greatly reduce everyone’s training time. Unfortunately, I wasn’t able to replicate their results with TensorFlow, nor could I implement several similar ideas I came up with. There might be a Bug here. (If you want to obtain my implementation method, please contact me through internal channels)
Ideas and summary
Talking about so much technology, I roughly do some summary, is also more deep experience.
As a software engineer, I don’t have much experience in machine learning. But based on my research on deep learning over the past year, I’d like to share a general view of the field and how it relates to the broader software field.
I firmly believe that machine learning will transform all industries and ultimately improve everyone’s lives, and many industries will benefit from the intelligent predictions machine learning can provide.
For me, my original goal in this project was to make old Charlie Chaplin movies available in 4K for everyone in the near future.
However, I did find this model quite difficult to build, train, and debug. Of course, most of the difficulties were due to my lack of experience, which also showed that it takes considerable experience to effectively train these models.
My work focuses on the easiest branch of machine learning: supervised learning. But even with perfect annotations, model development is difficult. It seems that the greater the dimension of prediction, the longer it takes to build the model (e.g., more time to program, debug, and train).
Therefore, I recommend that all of you start by simplifying and limiting your predictions as much as possible.
To take an example from our coloring experiment, we started by trying to have the model predict the entire RGB image, not just the color channel. We think that the neural network can easily process the intensity image and output it, because we use skip connection. Predicting only color channels still improves performance.
If I use “work” to describe software in a subjective, instinctive way: image classification seems to work robustly. Generative models rarely work and are poorly understood. Gans produce good graphics, but building them is nearly impossible — my experience is that making any small changes to the architecture can make it break down. I hear reinforcement learning is more difficult. But I won’t comment on recursive neural networks for lack of experience.
However, the STOCHASTIC gradient descent algorithm is so stable that even serious mathematical errors do not wipe it out and only slightly detract from performance.
Because training models often takes many days, this is a very slow change-run cycle.
The culture of testing is not yet fully developed. We need better evaluation methods when training models. The multiple components of the network need to maintain specific mean values and variables, and not overswing or stay in range. Machine learning bugs make it particularly easy to test your system’s Heisenbugs.
The benefits of parallelization are limited. Large-scale hyperparametric searches are made easier by increasing the number of computers, but ideally we would design models that would work without careful debugging. (In fact, I suspect that researchers with limited ability to search for hyperparameters will have to design better models, so they design models that are more stable).
Unfortunately, for many models, the asynchronous stochastic gradient descent algorithm is not very useful — a more accurate gradient is usually not very useful. That’s why DCASGD’s research direction is important.
From a software maintenance perspective, there is little consensus on how to organize machine learning projects.
It’s like websites before Rail: a bunch of random PHP scripts, business logic and markup. In the TensorFlow project, data pipelines, mathematics, and hyperparameter/configuration management are unorganized. I don’t think we’ve found a nice structure/organization yet. (Or rather, not yet rediscovered, as DHH rediscovered and popularized MVC.) The structure of my project is evolving all the time, but I wouldn’t call it elegant.
Frameworks continue to evolve rapidly. I started with Caffe and have to applaud the benefits of TensorFlow. Projects such as PyTorch and Chainer now delight customers with dynamic computational graphics. Lengthy modify-line cycles are a major hindrance to developing better models — I doubt that frameworks that prioritize fast startup and rapid evaluation will ultimately succeed. Despite the availability of useful tools like TensorBoard and iPython, it is still difficult to check the model’s activity during training.
The signal-to-noise ratio in the paper is very low. But there is plenty of room for improvement. People often don’t openly admit the failures of their models because academic conferences value accuracy more than transparency. I hope academic conferences accept submission of blog posts and request open source implementations). Distill deserves credit for his efforts in this regard.
This is an exciting time for machine learning. There is a lot of work to be done at all levels: from the theoretical end to the framework end, there is a lot of room for improvement. It was almost as exciting as the birth of the Internet. Join the technological revolution!
Is this your machine learning system?
Yeah! Pour the data into this big pile of linear algebra, and gather the answers at the other end.
What if the answer is wrong?
Just churn around the linear algebra until the results start to look right.