This article was originally written by AI Frontier :t.cn/RHxwjPU


Translated by Ed Tyantov, edited by Debra, Emily

“Deep learning in text, speech and computer vision has seen tremendous advances in the past year, with some promising results.

Translated from data scientist Ed Tyantov, this article summarizes and evaluates the latest advances in deep learning over the past year (and beyond) and reveals the most important developments that could affect our future.”


The text

Google Neural machine Translation

About a year ago, Google announced a new Model for Google Translate and detailed the structure of the network, a recursive neural network (RNN).

  • Machine learning translation and Google Translate algorithms
  • Fundamentals of machine translation engines
  • blog.statsbot.co

Major results: Google neural machine narrowed the translation accuracy gap between humans by 55-85% (humans scored 6 points). However, without Google’s huge data set, this model may not be able to achieve this effect.

Chatbots create language? No! It’s just a rumor.

You may have heard the ridiculous news that Facebook shut down their company’s chatbot, supposedly because it got out of control and invented its own language. This is a negotiation bot created by Facebook to negotiate text with another agent and reach an agreement on how to divide objects (books, hats, etc.) into two parts. In a negotiation, each agent has its own goal, and the other agent is unaware of it.

During the training, they trained the supervised feedback neural network with a data set of human negotiation, and then used reinforcement learning algorithms to train another agent to talk to them and set limits, which is similar to human language.

The robots then quickly learned the same set of tactics that humans often use in negotiations, such as showing false interest in certain aspects of the deal and then benefiting from giving it up later in the negotiation and moving to the actual end. It was the first attempt to create such an interactive robot, and it was very successful.

The full story is described in this article, and the robot’s code is also open source.

Of course, media claims that robots have invented a language are completely unfounded. When training (with the same agent protocol), Facebook banned the limitation of text similarity to humans and changed the language of interaction, which is nothing special.

In the past year, people have been developing feedback neural networks with great enthusiasm and have applied them to many tasks and software. The architecture of RNN has become much more complex, but similar results have been achieved in some areas using a simple feedforward network DSSM. For example, The “smart reply” feature of Google Mail can achieve the same effect as LSTM. Yandex has also launched a new search engine based on the neural network.


voice

WaveNet: Original audio generation model

DeepMind reports on a model for generating audio in this article. In short, the researchers in this group generated the autoregressive full convolution WaveNet model based on the image generation methods (PixelRNN and PixelCNN).

The network is an end-to-end training model: input text and output audio. Compared with humans, the study’s error was reduced by 50 percent, which is not bad.


However, the main disadvantage of WaveNet is its low efficiency, this is because the automatic regression, the sound generation in order, it takes about 1-2 minutes to create a one-second length of audio.

Case study:

Storage.googleapis.com/deepmind-me…

If the network were to eliminate its dependence on input text and rely only on previously generated phonemes, the network would generate meaningless phonemes similar to human language.

Example of generating sound

This model applies not only to speeches, but also to composing music. Imagine a day when this model could learn from data sets of piano games (again independent of data input) and generate audio.

If you’re interested in this model, you can read the full DeepMind study.


Read lip augmentation

Lip-reading is another achievement of deep learning beyond human ability.

Google Deepmind has teamed up with Oxford University to introduce their lip-reading model in an article called Lip Reading Sentences in the Wild. Using data sets from TV shows they collected, the model was trained to outperform professional lip readers on the BBC.

The dataset contains 100,000 sentences with audio and video. Model: LSTM processing audio, CNN + LSTM processing video. These two state vectors are fed back to the LSTM to generate the result (characters).

During training, the model used different types of input data: audio, video, and audio + video. In other words, this is a “full type” task model.


Generate audio of Obama lips moving in sync

The University of Washington conducted a serious study of the movement of President Obama’s lips. The choice of his video is said to be due to the large number of videos of the president’s speeches available online (17 hours of high-definition video) by the handful.

Because the audio is so “embellished,” the author of the article makes inflection points (or tricks) to improve the text and timing points.

The results were striking. In the near future, we’ll have to think twice before watching a video.


Computer vision

OCR: Google Maps and Street View

In their article, the Google Brain team details how they introduced an OCR (Optical Character Recognition) engine for street signs and store signs into maps.

During the development of the technology, the company compiled a new set of FSNS (French Street Name signs), which contained many complex cases.

The network takes up to four photos to identify a sign. Features are extracted by CNN, scaled with the help of spatial attention (taking pixel coordinates into account), and the results are fed back to LSTM.

Figure 1: Each image is processed by feature extractor, and then the results are connected into a feature graph, represented by “F”. A vector UT with fixed size is created by using spatially weighted combination and fed back to RNN.

The same can be done for the task of identifying store sign names (there may be a lot of “noise” data, and the network itself must be focused in the right place), with samples of up to 80 billion photos.


Visual reasoning

Deep learning can also perform a task called visual reasoning, which requires a neural network to answer questions based on photographs. For example: “Is there a rubber material of the same size as the yellow metal cylinder in the picture?” This important problem has only recently been solved, with an accuracy of 68.5 percent.

The Deepmind team was also responsible for the breakthrough: in the CLEVR dataset, they achieved 95.5% accuracy, even better than humans.

Also, the network architecture is very interesting:

  1. Using pre-trained LSTM on text problems, problem embedding is obtained.
  2. Using CNN (only four layers) and the image, the functional map (representing the features of the image) is obtained.
  3. Next, form paired coordinate combinations in the feature graph (yellow, blue, and red at the bottom of the graph) and add coordinates and text inserts.
  4. Triple and sum over another network.
  5. The final presentation is generated by another feedforward network, which is responsible for the flexible Maximum Transfer Function (Softmax) answer.


Pix2 code

In addition, Uizard has created an interesting neural network application that takes a screenshot of the interface designer and generates a layout code.

Figure 2: Example of a native iOS GUI with DSL programming

This is a very practical neural network application that can make the process of developing software much easier. The network was 77 percent accurate, the authors said. However, the project is still in the research stage and has not yet started practical application.


SketchRNN: Train a machine to draw

You may have heard of Google’s Quick, Draw! , the goal is to sketch various objects in 20 seconds. As Google explains in blog posts and articles, they collected a data set to train the neural network.

This dataset contains 70,000 sketches and is now open source. A sketch is not a picture, but a detailed vector description of the drawing (at some point the user presses “pencil”, releases to start drawing, etc.).

Researchers have used RNN as a encoding/decoding mechanism to train sequence-to-sequence variational self-encoders (VAE).

Figure 2: RNN framework schematic diagram

Finally, in order to adapt to the autoencoder, the model receives eigenvectors representing the original image.

Since the decoder can extract a picture from this vector, you can get a new sketch by changing the vector.

Even create a “cat pig” by performing a vector algorithm:


GANs

Generative Adversarial Networks (GAN) are one of the hottest topics in deep learning. GANs is mostly used for graphics, so I’ll use graphics to explain this concept.

  • Generative adversarial Networks (GAN) : Engines and applications
  • How can generative adversarial nets be used to improve our lives
  • blog.statsbot.co

The rationale for GAN is that two networks — generators and discriminators — compete with each other. The generator network generates the image, and the discriminator network determines whether the image is real or generated.

The schematic diagram is as follows:

During training, a generator from a random vector (noise) generates an image and feeds it back to the input of a discriminator, which determines whether the input is true or false. Of course, the image received by the discriminator also includes the real image in the data set.

Training for such an architecture is difficult because it is difficult to find a balance between the two networks. Most of the time, the discriminator wins, and training stalls. However, the advantage of this system is that it can solve problems that are difficult to set up for loss of functionality (such as improving photo quality), which we can leave to the discriminator.

The most typical examples of GAN training are the bedroom or portrait.

Previously, we considered using sketch-RNN, which encodes raw data as a hidden representation. Generators can do the same thing.

In a face image project, we use vector image generation method. You can change the vector and see how the face changes.

The same algorithm works in hidden Spaces: “a man with glasses” minus “man,” plus “woman” equals “woman with glasses.”


Use GAN to change the age of the face

If control parameters are input to the hidden vector during training, it can be changed when the hidden vector is generated to manage the image in the picture. This method is called conditional GAN.

In Face Aging With Conditional Generative Adversarial Networks, the author did the same. After the engine was trained using a dataset of actors of known ages on IMDB, the researchers were able to change the facial age of the person.


Professional photos

Google has also found another interesting use for GAN — photo filtering and quality improvement. They trained gans using a dataset of professional photos: generators to improve the quality of photos (professional shots with special filters), and discriminators to tell which were “improved” photos and which were truly professional photos.

Trained algorithms search for the best images through Google Street View panorama, and find some professional and semi-professional photos (according to the photographer’s rating).


Synthesize images from text descriptions

Another impressive use of GAN is to generate images from text descriptions.

The researchers on the project suggest embedding text not only into the input of the generator (the condition GAN), but also into the discriminator to verify the correspondence between text and image. To ensure that the discriminator learned to perform its functions, in addition to training, they added corresponding error text to real images.


Pix2pix

In 2016, among many articles, Berkeley AI Research (BAIR) ‘Image-to-Image Translation with Conditional Adversarial Networks’ (image-to-image Translation using CAN) stands out. This research solves image-to-image generation problems, such as creating maps using satellite images or creating realistic textures using object sketches.

This is another example of a successful application of conditional GAN. In this case, conditions cover the entire image, and UNet, which is popular in image segmentation, is used as the architecture of the generator, and a new PatchGAN classifier is used as a discriminator for processing blurred images (images are cut into N patches and true/false predictions are made for each patch).

The researchers posted an online demonstration of the network, which generated a lot of interest among users.

Source code address:

Github.com/phillipi/pi…


CycleGAN

To apply Pix2Pix, we need a data set containing images from different fields that correspond to each other. In general, gathering such data sets is not a problem, however, if you want to do something more complex, such as “changing” objects or styles, in principle the corresponding image of the target is hard to find.

Therefore, the researchers at Pix2Pix decided to break new ground and created CycleGAN (” Unpaired image-to-Image Translation “), a model for converting between different fields of images without the need for pairing images.

The basic principle of this network is to train two pairs of generators and discriminators to transfer images from one domain to another, while maintaining cyclic consistency — after a series of applications of the generator, we expect to get a loss image similar to the original L1. To ensure that the generator passes a field completely unrelated to the original image to another field, we need a looping loss.

This way you can learn to switch between horses – > zebras.

But this transition is very unstable and often leads to undesirable outcomes:

Source code link:

Github.com/junyanz/Cyc…


Molecular development in oncology

Machine learning is now making its way into medicine, too. In addition to recognizing ultrasound and MRI for diagnosis, machine learning can also be used to discover new drugs against cancer.

In short, with the help of Adversarial Autoencoder (AAE), we can learn the hidden representation of molecules and then use it to search for new molecules. As a result, we found 69 molecules, some of which are anticancer and others of great medical value.


Adversarial Attacks

Confrontational aggression has also been actively explored. So, what is adversarial attack? For example, imagenet-trained standard networks are completely unstable when special noise is added to categorize images. In the example below, we see little change in the image of human noise, but the model predicts a completely different classification.


For this, we can use fast gradient notation (FGSM) to achieve stability: it can access the parameters of the model, be able to perform one or more gradient steps for the desired classes, and change the original image.

One of Kaggle’s tasks is to encourage participants to generate attacks/defenses that compete against each other.

But why do we study these attacks? First, if we want to protect our products, we can add noise to the captcha to automatically detect spam. Second, algorithms are becoming more and more closely related to our lives, such as facial recognition systems and self-driving cars. In these cases, the weaknesses of the algorithm could easily be exploited by attackers with unimaginable consequences.

For example, a special pair of glasses can successfully trick facial recognition systems into “recognising yourself as someone else”. So, we need to train our models for possible attacks.

Doing the following to identifiers will also cause them to be incorrectly identified.

  • Some articles from the competition organizers.
  • Libraries written to prevent attacks: CleverHans and FoolBox.


Reinforcement learning

Reinforcement learning (RL) is also one of the most interesting and interesting approaches to machine learning.

The essence of this approach is that agents learn rewarding experiences, i.e., “successful” behaviors, in a specific environment, just as people learn lessons from life.

Currently, RL is widely used in gaming, robotics, and system management, such as traffic.

AlphaGo is, of course, well known for its victories over professional players, and its researchers trained it using RL: robots played against themselves to improve their strategies.

Uncontrolled auxiliary task reinforcement learning

For the past few years, DeepMind has been able to play arcade games with DQN better than humans. DeepMind is currently training algorithms to play more complex games, such as Doom.

Since agents need hours of training on current Gpus to gain experience interacting with the environment, researchers have focused most of their efforts on learning acceleration.

Deepmind said in the blog post that introducing additional losses (auxiliary tasks), such as predicting frame changes (pixel control), allows agents to better understand the consequences of their actions, which greatly speeds up learning.


Learning robot

OpenAI has been actively researching training agents in virtual environments, which is safer than conducting experiments in the real world.

In one of the studies, the team showed that learning at once is not impossible: in a virtual reality environment, simply showing the process of performing a task once allows the algorithm to get the hang of it and reproduce it in real conditions.

Although these tasks are a piece of cake for humans 🙂

Learn what people like

Both OpenAI and DeepMind are working on this topic. The principle is that the agent needs to complete a task, and the algorithm provides it with two possible solutions and indicates which one is better. Repeat the process iteratively, and learn how to solve the problem from feedback from humans (binary markers).


As always, great care must be taken to train machines carefully. In this case, the discriminator decision algorithm really wants to get the target, but it may actually be simulating the “take” action.

Moving in a complex environment

DeepMind is also carrying out another study. In order to train a robot to master complex (walking, jumping, etc.) or human-like behaviors, a large number of participation loss function selection must be performed to encourage the robot to perform the desired behavior. In this regard, algorithms are best able to learn complex behaviors through simple rewards.

True to expectations, DeepMind researchers managed to do just that: they trained agents (body simulators) to perform complex behaviors by building an environment full of complex obstacles and rewarding the algorithm for moving.


other

Cooling data center

In July 2017, Google reported that it was using DeepMind machine learning to reduce energy costs in data centers.

Using information from thousands of sensors in the data center, Google developers trained a set of neural networks to predict PUE (power usage efficiency) and improve the efficiency of data center management. This is an impressive and important example of ML in action.

A model for all tasks

It is notoriously difficult to apply one model to other tasks because each task must be trained in a specific model. However, Google Brain, in its article “One Model To Learn The All,” took a small step towards creating a Model that could do anything.

The researchers have trained models to perform eight tasks from different domains (text, speech and image). Such as translating different languages, text parsing, and image and sound recognition.

To achieve this, they built a complex network structure with different blocks to process different input data and produce results. Blocks for encoders/decoders are divided into three types: convolutional, attention and Gated Mixture of Experts (MoE).

Main Learning Achievements:

  • Near-perfect model (no fine tuning of hyperparameters).
  • Knowledge transfer between domains, that is, on tasks with large amounts of data, results in almost identical performance and is better for solving small problems (such as parsing).
  • Modules needed for different tasks do not interfere with each other and can sometimes even help each other.

Learn Imagenet in one hour

Facebook’s engineers were able to complete resNet-50 model training on Imagenet in less than an hour, according to their article. However, this process needs to be done on a cluster of 256 Gpus (Tesla P100).

They used Gloo and Caffe2 for distributed learning. In order to improve efficiency, it is necessary to make a large number of adjustments (8192 elements) to the learning strategy, such as gradient average, warm-up phase, special learning rate, etc.

As a result, when scaling from 8-bit to 256-bit Gpus, efficiency can reach 90%. Now researchers from Facebook can run experiments more quickly than other companies can.


news

Autonomous vehicle

The self-driving car sector is in full swing, with many companies testing their products. Intel’s acquisition of MobilEye, the theft of Uber and Google technology by longstanding complaints, and the first fatality of a self-driving car have all been highlighted recently.

Another thing worth noting is that Google Waymo is rolling out a beta application. Google is a pioneer in this area, and it looks like they have a good technology because their cars have more than 3 million miles on them.

In addition, in the United States, self-driving cars are already on the road in every state in the country.

Health care

As I said, ML technology is beginning to be introduced into medicine. For example, Google works with medical centers to diagnose diseases.

DeepMind has even opened a healthcare department.

This year, based on the Data Science Bowl program, there is a $1 million competition to predict lung cancer based on detailed images over the course of a year.

investment

At present, ML has received a lot of investment, just as in the field of big data before.

China has invested $150 billion in AI, making it the global leader in the industry.

To get ahead of the game, Baidu Research employs 1,300 people working on AI research, while Facebook’s FAIR employs 80. In the 2017 International Knowledge Discovery and Data Mining Conference (KDD) global paper submission, Alibaba employees introduced their parameter server Kunpeng, which runs 100 billion samples and has trillions of parameters for common tasks.


As we come to fruition in the field of deep learning in 2017, everyone may draw different conclusions from it, but it’s never too late to learn about machine learning. In any case, over time, all developers will need to use machine learning, and machine learning will become a common skill for developers, just like the ability to use databases today.

You know what I mean