In recent years, advances in deep neural networks have been rapid, and it has become very accurate to train neural networks to learn input and output mappings from large amounts of labeled data, whether these mappings are images, sentences, label predictions, and so on.
What these models still lack is the ability to generalize to conditions different from those of training. When do you need this generalization ability? When you apply the model to the real world, not your carefully constructed data set. The real world is messy and contains countless new scenarios, many of which your model has not encountered during training, so the model does not predict them very well. The ability to transfer knowledge to new conditions is often referred to as transfer learning, and it’s the subject of this article.
-
What is transfer learning?
-
Why is transfer learning the next frontier?
-
Application of transfer learning
Learn from simulation
Adapt to new territory
Spread knowledge across languages
-
Transfer learning method
Use pre-trained CNN features
Learning domain invariant representation
To make representations more similar
Chaos domain
-
Related research areas
Semi-supervised learning
More efficient use of available data
Improve the generalization ability of the model
Make the model more robust
Multitasking learning
Zero cyber-shot learning
-
conclusion
In the classic supervised learning scenario, if we want to train A model for tasks and A Domain A, we have labeled data for the same task and domain. As shown in Figure 1, the training data and test data for Model A are the same as the data for task and DOMAIN A. The concepts of “tasks” and “domains” will be explained in more detail later. Now let’s assume that we have a target task that our model needs to perform, for example, identifying objects in an image, and that there is a domain, the source of the data, such as a photo of a San Francisco coffee shop.
Figure 1: Classic supervised learning setup in ML
We can now train A model A on this data set and expect it to perform well on unknown data for the same task and domain. On the other hand, when we provide data for other tasks or domains, we again need labeled data for those tasks or domains to train a new model B so that the model is likely to perform well on these data.
When there is not enough labeled data to train a model for a task or domain, the traditional supervised learning paradigm breaks down.
If we want to train a model to detect pedestrians in nighttime photographs, we can use a model that has been trained in a similar domain, such as with daytime photographs. In practice, however, doing so often makes the model behave badly or even crash, because the model has inherited the bias of the training data and does not know how to generalize to the new domain.
If we want to train a model to perform a new task, such as detecting cyclists, we can’t even reuse an existing model for detecting pedestrians because the markers are different between the tasks.
But with transfer learning, we are able to deal with these situations using labeled data that already exists for related tasks or domains. We try to store the knowledge gained from solving source tasks in the Source Domain and apply it to other tasks. See Figure 2.
Figure 2: Setup for transfer learning
In practice, we strive to migrate as much knowledge as possible from source Settings to target tasks or domains. Depending on the type of data, this knowledge can take many forms: it can be about the composition of objects so that the model can more easily recognize new objects; It can also be a general word that people use to express an opinion, etc.
Andrew Ng said in his NIPS 2016 tutorial that transfer learning will be the next driving force to achieve success in the business application of ML after supervised learning.
Figure 3: Ng giving a talk on transfer learning at NIPS 2016
He drew a chart on the whiteboard, which I reproduced as faithfully as POSSIBLE. Ng says that transfer learning will be a key driver of machine learning success in industrial applications.
Figure 4: The driving force behind ML’s success in industrial applications, Via. Andrew Ng
Undoubtedly, the successful use of ML in industry to date has been largely driven by supervised learning. Aided by advances in deep learning and driven by more powerful computing power and large tagged datasets, supervised learning has brought AI back to the attention and interest of many, with start-ups being funded and acquired. In particular, the application of machine learning has become a part of our daily lives in recent years. If we ignore the naysayers and those predicting the next AI winter, and instead choose to trust Andrew Ng’s prescience, these successes will continue.
However, since transfer learning has been around for decades and has hardly ever been used in industry, why does Ng predict its explosive growth? In addition, transfer learning currently yields relatively little knowledge compared to other areas of machine learning, such as unsupervised and reinforcement learning, which have become increasingly popular recently: According to Yann LeCun, unsupervised learning is a key element of general-purpose AI, and with the rise of generative adversarial networks, unsupervised learning has regained the attention of many researchers.
On the other hand, the main push for reinforcement learning is Google’s DeepMind. DeepMind used reinforcement learning to make dramatic improvements in game AI, most notably with The success of AlphaGo. Reinforcement learning has also had real-world successes, such as helping Google’s data centers reduce cooling costs by 40 percent. Both fields, while promising, remain the subject of cutting-edge research papers because they still face many challenges that will prevent large-scale commercial applications in the foreseeable future.
Figure 5: In Yann LeCun’s famous AI cake metaphor, there is clearly no place for transfer learning.
What makes transfer learning different? Below, we explain (on behalf of the authors only) what drives Ng’s prediction and outline why it is time to focus on transferred learning.
The current use of machine learning in industry has two characteristics:
For one thing, over the past few years, we have been able to train more and more accurate models. We are now at a bottleneck where for many tasks, those state-of-the-art models are so good that they are no longer a barrier to users. How good is it? ImageNet’s latest residual network has surpassed the human level in image recognition tasks; Google’s Smart Reply already handles response requests automatically on 10% of phones; Speech recognition error rates have also been falling, bringing them into line with those of human stenographers; Machines have been able to identify skin cancers with the same accuracy as dermatologists; Google’s NMT system has been used to translate more than 10 language pairs; Baidu’s Deep Voice can already generate human speech in real time… This list could go on and on and on. In summary, ML’s maturity has allowed these models to be deployed to millions of users on a large scale and has been widely adopted.
On the other hand, the success of these models depends heavily on data, and on large scale labeled data. For some tasks and domains, this data is available because it has been carefully collected over many years. In a few cases, this data is public, such as ImageNet. But often large-scale tagged data, such as many voice or MT datasets, is proprietary, or very expensive, because it is this data that gives companies a competitive advantage.
But when machine learning models are applied to the real world, the model is confronted with countless conditions it has never encountered before and does not know how to deal with them. Each customer or user has their own preferences and has or generates data different from the data used for training; A model can be asked to perform a number of tasks that are related to, but not the same as, the task for which it is trained. Our current so-called state-of-the art models, while showing human-level or even human-superior performance on the tasks or domains they have been trained on, suffer significant performance degradation or even complete breakdown under the conditions mentioned above.
Transfer learning can help us deal with these newly encountered scenarios, and it is necessary for the industrial-scale use of machine learning, which goes beyond the limits of tasks and fields with rich markup data. So far, we have applied the model effectively to tasks and domains that are very accessible in terms of data availability. To serve the long tail of distribution, we must learn to migrate our acquired knowledge to new tasks and domains.
Learn from simulation
One of the exciting applications of transfer learning, and we’ll see more of that in the future, is learning from simulation. For many machine learning applications that rely on hardware to interact, collecting data and training models in the real world can be either expensive and time consuming or very expensive. Therefore, it is better to collect data in other, less risky ways.
Simulation is the preferred tool and has been used to implement many advanced ML systems in the real world. Learning from a simulation and then applying what is learned to the real world is an example of transfer learning, because the feature space between the source and target domains is the same (both often depend on pixels). But the marginal probability distributions between simulation and reality are different — that is, objects in the simulation and source scenarios look different, though the differences diminish as the simulations become more realistic. Also, conditional probability distributions may differ between simulations and the real world because simulations cannot replicate all interactions in the real world, for example, A physics engine cannot fully mimic the complex interactions of objects in the real world.
Figure 6: Google’s driverless car (source: Google Rearch Blog)
Learning from simulations facilitates easier data collection because objects can be easily bound and analyzed, and because learning can be done in parallel across multiple instances, fast training is possible. So learning from simulation is a prerequisite for large machine learning projects that need to interact with the real world, such as driverless cars. According to Zhaoyin Jia, Google’s head of driverless car technology, “simulation is crucial if you really want to make a driverless car.” Udacity has opened up its simulator for self-driving car nanocourses, and OpenAI has opened up its Universe platform to train self-driving car systems with GTA V or other games.
Figure 7: Udacity’s self-driving car simulator (source: TechCrunch)
Another area that requires learning from simulation is robotics: training models on real robots is too slow and very expensive. This problem is mitigated by robots that simulate learning and transfer knowledge to the real world. Such research has sparked renewed interest in recent years. Figure 8 is an example of a data manipulation task in the real world and a simulated environment.
Figure 8: Robot and simulation images (Rusu et al., 2016)
Finally, another direction in which simulation is indispensable is general-purpose AI. Training agents in the real world to directly implement general purpose AI is too costly, and unnecessary complexity impedes initial learning. Conversely, learning may be more successful, for example, based on a simulated environment such as comMai-env in Figure 9.
Figure 9: Commai-Env of FAIR (Mikolov et al., 2015)
Adapt to new territory
Learning from simulation is a special example of domain adaptation. Other examples of domain adaptation are summarized below.
Domain adaptation is a common requirement for visual tasks, because often the readily available labeled data is different from the data we actually use, whether to identify, for example, the bicycle in Figure 11, or to identify other objects in the field. Even if the training lines and test data look the same, the training data may still contain biases that humans can’t detect, but the model takes advantage of these biases to get over-fitting results.
Figure 10: Different visual domains (Sun et al., 2016)
Another common domain adaptation scenario is to adapt to different text categories: standard NLP tools such as part-of-Speech Taggers and Parsers are often trained on news data such as the Wall Street Journal. The challenge, however, is that models trained on news data struggle to cope with other types of text, such as messages on social media.
Figure 11: Different text categories/genres
Even in the same domain, such as product reviews, people will use different words and phrases to express the same opinion. Therefore, a model trained with a certain type of comment should be able to handle both general words and domain-specific words and avoid being confused by domain variations.
Figure 12: Different topics
Finally, the above challenges only relate to general text categories or image types, but the problem is magnified if you need to deal with domains related to individuals or groups of users: for example, in the case of automatic speech recognition (ASR). Voice will be the next big thing, with 50% of all searches expected to be performed by voice by 2020. Most ASR systems have traditionally been evaluated with the Switchboard dataset, which contains only 500 speakers. As a result, these systems may be well understood by people with standard accents, but difficult to understand by people with non-standard accents, speech disorders or children. We need ASR systems that can accommodate both individual users and minority groups to ensure that everyone’s voice is understood.
Figure 13: Voice application
Spread knowledge across languages
Finally, in my opinion, learning from one language and then applying that knowledge to another language is another important application of transfer learning. Reliable cross-language adaptation allows us to take a large amount of Annotated data in English and then apply it to other languages, especially those with less data. While this is still utopian, recent advances such as zero-shot translation prove that the field is making rapid progress.
The research on transfer learning has a long history, and the four scenarios mentioned above have corresponding technologies to deal with them. The rise of deep learning has given rise to a host of transfer learning approaches, some of which are described below.
Use pre-trained CNN functionality
In order to activate the most commonly used transfer learning methods, we must understand the reasons for the great success of large convolutional neural networks on ImageNet.
-
Understand convolutional neural networks
While many of the details of how the model works remain a mystery, we now know that a lower convolutional layer captures low-level image features, such as edges (see Figure 14), while a higher convolutional layer captures increasingly complex details, such as body parts, faces, and other component features.
Figure 14: Sample filtering for AlexNet learning
The final fully connected layer is often used to capture information relevant to solving the corresponding task; for example, AlexNet’s fully connected layer will indicate which functions are relevant to categorizing an image into one of 1000 object categories.
However, while knowing that a cat has whiskers, claws, fur, etc., is necessary to recognize an animal like a cat, it does not help us identify new objects or solve other common visual tasks such as scene recognition, fine-grained recognition, attribute detection, and image retrieval.
What is helpful, however, is capturing general information about image combinations as well as features of information such as edges and shapes. As mentioned above, this information is contained in one of the final convolutional layers or early fully connected layers in a large convolutional neural network trained on ImageNet.
When completing new tasks, we can simply use the most advanced CNN field features pre-trained on ImageNet and train the extracted features. In practice, we either keep pre-trained parameters fixed or adjust them at a small learning rate to ensure that we don’t forget what we learned before. This simple approach can achieve impressive results on a range of visual tasks as well as tasks that rely on visual input, such as image description. Models trained on ImageNet appear to capture some of the details of how animals and other objects relate to each other in terms of structure and composition when processing images. Thus, the ImageNet task seems to be a good proxy for general computer vision problems because the knowledge required is relevant to many other tasks.
-
Learn the image infrastructure
Similar assumptions are used to stimulate generative models: When training generative models, we assume that the ability to generate realistic images requires an understanding of the underlying structure of the image, which in turn can be applied to many other tasks. This assumption itself relies on the premise that all images are in low-dimensional manifolds, that is, some underlying structure of the image can be extracted from the model. Recent advances in generating realistic images using generative adversarial networks suggest that such structures may indeed exist, as evidenced by the model’s ability to demonstrate realistic transitions between points in the bedroom image space, as shown in Figure 15.
Figure 15: Image manifold walking through the bedroom
-
Are pre-trained features useful for tasks other than visual ones?
Field-generated CNN features have achieved unrivalled results in visual tasks, but it is questionable whether this success can be replicated in other fields using other types of data, such as language. Currently, no field feature can achieve the amazing results of visual tasks in NLP. Why is that? Does such a feature exist? Or, if not, why is vision more conducive to this transfer of form than language?
The output of low-level tasks (such as part-of-speech tagging or grouping) can be treated as field features, but these features do not capture more fine-grained rules of language use beyond grammar, and do not help all tasks. As we have seen, the existence of generalizable field features seems to be interwoven with the existence of a task that can be seen as a prototype for many tasks in the field. In visual tasks, object recognition occupies such a position. In NLP, the closest simulation is probably language modeling: given a sequence of words, in order to predict the next word or sentence, the model needs to have knowledge of language structure, needs to know what words are likely to be related to and likely to follow, needs to model long-term dependencies, and so on. Although the most advanced language models are getting closer to the human level, their features are only used in limited ways. At the same time, advances in language modeling have brought positive results for other tasks: pre-training models with language model objects improves model performance. In addition, word embedding pre-trained with approximate language modeling objects in a large unlabeled corpus has become common. While they are not as effective as field features in visual tasks, they are still useful and can be seen as a simple form of transfer of generic domain knowledge derived from a large-scale unlabeled corpus.
Although there is no general agent task in natural language processing, auxiliary tasks can take the form of local agents. Other relevant knowledge can be injected into the model, whether through multi-task objects or composite task objects.
The use of pre-trained features is currently the most direct and common way to perform transfer learning. However, it is not the only way.
Domain invariant representation of learning
The pre-trained features are actually primarily used to adapt to scenario 3, where we want them to adapt to new tasks. In other scenarios, another way to transfer knowledge through deep learning is to learn representations that do not change based on our domain. This approach is conceptually very similar to the way we’ve been thinking about using pre-trained CNN features: both encode general knowledge of the domain. However, creating domain-based representations that do not change is less costly and more feasible for non-visual tasks than generating representations that are useful for all tasks. ImageNet has taken years and thousands of hours, and we typically only need unlabeled data for each domain to create domain-invariant representations. These representations are often learned using stacked denoising autoencoders and have been shown to be effective in natural language processing and computer vision.
Make representations more similar
To improve the ability to transfer learning representations from source to target domains, we want representations between the two domains to be as similar as possible, so that the model does not consider domain-specific features that might hinder representations, but only commonalities between domains.
Instead of having the autoencoder learn partial representations, we actively force the representations of the two domains to be more similar. We can apply this directly to our data representations as a pre-processing step, and then use the new representations for training. We can also force the representation of fields in our model to be more similar.
Confusing domains
Adding another goal to an existing model, encouraging it to confuse the two domains, is another method that has become increasingly common recently to ensure that the representations of the two domains are similar. The loss of domain confusion is a regular classification loss in which the model tries to predict the domain of the input example. The difference from regular losses, however, is that the gradient from the losses to the rest of the network is reversed, as shown in Figure 16.
Figure 16: Confusion domain with gradient inversion layer (Ganin and Lempitsky, 2015)
The gradient inversion layer does not minimize errors in domain classification loss but maximizes such errors in the model. In practice, this means that the model is learning to minimize the representation of the original target without allowing it to distinguish between two domains that facilitate knowledge transfer. Although the model trained only with regular goals shown in Figure 17 is clearly able to resolve domains based on learned representations, a model cannot do so if its goals add domain obfuscation terms.
Figure 17: Domain classifier scores for conventional and domain obfuscation models
Although this article is about transfer learning, transfer learning is not currently the only way for machine learning to leverage limited data, apply learned knowledge to new tasks, and make models better generalized in new environments. Below, we introduce some other research areas that are related to or complementary to transfer learning goals.
Semi-supervised learning
Transfer learning is designed to maximize the use of unlabeled data in the target task or domain. This is also the standard for semi-supervised learning, which follows a classic machine learning setup but only assumes a limited number of tag samples to be trained on. Many of the lessons and insights of semi-supervised learning apply to transfer learning as well.
Use existing data more effectively
Another direction related to transfer and semi-supervised learning is to enable models to work better with limited data.
This can be done in several ways: unsupervised or semi-supervised learning can be used to extract information from unlabeled data, thereby reducing reliance on labeled samples; It allows the model to access other features inherent in the data while reducing over-fitting in its regularization; Finally, you can take advantage of data that has so far been neglected.
Improve the generalization ability of the model
Improving the generalization ability of the model is another way. To achieve this, we must first better understand the behavior and complexity of large neural networks and investigate why and how they generalize.
Increase model robustness
As we improve the generalization ability of our model, we may generalize it well to similar instances, but still get serious errors on unexpected or atypical inputs. Therefore, a key goal is to increase the robustness of our model. Recently, due to the progress of adversarial learning, there are more and more studies on this direction. A number of studies have proposed ways in which models can perform better against worst-case or adversarial samples in different environments.
Multitasking learning
In transfer learning, we are mainly concerned with our target tasks. In multitasking learning, by contrast, the goal is to perform well on all tasks. Although the multi-task learning approach is not directly applicable to the setup of transfer learning, the idea of facilitating multi-task learning can still benefit transfer learning.
Continuous learning
While multitasking learning allows us to retain knowledge in many tasks without a performance penalty to our source tasks, this is only possible if all tasks have training time. For each new task, we often need to re-model all of our tasks.
Zero-shot learning
Finally, if we push transfer learning to the limit and only learn from a few or even zero samples, we can get small, one, and zero learning, respectively. Models that perform one – and zero-degree learning are among the most difficult problems in machine learning. It comes naturally to us as humans.
In summary, transfer learning offers many exciting research directions, especially the application of many models that require the transfer of knowledge to new tasks and adaptation to new domains. I hope this article has provided you with an overview of transfer learning and piqued your interest.
http://sebastianruder.com/transfer-learning/index.html#transferlearningmethods