“Efficient transfer learning in NLP
Evaluate new machine learning approaches
For a wide range of practical applications, machine learning models must test themselves against most of the following requirements:
- Fast train
- Rapid prediction
- Requires little or no hyperparameter tuning
- Works well with low availability of training data (100 examples)
- Suitable for a wide range of tasks and corresponding fields
- It extends well on the training data of the tag
Let’s see how a pre-trained word + document embedding layer meets these requirements:
- Fast training the time it takes to train a lightweight model above the pre-trained embedding layer is possible in a matter of seconds, although calculating the pre-trained embedding layer depends on the underlying model complexity.
- Predictions are also fast and cost the same as the base model.
- Cross-validation of regularization parameters and embedded types that require little or no hyperparameter tuning is helpful, and is cheap enough to introduce no problems.
- Working when training data availability is low (100 examples) Applying logistic regression blocks above the pre-trained word embedding layer only requires learning 100s to 1000s of parameters rather than millions of parameters, and the simplicity of the model means it requires very little data to get good results.
- Word and document embedding layers that are pretrained for a wide range of tasks and corresponding domains typically perform “good enough,” but specific tasks need to be domain and target model dependent.
- This approach extends well on labeled training data very quickly and does not benefit from additional training data. Learning linear models helps produce better results with less data, but it also means the model is much less capable of learning complex relationships between inputs and outputs.
In short, an embedded layer using pre-training is computationally inexpensive and performs well with low availability of training data, but using static representations limits the gain from additional training data. Obtaining good performance from the pre-trained embedding layer requires searching for the right pre-trained embedding layer for a given task, but it is difficult to predict whether a pre-trained embedding layer can be well generalized to a new target task, which needs to be verified by experiment after experiment.
Transfer learning solutions in computer vision
Fortunately, research in the field of computer vision provides a viable alternative. In the field of computer vision, the use of pre-trained feature representations has been largely replaced by methods of “fine-tuning” pre-trained models rather than just learning the final classification layer. Modify the ownership of the source model rather than simply reinitialize and learn the weights of the final classification layer. This additional model flexibility is starting to pay off as the availability of training data increases. The source basis for this approach is several years old — starting with Yosinski, Clune, Bengio et al. ‘s 2014 exploration of the transferability of parameters of convolutional neural networks (CNN), and only recently has the process become common practice. It is now common to apply fine-tuning methods to CNN networks, and the Stanford Computer Vision Course (CS231N) includes this process as part of its curriculum, as well as Mahajan et al. ‘s 2018 paper (” Exploring the Limits of weakly supervised pretraining “) shows that when model performance is critical, Fine-tuning should always be used in place of pre-trained features.
Model validation for natural language processing
So why is the field of natural language processing so backward? Sebastian Ruder wrote in an article that “NLP’s ImageNet moment has arrived”, which he attributed to the lack of established data sets and source tasks in the field for learning generable base models. Until recently, the field of natural language processing lacked imagenet-like data sets. However, recent papers such as Howard and Ruder’s “General Language Model Fine-tuning text Classification” and Radford’s “Improving Language Comprehension through Generative pretraining” have demonstrated that model fine-tuning is finally showing promise in the field of natural language. Although the source data sets used in these papers vary, the NLP field seems to be standardizing the goal of “language modeling” as the preferred way to train transferable base models. Simply put, language modeling is the task of predicting the next word in a sequence. Given that part of the sentence “I thought I would arrive on time, but in the end ____5 minutes”, it is quite obvious to the reader that the next word will be synonymous with “late”. Solving this task effectively requires not only an understanding of linguistic structure (nouns follow adjectives, verbs have subjects and objects, etc.), but also the ability to make decisions based on a wide range of contextual cues (” late “is filled in as a blank in examples because the preceding text provides clues that the speaker is talking about the time). In addition, language modeling has the ideal property of training data that does not require markup, and raw text is rich for every conceivable domain. These two features make language modeling an ideal choice for learning generalizable base models.
“Migration” model
When recursive models don’t need recursion
Does the fine-tuning of the model meet the set criteria?
In light of recent progress, let’s revisit how the fine-tuning model meets previous requirements:
- Fast training while computationally expensive compared to pre-computed feature representations, OpenAI’s migration model can be fine-tuned in about 5 minutes with GPU hardware in hundreds of examples.
- Fast predictive forecasting is also more expensive, with throughput limited to one-digit documents per second. The prediction speed must be improved before widespread practical application.
- Using default hyperparameters for each task works very well with little or no hyperparameter tuning, although basic cross-validation to find the ideal regularization parameter can be beneficial.
- The model fine-tuning that works well when training data availability is low (100 examples) performs as well as the use and training embedding layer in data volumes as low as 100 examples.
- Applicable to a wide range of tasks and domains Domain-task matching problems appear to be fewer than pre-trained feature representation, and language modeling targets appear to learn features applicable to semantic and syntactic tasks.
- The ability to use pre-trained features to represent unsolvable tasks on labeled training data can be well extended by using very complex models with sufficient training data. As more training data becomes available, the gap between pre-training features and model fine-tuning widens considerably. In fact, fine-tuning often seems preferable to training from scratch — the latest results demonstrated in OpenAI’s paper “Improving Language Understanding through Generative Pretraining.”
Although it has certain limitations, model fine-tuning for NLP tasks holds great promise and has shown clear advantages over current best practice of using pre-trained text and document embedding layers. Sebastian Ruder concludes:
“The time is ripe to practice transfer learning in the NLP space. Given the impressive empirical results achieved by ELMo, ULMFiT, and OpenAI, it seems only a matter of time until the pre-trained word embedding layer will eventually be phased out, which could lead to the development of many new applications in the NLP space where the amount of tag data is limited. The king is gone, a new king is born! “
Quantitative study of NLP model localization
Our early benchmarking confirms that there is a general benefit to fine-tuning the model using pre-trained representations. Below is an example of the output from a recent benchmark obtained using Enso, our migration learning benchmark tool.
s3
Fine-tuning: SciKit-learn style model fine-tuning library
In light of this recent development, Indico has opened source a wrapper class for OpenAI’s work on fine-tuning the migration model. We’re trying to make Radford’s research more widely available by packaging it into an easy-to-use SciKit-Learn style library, and we’ll see how you can use some short code tweaks to improve your own tasks.
The original link
This article is the original content of the cloud habitat community, shall not be reproduced without permission.