Author: Lao Song’s tea book Meeting
zhuanlan.zhihu.com/p/69389583
preface
Recently, I have almost completely focused on how to do things on Bert, mainly exploring Bert’s performance in classification and reading comprehension. I have already stepped in a lot of holes, and I want to record all the holes recently, so as to help people better use Bert.
A couple of caveats
The length of the text
The first thing to note is that as text length increases, the amount of video memory required increases linearly, and the running time is nearly linear, so there is often a tradeoff that the effect of text length is not the same for different tasks.
In terms of classification, after a certain text length, the model performance will hardly change, so it is meaningless to improve the text length at this time.
The spell of 512
When you set your text length to more than 512, the following error occurs:
RuntimeError: Creating MTGP constants failed.
Copy the code
In the BERT implementation of pyTorch – Prewell-bert, the text length is 512 Max because of Position Embedding. This means that if your text is long, you need to truncate or batch read it.
Don’t run the entire data set at first
In the early stages of coding testing, because the data sets were often large and the loading process was long, we had to wait until the loading was complete to see if the model would run, which was often a matter of trial and error. If you run the full dataset every time, for large datasets, especially with Bert (slow word segmentation), the efficiency is mind-boggling.
Therefore, it is highly recommended to split up a demo level subset first, I will usually run 1000, 1000, 1000, the whole data set before running the actual data set.
How to fine-tune Bert for Text categorization Tasks [2]
How to intercept text
Since Bert supports a maximum length of 512 tokens, how to intercept text is also a critical issue. [2] discussed three ways:
- Head-only: save the first 510 tokens (two slots for [CLS] and [SEP])
- Tail-only: saves the last 510 tokens
- Head + tail: Select the first 128 tokens and the last 382 tokens
Head + tail works best when the authors test it on the IMDB and Sogou datasets, so in practice, it’s worth trying all three ideas and maybe improving them a little bit.
Multilayer strategy
Another approach is to divide the text into multiple segments, each with no more than 512 tokens, so that the entire text is fully captured. But does this really work for sorting tasks?
Personally, I think there is little effect. Just like when we read an article, we can know the topic and the classification of the article at the end. In rare cases, there will be a vague situation, and the experiment does show that the effect has not been improved.
In [2], the text is first divided intoSegment, and then encode each segment separately. Three strategies are adopted when fusing information:
- Multi-layer + mean: Calculate the average of each paragraph
- Multilayer + Max: Maximize each segment
- Multi-layer + self-att: Add a layer of Attention to blend
The experiment shows that the effect has not been improved or even decreased. I conducted a multi-layer strategy test on THE CNews data set myself, and found that the effect had a limited improvement of 0.09 percentage points, which can be said to be almost no improvement. I will test several data sets later, which can be found in my warehouse: Bert-TextClassification
Catastrophic Forgetting
As the representative of migration learning in NLP, Bert will forget the important old knowledge when learning new knowledge. Is Bert suffering from Catastrophic failure?
[2] It is found that a low learning rate, such as 2E-5, is the key for Bert to overcome Catastrophic acceleration, and the learning rate of Pytorch – Prewell-Bert is 5E-5, which is also corresponding to this view.
Whether pre-training is required
Fine-tuning is powerful enough, but whether or not pre-training can again improve results and how to do it remains an open question.
First of all, the first question is: can pre-training improve the effect? The answer is that there is a high probability of yes. The specific improvement still depends on the data set itself, and it can be seen from the experiment in [2] that most data sets have different effects of improvement.
Second question: how to conduct pre-training, or what data to use for pre-training, there are three main strategies:
- Do pre-training on specific data sets. Experiments in [2] show that this approach is highly likely to improve the effect.
- Pre-training on data in the same field. In general, this is better than Strategy 1 and the data is easier to obtain, but it can be noisy if the data comes from different sources.
- Pre-training on cross-domain data. The effect improvement of this strategy is not as great as the above two strategies, because Bert himself has been trained with high-quality and large-scale cross-field data
Overall, strategy 2 is best, as long as you keep the quality of the data.
You know, the more the better?
Pytorch Multi-GPU parallelism
Let’s start by talking about the internal processing mechanism of Pytorch in the case of multiple Gpus, which will help us tune arguments.
Pytorch, like most deep learning frameworks, uses data parallelism (Figure 1) to handle model training on multiple Gpus, but it has its quirks.
Specifically, Pytorch first loads the model onto the primary GPU(typically device_id=0), then copies the model to each GPU, then divides a batch of data by number of Gpus, and inputs the corresponding data into each GPU. Each GPU independently performs forward calculation. In the back propagation process, Pytorch needs to aggregate model output information from each GPU to GPU 0. As a result, GPU 0 occupies a relatively large amount of video memory, and the gradient calculation process is concentrated on GPU 0. After the gradient calculation is completed, the gradient is copied to other Gpus, and then the backpropagation update is carried out.
The actual experiment shows that for the classification task, the gap is not obvious, but for the language model task, due to its large output layer, it may cause the GPU burst. As shown below:
How many cards do we need?
For small data sets, it is recommended to use 1 or 2 Gpus to solve the problem. In the case of multiple Gpus, we need to take into account the communication time, and the communication transmission is relatively slow. For small data sets, too many Gpus will take longer to test. For large data sets, you need to judge by yourself. Try several gpus to save the most time and have enough video memory.
Multiple cards are used in PyTorch
In the case of multiple cards, model-related calls are different, mainly in the following aspects:
# Model definitionif n_gpu > 1:
model=nn.DataParallel(model,device_ids=[0.1.2] # loss calculationif n_gpu > 1: Loss = loss. Mean () # mean() to average on multi-GPU. # Save model_to_save = model.moduleif hasattr(
model, 'module') else model
torch.save(model_to_save.state_dict(), output_model_file)
Copy the code
We see. The most important change remains in the modeling framework, which is also important to understand.
Load balancing problem in multiple Gpus
In the case of standalone multiple Gpus, GPU 0 occupies more video memory than other Gpus. This is because the model output and related gradient tensor are eventually summarized on GPU 0. As mentioned above, this is the load balancing problem of Pytorch.
Pytorch’s load balancing problems are not serious for general tasks such as classification, but are critical for tasks with a larger output layer such as language modeling, which proves what I said before: language modeling is not something that LABS and companies can do.
Personally, I am not deeply troubled by this problem, so THERE is no deeper exploration, after all, the Trick of gradient accumulation is very conscientious.
There are two solutions to Pytorch’s load balancing problem:
- The first is to use distributed data framework
- The other is to write their own, you can look at [1], because there is no in-depth nonsense.
The last
In fact, this article is more like a note, it is a summary of my recent pit, I believe everyone will also more or less encounter these problems, so share it for you to think about. I don’t want to write any longer, but I’ll leave it at that.
If you think it helps, just give it a thumbs up and go. After all, writing is not easy.
Reference
[1] Augmenting Batch training neural networks: Practical tips for single-GPU, multi-GPU and distributed configuration
[2] How to Fine-Tune BERT for Text Classification?
Note: the menu of the official account includes an AI cheat sheet, which is very suitable for learning on the commute.
Highlights from the past2019Machine learning Online Manual Deep Learning online Manual AI Basic Download (Part I) note: To join our wechat group or QQ group, please reply "add group" to join knowledge planet (4500+ user ID:92416895), please reply to knowledge PlanetCopy the code
Like articles, click Looking at the