On July 8, CLUE, an authoritative evaluation benchmark for Chinese language understanding, released the latest results of the Chinese small sample learning evaluation list. Alibaba Cloud computing platform PAI team and Dharma Institute intelligent Communication and service technology team ranked first in the overall performance of the large model and the model without parameter limitation, and the overall performance of the final defense.
The author | with embellish, belongs to the rain, bear
Source technology | ali public number
An overview of
On July 8, CLUE, an authoritative evaluation benchmark for Chinese language understanding, released the latest results of the Chinese small sample learning evaluation list. Together with the Intelligent dialogue and service technology team from Alibaba Cloud computing platform PAI and Dharma Institute, the team won the first place in both the large model and the model without parameter limitation, and the final defense.
Since its inception, CLUE has released a number of NLP benchmarks, including classification lists, reading comprehension lists and natural Language inference lists, which have had a profound impact in academia and industry. FewCLUE is CLUE’s latest Chinese small sample learning benchmark designed to evaluate whether machine learning models can master specific natural language processing tasks with very few samples. Based on this assessment, researchers can more accurately measure the generalization and accuracy of the models trained by machine learning. For example, in the intelligent customer service scenario, the accuracy of user intention recognition can reach 90% only by manually labeling dozens of samples.
As we all know, although the large-scale pre-training model achieves great results in various tasks, it still needs a lot of annotated data in specific tasks. Due to the high cost of data collection for training required by collecting and annotating models, it is necessary to tackle key problems with small sample learning technology, which uses much less data than the classical deep learning algorithm and approaches or even exceeds the accuracy of the classical deep learning algorithm. This time, Aliyun PAI team and Dharma Institute put forward a set of large model + small sample joint program, on the basis of large-scale general pre-training, combined with knowledge-based pre-training and fuzzy-PET small sample learning, achieved excellent results at one stroke. Even on a small sample learning task with greater accuracy than humans.
Second problem analysis & modeling ideas
The overall characteristics of the data set are as follows:
- Small sample: The training set and test set are 16 shots for each category, testing the robustness of the algorithm in small sample situations
- Generalization: there are obvious differences in task characteristics, which requires the model to have good generalization ability
- Labelless data: Most tasks provide a significant amount of labelless data, and you can try continued Pretrain and self-training
Based on the interpretation of the questions, we designed a three-stage modeling method:
- De novo pre-training of general domain data: With the help of various acceleration strategies and pre-training kits provided by Pai-RapidFormer, we pre-trained Chinese pre-training models of 300 million magnitude and 1.5 billion magnitude from the beginning. The pre-training process adopted the pre-training algorithm incorporating knowledge (see 3.2 for details).
- Continuous pre-training of multi-task: The purpose is to further enhance Performance of the two-sentence matching task (OCNLI, BUSTM, CSL). We transform the classification task into text implication task and use text implication data for continuing Pretrain. E.g. [CLS]I like the movie[SEP]This indicates positive user sentiment[EOS]
- Small sample algorithm fine-tuning for each task: PET (Pattern-exploiting Training) was selected as the core method of downstream fine-tuning, and fuzzy-PET algorithm was developed, which reduced the fluctuation caused by artificial selection of PET algorithm tag words and improved the effect on tasks. At the same time, self-training semi-supervised method is used, and upper semi-supervised learning is used in downstream fine-tuning stage (see 3.3 for details).
Three core technologies
1. PyTorch Large model training acceleration
Since the launch of Pai-EasyTransfer’s framework for NLP and migration learning in 2020, the PAI team has developed a PyTorch version of EasyTransfer, named EasyTexMiner. The models used in the competition were developed using EasyTexMiner’s high performance distributed pre-training. EasyTexMiner’s distributed training organically integrates the advantages of Microsoft’s DeepSpeed and Nvidia’s Megatron. The overall block diagram is as follows:
EasyTexMiner’s distributed training incorporates the following core technologies:
1) Activating checkpoints
Several checkpoint points are set in the middle of the neural network, and all intermediate results other than the checkpoint are discarded. In order to calculate the derivative, the intermediate results need to be calculated from the nearest checkpoint. In this way, the video memory is saved and the tedious process of ab initio calculation is avoided.
2) Gradient Accumulation
Take batch_size=16 as an example, you can calculate the average gradient of 16 samples each time, and then add up the cache. After four times of calculation, divide the total gradient by 4, and then perform parameter update. This effect is equivalent to batch_size=64. This is an effective method to increase the Batch Size. Through this strategy, the batch size of each step can be expanded to a large size, and the convergence speed can be improved by combining with LAMB optimizer.
3) Mixed Precision Training
The advantages of using mixed precision training are mainly as follows:
- Reduced video memory footprint, since FP16 has only half the memory footprint of FP32, it naturally helps the training process save half the video memory footprint.
- To speed up the calculation of training and inference, FP16 can save both memory and training time of the model. The specific principle is shown in the figure below. The core is that a backup of FP32 needs to be maintained to avoid rounding errors when backpropagation parameters are updated, and overflow errors can be alleviated by Loss Scaling.
4) Just-in-time JIT compilation
When PyTorch does a series of Element-Wise Tensor operations, the underlying Kernel implementation needs to read and write repeatedly, but only does a small amount of computation, and most of that time is spent not on computation, but on reading and writing. For example, implementing a Tensor multiplication/addition Kernel with N elements requires N addition calculations, 2N reads, and N write accesses. We call the Kernel with less computation and more fetch Bound. To avoid such repeated reads and writes and reduce the overhead of Kernel Launch, Kernel Fusion can be used. The core principle of access Bound’s Kernel Fusion is to automatically merge multiple element-wise kernels into one Kernel by using the locality principle of access to avoid writing intermediate results to memory to improve access utilization. At the same time, as multiple kernels are merged into one Kernel, the Kernel launch overhead is reduced to one time.
5) 3D parallelism
3D parallel strategy refers to the mixed application of data parallelism, model parallelism and flow parallelism, so as to achieve the purpose of fast training billion/billion scale models. The technology, first developed by the DeepSpeed team, speeds up the training of large models.
6) CPU Offload
Backpropagation is not calculated on GPU, but on CPU, and all intermediate variables used are stored in memory, which can save GPU memory occupancy, and exchange time for space, so that it can be put into a larger size model.
7) Zero video memory optimizer
ZeRO (The ZeRO Redundancy Optimizer) is a novel memory optimization technique for large-scale distributed deep learning. ZeRO has three major optimization phases:
- Optimizer state partitioning (Pos) : 4 times less memory, same communication capacity and data parallelism;
- Increase gradient partition (Pos+ G) : 8x memory reduced, communication capacity and data parallelism is the same;
- Increased parameter partitioning (Pos+ G + P) : Memory reduction is linearly related to data parallelism and complexity.
Throughput performance evaluation
This release uses the latest Ali Cloud EFLOPS AI cluster system, using NVIDIA A100 GPU and 100Gbps Mellanonx CX6-DX network card, combined with the system topology-aware high-performance distributed communication library ACCL and EFLOPS cluster multi-track network capabilities. The realization of congestion – free communication greatly accelerates the training speed of the model. As shown below:
Scalability measurement
We used a model larger than BertLarge that could not fit a single card to evaluate the scalability under model parallelism. The specific configuration is num-layers=24, hidden-size=2048, num-attention-heads=32. The total parameters of this model are about 1.2b. Throughput was evaluated on 8/16/32/64 cards. As shown in the figure below, throughput increased almost linearly with the increase of cards.
2. Knowledge integrated pre-training algorithm KGBERT
On the basis of the general pre-training model, we consider the pre-training with knowledge to improve the effect of the pre-training model.
Data and knowledge: Large scale, high quality and diverse data and knowledge were acquired through collaboration with the DHARma Institute NLP data team.
- Large scale: 500 million Chinese atlas knowledge, 200 million sentence-SPo pairs obtained through remote monitoring;
- High quality: Aiming at the problem of huge and complicated original corpus, large amount of redundancy and noise, hundreds of millions of high quality sentence-SPO were selected by DSGAN knowledge denoising algorithm for model training.
- Diversity: FewCLUE data set, in addition to the general field also contains electrical business, tourism, education, finance and other vertical industry, and this part of the data and knowledge is scarce, we build the knowledge production system, a set of effective documents of all kinds of vertical industry, web pages can be automatically triples extract, thus greatly enhanced the abundance of knowledge.
Model and pre-training tasks
In order to efficiently utilize knowledge, we designed a multi-granularity semantic understanding pre-training task based on the “Sentence-forward SPO- negative SPO” alignment corpus:
- Mention Detection: enhance the model’s understanding of the core entity Mention;
- Sentence-spo joint Mask: Large-scale text data and its corresponding SPO knowledge are simultaneously input into the pre-training model for pre-joint training to promote information sharing between structured knowledge and unstructured text and improve the semantic understanding ability of the model;
- SPO Margin Magnify: The pre-training task of contrast learning was designed to separate sentence-related SPO and sentence-irrelevant SPO semantics, so that it could have stronger semantic discrimination ability.
Technological innovation: Knowledge screening and integration mechanism
1) motivation
In NLP tasks, it is common to model based on the natural language of the current input, but this usually uses only the current literal local information. This is very different from how humans understand language, where we use what we’ve learned to help us understand. Humans use this external knowledge to enhance their understanding, and without additional knowledge, such as exposure to an unfamiliar field, it’s hard to fully understand semantics. However, the current common practice of NLP only uses input information, but does not use external knowledge, resulting in a low level of understanding.
In reality, knowledge is huge and complicated, so targeted knowledge sampling is needed to reduce the introduction of irrelevant knowledge and maximize the benefits of knowledge.
2) method
A novel Gated mechanism is designed to encode sentences first and then aggregate subgraph information through GCN to control the inflow of information through gating mechanism. In the pre-training stage, the model can learn valuable information better by designing the objective function of maximizing knowledge gain.
3) the results
Knowledge screening based on the Gated mechanism can effectively capture high-gain triples for integration, and the accuracy rate of government and financial attribute identification tasks can be improved by 2%. Such knowledge filtering mechanism is validated in academic open data sets and achieves SOTA effect. Relevant work has been published in SIGIR2021.
3. Small sample learning algorithm
On the basis of the pre-training language model incorporating knowledge, the computational platform PAI and the Dharma Institute team jointly launched fuzzy-PET, a self-developed multi-task small sample learning algorithm. Since the FewClue list has a range of different categories of tasks, the model will get better initial parameter Settings when fine-tuning a small sample for a particular task if the model learns transferable knowledge across tasks before fine-tuning a small sample for a particular task. Based on the accumulation of meta-learning-related algorithms by PAI’s team on the computing platform, we introduced the unlabeled data of multiple FewClue tasks for Learning in the continuing pre-training stage of the pre-training language model incorporating knowledge. During the Learning process, the model automatically learned the background knowledge of these tasks from the data related to these tasks. Thus, it is more conducive to small sample learning for specific tasks. Related algorithms of meta-learning have been published in EMNLP2020 and ACL2021.
In the learning stage of specific small sample tasks, we improve the patterner-exploiting Training (PET) algorithm and introduce Fuzzy Verbalizer Mapping mechanism. For example, in the classic PET algorithm, for FewClue’s OCNLI task, we created templates like “Actually, I don’t think you understand balls” and “You don’t understand basketball.” The relationship of is MASK.
For the outputed Masked Language Token (Verbalizer), if the predicted result is “correlated”, we map it to the category tag “entailment”; If the prediction result is “irrelevant”, we map it to the category label “neural”. If the prediction is “opposite”, we map it to the category label “object”. Using Verbalizer to manual mapping of category tags, PET implements modeling of text categorization tasks. In the Fuzzy Verbalizer Mapping mechanism, we assume that multiple Verbalizer may have a Mapping relationship to a certain category label, so as to further improve the generalization of the model in the process of small sample learning. Referring to the previous example, we designed three groups of tag words: related, unrelated, opposite/implied, neutral, contradictory/contained, neutral, and reverse. In training, each sample is input with multiple groups of label words. In reasoning, the prediction probability of all candidate words for each category is calculated and added up, and finally the category with the highest total probability is selected. As in the above example, if the probability of “correlation”, “implication” and “inclusion” is greater than the probability of “irrelevant”, “neutral” and “neutral” or the probability of “opposite”, “contradiction” and “reverse”, the prediction result will be “entailment”.
This mechanism has a positive effect on improving the prediction accuracy in FewClue’s multiple tasks, and to some extent reduces the fluctuation caused by the artificial selection of different label words. In addition, we also consider introducing unlabeled data for self-training in the small sample learning stage, that is, relying on existing models to mark unlabeled data and realize iterative optimization of the model.
Business & Products
It is worth mentioning that based on machine learning platform PAI platform, this technology has been implemented in actual business scenarios and has a good performance. These technologies enhance the KBQA capability of Cloud Small service of Dharma Institute, enabling it to have the ability of quick cold start and accurate question and answer, and landing in multiple business scenarios of government affairs, finance and general line. In practical projects, in the case of a small number of samples (20), fast cold start can be achieved, so as to achieve accurate q&A. At the same time, these technologies are expected to give the machine learning algorithm on Ali Cloud the ability to learn from small samples, which can greatly improve the effect of downstream tasks with little data annotation. This means that Ali Cloud model has low cost and rapid implementation of the ability to enable efficient and agile business.
Based on PAI, Aliyun hopes to build large-scale AI end-to-end capabilities, from the bottom chip to the distributed system, and then to the scale of upper algorithms and data, so as to build AI engineering group combat capabilities and serve all walks of life. At present, PAI platform supports 100 billion features, 100 trillion sample size accelerated training, built-in 200+ mature algorithm, as well as image vision, audio and video, text and other AI high-quality deep learning pre-training model of more than 50, comprehensively improve the efficiency of enterprise AI engineering. On the basis of platform capabilities, PAI platform also provides mature industry solutions and has become the preferred service of many enterprises. It has been used commercially in many scenarios such as intelligent recommendation, user growth, end-to-end over-separation and automatic driving.
The original link
This article is the original content of Aliyun and shall not be reproduced without permission.