On July 8, the latest results of the Chinese language comprehension assessment list were released by Clue, the authoritative Chinese language understanding assessment benchmark. The PAI team from Alibaba cloud computing platform and the intelligent communication and service technology team from DAMO Academy won the first place in the total score of the large model and the non-parametric model, and the total score of the final defense.
The author | with embellish, belongs to the rain, bear xi source | ali technology to the public
An overview of
On July 8, the authoritative Chinese language understanding assessment benchmark CLUE released the latest results of its Chinese learning evaluation list for small samples. The PAI team from Alibaba Cloud Computing Platform and the Intelligent Dialogue and Service Technology team from DAMO Academy won the first place in both the large model and the non-parametric model tracks, and the first place in the final oral defense.
Since Clue was founded, it has released a number of NLP benchmarks, including classification lists, reading comprehension lists, and natural language inference lists, which have had a profound impact in academia and industry. FewClue is a new Chinese learning benchmark launched by Clue to evaluate whether machine learning models can master specific natural language processing tasks with very few samples. Based on this assessment, researchers can more accurately measure the generalization and accuracy of the model trained by machine learning. For example, in the intelligent customer service scene, user intention identification only needs to manually mark dozens of samples, so that the accuracy of intention identification can reach 90%.
As is known to all, although large-scale pre-training model has achieved great effects in various tasks, it still needs a lot of annotated data in specific tasks. Due to the high cost of data collection for training required by the collection and labeling model, small sample learning technology is required to tackle key problems. The amount of data required by the classical deep learning algorithm is far less than that of the classical deep learning algorithm, and its accuracy is close to or even beyond that of the classical deep learning algorithm. This time, Ali Yun PAI team and Damo Institute put forward a set of large model + small sample joint program, on the basis of large-scale general pre-training, combined knowledge-based pre-training and Fuzzy-PET less sample learning, and achieved excellent results at one go. Even more accurate than humans on a small sample learning task.
Analysis of the Second Competition & Modeling Ideas
The overall characteristics of the competition data set are as follows:
- Small sample: both the training set and the test set are 16shot for each category to test the robustness of the algorithm in small sample situations
- Generalization: The task characteristics are obviously different, so the model needs to have good generalization ability
- Untagged data: Most missions provide a decent amount of untagged data, so you can try Continued Pretrain and Self-Training
Based on the interpretation of the competition questions, we designed a three-stage modeling method:
- Ab initio pre-training of general domain data: With the help of various acceleration strategies and pre-training kits provided by PAI-RapidFormer, we pre-trained Chinese pre-training models on the order of 300 million and 1.5 billion from scratch, and the pre-training process adopted the pre-training algorithm integrated with knowledge (see 3.2 for details).
- Continued pre-training for multi-tasking: The goal is to further enhance Performance on two-sentence matching tasks (OCNLI, BUSTM, CSL). We convert the categorization task to a text contained task and use the text contained data to continue the Continued Pretrain. [Cls]I like the movie[SEP]This indicates positive user sentiment[Eos]
- Small sample algorithm fine-tuning for each task: Select PET (Pattern-Exploit Training) as the core method of downstream fine-tuning and develop Fuzzy-PET algorithm to reduce the fluctuation caused by artificial selection of label words in PET algorithm and improve the effect on tasks. At the same time, the semi-supervised method of self-training is used, and the upper semi-supervised learning is used in the downstream fine-tuning stage (see 3.3 for details).
Three core technologies
1. PyTorch big model training acceleration
Since launching PAI-easyTransfer’s framework for NLP and transfer learning in 2020, the PAI team has developed a PyTorch version of EasyTransfer, named EasyTexMiner. The models used in the competition were developed through EasyTexMiner’s high-performance distributed pre-training. EasyTexMiner’s distributed training is an organic combination of Microsoft’s DeepSpeed and Nvidia’s Megatron.
EasyTexMiner’s distributed training incorporates the following core technologies:
1) Activation Checkpoint
A number of checkpoints are set in the middle of the neural network, and all intermediate results other than the checkpoints are discarded. The time of calculating derivatives is reversed propagated, and the computation starts from the nearest checkpoint when a certain intermediate result is needed. In this way, video memory is saved and the tedious process of ab initio calculation is avoided.
2) Gradient Accumulation
Taking batch_size=16 as an example, we could calculate the average gradient of 16 samples at a time, and then the cache would add up. After 4 times, we would divide the total gradient by 4, and then perform the parameter update. This effect is equivalent to batch_size=64. This is an effective way to increase the Batch Size. Through this strategy, the batch size of each step can be expanded to a large extent, and the convergence speed can be improved by combining Lamb optimizer.
3) Mixed Precision Training
The benefits of using mixed precision training mainly include the following two points:
- Reducing the memory footprint, since the FP16 has half the memory footprint of the FP32, naturally helps the training process reduce the memory footprint by half.
- In addition to saving memory, FP16 can also save model training time by speeding up training and inference calculations. The specific principle is shown in the figure below. The core is that when the back propagation parameters are updated, a backup of FP32 needs to be maintained to avoid rounding error. In addition, the overflow error is mitigated by Loss Scaling.
4) JIT compilation on the fly
While the PyTorch is performing a series of Element-wise Tensor operations, the underlying Kernel implementation requires repeated read-write visits, but only a small amount of computation, most of which is spent not on computation but on read-write visits. For example, implementing a multiplication/addition Kernel for a Tensor with N elements would require N addition computations, 2N reads, and N write access operations. We call a Kernel with fewer calculations and more visits to storage Bound. In order to avoid such repeated reads and writes and reduce the overhead of Kernel Launch, Kernel Fusion can be adopted. The core principle of Kernel Fusion in fetching Bound is to automatically merge multiple Element-wise kernels into one Kernel through the locality principle of fetching, so as to avoid writing intermediate results to memory, so as to improve the utilization rate of fetching. At the same time, as several kernels were merged into one Kernel, the Kernel launch cost was also reduced to one time.
5) 3D parallelism
3D parallel strategy refers to the mixed application of three strategies: data parallel, model parallel, and flow parallel, so as to achieve the purpose of rapidly training the 10 billion / 100 billion scale model. The technology, originally developed by the DeepSpeed team, speeds up training of large models.
6) CPU Offload
The back propagation is not calculated on the GPU, but on the CPU. The intermediate variables used in it are all stored in the memory, which can save the GPU’s video memory and trade time for space, so that it can be placed in a larger model.
7) Zero video memory optimizer
Zero (The Zero Redundancy Optimizer) is a novel memory optimization technology for large-scale distributed deep learning. Zero has three major optimization phases:
- Optimizer state partitioning (POS) : 4x less memory, same communication capacity as data parallelism;
- Added gradient partition (POS + G) : 8x memory reduction, communication capacity and data parallelism is the same;
- Increase parameter partitioning (POS + G + P) : Memory reduction is linear with data parallelism and complexity.
Throughput performance evaluation
This release uses the latest AliCloud eBlops AI cluster system, using NVIDIA A100 GPU and 100Gbps MellanonX CX6-DX network card, combined with the whole system topology aware high-performance distributed communication library ACCL and eBlops cluster multi-track network capability. The training speed of the model is greatly accelerated by non-congestion communication. As shown in the figure below:
Scalability measurement
We used models a little larger than BertLarge that could not fit on a single card for scalability evaluation under model parallelism. Num-layers =24, hidden-size=2048, num-attention-heads=32. The total number of parameters in the model is about 1.2B. We evaluated the throughput on 8/16/32/64 cards respectively. From the index in the figure below, the throughput increased almost linearly with the increase of the number of cards.
2. Pre-training algorithm Kgbert with knowledge
On the basis of the general pre-training model, we consider to improve the effect of the pre-training model by incorporating knowledge into the pre-training model. Data and knowledge: Through the cooperation with the NLP data team of Damo Academy, we have acquired large-scale, high quality and diverse data and knowledge.
- Large-scale: 500 million Chinese map knowledge, 200 million Sentence-SPO Pair acquired through remote supervision;
- High quality: Aiming at the problems of large and complex original corpus, large amount of redundancy and noise, hundreds of millions of high-quality Sentence-SPO were selected for model training through DSGAN knowledge denoising algorithm.
- Diversity: FewCLUE data set, in addition to the general field also contains electrical business, tourism, education, finance and other vertical industry, and this part of the data and knowledge is scarce, we build the knowledge production system, a set of effective documents of all kinds of vertical industry, web pages can be automatically triples extract, thus greatly enhanced the abundance of knowledge.
Models and pre-training tasks
In order to efficiently use the knowledge, we designed a multi-granularity semantic understanding pre-training task based on the alignment of corpus “Sentence- positive SPO- negative SPO” :
- Mention Detection: Enhances the model’s understanding of the core entity Mention;
- Sentence-SPO Joint Mask: The large-scale text data and its corresponding SPO knowledge were input into the pre-training model at the same time for pre-joint training, which promoted the information sharing between structured knowledge and unstructured text and improved the semantic understanding ability of the model.
- SPO Margin Magnify: The pre-training task of contrast learning was designed to open the semantic interval between relevant SPO and irrelevant SPO in order to make it have stronger semantic differentiation ability.
Technological innovation: the mechanism of knowledge screening and integration
1) motivation
In an NLP task, it is common to model based on the natural language of the current input, but this usually uses only the current literal local information. This is quite different from the way humans understand language, where we use what we’ve learned before to help us understand. Humans use this external knowledge to enhance their understanding, and without additional knowledge, such as exposure to an unfamiliar area, we can hardly fully understand semantics. However, the current common practice of NLP only uses input information instead of external knowledge, and the understanding level is low.
In reality, knowledge is huge and complex, so targeted sampling of knowledge is needed to reduce the introduction of irrelevant knowledge and maximize the benefits of knowledge.
2) method
Design a novel target-oriented mechanism. First, the target codes sentences and aggregates subgraphic information through GCN. Then, the target controls the inflow of information through a gating mechanism. In the pre-training stage, the objective function of maximizing knowledge gain is designed so that the model can learn valuable information better.
3) the results
The screening of knowledge based on the government-based mechanism can effectively capture the high-gain triad for integration, with a 2% improvement in the accuracy of recognition tasks of government and financial attributes. Such a knowledge screening mechanism has been validated in academic open data sets and achieved the effect of SOTA. The relevant work has been published in SIGIR2021.
3. Small sample learning algorithm
Based on the pre-training language model integrated with knowledge, the computational platform PAI and the team of Damo Institute jointly launched Fuzzy-PET, a self-developed multi-task small sample learning algorithm. Because the FewClue list has a set of different categories of tasks, if the model learns transferable knowledge across tasks before making small sample fine-tuning for a specific task, the model will get better initial parameter Settings during small sample fine-tuning for a specific task. Based on the accumulation of meta-learning related algorithms by the PAI team on the computing platform, we introduced unannotated data of several FewClue tasks for Learning in the continuing pre-training stage of the pre-training language model with knowledge. During the Learning process, the model automatically learned background knowledge of these tasks from the task-related data. Thus, it is more conducive to small sample learning for specific tasks. Meta-learning algorithms have been published in EMNLP2020 and ACL2021.
In the learning stage of specific small sample task, we have improved Pattern-Exploit Training (PET) algorithm and introduced Fuzzy Verbalizer Mapping. For example, in the classic PET algorithm, for FewClue’s task OCNLI, we devised the following templates: “Actually, I don’t think you know a ball” and “You don’t know a basketball.” The relationship is a MASK.
For an output Masked Language Token (i.e., Verbalizer), if the predicted result is “related”, we map it to the category label “entailment”. If the prediction result is “irrelevant”, we map it to the category label “neural”; If the forecast result is “oppose,” we map it to the category label “maodun.” Using Verbalizer to category label manual mapping, PET implemented the modeling of text classification tasks. In the Fuzzy Verbalizer Mapping mechanism, we assume that multiple Verbalizers may map to a certain category label, so as to further improve the generalization of the model in the process of small sample learning. Referring to the previous example, we designed three groups of tags: relevant, irrelevant, contrary/implied, neutral, contradictory/contained, neutral, and contrary. During the training, multiple groups of label words were used for each sample. During the inference, the predicted probabilities of all candidate words in each category were calculated and added together, and the category with the highest total probability was finally selected. As in the above example, if the probabilities of “correlation”, “implication” and “inclusion” are greater than the probabilities of “irrelevant”, “neutral” and “neutral” or the probabilities of “opposite”, “contradiction” and “reverse”, the prediction result will be “entailment”.
This mechanism has a positive effect on the improvement of prediction accuracy in multiple tasks of FewClue, and to some extent reduces the fluctuation caused by manual selection of different tag words. In addition, we also consider introducing unannotated data in the small sample learning stage for self-training, that is, marking unannotated data by relying on the existing model to achieve iterative optimization of the model.
Business & Products
It is worth mentioning that, based on the machine learning platform PAI platform, this technology has been implemented in practical business scenarios and has a good performance. These technologies enhance the KBQA capability of Damo Institute Cloud Honey, enabling it to have the ability of quick cold start and accurate Q&A, and landing in multiple business scenarios of government affairs, finance and general line. In actual projects, in the case of fewer samples (20 items), fast cold start can be achieved, so as to achieve accurate question and answer. At the same time, these technologies are expected to give machine learning algorithms on AliCloud the ability to learn from small samples, and the effect of downstream tasks can be greatly improved through a small amount of data annotation. This means that Ali Cloud model has the ability to implement the low-cost and fast implementation, which can efficiently and nimbly enable the business of enterprises.
Based on PAI, Ali Cloud hopes to build large-scale end-to-end capabilities of AI, from the bottom chip to the distributed system, and then to the scale of the upper algorithm and data, so as to build the ability of AI engineering group to fight and serve all walks of life. At present, the PAI platform supports the accelerated training of hundreds of billions of features and trillions of samples, with more than 50 built-in 200+ mature algorithms and high-quality deep learning pre-training models in the AI fields of image vision, audio and video, text, etc., comprehensively improving the efficiency of enterprise AI engineering. On the basis of platform capabilities, PAI platform also provides mature industry solutions, which has become the preferred service for many enterprises, and has been mature and commercial in many scenarios such as intelligent recommendation, user growth, end-to-side superpartition, and automatic driving.
This article is the original content of Aliyun, shall not be reproduced without permission.