At the WAVE SUMMIT of 2021 Deep Learning Developer SUMMIT, which ended on May 20, ERNIE of Baidu Wenxin opened the source of four pre-training models. This paper gives a detailed technical interpretation of the four open source pre-training models. Since 2019, the NLP pre-training model has made continuous breakthroughs in technological innovation and industrial application, but there are still some pain points in the current pre-training model that baffle developers:

  • Only single granularity semantic modeling is considered, and the semantic understanding ability is limited due to the lack of multi-granularity knowledge introduction.
  • Due to the modeling length bottleneck of Transformer structure, it is unable to process ultra-long text.
  • Focusing on a single mode, such as language, lacks the ability to jointly model multiple modes, such as language, visual and auditory information, in real industrial application scenarios.

At the WAVE SUMMIT of 2021 Deep Learning Developer SUMMIT held on May 20, baidu Wenxin ERNIE’s latest four open source pre-training models based on the core framework of Flying Paddle: The multi-granularity language knowledge enhancement model (ERNIE-Gram), the long text understanding model (ERNIE-Doc), the cross-modal understanding model (ERNIE-ViL), and the language and vision integration model (ERNIE-UNIMO).

Aiming at the existing difficulties and pain points of the current pre-training model, the four open source pre-training models of Wenxin ERNIE not only make breakthroughs in the three fields of text semantic understanding, long text modeling and cross-modal understanding, but also have a wide range of application scenarios and prospects, which will further promote the intelligent upgrading of the industry.

ERNIE open source version address: github.com/PaddlePaddl…

First, the multi-granularity language knowledge enhancement model ERNIE-Gram **

Since the birth of ERNIE model, Baidu researchers have introduced knowledge into the pre-training model to enhance the ability of the semantic model through knowledge enhancement. This release of the ERNIE-Gram model improves the effectiveness of the model by explicitly introducing language granularity knowledge. Specifically, Ernie-Gram proposed an explicit N-gram mask language model to learn n-gram granularity language information. The relatively continuous N-gram mask language model greatly reduces the semantic learning space, (V^n→V_(n-gram), where V is the size of the word list. N is the length of gram), which significantly improves the convergence speed of the pre-training model.

▲ Figure 1-1 Continuous N-gram mask language model vs explicit N-gram mask language model

In addition, on the basis of explicit n-Gram semantic granularity modeling, ERNIE-Gram proposed multi-level N-gram language granularity learning, using two-stream dual-stream mechanism, Fine-grained semantic knowledge in n-Gram language units and coarse-grained semantic knowledge in N-Gram language units are learned at the same time, so as to realize multi-level language granularity knowledge learning.

▲ Figure 1-2 N-gram multi-level language granularity mask learning

The results of Ernie-Gram significantly outperformed the mainstream open source pre-training model in the industry in several typical Chinese tasks, such as natural language inference task, short text similarity task and reading comprehension task, without increasing any computational complexity. In addition, the ERNIE-Gram English pre-training model also outperformed the mainstream model in general language comprehension tasks and reading comprehension tasks.

The ERNIE-Gram method was accepted by NAACL 2021, address: arxiv.org/abs/2010.12…

** ii. Long Text Understanding model ERNIE-Doc **

Transformer is the basic network structure that ERNIE’s pre-training model relies on, but it is difficult to model text contents such as chapters and books due to its computation and space consumption increasing with the square of the modeling length. Inspired by the human reading method of rough reading first and then intensive reading, ErNIE-Doc pioneered retrospective modeling technology, which broke through the bottleneck of text length modeling in Transformer and realized two-way modeling of text of arbitrary length.

By inputting the long text into the model twice, erni-doc learns and stores the semantic information of the whole text in the rough reading stage, and merges the semantic information of the whole text explicitly for each text fragment in the intensive reading stage, so as to realize bidirectional modeling and avoid the problem of context fragmentation.

In addition, the Recurrence Memory structure in the traditional long text model (Transformer-XL, etc.) limits the effective modeling length of the model. ERNIE-Doc improves it to the same layer loop, so that the model retains the semantic information of higher level, with the modeling ability of super long text.

▲ Figure 2-1 Retrospective modeling and memory enhancement mechanism in ERNIE-Doc

ERNIE-Doc can better model the overall information of a text by learning the sequential relationship between paragraphs at the text level.

▲ Figure 2-2 Chapter reorder learning

Erici-doc significantly improves the modeling capabilities of long text and can solve many application challenges that traditional models cannot handle. In search engines, for example, erni-doc can understand a web page as a whole and return more systematic results to the user. In intelligent writing, erni-doc can be used to generate longer, semantically rich articles.

The ERNIE-Doc model is the best in 13 typical Chinese and English long text tasks including reading comprehension, information extraction, text classification and language modeling.

The ERNIE-Doc method was accepted by ACL 2021. The link of the paper is arxiv.org/abs/2012.15… ** third, the cross-modal understanding model of scene graph knowledge fusion ERNIE-ViL **

Cross-modal information processing capabilities require ai models to deeply understand and synthesize modal information such as language, visual, and auditory. At present, based on pre-trained cross-modal semantic understanding technology, cross-modal joint representation is learned by aligning corpus, and semantic alignment signals are fused into the joint representation to improve cross-modal semantic understanding ability. Ernie-vil proposed a knowledge-enhanced vision-language pre-training model. Scene Graph knowledge containing fine-grained semantic information was integrated into the pre-training process, and three pre-training tasks, object prediction, attribute prediction and relation prediction, were constructed to make the model pay more attention to fine-grained semantic knowledge in the pre-training process. Learn to characterize better cross-modal semantic alignment information and get better cross-modal semantic representation.

▲ Figure 3-1 Knowledge-enhanced cross-modal pre-training ERNIE-ViL framework

For the first time, Ernie-VIL integrates the knowledge of scene diagram into the pre-training process of cross-modal model, which provides a new idea for the research of cross-modal semantic understanding. This model has achieved leading results in five typical cross-modal tasks, such as visual question answering, visual common sense reasoning, referential expression understanding, and cross-modal text & image retrieval. The ERNIE-ViL model has been gradually implemented in real industrial application scenarios such as video search.

ERNIE-ViL’s method was accepted by aaAI-2021 president paper, address: arxiv.org/abs/2006.16…

Four, language and visual integration model ERNIE-UNIMO **

Big data is one of the key foundations for the success of deep learning. The current pre-training method is usually carried out separately on different modal data, which is difficult to support various language and image tasks at the same time. Can deep learn-based AI systems learn all kinds of heterogeneous mode data, such as single-mode and multi-mode data, at the same time like human beings? If it can be realized, it will undoubtedly further open the boundary of deep learning to use large-scale data, so as to further improve the general ability of AI system perception and cognition.

Therefore, the unimodal learning method proposed by ERNIE UNIMO, a model integrating language and vision, is used to train data by using single-mode text, single-mode image and multi-mode graphic at the same time, so as to learn the unified semantic representation of text and image, so as to be able to process multiple single-mode and cross-mode downstream tasks at the same time. The core module of this method is a Transformer network. In the specific training process, text, image and text pair are randomly mixed together. The image is converted into object sequence, the text into word sequence, and the text pair is converted into the Mosaic of target sequence and word sequence. Unified modal learning processes the three types of data in a unified way, conducts self-supervised learning based on mask prediction on the target sequence or word sequence, and carries out cross-modal comparative learning on the basis of images and texts, so as to achieve unified representation learning of images and texts. Furthermore, this joint learning method enables text knowledge and visual knowledge to enhance each other, so as to effectively improve the ability of text semantic representation and visual semantic representation.

This method surpassed the mainstream text pre-training model and multi-mode pre-training model in language understanding and generation, multi-mode understanding and generation, 4 types of scenes, a total of 13 tasks, and topped the authoritative visual question and answer list VQA, text reasoning list aNLI. It is verified for the first time that language knowledge and visual knowledge can be mutually enhanced by non-parallel text and image single-mode data.

This work is accepted by the president of ACL2021. The address of the paper is arxiv.org/abs/2012.15…

Fifth, solve the TECHNICAL problems of NLP, and help the intelligent development of industry

ERNIE released four new open source pre-training models, constantly promoting the innovation and application of NLP model technology research.

Language and knowledge technology is regarded as the core of the cognitive ability of ARTIFICIAL intelligence. Since 2019, Baidu has made a series of breakthroughs in the world with its profound accumulation in the field of natural language processing, and released the ERNIE Semantic understanding platform of Wenxin, which is widely used in finance, communication, education, Internet and other industries to help the industrial intelligent upgrading.

As the “pearl on the crown of ARTIFICIAL intelligence”, NLP has always been at the forefront of ARTIFICIAL intelligence technology research and development and practice. Baidu Wenxin platform, based on the leading semantic understanding technology, helps enterprises to cross the threshold of technology, tools, computing power and talent on the NLP track. It is open to developers and enterprises, and comprehensively accelerates the NLP technology to help the whole industry to upgrade intelligent, and puts intelligent “wings” on the LARGE-SCALE PRODUCTION of AI industry. With the mission of “understanding Language, possessing intelligence, and changing the world”, Baidu Natural Language Processing (NLP) aims to develop core Natural Language Processing technologies, build leading technology platforms and innovative products, serve global users and make the complex world simpler.