The common natural Language processing (NLP) processes are as follows:

  1. Text preprocessing
    • Data cleaning, word segmentation, etc
  2. Characteristics of the engineering
    • Traditional ML features (such as TF-IDF feature, which mainly focuses on extracting keywords)
    • Syntax and semantic information (word vector, sentence vector, etc.)
  3. Build a model
    • Traditional models: LinearSVM, LogicRegression
    • Depth models: TextCNN, TextRNN, TextRCNN, Bert

Since the model takes tensors as input, the first task for NLP data sets is to convert text into vectors, which is the focus of this article. There are probably the following methods:

  1. Good data cleaning, word segmentation and one-hot embedding of Word have the advantages of simple, but it will lose a lot of information and some tricky places need to be handled in the middle
  2. Using word frequency, TF-IDF and other information, the text is directly encoded. These features may not be of much use when word is fairly evenly distributed
  3. Using pre-training methods, word can be mapped to a hidden space to represent its semantics according to the universality of the language, and perhaps context can be added to fine-tune its semantic vector (solve the polysemy problem)

Traditional feature

Use traditional features such as word frequency, TF-IDF, etc. to encode the text, and then throw it into the LinearSVC. The disadvantages of these features are:

  • The effect eats the data set, if the data set itself is scattered, the effect will be very poor

  • Not universal

  • The calculation is slow, compared with using Word embedding + frozen

Word Embedding

Word Embeddind is a pre-training technique.

In CV tasks, hierarchical CNN structure is commonly used, and CNN method is generally used to assist. For CNN hierarchy structure, different levels of neuron learning to different types of image features, formed by bottom-up hierarchical structure, so the closer at the bottom of the neuron learning have more to the characteristics of the universal based characteristics (such as edge horn lines, arcs, etc), the characteristics of high-level neurons to learn the associated with specific tasks.

Therefore, large data sets such as ImageNet can be used to pre-train the network, and the parameters of its underlying task-independent feature extractor can be extracted to initialize the network of a specific task:

  • For small data sets, because network parameters are generally in the hundreds of millions, it is difficult to train complex networks with a small amount of data. Therefore, pre-training +fine tuning is more appropriate
  • For large data sets, you can set a good initialization value for the parameters

NLP language model Language model includes grammar language model and statistical language model, generally we refer to statistical language model. Statistical language model: Sequence of words is regarded as a random event, and the probability of belonging to a certain language set is described with corresponding probability. Given a vocabulary set V, for a sequence of words in V, assign probabilityA common statistical language Model to measure the confidence of its composite natural syntax semantics is the N-gram Model

So it is easy to think that semantic extraction can also be done by pre-training in the field of NLP. The earliest work may be the Neural Network Language model (NNLM) in 2003. After the popularity of deep learning in 2013, word embedding is a by-product of network training and a matrix composed of word vectors. The most popular tool for word embedding with language model in 2013 is Word2vec.

The network structure of Word2Vec is similar to that of NNLM, but the difference lies in the training method:

  • CBOW: The omission of a word from a sentence, using the context of the word to predict the omission
  • Skip-gram: As opposed to CBOW, typing a word requires predicting its context

Try Word2vec using the Gensim library

The WordEmbedding problem is that it trains so that no matter the context of the word is predicted to be the same, so that two different contexts are encoded into the corresponding word embedding space

ELMO

Deep contextualized word representation — Embedding from Language Models provides a solution to the problem of polysemy. The WordEmbedding is trained so that each word is fixed and does not change with the context, that is, its embedding is static.

The essence of ELMO is to train the WordEmbedding of the Word using the language model, and then adjust it according to the semantics of the Word in the context.

ELMO adopted a two-stage training process:

  1. Upstream: The language model is used for pre-training, and the word meaning is generally extracted
  2. Downstream: Word embedding in the network layer that extracts corresponding words from the training network is added to the downstream task as a new feature, where it generally extracts grammatical and semantic information

Disadvantages (compared to GPT and Bert) :

  1. LSTM is far less capable of extracting features than Transformer
  2. The fusion ability is weak when the features are fused by splicing

Transformer is a deep network of superimposed Self Attention. The attentional mechanism refers to scanning the whole world to get the target area that needs special attention and suppress useless information. Attention mechanism is generally attached to Encoder-Decoder framework. Pure Encoder-Decoder is generally used for input-output dimension inconsistency, so Encoder is used to encode it into fixed-length information vector (intermediate semantic representation), and Decoder is used to decode. Its disadvantages are:

  • Intermediate semantics indicate that the length is limited and there may be information loss
  • Decoder Decoder decodes the global information of the sentence according to the intermediate semantic representation, and cannot pay attention to the local information of the sentence

The introduction of the Attention model adds weight to each word and allows you to focus on local informationThe demoAttention essentially sums the values of elements in Source (Source is a sentence, elements are each word, and their Value is added) :

Self Attention is also called intra Attention. The Source and Target contents of Attention mechanism are different. The Self Attention mechanism is the mechanism by which Attention occurs between elements within a sentence (think of it as the special case of Source=Target). Self Attention can capture some syntactic features/semantic features between words in the same sentence. It directly captures the features of two words, which is different from the sequence calculation of RNN (which leads to the smaller possibility of effective capture of distant words).

GPT

Generative pre-training also uses a two-stage process:

  1. Pre-training using language models
  2. Downstream tasks are solved by fine Tuning mode

Different from ELMO:

  • Feature extraction uses Transformer
  • Use context-before for pre-training (ELMO uses BI-LSTM so Context can be extracted)

Disadvantages:

  • The language model is one-way and only context-before information can be extracted

Bert

Bert uses exactly the same two-stage model of GPT (the difference is that it uses bidirectional and the data size of the pre-trained language model is larger), so in fact:

  • Replace ELMO’s feature extractor with Transformer and you get Bert
  • Switching the GPT language model to bidirectional is Bert

The question then is how can Transformer be used both ways for training?

  • Maked LM: Use[MASK]Or other words to substitute for words that need to be predicted
    • The reason we need other words is because we won’t actually use them laterMASKFlag, so you cannot train for this flag only
  • Next Sentence Prediction: Masked LM concentrates on word granularity, which is too fine for Sentence relationship Prediction in NLP, so Sentence order in corpus needs to be replaced
  • So Bert’s pre-training is a multi-tasking process

Bert’s input is stacked with three embedding types:

  • Token Embedding: meaning
  • Segment Embedding: Because of the task of sentence relation prediction, the Embedding of the sentence also needs to be part of the word input
  • Positionn Embeddinng: Position information, word order is important in NLP

reference

From Word Embedding to Bert model: History of pre-training techniques in Natural Language Processing