This article belongs to NLP study notes, pure white non-professional, also please big guy guidance.

Introduction to NLP

Simple models of word vectors, such as Onehot, window-based co-occurrence matrix, are introduced.

Continue to sort: the information on the Internet is not too little, but as a small white information overload, especially for my kind of math is not good, until I read peghoty’s blog, the introduction of the system, really god, they are sent papers, read the source code.

There are two common ways to make computers understand our language:

1. Research on rule-based language model, at the very beginning.

Statistic-based language models, which came later, and now with the development of deep learning, neural network language models.

The following is the language model based on statistics (high probability is more reasonable, low probability is not reasonable).

Unigram model:

Take an English sentence for example: “The cat jumped over The puddle.”

The purpose of the model is to give the sentence a probability value P, and the probability of a sentence with n words can be expressed as:

It is understood as: a document is generated by words, and each word has its probability of occurrence, that is, the probability of the document = the product of all the probabilities of words.

Note: This is not a common N-gram model.

N – “gramm model

If we have a sequence of m a word (or a sentence), we want to calculate the probability P (w1, w2,…, wm), according to the rules of the chain, available P (w1, w2,…, wm) = P (w1) P (w2 | w1) P (w3 | w1 and w2). P (wm | w1,…, wm – 1)

Insert of conditional probability, P (B | A) : probability under the condition of B. From a large space into a subspace (slice), calculate the proportion in the subspace.

This is Bayes’ formula. Really, math is bad. It makes my head spin. The understanding of Bayes’ formula is that the language model P() is regarded as a kind of prior knowledge to guide the final result selection.

This looks easy, but it’s hard to do, because if you go straight to the conditional probability, you have to take into account all the words that precede each word. There’s a lot of data.

N-gram is based on the Markov hypothesis that the probability of each word is related only to the few words that precede it. For example, the second-order Markov hypothesis considers only the first two words, and the corresponding language model is a ternary model. Introduced markov hypothesis language model, also called Markov model. The academic edition can be seen here by Peghoty:

Applying this assumption shows that the current word is related only to a limited number of previous words, and therefore does not need to be traced back to the original word, thus drastically reducing the length of the equation.

We can set m= 1,2,3,…. The corresponding unary model, binary model, ternary model, when m=1, called 1-gram or unigram; M =2, called 2-gram or bigram; When m=3 it is called 3-gram or trigram; When m=N, it means n-gram.

How much N is reasonable: usually the application should not exceed 3. In addition to the computational complexity, the model effect should also be considered (the more parameters, the higher the differentiation, the lower the reliability, and a compromise should be sought).

Smoothness is not expanded here, in order to solve the sparse matrix (my understanding is that words that are not in the training set appear in the test set, resulting in a probability of zero calculated by the language model).

In addition, n-Gram theory is finished, for example, easy to understand, recommended to see the n-Gram model in natural language processing explanation – White horse negative jin Tye – CSDN blog

Common scenario: search engine input prompt, can use binary language model to predict the next word.

Standard mode: Count the frequency of each word and store it. When you need to calculate the probability of a statement, find the corresponding probability and practice multiplying it.

It is to use the known sample result information to deduce the model parameter values that are most likely (maximum probability) to lead to these sample results.

Advantages :(1) maximum likelihood estimation, easy to train parameters; (2) it completely contains all the information of the first N-1 word; (3) Strong explanatory, intuitive and easy to understand.

Disadvantages :(1) lack of long-term dependence, can only be modeled to the first n-1 word; (2) The parameter space increases exponentially with the increase of n; (3) With sparse data, OOV problems are inevitable; (4) Based solely on statistical frequency, poor generalization ability.

Neural probabilistic language models

Model based on feedforward neural network

The most classic work of Language training Model was Bengio et al. ‘s 2003 paper “A Neural Probabilistic Language Model”.

Bengio used a three-layer neural network to build language models, also known as N-Gram models

At the bottom of the graph wt−n+1… ,wt−2 wt−1 is the first n−1 word. We now need to predict the next word wt from this known n−1 word. C w (w) said word for word vector, the entire model is a set of unique words used in the vector, the existing matrix C (a | V | x m matrix). Including | V | said word (the total number of words in the corpus), the size of the m word vector dimensions. The transformation from W to C of w is just taking a row out of the matrix.

The first layer (input layer) of the network is to combine C(wt− N +1),… ,C(wt−2),C(wt−1) The n−1 vectors are put together end to end to form an (n−1) m-dimensional vector, which is denoted as x. The second layer (hidden layer) of the network is calculated directly using D +Hx, just like ordinary neural networks. D is a offset term. After that, tanh is used as the activation function. Network of the third layer (output layer), a total of | V | nodes, each node yi said the next word for I not normalized the log probability. Finally, softmax activation function is used to normalize the output value Y into probability. Finally, the calculation formula of Y is:

y = b + Wx + U\tanh(d+Hx)

Formulas of U (a | V | x h matrix) is the parameters of the hidden layer to output layer, the majority of the entire model calculation on U and in the hidden layer of matrix multiplication.

I don’t know, but peghoty explained in detail why we use tanh hyperbolic tangent functions, and some advice on what parameters to take.

Advantages: Compared to Ngram, the similarity between words can be represented by word vector.

Built-in smoothing, without Ngram which complex smoothing algorithm.

Disadvantages: Contains only limited prior information.

.2 Model based on cyclic neural network

In order to understand the problem of determining long messages, Mikolov published “Recurrent Neural Network Based Language Model” in INTERSPEECH 2010.

I don’t understand it either, so I won’t stick the formula.

Every time a new word comes, it calculates the next hidden layer together with the previous hidden layer. The hidden layer is used repeatedly and keeps the latest state all the time. The output value of each hidden layer is obtained through a traditional feedforward network.

Compared with the simple feedforward neural network, it can really make full use of all the above information to predict the next word, rather than using only the first N words to predict the next word in a window of n words. It doesn’t use the Markov hypothesis.

(Good in theory, but very difficult to optimize)

A Brief summary of Neural Network Language Model (NNLM)

Advantages :(1) long distance dependence, with stronger constraint; (2) Avoid OOV problems caused by data sparseness (we usually have a word vocabulary, usually custom preloaded, used as a training model, if the test data appears in words that are not in the vocabulary, we say these words are out-of vocabulary, referred to as OOV)

Disadvantages: long training time, neural network black box, poor interpretability (less intuitive than statistical model)

If you still don’t understand it, just to simplify it a little bit for the purpose, neural network language model is to predict the probability of the occurrence of the next word given a sequence.

The evaluation index

There are many models, how to evaluate the effect better?

Encyclopedia introduction:

In information theory, perplexity is used to measure how well a probability distribution or model predicts a sample. It can also be used to compare two probability distributions or probability models. Probability distribution models or probabilistic models with a low degree of confusion are better at predicting samples.

The language model that assigns a high probability value to the sentences in the test set is better. After the training of the language model, all the sentences in the test set are normal sentences, so the trained model has a higher probability on the test set, the better. The formula is as follows:

Conclusion: The higher the sentence probability, the better the language model and the less confusion

So let’s go over here, CBOW, skip-gram in the back.

Reference:

www.cnblogs.com/peghoty/p/3…

Licstar.net/archives/32…

zhuanlan.zhihu.com/p/28894219

zhuanlan.zhihu.com/p/34219483

Blog.csdn.net/baimafujinj…

zhuanlan.zhihu.com/p/44107044