This is the 27th day of my participation in the August More Text Challenge

For some new fields or emerging things, people are willing to spend some time to understand them, and it is easy to start with them at the beginning. However, after a period of learning, they will gradually fade away and give up due to various reasons, such as no use, no progress or bottlenecks. I have been working with machine learning for nearly 2 years. I have gone through several stages to get to where I am today, including passion at the beginning, pain and despair in the process, and persistence today. We hope to share not only the technology in machine learning, but also the learning process.

Language model

The main application field of recurrent neural network is timing problem. The field of natural language processing (NLP) can be regarded as the best practice of recurrent neural networks. So when we talk about natural language processing, we can’t talk about language models. The language model has been described in detail before, so here is a brief review.


P ( w 1 . w 2 . . w T ) P(w_1,w_2,\cdots,w_T)

When we write or read an article, it is related to time. From the first word to the T word, we read the article from left to right or from top to bottom. It seems to be a matter of space, but actually it is related to time, a matter of timing.

We just need to learn a model, which will give a high probability of correct sentences. Some words often appear together, that is, they have a high probability of appearing together. This probability can be obtained through statistical methods in corpus.


C ( w 1 . w 2 ) C ( w 1 ) \frac{C(w_1,w_2)}{C(w_1)}

  • C(w1)C(W_1)C(w1) Count the number of machines appearing in the article
  • C(w1,w2)C(w_1,w_2)C(w1,w2) count the number of simultaneous machine and learning occurrences in the article


P ( w 1 ) P ( w 2 w 1 ) P(w_1)P(w_2|w_1)

This makes sense mathematically, but it’s hard to calculate, and the main reason it’s hard to calculate is that for the probability of occurrence of a sentence, we need the following joint probability, and you can imagine how much calculation is required from the following formula.


P ( w 1 ) P ( w 2 w 1 ) P ( w n w 1 . w n . . w n 1 ) P(w_1)P(w_2|w_1)P(w_n|w_1,w_n,\cdots,w_{n-1})

Recurrent neural network

For this reason, people introduce Markov chain, which is a model active in natural language processing before the emergence of recurrent neural networks. Markov hypothesis is relatively simple, that is, words at the current time of T are only related to the previous time of T-1 and have nothing to do with previous words. In fact, it sounds reasonable. In most cases, the current state is closely related to the previous state. For example, the current position of the taxi is closely related to the previous position.

A language model can imagine predicting the probability of the next word, given a sequence of words. If it’s a Markov chain, it predicts the next word based on the previous word, and the recurrent neural network solves the previous problem


P ( w 1 ) P ( w 2 w 1 ) P ( w 3 w 1 w 2 ) P(w_1)P(w_2|w_1)P(w_3|w_1w_2)


k = 2 C ( w 1 . w 2 . w 3 ) C ( w 1 . w 2 ) \begin{aligned} k = 2\\ \frac{C(w_1,w_2,w_3)}{C(w_1,w_2)} \end{aligned}

Now the basic tasks and the most basic implementation of language model are given for fully connected neural network or convolutional neural network. Each input is isolated in time sequence and corresponds to an output of the neural network. Everything happens at time T, and the neural network does not use previous information to process the input.

First look at a simple neural network, such as, a composed of input layer, hidden layer and one output layer, please pay attention to the above the following network as a result, for this picture looks be like simple, but the circulation of the neural network is clever is refers to the line itself, namely this time line will not enter information can be passed to like a moment, So you can imagine that you have a memory hopefully a memory unit, and every time that input is processed it updates the memory unit in addition to being used for output, and every output depends not only on the input but also on the memory unit. A memory unit is added to the neural network.

A lot of people tend to think of a recurrent neural network as a unfold, and the unfold is just to help you understand it and show it unfold

X is a vector representing the value of the input layer, and the subscript facilitates the differentiation of x at different points in time. S is a vector representing the value of the hidden layer, which can understand the state used by the recurrent neural network to transmit information over the timing sequence.

U is the weight matrix from the input layer to the hidden layer, o is also a vector representing the value of the output layer, and V is the weight matrix from the hidden layer to the output layer.

So, now let’s see what W is. The value s of the hidden layer of the recurrent neural network depends not only on the current input X, but also on the value S of the last hidden layer. The weight matrix W is the weight of the input from the previous value of the hidden layer.

Okay, I’m running a little bit late today, so I’ll stop there and I’ll talk about recurrent neural networks tomorrow, and we’ll talk about the design of loss functions, and we’ll talk about problems in recurrent neural networks, and that’s why LSTM and GRU were proposed.

  • There’s going to be logic elements that have memory, that can remember the state of the system

Smaller parameters can be used to express the probabilistic relationship over time. In this model, we repeat the use of heavy parameters