From sequence Model to Transformer(PyTorch Edition)(1) | Python Theme month

This article is participating in Python Theme Month. See the link for details

Imagine that you are watching a movie on YouTube. First, let’s assume that each user’s rating is his or her actual rating of the movie. You want to see more movies, don’t you? As it turns out, it’s not that simple. People’s interest in different types of movies can vary considerably over time. In fact, psychologists even have names for some of these effects.

The content refers to 8.1. Sequence Models in Deep Learning, and I have interpreted and sorted out the content and integrated my own understanding.

Timeliness, our ratings will also be influenced by the public, after the Academy Awards, the ratings of corresponding movies will go up, even if it is the same movie. This effect lasted for months, until the award was forgotten. Some studies have shown that this effect boosts scores by more than half a percentage point.
Moreover, we have the feeling that all the films are excellent in the same period. If the same film is in the fierce competition in winter and summer holidays, the audience will feel very ordinary after watching many excellent works. However, in other periods, the same film may stand out among some mediocre works
There is seasonality. The box office of some films will increase during the summer or winter vacation. If the New Year films are released in the winter or summer vacation, the effect will not be as good as expected, because many products are seasonal.
In some cases, movies become unpopular due to inappropriate behavior by directors or actors during production.

Let us represent the price with 𝑥𝑡, that is, in the time step 𝑡∈ℤ+ we observe the price 𝑥𝑡. Note that for the sequences in this article, 𝑡 is usually discrete, varying over integers or subsets of them. Suppose a trader who hopes to do well in the stock market on 𝑡 predicts 𝑥𝑡 in the following way

The regression model

To achieve this, our traders can use a regression model. But there is an important problem here, is that the number of inputs is not a constant but a variable, 𝑥𝑡-1… 𝑥1 is a random variable that depends on t. Here is how to understand the data, that is, even in the same time t look at the data is not different, it sounds a bit, of course the sampling data is unlikely to return to the past, it is assumed that can return to the past time t, because xtx_txt is a random variable, is repeated in t time to observe the data you can’t get the same value, However, this XTX_txt does not exist in isolation, and may have a certain relationship with the previous time value. For example, although the fluctuation of commodity price today is random, it also increases or decreases on the basis of the price at the last time, and may also be affected by the price trend in the previous period, that is, the time span. So we can think these points joint distribution is approximate to p (xt ∣ xt – 1,… And the x1) p (x_t | x_ {1} t -, \ cdots, x_1) p (xt ∣ xt – 1,… And the x1).

Autoregressive model

First, there is no need to consider a fairly long sequence xt−1,… x1x_{t-1},\cdots,x_1xt−1,… x1 for x at present time t. In this case, the time span of the appropriate sequence length 𝜏 can be chosen, so the observed values of Xt −1… xT −τx_{t-1} \cdots, x_{t-\tau} xT −1… xT −τ are considered as predictions for the current XTX_txt. This avoids too many arguments, and the number of arguments is fixed, 𝑡>𝜏. This allows us to train a deep network as described above. Such models will be called autoregressive models.

Latent autoregressive model

The second strategy, shown in Figure 8.1.2, is to keep past observations updated to hTH_THt because of the state unit, which at every moment predicts x^\hat{x}x^ and also updates hTH_tht. Estimated xtx_txt is x) t = P (xt ∣ ht), and then through the x) t = P (x_t | h_t), then through x) t = P (xt ∣ ht), and then through the h_t = g (h_ {1} t -, x_} {t – 1) to update to update to update the h_t. Due to the. Due to the. Because h_T $is used internally to preserve state in implicit models, these models are also known as latent autoregressive models.

Both cases raise an obvious problem: how to generate training data. People often use historical observations to predict the next observation, and those observations are up to now. Obviously, we don’t expect time to stand still. However, a common assumption is that while the exact value of 𝑥 may change, at least the dynamics of the sequence itself (that is, the patterns of change, such as trends and seasonality) do not. And that makes sense, because the new dynamic is just that, it’s new, and therefore impossible to predict with the data that we have right now. Statisticians call a dynamic that does not change a static dynamic. No matter what we do, we’re going to get an estimate of the entire sequence in the following way.

Above two questions are over how to generate training data set, we are based on the observation in the past to predict the next value, each data are random variables, the random variables are not static data, these and we met before image data set, the house price data set the static data, it is also a random variable in the process of random charm, Then we have how to learn knowledge from these dynamic data for reasoning. Although data are dynamic random variables, they will present some rules in the timeline for our model to learn. Statisticians call a dynamic that does not change a static dynamic. And what we’re going to do is we’re going to give the prediction from the sequence

P(x_1, \cdots, x_T) = \prod_{i=1}^T P(x_t|x_{t-1},\cdots, x_1)

Markov chain model

Review the approximation, the autoregressive model, only use the xt – 1,…, xt – tau x_ {1} t -, \ cdots, x_ {t – \ tau} xt – 1,…, xt – tau, Instead of xt – 1,…, x1x_ {1} t -, \ cdots, x_1xt – 1,… And the x1 data to estimate xtx_txt. As long as the approximation is accurate, i.e., the time window of size tautautau slides across the data, the sequence satisfies Markov’s condition. If τ=1\tau =1 τ=1, there is a first-order Markov model, 𝑃(𝑥) given by the following formula

P(x_1, \ldots, x_T) = \prod_{t=1}^T P(x_t \mid x_{t-1}) \text{ where } P(x_1 \mid x_0) = P(x_1)

And when xtx_txt is easy to calculate, because we approximate that only the nearest value contributes the most to the current value, we can effectively calculate P(xt+1∣xt−1)P(x_{t+1} \mid x_{t-1})P(xt+1∣xt−1).

\begin{aligned} P(x_{t+1} \mid x_{t-1}) &= \frac{\sum_{x_t} P(x_{t+1}, x_t, x_{t-1})}{P(x_{t-1})}\\ &= \frac{\sum_{x_t} P(x_{t+1} \mid x_t, x_{t-1}) P(x_t, x_{t-1})}{P(x_{t-1})}\\ &= \sum_{x_t} P(x_{t+1} \mid x_t) P(x_t \mid x_{t-1}) \end{aligned}

Code implementation

%matplotlib inline
import torch
from torch import nn
from d2l import torch as d2l
Copy the code

T = 1000
time = torch.arange(1, T+1, dtype=torch.float32)
x = torch.sin(0.01 * time) + torch.normal(0.0.2,(T,))
d2l.plot(time, [x], 'time'.'x',xlim=[1.1000],figsize=(6.3))
Copy the code

tau = 4
features = torch.zeros((T - tau, tau))
for i in range(tau):
    features[:,i] = x[i:T - tau + i]
labels = x[tau:].reshape((-1.1))

batch_size, n_train = 16.600
train_iter = d2l.load_array((features[:n_train], labels[:n_train]), batch_size, is_train=True)
Copy the code

def init_weights(m) :
    if type(m) == nn.Linear:
        nn.init.xavier_uniform_(m.weight)
        
def get_net() :
    net = nn.Sequential(nn.Linear(4.10),nn.ReLU(),nn.Linear(10.1))
    net.apply(init_weights)
    return net
loss = nn.MSELoss()
Copy the code

def train(net, train_iter, loss, epochs, lr) :
    trainer = torch.optim.Adam(net.parameters(),lr)
    for epoch in range(epochs):
        for X, y in train_iter:
            trainer.zero_grad()
            l = loss(net(X),y)
            l.backward()
            trainer.step()
            
        print(f'epoch {epoch + 1}, loss:{d2l.evaluate_loss(net, train_iter, loss):f}')
net = get_net()
train(net, train_iter, loss, 5.0.01)
Copy the code

Epoch 1, Loss :0.064925 epoch 2, Loss :0.052238 epoch 3, Loss :0.054105 epoch 4, Loss :0.052671 epoch 5, Loss :0.049820Copy the code

To predict

onestep_preds = net(features)
d2l.plot([time, time[tau:]], [x.detach().numpy(), onestep_preds.detach().numpy()],'time'.'x',
        legend=['data'.'1-step preds'],xlim=[1.1000],figsize=(6.3))
Copy the code

See how the model performs, since the training loss is minimal. To see what this means in practice, check the model’s ability to predict what will happen next, a one-step prediction.

multisetp_preds = torch.zeros(T)
multisetp_preds[:n_train + tau] = x[:n_train + tau]
for i in range(n_train + tau, T):
    multisetp_preds[i] = net(multisetp_preds[i - tau:i].reshape((1, -1)))
d2l.plot([time, time[tau:], time[n_train + tau:]], [
    x.detach().numpy(),
    onestep_preds.detach().numpy(), multisetp_preds[n_train + tau:].detach().numpy()],
        'time'.'x',legend=['data'.'1-step preds'.'multistep preds'],
        xlim=[1.1000],figsize=(6.3))
Copy the code

Copy the code

From sequence Model to Transformer(PyTorch Edition)(1) | Python Theme month

The regression model

Autoregressive model

Latent autoregressive model

Markov chain model

Code implementation

To predict

Related Posts

Transaction features in relational databases

Interviewers love to ask about garbage collection algorithms

Integrate Django’s ORM module in Flask