Welcome to Tencent Cloud + community, get more Tencent mass technology practice dry goods oh ~

This article was published in cloud + Community column by Dr. Jinchao Zhang

The Google Tensor2Tensor system is a powerful deep learning system that works well on multiple tasks. In particular, for machine translation problems, a single model can outperform the integrated model of the previous approach. The model structure, training and optimization techniques of this system can be applied to the company’s product line and translated directly into productivity. This article has taken a thorough look at the Tensor2Tensor system from the model to the code, hoping to provide some useful information.

Chapter one: Overview

Tensor2Tensor (T2T) is a deep learning system based on TensorFlow developed by Google Brain Team on Github. The system was originally intended to model sequence-to-sequence (Seq2Seq) problems entirely using the Attention method, corresponding to the paper “Attention Is All You Need”. This work has an interesting name called “Transformer.” As the system expands, it supports more and more functions, including image classification, language modeling, emotion analysis, speech recognition, text summarization, and machine translation. T2T performs well in many tasks, and the model converges quickly. The engineering code implemented on the TF platform is also very good. It is a system worth using and learning.

If you want to use T2T system quickly from the perspective of engineering application, you only need to have a preliminary understanding of the model. Read the WorkThrough document and you will soon be able to do model training and data decoding. This is what the system aims to achieve, namely lowering the threshold for using deep learning models. The system of data processing, model, super parameter, computing equipment are high encapsulation, in use only need to give the data path, specify to use the model and super parameter, explain the computing equipment can run the system.

If you want to get into the details of the implementation of the system, do secondary development on the system, or implement some research ideas, you need to spend a certain amount of time and energy studying the model and code. T2T is a complex system. Recently, the author studied the model and code implementation in a comprehensive way. Meanwhile, the code involved in sequence to sequence functions was stripped and reconstructed, which cost a lot of time. As the author is doing research on natural language processing, this article focuses on Transformer model. On the one hand, I write this article to summarize and record some gains in this process, and on the other hand, I share my understanding of T2T, hoping to provide some useful information to students.

Chapter 2: Sequence to sequence tasks and Transformer model

2.1 Sequence to sequence tasks and Encoder-Decoder framework

Sequence-to-sequence (SEQUence-to-sequence) is a common task in natural language processing. It is mainly used for general-text generation tasks, such as machine translation, text summarization, lyrics/story generation, dialogue robots, etc. One of the most representative tasks is Machine Translation, which maps sequences in one language to sequences in another. For example, in a Chinese-English machine translation task, the model converts a Chinese sentence (word sequence) into an English sentence (word sequence).

At present Encoder-Decoder framework is a mainstream model to solve sequence – to – sequence problems. Encoder is used to compress source sequence and Decoder is used to generate target sequence based on source compression. The advantage of this structure is that it can realize the end-to-end modeling between two sequences. All the parameter variables in the model are unified into one objective function for training, and the model performs well. Figure 1 shows the structure of the Encoder-decoder model as a bottom-up machine translation process.

Encoder and Decoder can choose Neural networks with different structures, such as RNN and CNN. The working mode of RNN is to compress the sequence according to the time step. When using RNN, the bidirectional RNN structure is generally used. In this way, one RNN is used to compress the elements in the sequence from left to right, and another RNN is used to compress the sequence from right to left. The two representations are used together as a distributed representation of the final sequence. When using CNN structure, multi-layer structure is generally used to realize the process of sequence local representation to global representation. Using RNN to model sentences can be regarded as a time series view, while using CNN to model sentences can be regarded as a structured view. Sequence to sequence models using RNN structure mainly include RNNSearch, GNMT, etc., while sequence to sequence models using CNN structure mainly include ConvS2S, etc.

2.2 Neural network model and language distance dependence

Transformer is a new method of modeling sequence. The sequence to sequence model still adopts the above classical Encoder-Decoder structure, but the difference is that it no longer uses RNN or CNN as the sequence modeling mechanism, but uses self-attention mechanism. The theoretical advantage of this mechanism is that it is easier to capture long Distance dependency. The so-called “long-distance dependent information” can be understood as follows: 1) a word is actually a symbol that can express a variety of semantic information (ambiguity problem). 2) The semantic definition of a word depends on its context. 3) Some words may require a small range of context to determine their semantics (short distance dependence), while others may require a large range of context to determine their semantics (long distance dependence).

For example, look at the following two sentences: “There are a lot of azalea on the mountain. When spring comes, it will open all over the mountain and be very beautiful.” “There are a lot of cuckoos on the mountain, and when spring arrives, they will cry all over the mountains, very graceful.” In both sentences, the word “cuckoo” refers to a flower (azalea) and a bird (cuckoo). In machine translation problems, it is difficult to translate the word “cuckoo” correctly without looking at the word far away from it. This example is an obvious one, where you can clearly see the distance dependencies between words. Of course, the vast majority of meanings can be determined within a relatively small range of contextual semantics, and examples like the ones above will account for a relatively small proportion of language. We expect the model to be able to learn both short distance dependencies and long distance dependencies well.

So, why can self-attention in Transformer theoretically capture such distance dependence knowledge better? Let’s take an intuitive look at the difference in the interaction distance between any two words based on the three sequence modeling methods of RNN, CNN and self-attention. Figure 2 is an approach to modeling the sequence using bidirectional RNN. Since the elements in the sequence are processed sequentially, the interaction distance between two words can be thought of as the relative distance between them. The interaction distance between W1 and Wn is n minus 1. RNN model with Gate mechanism can selectively store and forget historical information theoretically, and has better performance than pure RNN structure, but this ability is certain when the number of gated parameters is constant. As sentences grow and relative distances increase, there is an obvious theoretical upper limit.

Figure 3 shows the method of modeling the sequence using multi-layer CNN. The scope of semantic environment covered by CNN unit at the first layer is small, while the scope of semantic environment covered by the second layer will be larger, and so on. The deeper THE CNN unit is, the larger the scope of semantic environment covered will be. A word will first interact with words close to it on the bottom CNN cell, and then interact with words further away on the CNN cell at a higher level. Therefore, the multi-layer CNN structure reflects a feature extraction process from local to global. The distance between words is proportional to their relative distance. Words with a long distance can only encounter each other on a higher CNN node to generate interaction. There may be a lot of information loss in this process.

Figure 4 shows a sequence modeling approach based on the self-attention mechanism. Note that each word in the sentence layer is fully connected to the nodes in the first self-attention layer. The nodes between the first self-attention layer and the second self-attention layer are also fully connected. We can see that in this modeling method, the interaction distance between any two words is 1, and there is no relation with the relative distance between words. In this way, the semantic determination of each word takes into account the relationship with all the words in the sentence. Multiple layers of self-attention make this global interaction more complex and capture more information.

In conclusion, self-attention mechanism can capture long-distance dependent knowledge when modeling sequence problems, which has a better theoretical basis.

2.3 Formal expression of self-attention mechanism

The above section introduces the benefits of self-attention mechanism. This summary introduces the mathematical formal expression of self-attention mechanism. Let’s start with the attention mechanism. The attention mechanism can be thought of as a query mechanism, which uses a query to retrieve a memory region. Query is represented as key_q and memory is a set of key-value pairs with M entries. The ith entry is represented as <key_m[I], value_m[I]>. By calculating the correlation between query and key_m[I], value_m[I] determines the weight proportion of query results. Note that key_q, key_m, and value_m are vectors.

The Attention calculation can be summarized in three steps: 1) Calculate the correlation between query and memory for each key_m. 2) Softmax function is used to normalize probability of all correlation results. 3) Weighted average is performed on all value_m in memory according to probability normalization results to obtain the final query result. The calculation process is formalized as:

Two commonly used relatedness calculation functions are additive and dot-product. The additive function first goes through a forward neural network element and then through a linear transformation to get a real value. The multiplication method is the direct dot product of two vectors, giving you a real number.

In Encoder-Decoder framework, attention mechanism is generally used to connect Encoder and Decoder, that is, the Decoder state is taken as the key, and the distributed representation of source language sentences is taken as memory, from which relevant source language information can be found and the target language words can be generated. In this mechanism, key_m and value_m in memory are identical. In the self-attention mechanism, each embedding takes its own embedding as query, queries the memory space constituted by all embedding words, and gets the query result as the representation of this word. If the sentence length is N, each word queried in memory will still be n in length. These words can be queried in parallel. If relation function is multiplicative, then the query process is matrix multiplication, which can be formalized as:

In self-attention, Q=K=V is a matrix of word vectors for all words.

To sum up, self-attention is a sequence modeling method. In the distributed representation of sentences, all words in sentences will have direct interaction relations.

2.4 “Attention is All You Need”

“Attention Is All You Need” describes a serial-to-sequence model based on self-attention called “Transformer”. The model pushed the BLEU value of WMT2014 english-German translation tasks to a new high, close to the previously reported best results for English-French translation tasks, and this is just the performance of Transformer single model. The best results reported before are based on the integration method, which requires training of multiple models and integration at last. At the same time, this model has also been used in English component syntactic analysis task, and its performance is basically close to the best model previously reported. The convergence speed of this model is also very fast. In the anglo-French training set of 36 million sentence pairs, it only needs 8 cards and 3.5 days to converge.

In fact, such a good performance of the model is not only caused by a self-attention mechanism. In fact, many effective strategies are used in the Transformer model to make the model have stronger data fitting ability and faster convergence speed. The entire Transformer model is a solution, not just an improvement on the sequence modeling mechanism. Let’s explain them one by one.

2.4.1 Variations of the self-attention mechanic

First, let’s talk about the self-attention mechanism in Transformer. The basic form of self-attention is described above, but the self-attention mechanism in Transformer is a new variant, which is reflected in two aspects: on the one hand, the scaling factor is added, The other is the introduction of multi-head attention.

The scaling factor is reflected in the calculation formula of Attention by adding the dimension of a vector as the denominator, so as to avoid excessive dot product result caused by too large dimension, which will enter the saturation domain of Softmax function and cause too small gradient. The formula for calculating self-attention in Transformer is as follows:

Multi-headed mechanism refers to that multiple groups of parameter matrices are introduced to perform linear transformation on Q, K and V respectively to obtain self-attention results, and then all the results are splited together as the final self-attention output. This description may not be easy to understand, but a look at the formula and schematic diagram will make it clear, as follows:

In this way, the model has multiple sets of independent attention parameters, which can theoretically enhance the power of the model.

2.4.2 Positional Encoding

The way of self-attention mechanism modeling sequences is neither the sequential view of RNN nor the structured view of CNN, but a bag of words view. Further, it should be said that this mechanism sees a sequence as flat, because words that appear to be far away are equal to 1 in the self-attention mechanism. In this way of modeling, the relative distance between words is actually lost. For example, the three sentences “cow ate grass,” “grass ate cow,” and “eat cow grass” will have the same representation for each word modeled.

To solve this problem, Transformer maps the locations of words in sentences into vectors and adds them to their embedding. This idea was not proposed for the first time. In fact, CNN model also has the same defect that it is difficult to model relative position (timing information). Facebook proposed the method of location coding. A direct way is to directly model the absolute location information into embedding, that is, the I of Wi is mapped into a vector and added to its embedding. The disadvantage of this approach is that you can only model sequences of finite length. Transformer paper puts forward a very novel modeling method of timing information, which uses the periodicity of trigonometric function to model the relative position relationship between words. The specific way is to calculate the absolute position as a variable in the trigonometric function, and the specific formula is as follows:

The design of the formula is very transcendental, especially the denominator, which is not easy to explain. From the author’s personal point of view, on the one hand, trigonometric functions have very good periodicity, that is, at a certain distance, the value of the dependent variable will appear repeatedly, which can be used to model the relative distance; On the other hand, the value of the trigonometric function is [-1,1], which can provide the value of the embedding element well.

2.4.3 Multi-layer structure

Multi-layer structures in Transformer are very powerful and use many methods that have been proven to work before, including: There is much discussion about the residual connection and layer normalization as well as the use of the self-attention layer and the Feed Forward layer. Figure 6 shows the structure of the Encoder and Decoder tiers for Transformer.

In Figure 6, the Nx on the left represents the Encoder of one layer, which contains two sub-layers. The first is the multi-headed self-attention layer, and the second is a Feed Forward layer. There is residual connection on the input and output of each sub-layer, which can theoretically return the gradient well. The use of Layer Normalization speeds up model convergence. The calculation of the self-attention sublayer, which we’ve covered at length, won’t be repeated here. There are two linear transformations and one Relu nonlinear activation in the Feed Forward sub-layer. The specific calculation formula is as follows:

The accompanying page describes this calculation as a variant of attention.

In Figure 6, the right side is the structure of one layer of Decoder, which has three sub-layers. The first layer is used by the self-attention layer to model the generated target end sentence. In the process of training, a mask matrix is needed to control the calculation of self-attention, and only the first T-1 words are counted. The specific implementation method will be explained later when we talk about code implementation. The second sub-layer is the attention mechanism between Encoder and Decoder, which is to find relevant semantic information in the source language. The calculation of this part is consistent with other sequence-to-sequence attention calculation, and the dot-Product method is used in Transformer. The third sublayer is the Feed Forward layer, which is exactly the same as the sublayer in Encoder. There are also residual Connection and Layer Normalization operations in each sub-layer to speed up model convergence.

This multi-layer and multi-sub-layer mechanism in Transformer can make the model more complex and trainable, achieving very strong effects, which is worth learning from.

2.4.4 Optimization method and regular strategy

Adam method was adopted in the training of the model, and the paper proposed a method called warm up learning rate adjustment, as shown in the formula:

The formula is a priori and looks complicated, but the logic is clear. You need to set a warmup_STEPS superparameter beforehand. When the number of training steps step_num is less than this value, the second formula in brackets determines the learning rate, which is actually a linear function with positive slope of step_num variable. When the number of training steps step_num is greater than warm_steps, the first term in parentheses determines the learning rate, and the formula becomes a negative exponential power function. Therefore, on the whole, the learning rate shows a trend of increasing first and then decreasing, which is conducive to the rapid convergence of the model.

Two important regularization methods are also used in the model. One is the dropout method, which is used at the end of each sublayer and in the attention calculation. The other is the label Smoothing method, which is no longer the standard one-hot answer when calculating cross entropy during training. Instead, a non-zero minimum is filled in at every zero. This can enhance the robustness of the model and improve the BLEU value of the model. In fact, this idea is also to some extent to solve the problem of exposure bias in the process of training and decoding.

2.4.5 Summary of this chapter

The strong performance of a Transformer system requires not only self-attention but also the combination of the above strategies. The researchers who designed the system have a very deep understanding and keen sense of deep learning model and optimization algorithm, which is worth learning from. The code implementation of Transformer is better, but there are some inconvenient to read and understand. The following chapters will explain the code implementation of Transformer in detail, explain the overall structure clearly, and point out the difficult modules.

Chapter 3: The Tensor2Tensor system implements in-depth analysis

Tensor2Tensor’s system has a few things in it that make it difficult to use and understand things that take time to think about and digest, so write down some of the things that you’ve spent time doing, based on your understanding.

3.1 use paper

Tensor2Tensor is a good tensor to use. For problems your system supports, you need to set up your system with the following information: data, problem, model, hyperparameter set, device. The implementation here actually adopts the factory pattern in the design model, that is, given a problem name, it is returned to the corresponding processing class; Given a superparameter name, return a set of superparameter objects. A key file for implementing this approach is utils/registry.py. At system startup, all questions and overparameters are registered in Registry and stored in _MODELS, _HPAPAMS, and _RANGED_HPARAMS for invocation.

This paper mainly introduces the use and implementation of sequential to sequential systems. The operation of the system is divided into three stages: data processing, training and decoding. There are three entrances: T2T-Datagen, T2T-trainer and T2T-decoder.

The data processing process includes:

1. (Download) Read training and development data. If you want to use your own data, you can specify it in the question.

2. (Read) Construct vocabulary. You can use your own pre-constructed vocabulary. The system also provides a way to build the BPE vocabulary. Note that one implementation detail here is that when the system extracts the BPE vocabulary, it has a parameter that does not use the full data by default. Through several iterations, a vocabulary that is closest to the preset vocabulary size is obtained. In the case of large data volumes, this iterative process can be very slow.

EOS_ID is added after each sentence. Each pair of parallel sentences is constructed into a dict object ({‘ inputs’ :value, ‘targets’ :value}). All objects are serialized and written into a file for later training and evaluation.

The process of model training is mainly managed through the high-level Tensorflow API. It only needs to specify data, problem name, model name, superparameter name, and device information to run. A key file is the utils/trainer_lib.py file, where Experiment, Estimator, Monitor, and so on are built to control the training flow. Users mainly need to set some parameters of the training process, such as the maximum number of iterations of training, the frequency of model evaluation, and the indicators of model evaluation. A hyperparameter can either use an existing parameter set in the system or be passed in as a string. Simple tasks require less dynamic hyperparameters, because the set of hyperparameters in the system is almost always experimentally verified. Note that when batCH_size is too large, it may cause insufficient video memory, resulting in program errors. Continuous_train_and_eval mode is generally used to make model training and evaluation intervals, so that the performance of the model can be monitored at any time.

The process of decoding, can provide the whole file, can also be based on the Dataset, while the system also provides the way of server, can provide online services, and there is nothing particularly good to say.

3.2 In-depth mastery

3.2.1 Characteristics of Tensor2Tensor system implementation

Here are some of the problems you might have when you need to understand Tensor2Tensor because of its implementation:

1. The system supports multi-task, which leads to complex code structure. In the implementation, to take into account the overall structure, so there will be various encapsulation, inheritance, polymorphic implementation. Maybe you only want to use one feature and understand the code for that feature, but you need to eliminate a lot of irrelevant code.

2. The system is based on Tensorflow to package high API. The relatively high API in Tensorflow is used to manage the training and prediction of the model. The use of Experiment, Monitor, Estimator and Dataset object hides more control processes, which may be a good thing for users who focus on application. Setting parameters can run. However, for developers who want to know more, there are few documents in this part of TF, and what is said is not clear. Most of the time, they need to read the source code to know whether the experiment is carried out according to their expectations. This approach is also less convenient for finding bugs and debugging.

3. Some method calls are deep. The reason should be overall structure and scalability. This results in method A, which does A little bit of work, having to call another method B, which then calls method C. In fact, there are only A few lines of code in each method, and some methods even have empty operations.

4. Multi-level inheritance and polymorphism also reduce code readability. To trace a method of a class, you need to see the parent of its parent class… The parent and subclass methods are called back and forth, and the methods with the same name are overridden, so it takes some time to determine which class the current method name belongs to.

5. Requires model-level understanding and hooking up to code implementation. It is necessary to improve the understanding of the model logic, but in the process of reading the code, there will be two kinds of problems: first, the code implements the function in the paper, but not the original formula in the paper, so it may have to be deformed to avoid the overflow problem, or to achieve higher efficiency; Second, some code implementations are inconsistent with the presentation in the paper.

3.2.2 General logic module

In general, the code for T2T is logically divided as follows, consisting of three major modules:

  1. Module for problem definition and data management. ** This module is used to define problems and process data, such as defining a translation problem, which defines methods for extracting vocabularies and constructing training samples.
  2. ** Modules for model definition and computation diagram construction. ** This module is used to define model properties and compute graph structures.
  3. ** Experimental flow control and parallelization module. ** This module is used for experimental flow control, setting up available computing devices, and providing parallel running methods of models.

So we’re not going to do a tracking analysis of the code, we’re going to go through some of the problems you might have when you read the code for Tensor2Tensor systems, and we’re going to show you where some of the important features are and the logic of the implementation.

  1. ** Factory mode. ** Systems use factory mode to manage problems, models, hyperparameters, modes, and other modules. The key file registry. Py is a core file for the system’s overall management and scheduling module. If you want to add new problems, models, superparameters, modes, etc., in the system, you also need to register in Registry by adding decorators before the class, otherwise the system cannot find the newly added modules.
  2. ** Problem class. The class problem in **data_generators/problem.py is the base class for all subsequent problems. As mentioned earlier, the hierarchy of classes in the system causes the code to be difficult to read. For example, a translation problem inheritance route looks like this: Problem>>Text2TextProblem>> TranslateEndeWmtBpe32k>> TranslateEndeWmt32k Causes some reading difficulties. In general, a sequence-to-sequence problem should include the following information and methods: Data file information, vocabulary file name, type, size, method to construct vocabulary, method to serialize training data and development data, method to construct input stream input_FN for model (Estimator) by reading data file, method to set problem evaluation metric. To sum up, problem attribute definition, construction of training and evaluation samples, data processing and reading are all provided by the classes and methods in the problem system.
  3. ** Vocabulary object (TextEncoder). ** There are a variety of TextEncoder objects in the system, which can support multiple methods such as character, subword/ BPE, and token. TextEncoder’s main function is to build a vocabulary and map symbols to ids. There is a way to construct the BPE vocabulary in T2T, but there is no word piece vocabulary, and you can see the difference between T2T and GNMT research teams. The two teams have been alternately updating the top scores of machine translation tasks. The build_to_target_size () method in SubwordTextEncoder is used to build the BPE vocabulary. Instead of using the number of iterations Sennrich used to control the vocabulary size, the build_to_target_size () method uses binary lookup. Approximate the size of the preset vocabulary by searching for the optimal minimum Token count value.
  4. * * T2TModel class. The class T2TModel in **utils/t2t_model.py is the base class for model functionality that inherits from Layer and from which Transformer classes inherit. The calculation graph structure of the model is defined in class T2TModel, that is, after a given feature, how the model calculates the graph according to the feature to obtain logit and Loss, and then calculates the gradient according to loss, and calls Optimizer for gradient return and parameter update. Construct the calculation chart build tf. The purpose is to be ultimately estimator. EstimatorSpec () object. It can be understood that all model graph calculations are expressed in this object. T2TModel can return three EstimatorSpec objects for training, evaluation, and decoding. The process of training can support data parallelism. The specific realization is to activate the calculation graph on multiple data slices at the same time and average the loss obtained, which is a synchronous parallel training method. Methods for decoding are also provided in T2TModel.
  5. * * the Transformer class. ** Class Transformer in Models /transformer. Py inherits from Class T2TModel and provides a variety of supported methods when building diagrams for its parent class. The encode method can use the Encoder structure to compress the source side of the representation. The decode method uses the Decoder structure to generate the target end. Also, there are multiple sets of parameters to choose from in transformer. Py. The implementation of the feed-forward sublayer in the model is also in this file (transformer_ffn_layer).
  6. Data parallel classes. The main function of devices.py and expert_utils.py is to list the names of available devices based on the parallel device parameters given by the user, and then give a method that can call those devices and execute the methods in parallel.
  7. ** Experiment flow control. ** Experimental flow control uses advanced API objects of Tensorflow, including Experiment object, Estimator object and Dataset object. The three objects can be understood as follows: a) Experiment is a running Experiment, which is used to control the experimental process and transfer data to the model. B) Estimator is a specific model object, which can include training, evaluation and decoding. C) Dataset provides data flow for reading data files in the running experimental process.
  8. Experiment object. The concept of “Experiment” can be better understood by looking at the parameters required for the Experiment initialization in the figure below. The Experiment object needs various step parameters in the iteration, an Estimator object and two input stream functions. The Experiment object runs, feeds the data to the Estimator object, and then controls the training and iteration process.

9. The Estimator object. It can be understood as a model object, and can perform the training, evaluation and decoding of the model through Estimator. The most important parameter of the Estimator object is model_fn, which is the entry to the function that performs training, evaluation, and decoding. The three entries correspond to three EstimatorSpec objects, as shown in Figure 9 and 10.

As can be seen from Figure 10, EstimatorSpec for training needs to describe the relationship between feature and (Loss, train_op) in the calculation graph. EstimatorSpec used for evaluation needs to describe the relationship between feature and (loss, eval_metrics_OPS) in the calculation graph. The EstimatorSpec object used for the assessment needs to describe the relationship between Features and Predictions.

  1. The Dataset object. This object is a data stream that reads files and constructs training and evaluation. Training and evaluation correspond to two different data input streams, as shown in Figure 11.

\11. Implementation of Positional encoding. There are formula distortion and inconsistency in the implementation of the paper and the implementation of the code, which may cause confusion, so it is pointed out here. The parameter formula of trigonometric function in Positional encoding is as follows:

In the code, the formula needs to be deformed to avoid the risk of numerical overflow. The formula deformation process is as follows:

It should also be pointed out that the argument that sin and cosine functions are used alternately according to the parity of dimension subscripts in the paper is not implemented in this way in the code. Instead, sin functions are used for the first half of dimensions and cosine functions for the last half of dimensions, without considering the parity

12. ** uses the number of tokens as the batch size. ** Compared with batch size, batch occupies more average video memory space, which will not cause video memory occupation up and down due to training data, resulting in insufficient video memory space, resulting in program crash.

\13. How to make a mask. Since the model is trained in batch as a unit, the sentence length of batch shall be the longest one, and the other sentences shall be padding-based. If the padding item is not processed during calculation, it will introduce noise, so mask is needed to make the padding item not work in calculation. Mask is very simple to implement in the attention mechanism, which is to add a large negative number to the padding position element before softmax, forcing the probability result to be 0 after Softmax. For example, [1,1,1] is about [0.33,0.33,0.33] calculated by softmax, and [1,1,-1e9] is about [0.5, 0.5,0] calculated by softmax. This is equivalent to masking the third element in the array. When modeling the target sequence, it is necessary to ensure that only the first T-1 words are noticed at each attention. Mask is also needed in this area. The overall mask is an upper triangular matrix, and the value of non-zero elements is extremely negative.

\14. Batch based decoding. When decoding, if it is file-based, sentences are batch decoded in parallel. The trick here is to sort the sentences first, and then batch the sentences from the long sentences, translate them, and return the sentences to the original order. In this way, errors of insufficient video memory can be well detected, because the video memory is enough when solving the longest batch of sentences, and there is no problem with the other batch.

conclusion

Tensor2Tensor system at Google is an in-depth interpretation of the tensor system at Google. It involves a lot of aspects. I also need to further study and study it.

Question and answer

What is the difference between Docker and Docker-compose?

reading

Neural network core principles and algorithms for deep learning – normalization and parameter initialization

Heuristic path finding algorithm

Deep learning (5) — Introduction to RBF algorithm

Has been authorized by the author tencent cloud + community release, the original link: https://cloud.tencent.com/developer/article/1116709?fromSource=waitui

Welcome to Tencent Cloud + community or follow the wechat public account (QcloudCommunity), the first time to get more mass technology practice dry goods oh ~

Massive technical practice experience, all in the cloud plus community! https://cloud.tencent.com/developer?fromSource=waitui