Writing in the front
[email protected]
Recently, I was looking at paddle related, so I decided to go through the source code of Baidu ERNIE. I haven’t seen ERNIE2.0 or ERNIE tiny before, the overall feeling is very similar to BERT, I don’t know what will happen after the update. I will also sort out a summary like the following, those who happen to be studying Paddle or ERNIE can join me to discuss hahaha
@ 2019.05.16 original content
BERT model has been out for a long time, I have read papers and some blogs about it: NLP kill BERT model interpretation [1], but I have not carefully looked at the specific implementation of the source code. Take the time to take a look at it and write it down and discuss it with you.
Note that the source code reading series requires some prior knowledge of NLP, such as the Attention mechanism, the Transformer framework, and python and TensorFlow fundamentals. BERT principles are not the focus of this article.
Attached is a summary of BERT data: a summary of Bert-related papers, articles and code resources [2]
Today we will introduce BERT’s most important model implementation part —–BertModel, the code is located in
- Modeling. Py module [3]
In addition to the outside of the code block, there are also comments inside the code block
Please be sure to point out if any interpretation is incorrect
1. Configuration class (BertConfig)
This part of the code mainly defines some default parameters of the BERT model, in addition to some file handling functions.
class BertConfig(object):
"""Configuration classes for BERT models."""
def __init__(self,
vocab_size,
hidden_size=768,
num_hidden_layers=12,
num_attention_heads=12,
intermediate_size=3072,
hidden_act="gelu",
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
max_position_embeddings=512,
type_vocab_size=16,
initializer_range=0.02) : self.vocab_size = vocab_size self.hidden_size = hidden_size self.num_hidden_layers = num_hidden_layers self.num_attention_heads = num_attention_heads self.hidden_act = hidden_act self.intermediate_size = intermediate_size self.hidden_dropout_prob = hidden_dropout_prob self.attention_probs_dropout_prob = attention_probs_dropout_prob self.max_position_embeddings = max_position_embeddings self.type_vocab_size = type_vocab_size self.initializer_range = initializer_range @classmethod def from_dict(cls, json_object):"""Constructs a `BertConfig` from a Python dictionary of parameters."""
config = BertConfig(vocab_size=None)
for (key, value) in six.iteritems(json_object):
config.__dict__[key] = value
return config
@classmethod
def from_json_file(cls, json_file):
"""Constructs a `BertConfig` from a json file of parameters."""
with tf.gfile.GFile(json_file, "r") as reader:
text = reader.read()
return cls.from_dict(json.loads(text))
def to_dict(self):
"""Serializes this instance to a Python dictionary."""
output = copy.deepcopy(self.__dict__)
return output
def to_json_string(self):
"""Serializes this instance to a JSON string."""
return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"
Copy the code
“Parameter Meanings”
- Vocab_size: word table size
- Hidden_size: indicates the number of neurons at the hidden layer
- Num_hidden_layers: Number of hidden layers in Transformer Encoder
- *num_attention_heads: * The number of heads for multi-head attention
- Intermediate_size: number of “intermediate” hidden layer neurons of encoder (e.g. feed-forward layer)
- Hidden_act: Hidden layer activation function
- Hidden_dropout_prob: Hidden layer dropout rate
- Attention_probs_dropout_prob: Dropout of the attention part
- Max_position_embeddings: Maximum position code
- Type_vocab_size: dictionary size of token_type_IDS
- Initializer_range: Truncated_normal_Initializer Stdev for the initialization method
Segment A and Segment B in the Next Sentence Prediction task, Segment A and Segment B in the next Sentence Prediction task. The bert_config.json file is also available for download, and the default value should be 2. Refer to this Issue[4]
Embedding_lookup = Embedding_lookup
For word_ids, returns the embedding table. Use one-hot or Tf.Gather ()
Def embedding_lookup(input_ids, # word_id: [batch_size, seq_length] vocab_size, embedding_size=128,
initializer_range=0.02,
word_embedding_name="word_embeddings", use_one_HOT_embeddings =False): # The default input shape for this function is [batch_size, seq_length, input_num] # If input is2Batch_size, seq_length, batch_size, seq_length1】
if input_ids.shape.ndims == 2:
input_ids = tf.expand_dims(input_ids, axis=[- 1])
embedding_table = tf.get_variable(
name=word_embedding_name,
shape=[vocab_size, embedding_size],
initializer=create_initializer(initializer_range))
flat_input_ids = tf.reshape(input_ids, [- 1] # [batch_size*seq_length*input_num]if use_one_hot_embeddings:
one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size)
output = tf.matmul(one_hot_input_ids, embedding_table)
else: # 按索引取值
output = tf.gather(embedding_table, flat_input_ids)
input_shape = get_shape_list(input_ids)
# output:[batch_size, seq_length, num_inputs]
# 转成:[batch_size, seq_length, num_inputs*embedding_size]
output = tf.reshape(output,
input_shape[0:- 1] + [input_shape[- 1] * embedding_size])
return (output, embedding_table)
Copy the code
“Parameter Meanings”
- Input_ids: word id [batch_size, seq_length]
- Vocab_size: embedding word table
- Embedding_size: embedding dimension
- Initializer_range: embedding initialization range
- Word_embedding_name: name of the embeddding table
- Use_one_hot_embeddings: Whether to use one-hotembedding
- Return: [batch_size, seq_length, embedding_size]
embedding_postprocessor
We know that the input of BERT model has three parts:token embedding
,segment embedding
As well asposition embedding
. In the previous section we only got the token embedding. This code completes the information, regularizes it, and then outputs the final embedding. Notice that in the Transformer paperposition embedding
Is a fixed value generated by the sin/cos function. In the code implementation here, it is randomly generated like ordinary Word embedding and can be trained. The reason for the author’s choice here may be that BERT’s training data is much larger than Transformer’s, so the model can learn by itself.
def embedding_postprocessor(input_tensor, # [batch_size, seq_length, embedding_size]
use_token_type=False,
token_type_ids=None,
token_type_vocab_size=16, # generally yes2
token_type_embedding_name="token_type_embeddings",
use_position_embeddings=True,
position_embedding_name="position_embeddings",
initializer_range=0.02,
max_position_embeddings=512, # maximum position encoding must be greater than or equal to max_seq_len dropout_prob=0.1):
input_shape = get_shape_list(input_tensor, expected_rank=3) # embedding_size = input_shape[0]
seq_length = input_shape[1]
width = input_shape[2] output = input_tensor # Segment positionif use_token_type:
if token_type_ids is None:
raise ValueError("`token_type_ids` must be specified if"
"`use_token_type` is True.") token_type_table = tf.get_variable( name=token_type_embedding_name, shape=[token_type_vocab_size, width], Initializer = create_Initializer (Initializer_range)) # For token-typeFlat_token_type_ids = tf. 0 (0) 0- 1]) one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size) token_type_embeddings = tf.matmul(one_hot_ids, token_type_table) token_type_embeddings = tf.reshape(token_type_embeddings, [batch_size, seq_length, Width]) output += token_type_Embeddingsifuse_position_embeddings: Assert_op = tf.ASSERT_LESS_equal (seq_length, max_position_embeddings) with tf.control_dependencies([assert_op]): full_position_embeddings = tf.get_variable( name=position_embedding_name, shape=[max_position_embeddings, width], Initializer =create_initializer(initializer_range) [MAX_POSItion_embeddings, width] # But usually the actual input sequence did not reach max_POSItion_embeddings, so to improve the training speed, Embeddings = tf. Slice (full_position_Embeddings, [0.0],
[seq_length, - 1])
num_dims = len(output.shape.as_list()) # tensor [batch_size, seq_length, width] Our shape is always [seq_length, width] # we can't add position Embedding to Word Embedding # so we need to extend position encoding to [1, seq_length, width] # then it can be added by broadcasting. position_broadcast_shape = []for _ in range(num_dims - 2):
position_broadcast_shape.append(1)
position_broadcast_shape.extend([seq_length, width])
position_embeddings = tf.reshape(position_embeddings,
position_broadcast_shape)
output += position_embeddings
output = layer_norm_and_dropout(output, dropout_prob)
return output
Copy the code
4. Construct attention_mask
The purpose of this part of the code is to construct the attention_mask for the attentional domain. Because each sample goes through the padding process, the padding part of the self-attention part cannot attend the other part. Enter the shape as [batch_size, from_seq_length,… The padding is the input_ids and the mask vector of shape [batch_size, to_seq_length].
def create_attention_mask_from_input_mask(from_tensor, to_mask):
from_shape = get_shape_list(from_tensor, expected_rank=[2.3])
batch_size = from_shape[0]
from_seq_length = from_shape[1]
to_shape = get_shape_list(to_mask, expected_rank=2)
to_seq_length = to_shape[1]
to_mask = tf.cast(
tf.reshape(to_mask, [batch_size, 1, to_seq_length]), tf.float32)
broadcast_ones = tf.ones(
shape=[batch_size, from_seq_length, 1], dtype=tf.float32)
mask = broadcast_ones * to_mask
return mask
Copy the code
5. Attention Layer
This part of the code is the implementation of “multi-head attention”, mainly from the paper “Attention is All You Need”. So if you think about key-query-value attention, then from_tensor is query, to_tensor is key and value, and then self-attention when they’re the same. A more detailed introduction of attention can be referred to “Understanding the principle and Model of Attention Mechanism [5]”.
Def attention_layer(from_tensor, # batch_size, from_seq_length, from_width) to_tensor, # batch_size, to_seq_length, To_width attention_mask=None, # [batch_size,from_seq_length, to_seq_length] num_attention_heads=1, # attention head numbers
size_per_head=512# query_act=None, # key_act=None, # key = value_act=None, Attention_probs_dropout_prob =0.0, # Attention layer dropout Initializer_range =0.02Do_return_2d_tensor =False, # does it return2D tensor. [batch_size*from_seq_length,num_attention_heads*size_per_head] # If False, Output shape [batch_size, from_seq_length, num_attention_heads*size_per_head] BATch_size =None, # if input is3D, # then batch is the first dimension, but maybe3The delta of D is reduced to2Batch_size from_seq_length=None, to_seq_length=None Def transpose_for_scores(input_tensor, batch_size, num_attention_heads, seq_length, width) output_tensor = tf.reshape( input_tensor, [batch_size, seq_length, num_attention_heads, width]) output_tensor = tf.transpose(output_tensor, [0.2.1.3]) #[batch_size, num_attention_heads, seq_length, width]
return output_tensor
from_shape = get_shape_list(from_tensor, expected_rank=[2.3])
to_shape = get_shape_list(to_tensor, expected_rank=[2.3])
if len(from_shape) ! =len(to_shape):
raise ValueError(
"The rank of `from_tensor` must match the rank of `to_tensor`.")
if len(from_shape) == 3:
batch_size = from_shape[0]
from_seq_length = from_shape[1]
to_seq_length = to_shape[1]
elif len(from_shape) == 2:
if (batch_size is None or from_seq_length is None or to_seq_length is None):
raise ValueError(
"When passing in rank 2 tensors to attention_layer, the values "
"for `batch_size`, `from_seq_length`, and `to_seq_length` "
"must all be specified."# B = Batch size (number of sequences) # F =`from_tensor` sequence length
# T = `to_tensor` sequence length
# N = `num_attention_heads`
# H = `size_per_head`From_tensor and to_tensor2From_tensor_2d = reshape_to_matrix(From_tensor) # 【B*F, Hidden_size = to_tensor_2d = reshape_to_matrix(To_tensor) # 【B*T, hidden_size】 # Put from_tensor into the whole connected layer to get query_layer #`query_layer` = [B*F, N*H]
query_layer = tf.layers.dense(
from_tensor_2d,
num_attention_heads * size_per_head,
activation=query_act,
name="query", kernel_initializer= create_Initializer (initializer_range)) # put from_tensor into the full connected layer to get query_layer #`key_layer` = [B*T, N*H]
key_layer = tf.layers.dense(
to_tensor_2d,
num_attention_heads * size_per_head,
activation=key_act,
name="key", kernel_initializer=create_initializer(initializer_range)) #`value_layer` = [B*T, N*H]
value_layer = tf.layers.dense(
to_tensor_2d,
num_attention_heads * size_per_head,
activation=value_act,
name="value", kernel_initializer= create_Initializer (initializer_range)) # query_layer [B*F, N*H]==>[B, F, N, H]==>[B, N, F, H] query_layer = transpose_for_scores(query_layer, batch_size, Num_attention_heads, from_seq_length, size_per_head) # key_layer [B*T, N*H] ==> [B, T, N, H] ==> [B, N, T, H] key_layer = transpose_for_scores(key_layer, batch_size, Num_attention_heads, to_seq_length, size_per_head`attention_scores` = [B, N, F, T]
attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
attention_scores = tf.multiply(attention_scores,
1.0 / math.sqrt(float(size_per_head)))
if attention_mask is not None:
# `attention_mask` = [B, 1, F, T]
attention_mask = tf.expand_dims(attention_mask, axis=[1] # if the element in the attention_mask is1, then the following operation can be obtained:1- 1) *- 10000., the adder is0# if the element in attention_mask is0, then the following operation can be obtained:10) *- 10000., the adder is- 10000.
adder = (1.0 - tf.cast(attention_mask, tf.float32)) * 10000.0The final attention_score we get is generally not very large, so the above operation for mask is0Attention_scores += adder # Minus infinity after softmax0, is equivalent to mask is0The position of does not count attention_score #`attention_probs`= [B, N, F, T] attention_probs = tf.nn.softmax(attention_scores) But that's what the original Transforme papers do: Attention_probs = dropout(attention_probs, attention_PROBs_dropout_prob) #`value_layer` = [B, T, N, H]
value_layer = tf.reshape(
value_layer,
[batch_size, to_seq_length, num_attention_heads, size_per_head])
# `value_layer` = [B, N, T, H]
value_layer = tf.transpose(value_layer, [0.2.1.3])
# `context_layer` = [B, N, F, H]
context_layer = tf.matmul(attention_probs, value_layer)
# `context_layer` = [B, F, N, H]
context_layer = tf.transpose(context_layer, [0.2.1.3])
if do_return_2d_tensor:
# `context_layer` = [B*F, N*H]
context_layer = tf.reshape(
context_layer,
[batch_size * from_seq_length, num_attention_heads * size_per_head])
else:
# `context_layer` = [B, F, N*H]
context_layer = tf.reshape(
context_layer,
[batch_size, from_seq_length, num_attention_heads * size_per_head])
return context_layer
Copy the code
To sum up, the main flow of attention Layer is as follows:
- And then you take the input tensor and you take it
Batch_size, froM_seq_LENGTH, to_seq_length
; - If the input is a 3D tensor, it is converted to a 2D matrix;
- From_tensor for query, to_tensor for key and value, through a full connect layer you get query_layer, key_layer, value_layer;
- Pass the above tensor
transpose_for_scores
Convert to multi-head; - Calculate attention_score and attention_probs (pay attention to the trick of attention_mask) according to the formula in the paper:
- The resulting attention_probs is multiplied by value to return either a 2D or 3D tensor
6, the Transformer
The following code is the core code of the famous Transformer, which can be thought of as “Attention is All You Need”. Please refer to [original paper [6]] and [original code [7]].
Def transformer_model(input_tensor, # [batch_size, seq_length, hidden_size] attention_mask=None, # [batch_size, Seq_length seq_length 】 hidden_size =768,
num_hidden_layers=12,
num_attention_heads=12,
intermediate_size=3072, intermediate_act_fn=gelu, # feed-forward layer activation function hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
initializer_range=0.02, do_return_all_layers=False): # notice here, because we're going to print hidden_size, we have num_attention_head, # Each head region has more hidden layers of size_per_head # so there is hidden_size = num_attention_head * size_per_headifhidden_size % num_attention_heads ! =0:
raise ValueError(
"The hidden size (%d) is not a multiple of the number of attention "
"heads (%d)" % (hidden_size, num_attention_heads))
attention_head_size = int(hidden_size / num_attention_heads)
input_shape = get_shape_list(input_tensor, expected_rank=3)
batch_size = input_shape[0]
seq_length = input_shape[1]
input_width = input_shape[2Encoder has a residual operation, so you need the same shapeifinput_width ! = hidden_size: raise ValueError("The width of the input tensor (%d) ! = hidden size (%d)"% (input_width, hidden_size)) # 0 0 0 0 0 0 0 0 % (input_width, hidden_size2D and30 The frequency between D 0 we are 03D tensor2Prev_output = reshape_to_matrix(input_tensor) all_layer_outputs = []for layer_idx in range(num_hidden_layers):
with tf.variable_scope("layer_%d" % layer_idx):
layer_input = prev_output
with tf.variable_scope("attention"):
# multi-head attention
attention_heads = []
with tf.variable_scope("self") : # self-attention attention_head = attention_layer( from_tensor=layer_input, to_tensor=layer_input, attention_mask=attention_mask, num_attention_heads=num_attention_heads, size_per_head=attention_head_size, attention_probs_dropout_prob=attention_probs_dropout_prob, initializer_range=initializer_range, do_return_2d_tensor=True, batch_size=batch_size, from_seq_length=seq_length, to_seq_length=seq_length) attention_heads.append(attention_head)
attention_output = None
if len(attention_heads) == 1:
attention_output = attention_heads[0]
elseAttention_output = tf.concat(attention_heads, axis=- 1Dropout +residual+norm with tf.variable_scope()"output") : attention_output = tf.layers.dense( attention_output, hidden_size, kernel_initializer=create_initializer(initializer_range)) attention_output = dropout(attention_output, hidden_dropout_prob) attention_output = layer_norm(attention_output + layer_input) # feed-forward with tf.variable_scope("intermediate") : intermediate_output = tf.layers.dense( attention_output, intermediate_size, activation=intermediate_act_fn, Kernel_initializer = create_Initializer (Initializer_range)) # Transform the output of the feed-forward layer back to 'hidden_size' # and then dropout + using linear transformation residual + norm with tf.variable_scope("output") : layer_output = tf.layers.dense( intermediate_output, hidden_size, kernel_initializer=create_initializer(initializer_range)) layer_output = dropout(layer_output, hidden_dropout_prob) layer_output = layer_norm(layer_output + attention_output) prev_output = layer_output all_layer_outputs.append(layer_output)
if do_return_all_layers:
final_outputs = []
for layer_output in all_layer_outputs:
final_output = reshape_from_matrix(layer_output, input_shape)
final_outputs.append(final_output)
return final_outputs
else:
final_output = reshape_from_matrix(prev_output, input_shape)
return final_output
Copy the code
It works best when used with the above and below images, because BERT only has encoder and all decoders have no name
7. Function entry (init)
Constructor of the BertModel class. With the introduction of the previous sections, we can implement the BERT model.
Def __init__(self, config, # BertConfig) is_training, input_ids, # batch_size, seq_length, input_mask=None, # [batch_size, seq_length] token_type_ids=None, # [batch_size, seq_length] use_one_HOT_embeddings =False, # Whether to use one-hot; Otherwise tf.Gather () scope=None): config =copy.deepcopy(config)
if not is_training:
config.hidden_dropout_prob = 0.0
config.attention_probs_dropout_prob = 0.0
input_shape = get_shape_list(input_ids, expected_rank=2)
batch_size = input_shape[0]
seq_length = input_shape[1# # # # # # # # # #1
if input_mask is None:
input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32)
if token_type_ids is None:
token_type_ids = tf.zeros(shape=[batch_size, seq_length], dtype=tf.int32)
with tf.variable_scope(scope, default_name="bert"):
with tf.variable_scope("embeddings") : # word embedding (self.embedding_output, self.embedding_table) = embedding_lookup( input_ids=input_ids, vocab_size=config.vocab_size, embedding_size=config.hidden_size, initializer_range=config.initializer_range, word_embedding_name="word_embeddings". Use_one_hot_embeddings = USe_one_hot_embeddings) # Add position embedding and Segment embedding # layer norm + Dropout self.embedding_output = embedding_postprocessor( input_tensor=self.embedding_output, use_token_type=True, token_type_ids=token_type_ids, token_type_vocab_size=config.type_vocab_size, token_type_embedding_name="token_type_embeddings",
use_position_embeddings=True,
position_embedding_name="position_embeddings",
initializer_range=config.initializer_range,
max_position_embeddings=config.max_position_embeddings,
dropout_prob=config.hidden_dropout_prob)
with tf.variable_scope("encoder"): # input_ids is the padding word_ids: [25.120.34.0.0# input_mask is a valid word marker: [1.1.1.0.0] attention_mask = create_attention_mask_from_input_mask(input_ids, input_mask) # transformer module stack #`sequence_output` shape = [batch_size, seq_length, hidden_size].
self.all_encoder_layers = transformer_model(
input_tensor=self.embedding_output,
attention_mask=attention_mask,
hidden_size=config.hidden_size,
num_hidden_layers=config.num_hidden_layers,
num_attention_heads=config.num_attention_heads,
intermediate_size=config.intermediate_size,
intermediate_act_fn=get_activation(config.hidden_act),
hidden_dropout_prob=config.hidden_dropout_prob,
attention_probs_dropout_prob=config.attention_probs_dropout_prob,
initializer_range=config.initializer_range,
do_return_all_layers=True)
# `self.sequence_output`Shape is [batch_size, seq_length, hidden_size] self.sequence_output = self.all_encoder_layers[- 1[batch_size, seq_length, hidden_size] # convert to [batch_size, hidden_size] with tf.variable_scope()"pooler"): # Take the tensor at the first moment of the last layer [CLS]. It's important for sorting tasks.0:1, :] gets [batch_size,1First_token_tensor = tf.squeeze(self.sequence_output[:,0:1, :], axis=1) # Then add a full connection layer, The output is still [batch_size, hidden_size] self.pooled_output = tf.layers.dense(first_token_tensor, config.hidden_size, activation=tf.tanh, kernel_initializer=create_initializer(config.initializer_range))Copy the code
Conclusion the ha
With the above in-depth understanding of the source code, we will be more comfortable when using BertModel. Here’s a simple chestnut for the model:
# assuming the input has been split into word_ids. shape=[2.3]
input_ids = tf.constant([[31.51.99], [15.5.0]])
input_mask = tf.constant([[1.1.1], [1.1.0[]) # segment_emebdding1The latter word belongs to the sentence2.# The first word of the second sample belongs to the sentence1The second word belongs to the sentence2And the third element0The original code for padding # looks like this, but it doesn't feel necessary2Token_type_ids = tf.constant([[0.0.1], [0.2.0[]] # vocab_size= vocab_size32000, hidden_size=512,
num_hidden_layers=8, num_attention_heads=6, intermediate_size=1024BertModel(config=config, is_training=True, input_ids=input_ids, input_mask=input_mask, token_type_ids=token_type_ids) label_embeddings = tf.get_variable(...) The first Token of the last layer is the [CLS] vector representation, Embedding pooled_output = model.get_pooled_output() logits = tf.matmul(pooled_output, label_embeddings)Copy the code
The main process of BERT model construction is as follows:
- When the input sequence is added (three), ‘Attention is all you need’
- It is simpler to put the embedding into transformer and get output results.
- Embedding -> N * [multi-head attention -> Add(Residual) &Norm- > feed-forward -> Add(Residual) &Norm]
- Ha, is not very simple ~
- There are a few other helper functions in the source code that are not too hard to understand, so I won’t bother here.
The above –
References for this article
[1]
NLP kill BERT model interpretation: blog.csdn.net/Kaiyuan_sjt…
[2]
Bert-related papers, articles and code resources: www.52nlp.cn/bert-paper-…
[3]
Modeling. Py module: github.com/google-rese…
[4]
Refer to this Issue: github.com/google-rese…
[5]
Understanding the mechanics and models of Attention: blog.csdn.net/Kaiyuan_sjt…
[6]
Original paper: arxiv.org/abs/1706.03…
[7]
Original code: github.com/tensorflow/…
“`php
Highlights of past For beginners entry route of artificial intelligence and data download AI based machine learning online manual deep learning online manual download update (PDF to 25 sets) note: WeChat group or qq group to join this site, please reply “add group” to get a sale standing knowledge star coupons, please reply “planet” knowledge like articles, point in watching
Copy the code