Writing in the front
The use of BERT can be divided into two steps: “pre-training” and “fine-tuning”. Pre-training is good for your specific task, but it is expensive (four days on 4 to 16 Cloud TPUs) and not good for large logarithm practitioners to start from scratch. However, Google has released a variety of pre-trained models to choose from, which only need fine-tuning for specific tasks. Today, we will continue to read BERT’s pre-trained source code in accordance with the framework of the original paper. The BERT pre-training process is divided into two specific sub-tasks: “Masked LM” and “Next Sentence Prediction”.
-
tokenization.py[1]
-
create_pretraining_data.py[2]
-
run_pretraining[3]
In addition to the code block outside, there are also comments inside
1. Participle (tokenization. Py)
Tokenization. Py is a processing of raw text corpus, divided into BasicTokenizer and Wordpiece Tokenizer.
1.1 BasicTokenizer
According to the space, punctuation carries on the ordinary word segmentation, the final return is about the word list, in the case of Chinese is about the word list.
class BasicTokenizer(object): def __init__(self, do_lower_case=True): self.do_lower_case = do_lower_case def tokenize(self, text): Text = convert_to_unicode(text) text = self._clean_text(text) orig_tokens = whitespace_tokenize(text) split_tokens = []for token in orig_tokens:
if self.do_lower_case:
token = token.lower()
token = self._run_strip_accents(token)
split_tokens.extend(self._run_split_on_punc(token))
output_tokens = whitespace_tokenize("".join(split_tokens))
returnOutput_tokens def _run_strip_accents(self, text):"NFD", text)
output = []
forChar in text: cat = unicodedata.category(char) # refer://www.fileformat.info/info/unicode/category/Mn/list.htm
if cat == "Mn":
continue
output.append(char)
return "".join(output) def _run_split_on_punc(self, text)0
start_new_word = True
output = []
while i < len(chars):
char = chars[i]
if _is_punctuation(char):
output.append([char])
start_new_word = True
else:
if start_new_word:
output.append([])
start_new_word = False
output[- 1].append(char)
i += 1
return ["".join(x) forX in output] def _tokenize_chinese_chars(self, text): []for char in text:
cp = ord(char)
if self._is_chinese_char(cp):
output.append("")
output.append(char)
output.append("")
else:
output.append(char)
return ""Join (output) def _is_chinese_char(self, cp)://www.cnblogs.com/straybirds/p/6392306.html
if ((cp >= 0x4E00 and cp <= 0x9FFF) or #
(cp >= 0x3400 and cp <= 0x4DBF) or #
(cp >= 0x20000 and cp <= 0x2A6DF) or #
(cp >= 0x2A700 and cp <= 0x2B73F) or #
(cp >= 0x2B740 and cp <= 0x2B81F) or #
(cp >= 0x2B820 and cp <= 0x2CEAF) or
(cp >= 0xF900 and cp <= 0xFAFF) or #
(cp >= 0x2F800 and cp <= 0x2FA1F)) : #return True
returnFalse def _clean_text(self, text): output = []for char in text:
cp = ord(char)
if cp == 0 or cp == 0xfffd or _is_control(char):
continue
if _is_whitespace(char):
output.append("")
else:
output.append(char)
return "".join(output)
Copy the code
1.2 WordpieceTokenizer
WordpieceTokenizer is a further fine-grained shard of the results of BasicTokenizer. The purpose of this step is to remove the influence of unknown words on the model effect. This process has no effect on Chinese because it was already broken down into word units in the previous BasicTokenizer.
class WordpieceTokenizer(object):
def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200) : self.vocab = vocab self.unk_token = unk_token self.max_input_chars_per_word = max_input_chars_per_word def tokenize(self, text):"""Use greedy maximum forward matching algorithms such as: Input ="unaffable" output = ["un","##aff","##able"]"""
text = convert_to_unicode(text)
output_tokens = []
for token in whitespace_tokenize(text):
chars = list(token)
if len(chars) > self.max_input_chars_per_word:
output_tokens.append(self.unk_token)
continue
is_bad = False
start = 0
sub_tokens = []
while start < len(chars):
end = len(chars)
cur_substr = None
while start < end:
substr = "".join(chars[start:end])
if start > 0:
substr = "# #" + substr
if substr in self.vocab:
cur_substr = substr
break
end -= 1
if cur_substr is None:
is_bad = True
break
sub_tokens.append(cur_substr)
start = end
if is_bad:
output_tokens.append(self.unk_token)
else:
output_tokens.extend(sub_tokens)
return output_tokens
Copy the code
Let’s use an example to see how the code executes. Let’s say the input is “unaffable.” Start =0, end=len(chars)=9, end=len(chars)=9, unaffable = WordPiece, end-=1, unaffabl =1 Finally find “UN” in the dictionary, add UN to the result.
If (start=2) if (affable =2) if (affabl =2) if (affabl =2) “And finally found ##aff in the dictionary. Note: ## indicates that the word comes first, which makes WordPiece shards reversible — we can recover the “real” word.
1.3 FullTokenizer
The main interface for BERT participles contains both implementations.
class FullTokenizer(object): def __init__(self, vocab_file, do_lower_case=True): Self. Vocab = load_VOCab (vocab_file) self. Inv_vocab = {v: kfork, v in self.vocab.items()} self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case) self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab) def tokenize(self, text): split_tokens = [] #forToken in self.basic_tokenizer.tokenize(text): # call WordpieceTokenizer fine-grained word segmentationfor sub_token in self.wordpiece_tokenizer.tokenize(token):
split_tokens.append(sub_token)
return split_tokens
def convert_tokens_to_ids(self, tokens):
return convert_by_vocab(self.vocab, tokens)
def convert_ids_to_tokens(self, ids):
return convert_by_vocab(self.inv_vocab, ids)
Copy the code
2, training data generation (create_pretraining_data.py)
The purpose of this file is to convert the raw input corpus into the TFRecoed data format required for model pre-training.
2.1 Parameter Settings
flags.DEFINE_string("input_file", None,
"Input raw text file (or comma-separated list of files).")
flags.DEFINE_string("output_file", None,
"Output TF example file (or comma-separated list of files).")
flags.DEFINE_string("vocab_file", None,
"The vocabulary file that the BERT model was trained on.")
flags.DEFINE_bool( "do_lower_case", True,
"Whether to lower case the input text. Should be True for uncased "
"models and False for cased models.")
flags.DEFINE_integer("max_seq_length".128."Maximum sequence length.")
flags.DEFINE_integer("max_predictions_per_seq".20."Maximum number of masked LM predictions per sequence.")
flags.DEFINE_integer("random_seed".12345."Random seed for data generation.")
flags.DEFINE_integer( "dupe_factor".10."Number of times to duplicate the input data (with different masks).")
flags.DEFINE_float("masked_lm_prob".0.15."Masked LM probability.")
flags.DEFINE_float("short_seq_prob".0.1."Probability of creating sequences which are shorter than the maximum length.")
Copy the code
Just a couple of parameters
- “Dupe_factor:” repeat parameter, that is, for the same sentence, we can set [MASK] times in different positions. For example, for sentences
Hello world, this is bert.
In order to make full use of the data, mask can be formed for the first timeHello [MASK], this is bert.
The second time can beHello world, this is [MASK[.
- “Max_predictions_per_seq:” The maximum number of [MASK] markers in a sentence
- “Masked_lm_prob:” What percentage of tokens are masked away
- The length of short_seq_prob: is shorter than that of max_seq_length. Since the target_seq_length entered in the fine-tune process is variable (less than or equal to max_seq_length), short samples need to be constructed during the pre-train process to prevent overfitting.
2.2 the Main entrance
Let’s start with the overall process of constructing data,
def main(_):
tf.logging.set_verbosity(tf.logging.INFO)
tokenizer = tokenization.FullTokenizer(
vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
input_files = []
for input_pattern in FLAGS.input_file.split(","):
input_files.extend(tf.gfile.Glob(input_pattern))
tf.logging.info("*** Reading from input files ***")
for input_file in input_files:
tf.logging.info(" %s", input_file)
rng = random.Random(FLAGS.random_seed)
instances = create_training_instances(
input_files, tokenizer, FLAGS.max_seq_length, FLAGS.dupe_factor,
FLAGS.short_seq_prob, FLAGS.masked_lm_prob, FLAGS.max_predictions_per_seq,
rng)
output_files = FLAGS.output_file.split(",")
tf.logging.info("*** Writing to output files ***")
for output_file in output_files:
tf.logging.info(" %s", output_file)
write_instance_to_example_files(instances, tokenizer, FLAGS.max_seq_length,
FLAGS.max_predictions_per_seq, output_files)
Copy the code
- Construct tokenizer for word segmentation of the input corpus (tokenizer will be explained later)
- after
create_training_instances
Function constructs training instance - call
write_instance_to_example_files
Functions save data in TFRecord format and we will examine each of these functions.
2.3 Construction of training samples
First, a class of training samples is defined
class TrainingInstance(object):
def __init__(self, tokens, segment_ids, masked_lm_positions, masked_lm_labels,
is_random_next):
self.tokens = tokens
self.segment_ids = segment_ids
self.is_random_next = is_random_next
self.masked_lm_positions = masked_lm_positions
self.masked_lm_labels = masked_lm_labels
def __str__(self):
s = ""
s += "tokens: %s\n" % ("".join(
[tokenization.printable_text(x) for x in self.tokens]))
s += "segment_ids: %s\n" % ("".join([str(x) for x in self.segment_ids]))
s += "is_random_next: %s\n" % self.is_random_next
s += "masked_lm_positions: %s\n" % ("".join(
[str(x) for x in self.masked_lm_positions]))
s += "masked_lm_labels: %s\n" % ("".join(
[tokenization.printable_text(x) for x in self.masked_lm_labels]))
s += "\n"
return s
def __repr__(self):
return self.__str__()
Copy the code
The code for constructing the training sample is as follows. In the source package, Google provides an example of training sample input (” sample_text.txt “) in the format of the input file:
- One sentence per line, which should be the actual sentence, not the whole paragraph or random sections of the paragraph (span), because we need to use sentence boundaries to make predictions for the next sentence.
- Separate documents with a blank line.
- We believe that there is a relationship between sentences in the same document and there is no relationship between sentences in different documents.
def create_training_instances(input_files, tokenizer, max_seq_length, dupe_factor, short_seq_prob, masked_lm_prob, max_predictions_per_seq, rng): All_documents = [[]] # all_documents = [[]] # all_documents = [[]] # all_documents = [[]] # all_documentsfor input_file in input_files:
with tf.gfile.GFile(input_file, "r") as reader:
while True:
line = tokenization.convert_to_unicode(reader.readline())
if not line:
breakLine = line.strip() # empty line indicates document splitif not line:
all_documents.append([])
tokens = tokenizer.tokenize(line)
if tokens:
all_documents[- 1].appendTokens # Delete empty documentsfor x in all_documents ifX] rng.shuffle(all_documents) VOCab_words = list(Tokenizer.vocab.keys ()) instances = [] # duplicate dupe_factor timesfor _ in range(dupe_factor):
for document_index in range(len(all_documents)):
instances.extend(
create_instances_from_document(
all_documents, document_index, max_seq_length, short_seq_prob,
masked_lm_prob, max_predictions_per_seq, vocab_words, rng))
rng.shuffle(instances)
return instances
Copy the code
The above function calls create_instances_from_document to extract multiple training samples from a single document.
def create_instances_from_document( all_documents, document_index, max_seq_length, short_seq_prob, masked_lm_prob, max_predictions_per_seq, vocab_words, rng): Document = all_documents[document_index] # Tokens max_num_tokens = max_seq_length -3Target_seq_length = max_num_tokens # generate tokens with the probability of short_seq_prob2~max_num_tokens) lengthif rng.random() < short_seq_prob:
target_seq_length = rng.randint(2, max_num_tokens)
#
instances = []
current_chunk = []
current_length = 0
i = 0
while i < len(document):
segment = document[i]
current_chunk.append(segment)
current_length += len(segment) # Add sentences to current_chunk in turn until the maximum length is added or reachedif i == len(document) - 1 or current_length >= target_seq_length:
if current_chunk:
# `a_end`Is the subscript at the end of the first sentence A a_end =1# Randomly select the segmentation boundaryif len(current_chunk) >= 2:
a_end = rng.randint(1.len(current_chunk) - 1)
tokens_a = []
for j in range(a_end): tokens_A. extend(current_chunk[j]) tokens_B = [] # whether random next IS_random_next = False # Build random next sentenceif len(current_chunk) == 1 or rng.random() < 0.5:
is_random_next = True
target_b_length = target_seq_length - len(tokens_a) # But it is theoretically possible that the document randomly arrived is the current document, so a while loop is needed10Again, there is a theoretical possibility of duplication, but we ignore thatfor _ in range(10):
random_document_index = rng.randint(0.len(all_documents) - 1)
ifrandom_document_index ! = document_index:break
random_document = all_documents[random_document_index]
random_start = rng.randint(0.len(random_document) - 1)
for j in range(random_start, len(random_document)):
tokens_b.extend(random_document[j])
if len(tokens_b) >= target_b_length:
break# For the random next sentences built above, we don't really use them # so to avoid data waste, we "put them back" num_unused_segments =len(current_chunk) -a_end I -= num_unused_segments # build true next sentenceelse:
is_random_next = False
for j in range(a_end, len(current_chunk)): Tokens_b. extend(CURRENT_CHUNK [J]) # If too many, randomly remove some TRUNCATE_SEQ_pair (TOkens_A, TOkens_B, MAX_NUM_tokens, RNG) assertlen(tokens_a) >= 1
assert len(tokens_b) >= 1Tokens = [] segment_ids = [] # tokensappend("[CLS]")
segment_ids.append(0)
for token in tokens_a:
tokens.append(token)
segment_ids.append(0【SEP】 tokens.append("[SEP]")
segment_ids.append(0Bfor token in tokens_b:
tokens.append(token)
segment_ids.append(1(B) There is a sense of tokens.append("[SEP]")
segment_ids.append(1Calls create_MASKED_LM_predictions to mask some tokens randomly (tokens, masked_LM_positions, tokens) masked_lm_labels) = create_masked_lm_predictions( tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, rng) instance = TrainingInstance( tokens=tokens, segment_ids=segment_ids, is_random_next=is_random_next, masked_lm_positions=masked_lm_positions, masked_lm_labels=masked_lm_labels) instances.append(instance)
current_chunk = []
current_length = 0
i += 1
return instances
Copy the code
The above code is a bit long and I’ve commented it all out in key places. Let’s look at the code implementation process with a specific example. Take the corpus in “sample_text. TXT” as an example, only part of it is captured. The following figure contains two documents. The first document has 6 sentences, and the second one has 4 sentences:create_instances_from_document
Analyzing a document, let’s take the first one above as an example.
- Firstly, the algorithm will maintain a chunk and continuously add elements in document (i.e., segments) until the loading is complete or the number of tokens in chunk is greater than or equal to the maximum limit. In this way, the padding is minimized and training efficiency is higher.
- Now after chunk is established, assuming that the first three sentences are included, the algorithm will randomly select a point of segmentation, such as 2. Next build
predict next
(1) If the sample is positive, the first two sentences are regarded as sentence A, and the last sentence is regarded as sentence B; (2) If the sample is negative, the first two sentences are regarded as sentence A, and irrelevant sentences are randomly selected from other documents - Once you get sentences A and B, fill them with tokens and segment_ids with special [CLS] and [SEP] tokens
- Mask the sentence (described in the next section)
2.4 random MASK
Random mask of Tokens is one of BERT’s innovations. The reason for using masks is to prevent the model from “anticipating itself” during bidirectional circular training. Therefore, the strategy chosen in this paper is to MASK 15% of the words in the input sequence with [MASK] markers, and then predict these masked tokens from the context. However, in order to prevent the model from over-fitting to learn the marker [MASK], the words with 15% MASK removed were further optimized:
- Replace with [MASK] with 80% probability:
-
- Hello world, this is bert. —-> Hello world, this is [MASK].
- Replace at random with a 10% chance:
-
- Hello world, this is bert. —-> Hello world, this is Python.
- Do not replace with a 10% probability:
-
- —-> Hello world, this is Bert.
def create_masked_lm_predictions(tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, rng): Cand_indexes = [] # [CLS] and [SEP] cannot be used for masksfor (i, token) in enumerate(tokens):
if token == "[CLS]" or token == "[SEP]":
continue
cand_indexes.append(i)
rng.shuffle(cand_indexes)
output_tokens = list(tokens)
num_to_predict = min(max_predictions_per_seq,
max(1.int(round(len(tokens) * masked_lm_prob))))
masked_lms = []
covered_indexes = set()
for index in cand_indexes:
if len(masked_lms) >= num_to_predict:
break
if index in covered_indexes:
continue
covered_indexes.add(index)
masked_token = None
# 80% of the time, replace with [MASK]
if rng.random() < 0.8:
masked_token = "[MASK]"
else:
# 10% of the time, keep original
if rng.random() < 0.5:
masked_token = tokens[index]
# 10% of the time, replace with random word
else:
masked_token = vocab_words[rng.randint(0.len(vocab_words) - 1)]
output_tokens[index] = masked_token
masked_lms.appendMaskedLmInstance(index=index, label=tokens[index]) Masked_lms = sorted(masked_LMS, key=lambda x: X.index) masked_lM_positions = [] masked_LM_labels = []for p in masked_lms:
masked_lm_positions.append(p.index)
masked_lm_labels.append(p.label)
return (output_tokens, masked_lm_positions, masked_lm_labels)
Copy the code
2.5 Save TFRecord data
Finally, the data processed in the above steps is saved as TFRecord file. The overall logic is relatively simple, the code is as follows
def write_instance_to_example_files(instances, tokenizer, max_seq_length,
max_predictions_per_seq, output_files):
writers = []
for output_file in output_files:
writers.append(tf.python_io.TFRecordWriter(output_file))
writer_index = 0
total_written = 0
for(inst_index, instance) in enumerate(instances): Input_ids = tokenizer.convert_tokens_to_ids(instage.tokens)1] * len(input_ids)
segment_ids = list(instance.segment_ids)
assert len(input_ids) <= max_seq_length
# padding
while len(input_ids) < max_seq_length:
input_ids.append(0)
input_mask.append(0)
segment_ids.append(0)
assert len(input_ids) == max_seq_length
assert len(input_mask) == max_seq_length
assert len(segment_ids) == max_seq_length
masked_lm_positions = list(instance.masked_lm_positions)
masked_lm_ids = tokenizer.convert_tokens_to_ids(instance.masked_lm_labels)
masked_lm_weights = [1.0] * len(masked_lm_ids)
while len(masked_lm_positions) < max_predictions_per_seq:
masked_lm_positions.append(0)
masked_lm_ids.append(0)
masked_lm_weights.append(0.0)
next_sentence_label = 1 if instance.is_random_next else 0
features = collections.OrderedDict()
features["input_ids"] = create_int_feature(input_ids)
features["input_mask"] = create_int_feature(input_mask)
features["segment_ids"] = create_int_feature(segment_ids)
features["masked_lm_positions"] = create_int_feature(masked_lm_positions)
features["masked_lm_ids"] = create_int_feature(masked_lm_ids)
features["masked_lm_weights"] = create_float_feature(masked_lm_weights)
features["next_sentence_labels"] = create_int_feature([next_sentence_label] Tf.train.Example(features=tf.train.Features(feature=features)) # writers[writer_index].write(tf_example.SerializeToString()) writer_index = (writer_index +1) % len(writers)
total_written += 1# before printing20A sampleif inst_index < 20:
tf.logging.info("*** Example ***")
tf.logging.info("tokens: %s" % "".join(
[tokenization.printable_text(x) for x in instance.tokens]))
for feature_name in features.keys():
feature = features[feature_name]
values = []
if feature.int64_list.value:
values = feature.int64_list.value
elif feature.float_list.value:
values = feature.float_list.value
tf.logging.info(
"%s: %s" % (feature_name, "".join([str(x) for x in values])))
for writer in writers:
writer.close()
tf.logging.info("Wrote %d total instances", total_written)
Copy the code
2.6 Test Code
python create_pretraining_data.py \
--input_file=./sample_text_zh.txt \
--output_file=/tmp/tf_examples.tfrecord \
--vocab_file=$BERT_BASE_DIR/vocab.txt \
--do_lower_case=True \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--masked_lm_prob=0.15 \
--random_seed=12345 \
--dupe_factor=5
Copy the code
Since the word list I downloaded before was in Chinese, I randomly found a few news articles on the Internet for testing. The results are as followsHere’s an example of one:
A summary,
This paper mainly introduces BERT’s own phrase components and pretraining data generation process, which are part of the preparation of the whole project. I didn’t realize there was so much code, so I will not cover pretraining in this article. Please see the next article
The above –
References for this article
[1]
Tokenization. Py: github.com/google-rese…
[2]
Create_pretraining_data. Py: github.com/google-rese…
[3]
Run_pretraining: github.com/google-rese…
“`php
Highlights of past For beginners entry route of artificial intelligence and data download AI based machine learning online manual deep learning online manual download update (PDF to 25 sets) note: WeChat group or qq group to join this site, please reply “add group” to get a sale standing knowledge star coupons, please reply “planet” knowledge like articles, point in watching
Copy the code