Google announces: BERT is open source for the best NLP pre-training model in the world!

Compiled source | Google Research making | ignorance, Natalie editing | Natalie

Recently, a Google AI NLP paper has aroused great attention and discussion in the community, which is considered as a great breakthrough in the field of NLP. Thang Luong Twitter, a Google brain research scientist, said the study opens a new era in the FIELD of NLP. This paper introduces a new language representation model BERT, bidirectional encoder representation from Transformer. BERT is the first fine-tuning based representation model to achieve current best performance across a large number of sentence-level and token-level tasks, outperforming many systems using task-specific architectures and breaking the current best performance record for 11 NLP tasks. Just now, Google has officially made it open source! This means that all NLP practitioners can try out this powerful NLP pre-training model and incorporate it into their work. For more quality content, please follow the wechat official account “AI Front” (ID: AI-Front).

First attach the open source portal:

Github.com/google-rese…

Link to original paper:

Arxiv.org/abs/1810.04…

The open source project highlights are as follows:

Standalone TensorFlow code with a simple API and no dependencies.
Links to bert-Base and Bert-Large pre-trained versions of the paper.
MultiNLI and SQuAD V1.1 results in the paper are copied with one click.
Contains pre-training data generation and training code.
You can link to Colab to run BERT using the free cloud TPU.

Several frequently asked questions:

We plan to release a multilingual model soon (a large shared WordPiece vocabulary trained in 60 languages, with special treatment for Chinese).
Existing versions based on PyTorch (or other frameworks) are not compatible with checkpoints (because it would not be possible without our code). We hope someone will create an op-for-op re-implementation of Modeling.py to create a PyTorch model that is compatible with our checkpoints, especially as we plan to release more checkpoints in the future (for example, multilingual models).
We haven’t run this model on SQuAD 2.0 yet, and we’d like to leave it as an exercise for readers 🙂
You don’t have to train on a cloud TPU, but training a Bert-Large model on a GPU can cause serious out-of-memory problems. Running Bert-Base on a GPU usually works fine (you may need to reduce the Batch Size compared to what we used in the paper, but if you also adjust the learning rate, the end result should be similar). We’re trying to figure out the best way to run Bert-Large on a GPU.

The following content was compiled from the BERT open Source project Readme file (slightly condensed).

What is BERT?

BERT is a method of pre-training language representation, where we train a generic “language understanding” model based on a large textual corpus such as Wikipedia, and then use the model for downstream NLP tasks such as q&A. BERT is superior to the previous approach because it is the first unsupervised, deep bidirectional system for pre-training NLP.

Unsupervised means BERT only uses a plain-text corpus for his training, which is important because there is a lot of publicly available plain-text data on the web.

Pre-trained representations can also be contextless or contextual, and contextual representations can be unidirectional or bidirectional. Context-less models such as Word2vec or GloVe generate a single “bag” representation for each word in the vocabulary, so that “bank” has the same representation as “bank deposit” and “river bank”. Instead, the context model generates a representation of each word based on the other words in the sentence.

BERT builds on recent work in pretraining context representation, including semi-supervised sequence learning, generative pretraining, ELMo, and ULMFit, which are unidirectional or shallow-bidirectional. That is, each word is contextualized using only the words on the left (or right) side. For example, in the sentence “I made a bank deposit”, the one-way representation of “bank” is based on “I made a” instead of “deposit”. Some previous work combined representations from the upper left context and the upper right context models in a “shallow” manner, while BERT used left and right context to represent “bank” — starting at the very bottom of the deep neural network, so it was deep bidirectional.

BERT uses a simple method: We shade out 15% of the words we input, pass the entire sequence through the deep bi-directional Transformer encoder, and predict only the shaded words. Such as:

Input: the man went to the [MASK1] . he bought a [MASK2] of milk.
Labels: [MASK1] = store; [MASK2] = gallon
Copy the code

To learn the relationship between sentences, we also train A simple task: given two sentences A and B, is B the next sentence of A or just A random sentence in the corpus?

Sentence A: the man went to the store .
Sentence B: he bought a gallon of milk .
Label: IsNextSentence
Copy the code

Sentence A: the man went to the store .
Sentence B: penguins are flightless .
Label: NotNextSentence
Copy the code

We then trained a model (12 to 24 layer Transformer) based on a large corpus (Wikipedia + BookCorpus) that took a long time (1 million update steps) and that was BERT.

Using BERT requires two phases: pre-training and fine-tuning.

The cost of pre-training is quite high (it takes 4 days to train on 4 to 16 Cloud Tpus), and it’s a one-time program for each language (the current model is English only, and more language models will be released in the near future). We are releasing some pre-trained models that have been pre-trained on Google. Most NLP researchers do not need to train their models from scratch.

The cost of fine-tuning is low. All of the results mentioned in the paper can be trained on a single Cloud TPU in up to 1 hour, or on a GPU in a few hours.

Pre-training model We published bert-Base and Bert-large models in the paper. Uncased refers to the fact that the text is converted to lowercase before WordPiece tokenization, for example, “John Smith” is converted to “John Smith”. The Uncased model also removes stress markers. Cased refers to preserving true case and accent marks. In general, the Uncased model works better unless your tasks require case (for example, named entity recognition or part-of-speech tagging).

These models are distributed under the Apache 2.0 license.

Model link:

BERT – Base, Uncased:storage.googleapis.com/bert_models…
BERT – Large, Uncased:storage.googleapis.com/bert_models…
BERT – Base, Cased:storage.googleapis.com/bert_models…
Bert-large, Cased: is not yet available and needs to be regenerated.

Each.zip file contains three projects:

TensorFlow checkpoint (bert_model.ckpt) with pre-trained weights (actually 3 files).
The vocabulary file (vocab.txt) used to map WordPiece to word ID.
Configuration file (bert_config.json) that specifies the hyperparameters of the model.

Fine tuning with BERT

The fine-tuning example uses Bert-Base, which should be able to run on a GPU with at least 12GB of RAM with a given hyperparameter.

Fine tuning on Cloud TPU

Most of the examples below assume that you will be running training/evaluation on your local computer using a GPU like Titan X or GTX 1080.

However, if you have access to Cloud TPU, simply add the following flags to run_classifier.py or run_squeste.py:

--use_tpu=True \
 --tpu_name=$TPU_NAME
Copy the code

On Cloud TPU, the pre-trained model and output directory need to be on Google Cloud Storage. For example, if you have a bucket named some_bucket, you can use the following flag:

--output_dir=gs://some_bucket/my_output_dir/
Copy the code

The unzipped pretraining model file can also be found in the Google Cloud Storage folder gs://bert_models/2018_10_18. Such as:

export BERT_BASE_DIR=gs://bert_models/2018_10_18/uncased_L-12_H-768_A-12
Copy the code

Sentence (and sentence pairs) sorting tasks

Before you run this example, you must pass the script (gist.github.com/W4ngatang/6…). Download the GLUE data (gluebenchmark.com/tasks) and unzip it into a directory (the directory variable can be set to $GLUE_DIR). Next, download the Bert-base checkpoint and unzip it into another directory (the directory variable can be set to $BERT_BASE_DIR).

This example fine-tunes Bert-Base against the Microsoft Research Paraphrase Corpus (MRPC) Corpus, which contains only 3,600 samples and takes only a few minutes to fine-tune on most Gpus.

export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12
export GLUE_DIR=/path/to/glue

python run_classifier.py \
 --task_name=MRPC \
 --do_train=true \
 --do_eval=true \
 --data_dir=$GLUE_DIR/MRPC \
 --vocab_file=$BERT_BASE_DIR/vocab.txt \
 --bert_config_file=$BERT_BASE_DIR/bert_config.json \
 --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \ --max_seq_length=128 \ --train_batch_size=32 \ --learning_rate= 2E-5 \ --num_train_epochs=3.0 \ --output_dir=/tmp/mrpc_output/Copy the code

You should see something like this:

***** Eval results *****
 eval_accuracy = 0.845588
 eval_loss = 0.505248
 global_step = 343
 loss = 0.505248
Copy the code

The accuracy of dev set is 84.55%. MRPC has a wide variation in dev set accuracy, even from the same pre-training checkpoint. If you re-run it a few times (making sure to point to different output_dir), you should see results between 84% and 88%.

Some of the other pretraining models are implemented in run_classifier.py, so it should be straightforward to follow these examples and use BERT for any single or sentence pair classification task.

SQuAD

The Stanford Q&A Dataset (SQuAD) is a very popular q&A benchmark dataset. BERT (at launch) got the best results on SQuAD with few mission-specific network architecture modifications or data enhancements. It does, however, require semi-complex data preprocessing and post-processing to handle the variable-length nature of SQUAD context paragraphs, as well as character-level answer annotations for SQUAD training. Run_squad. Py implements and documents the processing.

To run training on the SQuAD, you first need to download this data set. SQuAD website (rajpurkar. Making. IO/SQuAD – explo… Links to v1.1 datasets, some of the necessary files can be found here:

“Train” – v1.1. Json (rajpurkar. Making. IO/SQuAD – explo…
Dev – v1.1. Json (rajpurkar. Making. IO/SQuAD – explo…
The evaluate – v1.1. Py (github.com/allenai/bi-…

Download these to a directory (the variable can be set to $SQUAD_DIR).

Due to memory limitations, it is currently impossible to reproduce the best SQuAD results on 12GB-16GB Gpus. However, you can train the Bert-Base model on the GPU using these hyperparameters:

python run_squad.py \
 --vocab_file=$BERT_BASE_DIR/vocab.txt \
 --bert_config_file=$BERT_BASE_DIR/bert_config.json \
 --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
 --do_train=True \
 --train_file=$SQUAD_DIR/train-v1.1.json \
 --do_predict=True \
 --predict_file=$SQUAD_DIR/dev/v1.1. json \ --train_batch_size=12 \ --learning_rate= 5E-5 \ --num_train_epochs=2.0 \ --max_seq_length=384 \ --doc_stride=128 \ --output_dir=/tmp/squad_base/Copy the code

Dev set predictions will be saved to a file called application.json in the output_dir directory:

python SQUAD_DIR /dev/v1.1.json./squad/ Applications. json should generate the following output:

{"f1": 88.41249612335034."exact_match": 81.2488174077578}
Copy the code

You should see the 88.5 percent F1 mentioned in the paper.

If you have access to Cloud TPU, you can train bert-Large models. The following hyperparameters (slightly different from those in the paper) yield about 90.5%-91.0% F1 (trained only on SQuAD) :

python run_squad.py \
 --vocab_file=$BERT_LARGE_DIR/vocab.txt \
 --bert_config_file=$BERT_LARGE_DIR/bert_config.json \
 --init_checkpoint=$BERT_LARGE_DIR/bert_model.ckpt \
 --do_train=True \
 --train_file=$SQUAD_DIR/train-v1.1.json \
 --do_predict=True \
 --predict_file=$SQUAD_DIR/dev/v1.1. json \ --train_batch_size=48 \ --learning_rate= 5E-5 \ --num_train_epochs=2.0 \ --max_seq_length=384 \ --doc_stride=128 \ --output_dir=gs://some_bucket/squad_large/ \ --use_tpu=True \ --tpu_name=$TPU_NAME
Copy the code

For example, a random run with these parameters yields the following DEV score:

{"f1": 90.87081895814865."exact_match": 84.38978240302744}
Copy the code

BERT is used to extract fixed feature vectors

In some cases, it may be better to have pre-trained context emplacements, which are fixed context representations of each input token generated by the pre-trained model’s hidden layer, rather than end-to-end fine-tuning of the entire pre-trained model.

For example, we might use the extract_features.py script like this:

# Sentence A and Sentence B are separated by the ||| delimiter.
# For single sentence inputs, don't use the delimiter.
echo 'Who was Jim Henson ? ||| Jim Henson was a puppeteer' > /tmp/input.txt

python extract_features.py \
 --input_file=/tmp/input.txt \
 --output_file=/tmp/output.jsonl \
 --vocab_file=$BERT_BASE_DIR/vocab.txt \
 --bert_config_file=$BERT_BASE_DIR/bert_config.json \
 --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
 --layers=-1,-2,-3,-4 \
 --max_seq_length=128 \
 --batch_size=8
Copy the code

This creates a JSON file that contains the BERT activation for each Transformer layer specified by Layers (-1 is the last hidden layer of Transformer, and so on).

Note that this script generates very large output files (about 15KB per input tag by default).

If you need to align original words with tokenized words, see the tokenization section below.

Tokenization

For sentence (or sentence pair) tasks, tokenization is very simple. Just follow the example code in run_classifier.py and extract_features.py. The basic flow of sentence level tasks:

Instantiate the tokenizer = tokenization. FullTokenizer;

Tokens = tokenizer.tokenize (raw_text)

Truncate to the maximum sequence length (up to 512 can be used, but for memory and speed, it is better to use a shorter one);

Add [CLS] and [SEP] tags in the correct places.

Word-level and span-level tasks, such as SQuAD and NER, are a bit more complicated because you need to align input text with output text. SQuAD is a particularly complex example because input tags are character-based, and SQuAD paragraphs are usually longer than our maximum sequence length. See the code in run_squads. Py to see how we handle this problem.

Before we describe our general approach to word-level tasks, we need to understand what our marker does. It has three main steps:

Text normalization: Converts all whitespace characters to Spaces and (for the Uncased model) converts the input to lower case and removes accent marks. For example, “John Johanson’s” becomes “John Johanson’s.”

Punctuation split: Split all punctuation marks on both sides (that is, add Spaces around all punctuation marks). A punctuation mark is an ASCII character with P* Unicode content or any non-alphabetic/numeric/space character. For example, “Johanson’s,” becomes “John Johanson’s,”

WordPiece tokenization: Space tokenization of the output from the previous step and WordPiece tokenization of each tag. For example, “John Johanson’s,” becomes “John Johan ##son’s,”

The advantage of this scheme is that it is “compatible” with most existing English markers. For example, suppose you have a part-of-speech tagging task that looks like this:

Input:  John Johanson 's house Labels: NNP NNP POS NNCopy the code

The tokenized output looks like this:

Tokens: John Johan ## Son’s House If you have a pre-tagged representation with word-level annotations, you can mark each input word individually and align the original word with the tokenized word:

### Input
orig_tokens = ["John"."Johanson"."'s"."house"]
labels      = ["NNP"."NNP"."POS"."NN"]

### Output
bert_tokens = []

# Token map will be an int -> int mapping between the `orig_tokens` index and
# the `bert_tokens` index.
orig_to_tok_map = []

tokenizer = tokenization.FullTokenizer(
   vocab_file=vocab_file, do_lower_case=True)

bert_tokens.append("[CLS]")
for orig_token in orig_tokens:
 orig_to_tok_map.append(len(bert_tokens))
 bert_tokens.extend(tokenizer.tokenize(orig_token))
bert_tokens.append("[SEP]")

# bert_tokens == ["[CLS]", "john", "johan", "##son", "'", "s", "house", "[SEP]"]
# orig_to_tok_map == [1, 2, 4, 6]
Copy the code

Orig_to_tok_map can now be used to project labels onto tokenized representations.

There are some common English tokenization schemes that result in slight mismatches between BERT pre-training. For example, if input tokenization separates abbreviated forms, such as “do n’t,” a mismatch occurs. If possible, you should preprocess the data and convert it back to the original text, but if not, the mismatch may not be a big deal.

Pre-training with BERT

We are trying to “masked LM” and “next sentence prediction” on arbitrary text corpora. Note that this code is different from the code described in the paper (the original code was written in C++ with some additional complexity), but can generate the pre-training data described in the paper.

The input is a plain text file, one sentence per line. Documents are separated by empty lines. The output is a set of tF.train.Examples serialized to the TFRecord file format.

The script keeps a sample of the entire input file in memory, and for large data files, you need to fragment it and call the script multiple times.

Max_predictions_per_seq is the maximum number of masked LM predictions for each sequence. You should set it to max_seq_length * masked_LM_PROb.

python create_pretraining_data.py \
 --input_file=./sample_text.txt \
 --output_file=/tmp/tf_examples.tfrecord \
 --vocab_file=$BERT_BASE_DIR/vocab.txt \ --do_lower_case=True \ --max_seq_length=128 \ --max_predictions_per_seq=20 \ -- masked_lM_prob =0.15 \ --random_seed=12345 \ --dupe_factor=5Copy the code

If you are pre-training from scratch, do not include init_checkpoint. The model configuration, including lexical size, is specified in bert_config_file. The demo code only pre-trains a few steps (20), but in practice you may need to set num_train_Steps to 10000 or more. The max_seq_length and max_predictions_per_seq parameters passed to run_pretrain.py must be the same as create_pretraining_data.py.

python run_pretraining.py \
 --input_file=/tmp/tf_examples.tfrecord \
 --output_dir=/tmp/pretraining_output \
 --do_train=True \
 --do_eval=True \
 --bert_config_file=$BERT_BASE_DIR/bert_config.json \
 --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
 --train_batch_size=32 \
 --max_seq_length=128 \
 --max_predictions_per_seq=20 \
 --num_train_steps=20 \
 --num_warmup_steps=10 \
 --learning_rate=2e-5
Copy the code

This produces the following output:

***** Eval results *****
 global_step = 20
 loss = 0.0979674
 masked_lm_accuracy = 0.985479
 masked_lm_loss = 0.0979328
 next_sentence_accuracy = 1.0
 next_sentence_loss = 3.45724e-05
Copy the code

Note that because the sample_text.txt file is very small, this example will overfit in a few steps and produce unrealistically high accuracy.

Pre-training data

We will not be able to publish the pre-processed data set used in the paper. For Wikipedia, recommended to download the latest dump (dumps.wikimedia.org/enwiki/late… Wikiextractor.py extracts the text and then does the necessary cleanup to convert it to plain text.

Sadly, the researchers who collected BookCorpus no longer offer it for public download. Guttenberg Data set (web.eecs.umich.edu/~lahiri/gut… A collection of old books.

Common Crawl (commonCrawl.org/) is another very large text… BERT pre-training.

Used in Colab BERT BERT and Colab if you want to be used together, can from “BERT FineTuning with Cloud TPU” (colab.sandbox.google.com/github/tens… Oct 31, 2007), Colab users can access a Cloud TPU for free. Each user can use one, which has limited availability, requires a Google Cloud Platform account with storage space, and may not be available in the future.

英文原文 :