By Chris McCormick and Nick Ryan

Original link: tinyurl.com/y74pvgyc

introduce

history

2018 was a breakthrough year for NLP. The models of transfer learning, especially Allen AI’s ELMO, OpenAI’s Open-GPT, and Google’s BERT, allowed researchers to refresh the baseline for multiple tasks. It also provides pre-training models that can be easily fine-tuned (requiring very little data and computation) and can be used to produce today’s highest level of results. However, the theory and application of these powerful models are not easy to understand for many developers new to NLP, or even for many experienced developers.

What is the BERT

BERT (Bidirectional Encoder Representations from Transformers), released at the end of 2018, is the model we will use in this tutorial to better understand and guide readers in using the transfer learning model in NLP. BERT is a method of pre-training language representations, and NLP practitioners can download and use these models for free. You can use these models to extract high-quality linguistic features from textual data, or you can fine-tune these models on specific tasks (classification, entity recognition, question-and-answer questions, etc.) with your own data to produce high-quality predictive results.

This article explains how to modify and fine-tune BERT to create a powerful NLP model.

The advantage of Fine – tuning

In this tutorial, we will use BERT to train a text classifier. Specifically, we will take a pre-trained BERT model, add an untrained neuron layer to the end, and then train the new model to accomplish our classification task. Why do this instead of training a specific deep learning model (CNN, BiLSTM, etc.)?

  1. Faster development

    First, the pre-trained BERT model weights already encode a lot of information about our language. As a result, it takes much less time to train our fine-tuning model — it’s as if we’ve already extensively trained the bottom layers of the network and just need to feature them as our sorting task and tweak them slightly. In fact, the authors suggest that only 2-4 epochs of training is required when fine-tuning BERT on a particular NLP task (compared to hundreds of GPU hours to train the original BERT or LSTM models from scratch).

  2. Less data

    Also, and perhaps just as important, the pre-training approach allows us to fine-tune a much smaller data set than we would need to build a model from scratch. A major drawback of NLP models built from scratch is that we usually need a large data set to train our network to achieve reasonable accuracy, which means we have to invest a lot of time and effort in the creation of the data set. By fine-tuning BERT, we can now train a model on fewer data sets to achieve good performance.

  3. Better results

    Finally, this simple fine-tuning process (usually adding a full connection layer to BERT and training several Epochs) has been shown to achieve state-of-the-art results in a wide range of tasks with minimal regulatory costs: classification, linguistic reasoning, semantic similarity, question-and-answer questions, etc. Rather than implementing custom and sometimes hard to understand network structures to accomplish specific tasks, simple fine-tuning with BERT may be a better (or at least as good) choice.

The change of NLP

This shift towards transfer learning is similar to what happened in computer vision a few years ago. Creating a good deep learning network for computer vision tasks can require millions of parameters, and training costs are very high. Researchers have found that feature representations of deep networks can be learned in layers (simple features, such as edges of objects, are learned at the lowest level, and complex features are gradually added at higher levels). Instead of training a new network from scratch each time, the low-level generalized image features of the trained network are copied and transferred to another network with a different task for use. Soon, it became common practice to download a pre-trained deep network and then quickly retrain it for a new task, or add additional layers on top of it — much better than training an expensive network from scratch. To many, the introduction of the Deep pre-trained Language Model (ELMO, BERT, ULMFIT, Open-GPT, etc.) in 2018 heralded a shift in NLP to transfer learning, as in computer vision.

Let’s get started!

1. Install

1.1. Use Colab GPU for training

Google Colab offers free Gpus and TPUS! Since we’ll be training a large neural network, it’s best to use these hardware accelerators (in this article, we’ll be using a GPU), otherwise the training will take a long time.

You can choose to add gpus in the directory

Edit -> Notebook Settings -> Hardware accelerator -> (GPU)

Then run the following code to confirm that the GPU has been detected:

import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device
-----
device(type='cuda')
Copy the code

1.2. Install the Hugging Face library

Next, let’s install the Transformers library, which will give us BERT’s PyTorch interface (which includes interfaces for other pre-trained language models, such as OpenAI’s GPT and GPT-2). We chose the PyTorch interface because of its high level API (easy to use, but lacking in detail) and tensorFlow code (which contains a lot of detail that tends to get us stuck in learning about TensorFlow, and here for BERT!). It’s a good balance.

At this point, Hugging Face seems to be the most widely accepted and powerful Bert interface. In addition to supporting a variety of different pre-training models, the library includes pre-builds of models adapted to different tasks. For example, in this tutorial, we will use the BertForSequenceClassification for text classification.

The library also provides specific libraries for different NLP tasks such as token classification, question answering, and next Sentence prediction. Using these pre-built classes simplifies the process of customizing BERT. Install the transformer:

! pip install transformersCopy the code

The code in this tutorial is actually a simplified version of the HuggingFace sample code run_glues.

Run_glues. Py is a useful tool that lets you choose which GLUE tasks you want to run and which pretraining model you want to use. It also supports the use of cpus, single Gpus, or multiple Gpus. It even supports 16-bit accuracy if you want to speed things up even further.

Unfortunately, all this configuration makes the code very unreadable, and this tutorial will greatly simplify it, with lots of comments, so you can see why and why.

2. Load the CoLA dataset

We will use The Corpus of Referenced Acceptability (CoLA) data set to classify single sentences. It is a set of sentences marked as grammatically correct or incorrect. It was first released in May 2018 and is one of the datasets in “GLUE Benchmark”.

2.1. Download & Decompress

We use wGET to download the dataset, install wGET:

! pip install wgetCopy the code

Download data set

import wget
import os

print('Downloading dataset... ')

# Download link for dataset
url = 'https://nyu-mll.github.io/CoLA/cola_public_1.1.zip'

If there is no local data set, download the data set
if not os.path.exists('. / cola_public_1. 1. The zip '):
    wget.download(url, '. / cola_public_1. 1. The zip ')
Copy the code

Once unzipped, you can see these files in the file system window on the left of Colab:

Unzip the zip package if you haven't unzip it
if not os.path.exists('./cola_public/') :! unzip cola_public_11..zip
Copy the code

2.2. Analytical

You can see which files are post-word segmentation and which are original files from the decompressed file names.

We use the unsegmented version of the data, because to apply pre-trained BERT, we must use the model’s built-in word divider. This is because: (1) the model has a fixed vocabulary, and (2) BERT deals with out-of-vocabulary words in a special way.

Using pandas to parse the in_domain_train. TSV file and preview the data:

import pandas as pd

# load the dataset into pandas' dataframe
df = pd.read_csv("./cola_public/raw/in_domain_train.tsv", delimiter='\t', header=None, names=['sentence_source'.'label'.'label_notes'.'sentence'])

Print the number of records in the dataset
print('Number of training sentences: {:,}\n'.format(df.shape[0]))

Sample 10 data points for a preview
df.sample(10)
Copy the code
sentence_source label label_notes sentence
1406 r-67 1 NaN A plan to negotiate an honorable end to the wa…
7315 sks13 0 * I said.
8277 ad03 0 * What Julie did of Lloyd was become fond.
621 bc01 1 NaN The ball lies completely in the box.
6646 m_02 1 NaN Very heavy, this parcel!
361 bc01 0 ?? Which problem do you wonder whether John said …
7193 sks13 0 * Will put, this girl in the red coat will put a…
4199 ks08 1 NaN The papers removed from the safe have not been…
5251 b_82 1 NaN He continued writing poems.
3617 ks08 1 NaN It was last night that the policeman met sever…

In the above table, we are mainly concerned with sentence and label fields. 0 in label means “syntax unacceptable” and 1 means “syntax acceptable”.

Here are five examples of grammatically unacceptable situations, which show how much more difficult this task is than sentiment analysis:

df.loc[df.label == 0].sample(5) [['sentence'.'label']]
Copy the code
sentence label
4867 They investigated. 0
200 The more he reads, the more books I wonder to … 0
4593 Any zebras can’t fly. 0
3226 Cities destroy easily. 0
7337 The time elapsed the day. 0

We load the sentence and label fields into the NUMpy array

# Construct sentences and Labels lists
sentences = df.sentence.values
labels = df.label.values
Copy the code

3. Word segmentation & formatting input layer

In this section, we will convert the data set into a format that can be trained by BERT.

3.1. BERT participle

To enter text into BERT, you must first word them and convert those words to their subscripts using the vocabulary provided inside the model.

Import the BERT library into the code, and use the “uncased” lower-case pretraining model:

from transformers import BertTokenizer

Load the BERT participle
print('Loading BERT tokenizer... ')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
Copy the code

Let’s try typing a sentence:

Print (' Original: ', sentences[0]) print('Tokenized: ', tokenizer.tokenize(sentences[0])) # map each word to dictionary subscript print('Token IDs: ', tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sentences[0]))) ----- Original: Our friends won't buy this analysis, let alone the next one we propose. Tokenized: ['our', 'friends', 'won', "'", 't', 'buy', 'this', 'analysis', ',', 'let', 'alone', 'the', 'next', 'one', 'we', 'propose', '.'] Token IDs: [2256, 2814, 2180, 1005, 1056, 4965, 2023, 4106, 1010, 2292, 2894, 1996, 2279, 2028, 2057, 16599, 1012]Copy the code

For actual training, we use the tokenize.encode function to perform the tokenize and convert_tokens_to_ids steps above.

Before we do that, let’s talk about BERT’s formatting requirements.

3.2. Formatting requirements

BERT asked us:

  1. Add special symbols at the beginning and end of sentences
  2. Fill or truncate sentences so that each sentence is of a fixed length
  3. Use the “attention mask” to distinguish filled tokens from unfilled tokens.

Special symbols

[SEP]

At the end of each sentence, a special [SEP] symbol needs to be added.

In A two-sentence input task (for example, can the answer to the question in sentence A be found in sentence B), the symbol is the separator of the two sentences.

So far I’m not sure why this symbol is included in a single sentence, but let’s do it if it’s required.

[CLS]

In the classification task, we need to insert the [CLS] symbol at the beginning of each sentence.

This notation has special meaning because BERT contains 12 Transformer layers, each of which accepts a list of tokens embeddings as input and produces the same number of embeddings as output (their values are different, of course).

For transformer’s output in the last layer, only the first embedding (corresponding to the [CLS] symbol) is typed into the classifier.

The first token of every sequence is always a special classification token ([CLS]). The final hidden state Corresponding to this token is used as the aggregate sequence representation for classification tasks. “(From the BERT paper)

You might think to use some pooling strategy for the last layer embeddings, but it’s not necessary. As BERT is trained to only use [CLS] for classification, it will encode all the information needed for classification into the corresponding 768 dimensional embedding vector of [CLS], which means that it has done the pooling work for us.

Sentence Length & Attention Mask

Obviously, the value range of sentence length in the dataset is very large. How should BERT deal with this problem?

BERT has two constraints:

  1. All sentences must be filled or truncated to a fixed length. The maximum length of sentences is 512 tokens.

  2. Fill sentences with [PAD] tokens. This is 0 in BERT’s Dictionary.

The “Attention Mask” is an array of zeros and ones to mark which tokens are filled and which are not. The mask tells BERT’s “self-attention” mechanism not to handle these padded symbols.

The maximum sentence length configuration affects training and evaluation speed. For example, on the Tesla K80 there are the following tests:

MAX_LEN = 128  # Train 5'28 "per epoch
MAX_LEN = 64   # train 2'27 "per epoch
Copy the code

3.3. Word segmentation for data sets

The Transformers library provides an encode function that handles most of the parsing and data preprocessing for us.

Before we can encode the text, we need to determine the MAX_LEN parameter. The following code calculates the maximum sentence length in the dataset:

max_len = 0
for sent in sentences:

    # split text and add '[CLS]' and '[SEP]' symbols
    input_ids = tokenizer.encode(sent, add_special_tokens=True)
    max_len = max(max_len, len(input_ids))

print('Max sentence length: ', max_len)
Copy the code

To avoid longer sentences, here we set MAX_LEN to 64. Now let’s begin the participle.

The function tokenizer.encode_plus contains the following steps:

  1. Divide the sentence into tokens.
  2. Add special symbols at both ends[CLS][SEP].
  3. Map tokens to subscripts IDs.
  4. Padding or truncating the list to a fixed length.
  5. Create attention masks to distinguish between filled and non-filled tokens.
Store the data set in the list after it has been sorted
input_ids = []
attention_masks = []

for sent in sentences:
    encoded_dict = tokenizer.encode_plus(
                        sent,                      # Input text
                        add_special_tokens = True.# add '[CLS]' and '[SEP]'
                        max_length = 64.# Fill & truncate length
                        pad_to_max_length = True,
                        return_attention_mask = True.Return attN. masks.
                        return_tensors = 'pt'.Return data in PyTorch Tensors format
                   )
    
    # Add the encoded text to the list
    input_ids.append(encoded_dict['input_ids'])
    
    Add the text's Attention mask to the list of attention_masks
    attention_masks.append(encoded_dict['attention_mask'])

Translate the list to tensor
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
labels = torch.tensor(labels)

Print the raw and encoded information for line 1
print('Original: ', sentences[0])
print('Token IDs:', input_ids[0])
Copy the code

3.4. Split the training set and verification set

Using 90% of the dataset as a training set and the remaining 10% as a validation set:

from torch.utils.data import TensorDataset, random_split

# merge input data into TensorDataset objects
dataset = TensorDataset(input_ids, attention_masks, labels)

# Calculate the size of training set and validation set
train_size = int(0.9 * len(dataset))
val_size = len(dataset) - train_size

# Randomly split training set and test set according to data size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

print('{:>5,} training samples'.format(train_size))
print('{:>5,} validation samples'.format(val_size))
Copy the code

We use the DataLoader class to read the data set, which is more memory efficient during training than a regular for loop:

from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

# In fine-tune training, BERT recommends small batch sizes of 16 or 32
batch_size = 32

Create a Dataloader for the training and validation sets, randomly shuffle the training samples
train_dataloader = DataLoader(
            train_dataset,  # Training sample
            sampler = RandomSampler(train_dataset), # Random small batches
            batch_size = batch_size # Train in small batches
        )

The validation set does not need to be randomized
validation_dataloader = DataLoader(
            val_dataset, # Validation sample
            sampler = SequentialSampler(val_dataset), # Select small batches in sequence
            batch_size = batch_size 
        )
Copy the code

4. Train the classification model

Now that the input data for the model is ready, it’s time to start fine-tuning.

4.1. BertForSequenceClassification

In this task, we first need to change the pre-training BERT model into a classification model. We then train the model with our data set so that it fits our task well, end-to-end.

Fortunately, HuggingFace’s PyTorch implementation includes a set of interfaces designed for different NLP tasks. These interfaces are built on the BERT model without exception, and they have different structures and output types for different NLP tasks.

Here is a list of classes currently available for fine tuning:

  • BertModel
  • BertForPreTraining
  • BertForNextSentencePrediction
  • BertForNextSentencePrediction
  • We use this BertForSequenceClassification –
  • BertForTokenClassification
  • BertForQuestionAnswering

The documentation for these classes is here.

We will use the BertForSequenceClassification, it consists of a common BERT model and a single linear classification layers, while the latter is mainly responsible for text classification. As we input data into the model, the entire pre-trained BERT model and additional untrained classification layers will be trained together.

Okay, now we load BERT! Several different pretraining models are available, “bert-base-uncased” is the lower-case version only, and it is the smaller version in “base” and “large”.

The documentation for interface From_pretrained is here, and additional parameter descriptions are here.

from transformers import BertForSequenceClassification, AdamW, BertConfig

BERT model # loading BertForSequenceClassification, training + linear classification of the top layer
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased".# lower-case 12-layer pre-training model
    num_labels = 2.# Number of classifications --2 indicates dichotomies
                    # You can change this number for multi-category tasks
    output_attentions = False.Check whether the model returns check weights.
    output_hidden_states = False.Whether the model returns all hidden layer states.
)

# Run the model on the GPU
model.cuda()
Copy the code

Out of curiosity, we can view all model parameters by parameter name.

The parameter name and its shape are printed below:

  1. Embedding layer
  2. 12th floor, 1st floor of the Transformers
  3. Output layer
Convert all model parameters to a list
params = list(model.named_parameters())

print('The BERT model has {:} different named parameters.\n'.format(len(params)))

print('==== Embedding Layer ====\n')

for p in params[0:5]:
    print("{: < 55} {: > 12}".format(p[0], str(tuple(p[1].size()))))

print('\n==== First Transformer ====\n')

for p in params[5:21]:
    print("{: < 55} {: > 12}".format(p[0], str(tuple(p[1].size()))))

print('\n==== Output Layer ====\n')

for p in params[4 -:]:
    print("{: < 55} {: > 12}".format(p[0], str(tuple(p[1].size()))))
Copy the code

The output

The BERT model has 201 different named parameters. ==== Embedding Layer ==== bert.embeddings.word_embeddings.weight (30522, 768) bert.embeddings.position_embeddings.weight (512, 768) bert.embeddings.token_type_embeddings.weight (2, 768) bert.embeddings.LayerNorm.weight (768,) bert.embeddings.LayerNorm.bias (768,) ==== First Transformer ==== bert.encoder.layer.0.attention.self.query.weight (768, 768) bert.encoder.layer.0.attention.self.query.bias (768,) bert.encoder.layer.0.attention.self.key.weight (768, 768) bert.encoder.layer.0.attention.self.key.bias (768,) bert.encoder.layer.0.attention.self.value.weight (768, 768) bert.encoder.layer.0.attention.self.value.bias (768,) bert.encoder.layer.0.attention.output.dense.weight (768, 768) bert.encoder.layer.0.attention.output.dense.bias (768,) bert.encoder.layer.0.attention.output.LayerNorm.weight (768,) bert.encoder.layer.0.attention.output.LayerNorm.bias (768,) bert.encoder.layer.0.intermediate.dense.weight (3072,  768) bert.encoder.layer.0.intermediate.dense.bias (3072,) bert.encoder.layer.0.output.dense.weight (768, 3072) bert.encoder.layer.0.output.dense.bias (768,) bert.encoder.layer.0.output.LayerNorm.weight (768,) bert.encoder.layer.0.output.LayerNorm.bias (768,) ==== Output Layer ==== bert.pooler.dense.weight (768, 768) bert.pooler.dense.bias (768,) classifier.weight (2, 768) classifier.bias (2,)Copy the code

4.2. Optimizer & Learning rate scheduler

With the model loaded, the next step is to tune the hyperparameters.

In the fine-tuning process, BERT’s authors suggest using the following superparameters (from Appendix A.3 of the BERT paper):

  • Batch size: 16, 32
  • Learning rate (Adam) : 5E-5, 3E-5, 2E-5
  • Times of epochs: 2, 3, 4

Our options are as follows:

  • Batch size: 32 (set when building DataLoaders)
  • Learning rate: 2-5 e
  • Epochs: 4 (we will see that this value is a bit large for this task)

The argument epsilon = 1e-8 is a very small value, which avoids the zero denominator in the implementation.

You can find the creation of the optimizer AdamW in run_glues. Py:

# I think 'W' stands for 'weight decay fix'"
optimizer = AdamW(model.parameters(),
                  lr = 2e-5.# args.learning_rate - default is 5e-5
                  eps = 1e-8 # args.adam_epsilon - default is 1e-8
                )
Copy the code
from transformers import get_linear_schedule_with_warmup

# Train Epochs. BERT suggested that a large set between 2 and 4 is easy to overfit
epochs = 4

Total number of training samples
total_steps = len(train_dataloader) * epochs

# Create a learning rate scheduler
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0, 
                                            num_training_steps = total_steps)
Copy the code

4.3. Training cycle

So here’s the training loop, there’s a lot of code, but basically every loop has a training loop and an evaluation loop.

Training:

  • Retrieve input sample and label data
  • Load this data into the GPU
  • Clear the gradient calculation of the last iteration
    • Gradients in PyTorch are cumulative (useful in RNN) and in this case need to be cleared manually before each iteration
  • The forward propagation
  • Back propagation
  • Use the optimizer to update the parameters
  • Monitor training process

Evaluation:

  • Retrieve input sample and label data
  • Load this data into the GPU
  • Prior to calculate
  • Calculate loss and monitor the entire assessment process

Define the function of calculation accuracy:

import numpy as np

# Calculate accuracy based on predicted results and tag data
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)
Copy the code

A help function to format the training time as HH :mm:ss:

import time
import datetime

def format_time(elapsed):
    ''' Takes a time in seconds and returns a string hh:mm:ss '''
    Round to the nearest second
    elapsed_rounded = int(round((elapsed)))
    
    Format to hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))
Copy the code

All training code:

import random
import numpy as np

The following training code is based on the 'run_glues. Py' script:
# https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128

Set a random seed value to ensure that the output is deterministic
seed_val = 42

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

Store statistical indicators such as loss, accuracy and training duration for training and evaluation.
training_stats = []

# Count the total training time
total_t0 = time.time()

for epoch_i in range(0, epochs):
    
    # = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
    # Training
    # = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
    

    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training... ')

    # Count the training time of single epoch
    t0 = time.time()

    Reset the training total loss for each epoch
    total_train_loss = 0

    Set the model to training mode. This is not meant to call the training interface
    The # Dropout, Batchnorm layers behave differently in training and test mode. https://stackoverflow.com/questions/51433378/what-does-model-train-do-in-pytorch)
    model.train()

    # Small batch iteration of training set
    for step, batch in enumerate(train_dataloader):

        Output progress information every 40 iterations
        if step % 40= =0 and not step == 0:
            elapsed = format_time(time.time() - t0)
            print(' Batch {:>5,} of {:>5,}. Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))

        Prepare the input data and copy it to the GPU
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)

        The gradient needs to be cleared to 0 each time the gradient is calculated, because the gradient of PyTorch is cumulative
        model.zero_grad()        

        # forward propagation
        # Documentation see also:
        # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
        This function returns different values for different arguments. In this case, loss and Logits -- the predicted results of the model are returned
        loss, logits = model(b_input_ids, 
                             token_type_ids=None, 
                             attention_mask=b_input_mask, 
                             labels=b_labels)

        # accumulative loss
        total_train_loss += loss.item()

        # Backpropagation
        loss.backward()

        Gradient cutting to avoid gradient explosion
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # update parameters
        optimizer.step()

        # Update learning rate
        scheduler.step()

    # Mean training error
    avg_train_loss = total_train_loss / len(train_dataloader)            
    
    # Training duration of single epoch
    training_time = format_time(time.time() - t0)

    print("")
    print(" Average training loss: {0:.2f}".format(avg_train_loss))
    print(" Training epcoh took: {:}".format(training_time))
        
    # = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
    # Validation
    # = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
    After completing an EPOCH training, the performance of the model was verified

    print("")
    print("Running Validation...")

    t0 = time.time()

    Set the model to evaluation mode
    model.eval()

    # Tracking variables 
    total_eval_accuracy = 0
    total_eval_loss = 0
    nb_eval_steps = 0

    # Evaluate data for one epoch
    for batch in validation_dataloader:
        
        Load input data into the GPU
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)
        
        # No need to update parameters and calculate gradients for evaluation
        with torch.no_grad():        
            (loss, logits) = model(b_input_ids, 
                                   token_type_ids=None, 
                                   attention_mask=b_input_mask,
                                   labels=b_labels)
            
        # accumulative loss
        total_eval_loss += loss.item()

        # Load the predicted results and labels into the CPU for calculation
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()

        # Calculation accuracy
        total_eval_accuracy += flat_accuracy(logits, label_ids)
        

    # The accuracy of printing the epoch
    avg_val_accuracy = total_eval_accuracy / len(validation_dataloader)
    print(" Accuracy: {0:.2f}".format(avg_val_accuracy))

    # Collect the loss of this epoch
    avg_val_loss = total_eval_loss / len(validation_dataloader)
    
    # Count the duration of this evaluation
    validation_time = format_time(time.time() - t0)
    
    print(" Validation Loss: {0:.2f}".format(avg_val_loss))
    print(" Validation took: {:}".format(validation_time))

    Record all statistics of the epoch
    training_stats.append(
        {
            'epoch': epoch_i + 1.'Training Loss': avg_train_loss,
            'Valid. Loss': avg_val_loss,
            'Valid. Accur.': avg_val_accuracy,
            'Training Time': training_time,
            'Validation Time': validation_time
        }
    )

print("")
print("Training complete!")
print("Total training took {:} (h:mm:ss)".format(format_time(time.time()-total_t0)))
Copy the code

Let’s take a look at the overall training overview:

import pandas as pd

Keep 2 decimal places
pd.set_option('precision'.2)

Load training statistics into the DataFrame
df_stats = pd.DataFrame(data=training_stats)

Use the EPOCH value as the index for each row
df_stats = df_stats.set_index('epoch')

Display tabular data
df_stats
Copy the code
epoch Training Loss Valid. Loss Valid. Accur. Training Time Validation Time
1 0.50 0.45 0.80 0:00:51 0:00:02
2 0.32 0.46 0.81 0:00:51 0:00:02
3 0.22 0.49 0.82 0:00:51 0:00:02
4 0.16 0.55 0.82 0:00:51 0:00:02

Notice that with each epoch, the training error goes down while the validation error goes up! This means that our training model took too long, that is, the model was overfitted.

In the evaluation process, the validation set error is more precise than the accuracy, because the accuracy does not care about the specific output value, but only considers which classification the sample will fall in given a threshold value.

When our prediction is correct but our confidence is still low, we can use validation error to evaluate, but the accuracy cannot do this. By comparing the training error and validation error of each epoch:

import matplotlib.pyplot as plt
% matplotlib inline

import seaborn as sns

# Drawing Style Settings
sns.set(style='darkgrid')

# Increase the plot size and font size.
sns.set(font_scale=1.5)
plt.rcParams["figure.figsize"] = (12.6)

# Plot the learning curve
plt.plot(df_stats['Training Loss'].'b-o', label="Training")
plt.plot(df_stats['Valid. Loss'].'g-o', label="Validation")

# Label the plot.
plt.title("Training & Validation Loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()
plt.xticks([1.2.3.4])

plt.show()
Copy the code

5. Test performance on the test set

Next we load the test set and use Matthew correlation coefficients to evaluate model performance, as this is a widely used measure of CoLA task performance in the NLP community. Using this measure, +1 is the highest score and -1 is the lowest. This allows us to compare performance horizontally with the best models for a particular task.

5.1. Data preparation

The processing of the test set is the same as the processing of the training data set, as follows

import pandas as pd

Load the data set
df = pd.read_csv("./cola_public/raw/out_of_domain_dev.tsv", delimiter='\t', header=None, names=['sentence_source'.'label'.'label_notes'.'sentence'])

Print the size of the dataset
print('Number of test sentences: {:,}\n'.format(df.shape[0]))
Convert the data set to a list
sentences = df.sentence.values
labels = df.label.values

# participle, fill, or truncate
input_ids = []
attention_masks = []
for sent in sentences:
    encoded_dict = tokenizer.encode_plus(
                        sent,                      
                        add_special_tokens = True, 
                        max_length = 64,           
                        pad_to_max_length = True,
                        return_attention_mask = True,   
                        return_tensors = 'pt',     
                   )
    input_ids.append(encoded_dict['input_ids'])
    attention_masks.append(encoded_dict['attention_mask'])

input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
labels = torch.tensor(labels)

batch_size = 32  

Get the data set ready
prediction_data = TensorDataset(input_ids, attention_masks, labels)
prediction_sampler = SequentialSampler(prediction_data)
prediction_dataloader = DataLoader(prediction_data, sampler=prediction_sampler, batch_size=batch_size)
Copy the code

5.2. Evaluate test suite performance

With the test set data ready, you can use the previously fine-tuned model to make predictions for the test set

# Predict test sets

print('Predicting labels for {:,} test sentences... '.format(len(input_ids)))
# Still in evaluation mode
model.eval()

# Tracking variables 
predictions , true_labels = [], []

# prediction
for batch in prediction_dataloader:
  Load data into the GPU
  batch = tuple(t.to(device) for t in batch)
  b_input_ids, b_input_mask, b_labels = batch
  
  # No need to calculate the gradient
  with torch.no_grad():
      # Forward propagation to obtain the predicted results
      outputs = model(b_input_ids, token_type_ids=None, 
                      attention_mask=b_input_mask)

  logits = outputs[0]

  Load the result into the CPU
  logits = logits.detach().cpu().numpy()
  label_ids = b_labels.to('cpu').numpy()
  
  # Store forecast results and labels
  predictions.append(logits)
  true_labels.append(label_ids)

print(' DONE.')
Copy the code

The Mathews correlation coefficient (MCC) was used to assess test set performance because of the uneven distribution of categories:

print('Positive samples: %d of %d (%.2f%%)' % (df.label.sum(), len(df.label), (df.label.sum() / len(df.label) * 100.0)))
-----
Positive samples: 354 of 516 (68.60%)  
Copy the code

The final evaluation results will be based on the full test data, but we can count the individual scores for each small batch to see changes from batch to batch.

from sklearn.metrics import matthews_corrcoef

matthews_set = []

# Calculate the MCC of each batch
print('Calculating Matthews Corr. Coef. for each batch... ')

# For each input batch...
for i in range(len(true_labels)):
  pred_labels_i = np.argmax(predictions[i], axis=1).flatten()
  
  Calculate the MCC of this batch
  matthews = matthews_corrcoef(true_labels[i], pred_labels_i)                
  matthews_set.append(matthews)
Copy the code
Create a bar chart to display the MCC scores for each batch
ax = sns.barplot(x=list(range(len(matthews_set))), y=matthews_set, ci=None)

plt.title('MCC Score per Batch')
plt.ylabel('MCC Score (-1 to +1)')
plt.xlabel('Batch #')

plt.show()
Copy the code

We combined the results of all batches to calculate the final MCC score:

# Merge all batch predictions
flat_predictions = np.concatenate(predictions, axis=0)

# Take the maximum value of each sample as the predicted value
flat_predictions = np.argmax(flat_predictions, axis=1).flatten()

# merge all labels
flat_true_labels = np.concatenate(true_labels, axis=0)

# calculation of MCC
mcc = matthews_corrcoef(flat_true_labels, flat_predictions)

print('Total MCC: %.3f' % mcc)

-----
Total MCC: 0.498
Copy the code

Cool! In half an hour, without adjusting any of the hyperparameters (learning rate, Epochs, batch size, ADAM attributes, etc.), we got a good score.

The expected accuracy of library documents can be viewed here by Benchmark. You can also check out the official leaderboard here.

conclusion

This tutorial describes how you can quickly and efficiently create a high quality NLP model using less data and training time based on pre-trained BERT models.

The appendix

A.1. Store & load fine-tuning model

The following code (from run_glues. Py) writes the model and participle to disk

import os

The path to which the model is stored
output_dir = './model_save/'

Create a directory that does not exist
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

print("Saving model to %s" % output_dir)

Use 'save_pretrained()' to save the trained model, model configuration, and classifier
They can then be loaded by 'from_pretrained()'
model_to_save = model.module if hasattr(model, 'module') else model  # Allow for distributed/parallel training
model_to_save.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

# Good practice: save your training arguments together with the trained model
# torch.save(args, os.path.join(output_dir, 'training_args.bin'))

Copy the code

Store the model from the Colab Notebook on Google Drive

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')
# Copy model files to Google Drive! cp -r ./model_save/"./drive/Shared drives/AI/BERT Fine-Tuning/"
Copy the code

The following code loads the model from disk

Load the vocabulary of the fine-tuned model
model = model_class.from_pretrained(output_dir)
tokenizer = tokenizer_class.from_pretrained(output_dir)

Copy the model to GPU/CPU to run
model.to(device)
Copy the code

A.2. Weight attenuation

The HuggingFace example contains the following code to set weight decay, but the default decay rate is “0”, so I’ve moved this code to the appendix.

This code snippet essentially tells the optimizer not to apply weight attenuation to the BIAS parameter, which is actually a regularization after calculating the gradient.

# code source:
# https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L102

# Do not apply weight attenuation on parameters corresponding to parameter names containing the following strings
# (Here, the BERT doesn't have `gamma` or `beta` parameters, only `bias` terms)
no_decay = ['bias'.'LayerNorm.weight']

# separate 'weight' argument from 'bias' argument
# - For the 'weight 'argument, 'weight_decay_rate' is set to 0.01
# - For the 'bias' argument, 'weight_decay_rate' is set to 0.0
optimizer_grouped_parameters = [
    # Filter for all parameters which *don't* include 'bias', 'gamma', 'beta'.
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.1},
    
    # Filter for parameters which *do* include those.
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.0}]# note - 'optimizer_grouped_parameters' only contains parameter values, not parameter names
Copy the code

, verified, and the above code can be run on Google Colab, link is as follows: colab.research.google.com/drive/1sfAy…