“This is the 8th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021”

High reuse Bert model text classification code

High reuse Bert model text classification code (a) data reading

The source code interpretation

The models in the source code are stored separately in the Model folder. Take a look at module.py, which houses a simple fully connected neural network model as a classifier.

The network structure of classifier is very simple and consists of only two layers.

  • Dropout layer
  • Linear layer
# module.py
import torch.nn as nn

# classifier
class IntentClassifier(nn.Module) :
    def __init__(self, input_dim, num_labels, dropout_rate=0.) :
        super(IntentClassifier, self).__init__()
        self.dropout = nn.Dropout(dropout_rate)
        self.linear = nn.Linear(input_dim, num_labels)

    def forward(self, x) :
        x = self.dropout(x)
        return self.linear(x)
Copy the code

Let’s focus on the Bert model code

import torch
import torch.nn as nn
from transformers import BertPreTrainedModel, BertModel, BertConfig
from torchcrf import CRF
from .module import IntentClassifier


class ClsBERT(BertPreTrainedModel) :
    def __init__(self, config, args, label_lst) :
        super(ClsBERT, self).__init__(config)
        self.args = args
        self.num_labels = len(label_lst)
        self.bert = BertModel(config=config)  # Load pretrained bert

        self.classifier = IntentClassifier(config.hidden_size, self.num_labels, args.dropout_rate)


    def forward(self, input_ids, attention_mask, token_type_ids, label_ids) :
        outputs = self.bert(input_ids, attention_mask=attention_mask,
                            token_type_ids=token_type_ids)  # sequence_output, pooled_output, (hidden_states), (attentions)
        sequence_output = outputs[0]
        pooled_output = outputs[1]  # [CLS]

        logits = self.classifier(pooled_output)

        outputs = ((logits),) + outputs[2:]  # add hidden states and attention if they are here

        # 1. Intent Softmax
        if label_ids is not None:
            if self.num_labels == 1:
                loss_fct = nn.MSELoss()
                loss = loss_fct(logits.view(-1), label_ids.view(-1))
            else:
                loss_fct = nn.CrossEntropyLoss()
                loss = loss_fct(logits.view(-1, self.num_labels), label_ids.view(-1))

            outputs = (loss,) + outputs

        return outputs  # (loss), logits, (hidden_states), (attentions)
Copy the code

The emphasis is on the forward part

        outputs = self.bert(input_ids, attention_mask=attention_mask,
                            token_type_ids=token_type_ids)  # sequence_output, pooled_output, (hidden_states), (attentions)
        sequence_output = outputs[0] # sequence_output = outputs.last_hidden_state
        pooled_output = outputs[1]   # [CLS] / pooled_output = outputs.pooler_output
Copy the code

Bert (input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids) returns two values Sequence_output, whose shape is (batch_size, bert_hidden_size), pooled_output, where pooler_output refers to the last hidden layer of the output sequence, i.e. CLS tag. Its shape size is (batch_size, bert_hidden_size)

  • Can be achieved byoutputs[0]oroutputs.last_hidden_stateachievesequence_outputVector.
  • Can be achieved byoutputs[1]oroutputs.pooler_output achievepooled_outputVector.

Generally, for classification tasks, the output of the last Bert layer is taken as the average pooling access linear layer, and the code can directly use outputs. Pooler_output as linear input. Elsider.last_hidden_state. mean(dim=1) can also be used as linear input. It is better to test the latter yourself.

Improved Bert output

We know that Bert model is composed of 12 transformer layers. What if we want to take out vectors of one layer or do vector stitching?

The output value of BertModel(BertPreTrainedModel) is the output value of BertPreTrainedModel.

Outputs: Tuple comprising various elements depending on the configuration (config) and inputs:

last_hidden_state: torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)

Sequence of hidden-states at the output of the last layer of the model.

pooler_output: torch.FloatTensor of shape (batch_size, hidden_size) Last layer hidden-state of the first token of the sequence (classification token)further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during Bert pretraining. This output is usually not a good summary of the semantic content of the input, you’re often better with averaging or pooling the sequence of hidden-states for the whole input sequence.

hidden_states: (optional, returned when config.output_hidden_states=True),list of torch.FloatTensor (one for the output of each layer + the output of the embeddings)of shape (batch_size, sequence_length, hidden_size): Hidden-states of the model at the output of each layer plus the initial embedding outputs.

attentions: (optional, returned when config.output_attentions=True),list of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length):Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

There is a hidden_states layer that returns all layer vectors with shapes (batch_size, sequence_length, hidden_size). To output this list, config.output_hidden_states=True is configured during Bert initialization before hidden_states is returned

Let’s rewrite the code and try to extract the output vector of Bert’s third-to-last Transformer

class ClsBERT(BertPreTrainedModel) :
    def __init__(self, config, args, label_lst) :
        super(ClsBERT, self).__init__(config)
        self.args = args
        self.num_labels = len(label_lst)
        self.bert = BertModel(config=config)  # Load pretrained bert

        self.classifier = IntentClassifier(config.hidden_size, self.num_labels, args.dropout_rate)


    def forward(self, input_ids, attention_mask, token_type_ids, label_ids) :
        """ Add output_hidden_states = True ""
        outputs = self.bert(input_ids, attention_mask=attention_mask,token_type_ids=token_type_ids,output_hidden_states = True) 
        Pooled_outputs hidden_states[-3] mean(dim=1) Mean (batch_size,hidden_layers)
        pooled_output = outputs.hidden_states[-3].mean(dim=1)
			
        logits = self.classifier(pooled_output)

        outputs = ((logits),) + outputs[2:]  # add hidden states and attention if they are here

        # 1. Intent Softmax
        if label_ids is not None:
            if self.num_labels == 1:
                loss_fct = nn.MSELoss()
                loss = loss_fct(logits.view(-1), label_ids.view(-1))
            else:
                loss_fct = nn.CrossEntropyLoss()
                loss = loss_fct(logits.view(-1, self.num_labels), label_ids.view(-1))

            outputs = (loss,) + outputs

        return outputs  # (loss), logits, (hidden_states), (attentions)
Copy the code

Tensor: Torch. Empty (0, dtype=torch. Long). To (self.device) Use loop and cat to interpolate vector Change the input layer size of linear to the hidden_size of pooled_output

class ClsBERT(BertPreTrainedModel) :
    def __init__(self, config, args, label_lst) :
        super(ClsBERT, self).__init__(config)
        self.args = args
        self.num_labels = len(label_lst)
        self.bert = BertModel(config=config)  # Load pretrained bert

        self.classifier = IntentClassifier(config.hidden_size, self.num_labels, args.dropout_rate)


    def forward(self, input_ids, attention_mask, token_type_ids, label_ids) :
        """ Add output_hidden_states = True ""
        outputs = self.bert(input_ids, attention_mask=attention_mask,token_type_ids=token_type_ids,output_hidden_states = True) 
        State_input torch. Empty (0, Dtype =torch. Long).to(self.device) hidden_states[-3] mean(dim=1) Hidden_states (batch_size,hidden_layers)
        pooled_output = torch.empty(0, dtype=torch.long).to(self.device)
        for layer in outputs.hidden_states[self.concatnum:]:
            pooled_output = torch.cat((pooled_output, layer.mean(dim=1)), dim=1)
		""" modify end """
        
        logits = self.classifier(pooled_output)

        outputs = ((logits),) + outputs[2:]  # add hidden states and attention if they are here

        # 1. Intent Softmax
        if label_ids is not None:
            if self.num_labels == 1:
                loss_fct = nn.MSELoss()
                loss = loss_fct(logits.view(-1), label_ids.view(-1))
            else:
                loss_fct = nn.CrossEntropyLoss()
                loss = loss_fct(logits.view(-1, self.num_labels), label_ids.view(-1))

            outputs = (loss,) + outputs

        return outputs  # (loss), logits, (hidden_states), (attentions)
Copy the code

So far, the explanation of model code is over, and we have also discussed and improved the output form of Bert model, including optimizer, learning rate and loss function, which will be explained in the model training code in the next article. NLP cute new, shallow talent, mistakes or imperfect place, please criticize!!