“This is the fifth day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”
Since Google proposed Bert in 2018, pre-training model has become a pioneer in the FIELD of NLP. Bert, as an epoch-making depth model in the history of NLP, is undoubtedly powerful. Generally speaking, the BERT pre-training model has been able to meet most scenarios in practical tasks. Next, the author will update a set of highly reusable text classification codes based on BERT pre-training model, which will be divided into three articles for detailed interpretation of the whole set of codes, and the data reading part will be explained in detail.
Source code download address: download link extraction code: 2021
The IFLYTEK’ long text classification dataset of The Chinese Language Comprehension Evaluation Benchmark (CLUE) is used as the data set in the code. CLUE paper is highly accepted by the International Conference on Computational Linguistics, COLING2020
The overview
The code reads the data in the data_loader.py file, and then interprets each class and function to clarify its logical relationship. The classes and functions in the file are shown below:
InputExample: a sample object that creates objects for each sample and can override internal methods based on the task. InputFeatures: Feature objects that create objects for each feature and can override internal methods based on the task. IflytekProcessor: processes objects and file data, and returns the InputExample class. This class has no fixed name and can modify the corresponding reading method according to specific tasks. In this example, iflyTek is named iflytekProcessor. Convert_examples_to_features: Convert the InputExample class to the InputFeatures class. InputExample returns the load_and_cache_examples InputFeatures class: The InputExample class generated by convert_examples_to_features is loaded on one side every time the training is not used.
Logic:
InputExample
This class is relatively simple and defines several attributes of the input sample in the initialization method: GUID: the unique ID of the sample. Words: Example sentences. Label: The label of the example. There is no need to modify, and another sentence can be added for text matching, such as self.wordspair = Wordspair, depending on the task.
class InputExample(object) :
""" A single training/test example for simple sequence classification. Args: guid: Unique id for the example. words: list. The words of the sequence. label: (Optional) string. The label of the example. """
def __init__(self, guid, words, label=None.) :
self.guid = guid
self.words = words
self.label = label
def __repr__(self) :
return str(self.to_json_string())
def to_dict(self) :
"""Serializes this instance to a Python dictionary."""
output = copy.deepcopy(self.__dict__)
return output
def to_json_string(self) :
"""Serializes this instance to a JSON string."""
return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"
Copy the code
InputFeatures
This class mainly describes the Bert input form, input_ids, attention_mask, token_type_ids plus label_id
If the Bert model is also used, there is no need to modify it. Of course, it needs to be modified according to the input mode of the corresponding pre-training model.
class InputFeatures(object) :
"""A single set of features of data."""
def __init__(self, input_ids, attention_mask, token_type_ids, label_id) :
self.input_ids = input_ids
self.attention_mask = attention_mask
self.token_type_ids = token_type_ids
self.label_id = label_id
def __repr__(self) :
return str(self.to_json_string())
def to_dict(self) :
"""Serializes this instance to a Python dictionary."""
output = copy.deepcopy(self.__dict__)
return output
def to_json_string(self) :
"""Serializes this instance to a JSON string."""
return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"
Copy the code
iflytekProcessor
Read the file according to the task
class iflytekProcessor(object) :
"""Processor for the JointBERT data set """
def __init__(self, args) :
self.args = args
self.labels = get_labels(args)
self.input_text_file = 'data.csv'
@classmethod
def _read_file(cls, input_file, quotechar=None) :
"""Reads a tab separated value file."""
df = pd.read_csv(input_file)
return df
def _create_examples(self, datas, set_type) :
"""Creates examples for the training and dev sets."""
examples = []
for i, rows in datas.iterrows():
try:
guid = "%s-%s" % (set_type, i)
# 1. input_text
words = rows["text"]
# 2. intent
label = rows["labels"]
except :
print(rows)
examples.append(InputExample(guid=guid, words=words, label=label))
return examples
def get_examples(self, mode) :
""" Args: mode: train, dev, test """
data_path = os.path.join(self.args.data_dir, self.args.task, mode)
logger.info("LOOKING AT {}".format(data_path))
return self._create_examples(datas=self._read_file(os.path.join(data_path, self.input_text_file)),
set_type=mode)
Copy the code
convert_examples_to_features
Examples are mainly encoded according to Bert’s coding mode and converted into the forms of input_IDS, attention_maskmax_seq_len and token_type_IDS to generate features
def convert_examples_to_features(examples, max_seq_len, tokenizer,
cls_token_segment_id=0,
pad_token_segment_id=0,
sequence_a_segment_id=0,
mask_padding_with_zero=True) :
# Setting based on the current model type
cls_token = tokenizer.cls_token
sep_token = tokenizer.sep_token
unk_token = tokenizer.unk_token
pad_token_id = tokenizer.pad_token_id
features = []
for (ex_index, example) in enumerate(examples):
if ex_index % 5000= =0:
logger.info("Writing example %d of %d" % (ex_index, len(examples)))
# Tokenize word by word (for NER)
tokens = []
for word in example.words:
word_tokens = tokenizer.tokenize(word)
if not word_tokens:
word_tokens = [unk_token] # For handling the bad-encoded word
tokens.extend(word_tokens)
# Account for [CLS] and [SEP]
special_tokens_count = 2
if len(tokens) > max_seq_len - special_tokens_count:
tokens = tokens[:(max_seq_len - special_tokens_count)]
# Add [SEP] token
tokens += [sep_token]
token_type_ids = [sequence_a_segment_id] * len(tokens)
# Add [CLS] token
tokens = [cls_token] + tokens
token_type_ids = [cls_token_segment_id] + token_type_ids
input_ids = tokenizer.convert_tokens_to_ids(tokens)
# The mask has 1 for real tokens and 0 for padding tokens. Only real
# tokens are attended to.
attention_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)
# Zero-pad up to the sequence length.
padding_length = max_seq_len - len(input_ids)
input_ids = input_ids + ([pad_token_id] * padding_length)
attention_mask = attention_mask + ([0 if mask_padding_with_zero else 1] * padding_length)
token_type_ids = token_type_ids + ([pad_token_segment_id] * padding_length)
assert len(input_ids) == max_seq_len, "Error with input length {} vs {}".format(len(input_ids), max_seq_len)
assert len(attention_mask) == max_seq_len, "Error with attention mask length {} vs {}".format(len(attention_mask), max_seq_len)
assert len(token_type_ids) == max_seq_len, "Error with token type length {} vs {}".format(len(token_type_ids), max_seq_len)
label_id = int(example.label)
if ex_index < 5:
logger.info("*** Example ***")
logger.info("guid: %s" % example.guid)
logger.info("tokens: %s" % "".join([str(x) for x in tokens]))
logger.info("input_ids: %s" % "".join([str(x) for x in input_ids]))
logger.info("attention_mask: %s" % "".join([str(x) for x in attention_mask]))
logger.info("token_type_ids: %s" % "".join([str(x) for x in token_type_ids]))
logger.info("label: %s (id = %d)" % (example.label, label_id))
features.append(
InputFeatures(input_ids=input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
label_id=label_id,
))
return features
Copy the code
load_and_cache_examples
The load and cache functions, as their name suggests, do both reading and saving.
1. Generate the cache name cached_features_file named cached_ dataset schema _ dataset name _MaxLen based on the parameters
2. Check whether the cache file in the current path exists. If yes, read and save the file directly.
- First, use
processors
To obtainexamples
. - And then, through
convert_examples_to_features
To obtainfeatures
And save the cached data.
3. Transform features data into tensors and construct data sets using TensorDataset
def load_and_cache_examples(args, tokenizer, mode) :
processor = processors[args.task](args)
# Load data features from cache or dataset file
cached_features_file = os.path.join(
args.data_dir,
'cached_{}_{}_{}_{}'.format(
mode,
args.task,
list(filter(None, args.model_name_or_path.split("/"))).pop(),
args.max_seq_len
)
)
print(cached_features_file)
if os.path.exists(cached_features_file):
logger.info("Loading features from cached file %s", cached_features_file)
features = torch.load(cached_features_file)
else:
# Load data features from dataset file
logger.info("Creating features from dataset file at %s", args.data_dir)
if mode == "train":
examples = processor.get_examples("train")
elif mode == "dev":
examples = processor.get_examples("dev")
elif mode == "test":
examples = processor.get_examples("test")
else:
raise Exception("For mode, Only train, dev, test is available")
# Use cross entropy ignore index as padding label id so that only real label ids contribute to the loss later
features = convert_examples_to_features(examples,
args.max_seq_len,
tokenizer,
)
logger.info("Saving features into cached file %s", cached_features_file)
torch.save(features, cached_features_file)
# Convert to Tensors and build dataset
all_input_ids = torch.tensor(
[f.input_ids for f in features],
dtype=torch.long
)
all_attention_mask = torch.tensor(
[f.attention_mask for f in features],
dtype=torch.long
)
all_token_type_ids = torch.tensor(
[f.token_type_ids for f in features],
dtype=torch.long
)
all_label_ids = torch.tensor(
[f.label_id for f in features],
dtype=torch.long
)
dataset = TensorDataset(all_input_ids, all_attention_mask,
all_token_type_ids, all_label_ids)
return dataset
Copy the code
The output shown
Preview: The follow-up introduction to model construction and training is to be continued…. NLP cute new, shallow talent, mistakes or imperfect place, please criticize!!