Subword is usually used for BERT. If you only want to divide words according to space, you can write like this when you don’t want to use subword:

from tokenizers import Tokenizer
from tokenizers.models import WordLevel
from tokenizers import normalizers
from tokenizers.normalizers import Lowercase, NFD, StripAccents
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.processors import TemplateProcessing
from tokenizers.trainers import WordLevelTrainer
Copy the code

Training tokenizer

Tokenizer = Tokenizer(WordLevel(unk_token="[UNK]")) Tokenizer.normalizer = normalizers.Sequence([NFD(), Lowercase(), StripAccents()]) # tokenizer = Whitespace() tokenizer.pre_tokenizer = Whitespace() tokenizer.post_processor = TemplateProcessing( single="[CLS] $A [SEP]", pair="[CLS] $A [SEP] $B:1 [SEP]:1", special_tokens=[("[CLS]", 1), ("[SEP]", 2)], ) trainer = WordLevelTrainer(vocab_size=30000, show_progress=True, special_tokens=["[UNK]", "[CLS]", "(SEP)," "(PAD)", "[MASK]]") # input_file. TXT for training corpus, a sentence per row. Files = ["data/pretrain/input_file.txt"] tokenizer.train(files, trainer=trainer) tokenizer.save("data/pretrain/vocab.json", pretty=True)Copy the code

Load the tokenizer

Tokenizer = tokenizer.from_file ("data/pretrain/vocab.json") Id output = tokenizer. Encode ("55 46 47 48 158 159 46 19 160 157111111") print(output.ids) print(tokenizer.decode([1, 74, 37, 40, 16, 151, 156, 37, 7, 1473, 0, 2], skip_special_tokens=False))Copy the code

How will the json generated above be used in the Transformers?

I scrawled this method, I’m not sure if there is a more elegant method, I haven’t found the corresponding WORDLEVEL Tokenizer in the transformers, so I have to rewrite it

First of all, take out the json dictionary, imitate Bert to write a dictionary, vocab.txt file, one word for each behavior

import json
from collections import OrderedDict
tokenizer_json = json.load(open('vocab.json', 'r', encoding='utf-8'))
vocab = OrderedDict(tokenizer_json['model']['vocab'])
out = open('vocab.txt', 'w', encoding='utf-8')
for k, _ in vocab.items():
    out.write(k + '\n')
out.close()
Copy the code

Vocab.txt looks like this:

[CLS]
[SEP]
[PAD]
[MASK]
12
29
19
23
16
...
Copy the code

Then inherit BertTokenizer to override the _tokenize function

from transformers import BertTokenizer


class MyTokenizer(BertTokenizer):
    def __init__(self, vocab_file, **kwargs):
        super().__init__(vocab_file=vocab_file, **kwargs,)

    def _tokenize(self, text):
        split_tokens = []
        for token in self.basic_tokenizer.tokenize(text, never_split=self.all_special_tokens):
            split_tokens.append(token)
        return split_tokens
Copy the code

Load the vocab.txt above

tokenizer = MyTokenizer.from_pretrained('vocab.txt')
sentence = '55 46 47 48 158 159 46 19 160 157111111'
encoded_input = tokenizer.tokenize(sentence)
print(encoded_input)
Copy the code