Finetune Bert for Chinese

The NLP problem proved to be the same as the image, which could be improved in the vertical domain by finetune. The Bert model itself is so computationally dependent that training from zero is unthinkable to most developers. In order to better fit the corpus in the vertical field, we have the motivation of Finetune while saving resources and avoiding starting from scratch.

Bert’s documentation itself describes Finetune in some detail, but it can be difficult for engineers unfamiliar with official standard data sets to get started. With the open source Bert as Service code, the use of lexical space, a byproduct of Bert classification or reading comprehension, becomes a more practical direction.

Thus, this document focuses on an example of the process of combing through the Finetune vertical corpus to obtain a fine-tuned model. Please refer to the official documentation for the Bert Principle or Bert as Service.

Rely on

Python = = 3.6 tensorflow > = 1.11.0Copy the code

Pretraining model

  • Download Bert-Base, Chinese: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters

Data preparation

  • train.tsvThe training set
  • dev.tsvValidation set

The data format

The first column is label and the second column is TAB separated. Because the model itself is processed at the character level, word segmentation is not required.

Wear a fashion shirt with it and take 10 years off your age! Live younger! Too beautiful! . Houseliving 95㎡ simple American small sanju, beautiful chic, carefree life! The owner's guest... Game season end with them two days a section, 7.20 strongest LOL on points hero recommended! Guys...Copy the code

Example Data location: data

The data format depends on the business scenario, and you can adjust the data import method in the code later.

operation

git clone https://github.com/google-research/bert.git
cd bert
Copy the code

Bert’s Finetune mainly has two application scenarios: classification and reading comprehension. Since it is easier to obtain samples by classification, the following takes classification as an example to conduct fine-tuning of the model:

Modify therun_classifier.py

Custom DataProcessor

class DemoProcessor(DataProcessor):
    """Processor for Demo data set."""

    def __init__(self):
        self.labels = set()
    
    def get_train_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

    def get_dev_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

    def get_test_examples(self, data_dir):
      """See base class."""
      return self._create_examples(
          self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")

    def get_labels(self):
        """See base class."""
        # return list(self.labels)
        return ["fashion"."houseliving"."game"] # Customize according to label


    def _create_examples(self, lines, set_type):
        """Creates examples for the training and dev sets."""
        examples = []
        for (i, line) in enumerate(lines):
            guid = "%s-%s" % (set_type, i)
            text_a = tokenization.convert_to_unicode(line[1])
            label = tokenization.convert_to_unicode(line[0])
            self.labels.add(label)
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
        return examples

Copy the code

Add DemoProcessor

  processors = {
      "cola": ColaProcessor,
      "mnli": MnliProcessor,
      "mrpc": MrpcProcessor,
      "xnli": XnliProcessor,
      "demo": DemoProcessor,
  }
Copy the code

Start the training

export BERT_Chinese_DIR=/path/to/bert/chinese_L-12_H-768_A-12 export Demo_DIR=/path/to/DemoDate python run_classifier.py  \ --task_name=demo \ --do_train=true \ --do_eval=true \ --data_dir=$Demo_DIR \ --vocab_file=$BERT_Chinese_DIR/vocab.txt  \ --bert_config_file=$BERT_Chinese_DIR/bert_config.json \ --init_checkpoint=$BERT_Chinese_DIR/bert_model.ckpt \ --max_seq_length=128 \ --train_batch_size=32 \ --learning_rate= 2E-5 \ --num_train_epochs=3.0 \ --output_dir=/tmp/Demo_output/Copy the code

If all goes well, this output will be:

***** Eval results *****
  eval_accuracy = xx
  eval_loss = xx
  global_step = xx
  loss = xx
Copy the code

Finally, the fine-tuned model is saved in the folder pointed to by output_dir.

conclusion

Finetune after Bert’s pre-training is a very efficient way to save time and improve the performance of the model in vertical corpus. The Finetune process is actually not that difficult. The bigger difficulty lies in data preparation and pipeline design. From a business perspective, emphasis should be placed on the validation of the model after Finetune and its application in business scenarios. If both the metrics and the business scenario are clear, try it.

  • Github address: github.com/kuhung/bert…

The resources

  • Github.com/NLPScott/be…
  • www.jianshu.com/p/aa2eff7ec…