Recently, in the internal technology sharing meeting, I found that people were interested in the problems in Bert’s practice. The questions mainly focused on the high cost of Bert’s machine resources, how to run Bert with a small cost (money and time)? Therefore, I hope this article can help you avoid some detours in the process of practicing Bert.

The whole article is divided into four parts.

The first part of Bert code speed reading hints the key points that are easily ignored in Bert code, aiming to make you quickly familiar with the code and run

The second part summarizes some of the pitfalls I stumbled into with Bert in service-oriented deployment

Part III Reference Materials (also dry)

The fourth part summarizes the performance and effect and gives the lowest cost path of Bert practice.

  • First, Bert code speed reading
This part of the code is from Google Research’s github website link: Google-Research/Bert Talk about the overlooked but important points in the code to help you practice Bert in a short time, the necessary code to master. Run_classifier. py, run_squad. Py, modeling.

1.Bert code structure

2. Create _model core code for run_classifier.py and run_squads.

Attention_layer code in modeling.py, including schematics and code parsing.

Attention Layer schematic

In the pre-training stage, there are high requirements for machines and data volume. Fortunately, the author provides the pre-training model of main languages English and Chinese, which can be downloaded directly. Therefore, we focus on how to construct and use fine-tuning models to achieve our goals. Focus on two fine-tuning models given by the author in the source code. After watching this, you may be amazed at how easy it is to fine-tune the model. With run_classifier.py, the training time in the fine tuning phase was about half an hour for the whole sample size of 500,000.

As you can see from the code, the run_squad. Py and run_classifier.py fine-tuning models are a simple full link layer, and so on. If you want to implement other goals like named entity recognition, you can add a full link layer to the pre-trained model.

  • 2. Some of the pits I’ve been in
1. Tensorflow service deployment pit.

Tensorflow as a service deployment has several sets of interfaces that are incompatible with each other due to version history and can be very unfriendly to users. I provide an available interface method for reference. How the TensorFlow model provides services externally. Fine-tuning model Run result model is the checkpoint file format, which needs to be converted to. Pt format.

2. Change TPU to GPU Estimator

The official website source code is given is TPU Estimator interface, change the ordinary Estimator interface scheme can run.

https://www.tensorflow.org/guide/estimators​

www.tensorflow.org

For example, the change to the TPU estimator in run_classifier.py can be used directly in the code

(1) Modification of the estimator definition in main() function

Definition in source code

Modified definition

(2) Model_fn () part of the code modification, give a train part of the example, eval part of the same can be obtained.

The source code

The modified

3.Out of Memory problems

The readme.md section provides a solution to Out of Memory problems. If you encounter similar problems, you should read this section first. Adjust max_seq_length and train_batch_size to see how much memory your GPU is using. The GPU I used has 28GB of memory, which is basically enough if you want to run the fine tuning model (train_batch_size=64,–max_seq_length=128).

  • Iii. Reference materials
1.Bert as Service
hanxiao/bert-as-service



Graph optimization method

Bert pre-training model is a relatively complete servitization deployment method, and the pre-training model can be used as the basic service of NLP. Source code in two highlights: one is to provide a graph optimization method, improve efficiency and reduce video memory consumption. Freezed graph freezing changes tf.Variable to tF. Constant, Pruned removes redundant nodes during training, Quantized reduces the floating point dimension, such as changing int64 to int32. Second, ZeromQ realizes asynchronous concurrent requests and designs a software architecture for Bert servitization deployment.

2. Refer to my other blog post about Bert’s principle, Octopus Maruko-chan: detailed interpretation of NLP breakthrough Bert model

  • Four,
As for Bert’s effect, I have not made a quantitative analysis, but from the personal evaluation results, its generalization ability in public data sets is obviously superior to QAnet and other q&A models using word vector pre-training.

As for Bert performance, stress test was conducted on the service. According to the application scenario, I adjusted Max Length =30, and the average time was about 400ms, which could meet the QPS requirements of general application. If the product requires higher computing speed, change it to GPU distributed computing.

Lower cost practice Bert path:

Step 1: Find a machine that meets the requirements of GPU memory (usually around 28GB, but it varies slightly)

Step 2: Set up a fine-tuning task that gives you access to the data set, such as categorization, question-and-answer, entity labeling, etc.

Step 3: Modify the fine tuning code to run and verify the effect.