This article will show you how to use Keras-Bert to implement text multi-categorization tasks, with Bert being fine-tuned.
The project structure
Keras-bert text multi-category item structure
The Python third-party modules are as follows:
pandas==0.234.
Keras==2.31.
keras_bert==0.83. 0
numpy==1.164.
Copy the code
The data set
The multi-classification data sets adopted in this paper are Sougou sub-classification data set and THUCNews data set, which are introduced as follows:
- Sougou small classification dataset
There are five categories, namely sports, health, military, education and automobile. It is divided into training set and test set, in which the training set has 800 samples for each classification and the test set has 100 samples for each classification.
- THUCNews data set
There are 10 categories: Sports, Finance, real estate, home furnishing, Education, Technology, Fashion, politics, games and entertainment. The data set is divided into: training set: 5000 * 10, test set: 1000 * 10.
Model training
The complete code for the model training script model_train.py is as follows:
# -*- coding: utf- 8 -- * -import json
import codecs
import pandas as pd
import numpy as np
from keras_bert import load_trained_model_from_checkpoint, Tokenizer
from keras.layers import *
from keras.models import Model
from keras.optimizers importAdam # Suggested length <=510
maxlen = 300
BATCH_SIZE = 8
config_path = './chinese_L-12_H-768_A-12/bert_config.json'
checkpoint_path = './chinese_L-12_H-768_A-12/bert_model.ckpt'
dict_path = './chinese_L-12_H-768_A-12/vocab.txt'
token_dict = {}
with codecs.open(dict_path, 'r'.'utf-8') as reader:
for line in reader:
token = line.strip()
token_dict[token] = len(token_dict)
class OurTokenizer(Tokenizer):
def _tokenize(self, text):
R = []
for c in text:
if c in self._token_dict:
R.append(c)
else:
R.append('[UNK]'The remaining characters are [UNK]return R
tokenizer = OurTokenizer(token_dict)
def seq_padding(X, padding=0):
L = [len(x) for x in X]
ML = max(L)
return np.array([
np.concatenate([x, [padding] * (ML - len(x))]) if len(x) < ML else x for x in X
])
class DataGenerator:
def __init__(self, data, batch_size=BATCH_SIZE):
self.data = data
self.batch_size = batch_size
self.steps = len(self.data) // self.batch_size
if len(self.data) % self.batch_size ! =0:
self.steps += 1
def __len__(self):
return self.steps
def __iter__(self):
while True:
idxs = list(range(len(self.data)))
np.random.shuffle(idxs)
X1, X2, Y = [], [], []
for i in idxs:
d = self.data[i]
text = d[0][:maxlen]
x1, x2 = tokenizer.encode(first=text)
y = d[1]
X1.append(x1)
X2.append(x2)
Y.append(y)
if len(X1) == self.batch_size or i == idxs[- 1] : X1 = seq_padding(X1) X2 = seq_padding(X2) Y = seq_padding(Y) yield [X1, X2], Y [X1, X2, Y] = [], [], [] def create_cls_model(num_labels): bert_model = load_trained_model_from_checkpoint(config_path, checkpoint_path, seq_len=None)for layer in bert_model.layers:
layer.trainable = True
x1_in = Input(shape=(None,))
x2_in = Input(shape=(None,))
x = bert_model([x1_in, x2_in])
cls_layer = Lambda(lambda x: x[:, 0] (x) # Take [CLS] corresponding vector and use it for classification P = Dense(NUM_labels, activation='softmax')(cls_layer) # pile = model ([x1_in, x2_in], p'categorical_crossentropy',
optimizer=Adam(1e-5), # use small enough learning rate metrics=['accuracy']
)
# model.summary()
return model
if __name__ == '__main__'Data processing, reading training set and test setprint("begin data processing...")
train_df = pd.read_csv("data/cnews/cnews_train.csv").fillna(value="")
test_df = pd.read_csv("data/cnews/cnews_test.csv").fillna(value="")
labels = train_df["label"].unique()
with open("label.json"."w", encoding="utf-8") as f:
f.write(json.dumps(dict(zip(range(len(labels)), labels)), ensure_ascii=False, indent=2))
train_data = []
test_data = []
for i in range(train_df.shape[0]):
label, content = train_df.iloc[i, :]
label_id = [0] * len(labels)
for j, _ in enumerate(labels):
if _ == label:
label_id[j] = 1
train_data.append((content, label_id))
for i in range(test_df.shape[0]):
label, content = test_df.iloc[i, :]
label_id = [0] * len(labels)
for j, _ in enumerate(labels):
if _ == label:
label_id[j] = 1
test_data.append((content, label_id))
print("finish data processing!"Model = create_cls_model()len(labels))
train_D = DataGenerator(train_data)
test_D = DataGenerator(test_data)
print("begin model training...")
model.fit_generator(
train_D.__iter__(),
steps_per_epoch=len(train_D),
epochs=3,
validation_data=test_D.__iter__(),
validation_steps=len(test_D)
)
print("finish model training!") # Save the model.save('cls_cnews.h5')
print("Model saved!")
result = model.evaluate_generator(test_D.__iter__(), steps=len(test_D))
print("Model evaluation Results :", result)
Copy the code
The model structure is as follows:
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_1 (InputLayer) (None, None) 0
__________________________________________________________________________________________________
input_2 (InputLayer) (None, None) 0
__________________________________________________________________________________________________
model_2 (Model) (None, None, 768) 101677056 input_1[0] [0]
input_2[0] [0]
__________________________________________________________________________________________________
lambda_1 (Lambda) (None, 768) 0 model_2[1] [0]
__________________________________________________________________________________________________
dense_1 (Dense) (None, 10) 7690 lambda_1[0] [0]
==================================================================================================
Total params: 101.684.746
Trainable params: 101.684.746
Non-trainable params: 0
Copy the code
In the above model, we take out the corresponding vector of [CLS], and then connect the whole connection layer. Softmax function is used as the activation function to complete the construction of multi-classification model, which is very simple and convenient.
Model to evaluate
The complete code for the model evaluation script model_evaluate.py is as follows:
# -*- coding: utf- 8 --*- # Model evaluation scriptimport json
import numpy as np
import pandas as pd
from keras.models import load_model
from keras_bert import get_custom_objects
from sklearn.metrics import classification_report
from model_train import token_dict, OurTokenizer
maxlen = 300# load_model = load_model("cls_cnews.h5", custom_objects=get_custom_objects())
tokenizer = OurTokenizer(token_dict)
with open("label.json"."r", encoding="utf-8"Loads (f.read()) def predict_single_text(text): Text =text [:maxlen] x1, x2 = tokenizer.encode(first=text) x1 = x1 + [0] * (maxlen - len(x1)) if len(x1) < maxlen else x1
X2 = x2 + [0] * (maxlen - len(x2)) if len(x2) < maxlen elsePredicted = model.predict([[X1], [x2]]) y = np.argmax(predicted[[X1], [x2]])0])
returnLabel_dict [STR (y)] def evaluate(): test_df = pd.read_csv("data/cnews/cnews_test.csv").fillna(value="")
true_y_list, pred_y_list = [], []
for i in range(test_df.shape[0) :print("predict %d samples" % (i+1))
true_y, content = test_df.iloc[i, :]
pred_y = predict_single_text(content)
true_y_list.append(true_y)
pred_y_list.append(pred_y)
return classification_report(true_y_list, pred_y_list, digits=4)
output_data = evaluate()
print("model evaluate result:\n")
print(output_data)
Copy the code
Run the code above to evaluate the two datasets, and the results are as follows:
- Sougou data set
Model parameters: BATCH_size = 8, maxlen = 256, epoch=10
Evaluation Results:
Precision Recall F1-Score Support sports0.9802 1.0000 0.9900 99health0.9495 0.9495 0.9495 99military1.0000 1.0000 1.0000 99education0.9307 0.9495 0.9400 99The car0.9895 0.9495 0.9691 99
accuracy 0.9697 495
macro avg 0.9700 0.9697 0.9697 495
weighted avg 0.9700 0.9697 0.9697 495
Copy the code
- THUCNews data set
Model parameters: BATCH_size = 8, MAXlen = 300, epoch=3
Evaluation Results:
Precision Recall F1-Score Support sports0.9970 0.9990 0.9980 1000entertainment0.9890 0.9890 0.9890 1000household0.9949 0.7820 0.8757 1000Real estate0.8006 0.8710 0.8343 1000education0.9753 0.9480 0.9615 1000fashion0.9708 0.9980 0.9842 1000The current politics0.9318 0.9560 0.9437 1000The game0.9851 0.9950 0.9900 1000Science and technology0.9689 0.9970 0.9828 1000Finance and economics,0.9377 0.9930 0.9645 1000
accuracy 0.9528 10000
macro avg 0.9551 0.9528 0.9524 10000
weighted avg 0.9551 0.9528 0.9524 10000
Copy the code
Model to predict
The complete code for the model prediction script model_predictor.py is as follows:
# -*- coding: utf- 8 - -*-
# @Time : 2020/12/23 15:28# @author: Jclian91 # @file: model_predictor. py # @place: Yangpu, Shanghai #import time
import json
import numpy as np
from model_train import token_dict, OurTokenizer
from keras.models import load_model
from keras_bert import get_custom_objects
maxlen = 256# load_model = load_model("cls_sougou.h5", custom_objects=get_custom_objects())
tokenizer = OurTokenizer(token_dict)
with open("label.json"."r", encoding="utf-8") as f: label_dict = json.loads(f.read()) s_time = time.time() #"When it comes to hard-drive suVs, what models come to mind? By what is known as the "hegemony" Toyota prado (configuration) | inquiry, or called "lynx" pajero car, or is "deceit philandering man car" feelings Mercedes G," \
"The Prince of the Desert. In short, as countries around the world pay more and more attention to environmental protection, those big-engine off-road SUVs will gradually disappear in the near future, so rather than miss it," \
"Why don't you get your hands on one of those hard-core suVs you want while you're still young? And what I want to talk to you about today is the world's top 10 hard-hitting suVs," \
"Off-road fans might want to think about which one is your favorite, and get started without further ado."Text =text [:maxlen] x1, x2 = tokenizer.encode(first=text) x1 = x1 + [0] * (maxlen-len(x1)) if len(x1) < maxlen else x1
X2 = x2 + [0] * (maxlen-len(x2)) if len(x2) < maxlen elsePredicted = model.predict([[X1], [x2]]) y = np.argmax(predicted[[X1], [x2]])0])
print("Original: %s" % text)
print("Prediction label: %s" % label_dict[str(y)])
e_time = time.time()
print("cost time:", e_time-s_time)
Copy the code
We make model predictions on new samples.
- Sougou data set
What models come to mind when you think of a hard-drive off-road SUV? By what is known as the "hegemony" Toyota prado (configuration) | inquiry, or called "lynx" pajero car, or is "deceit philandering emotional man car" Mercedes G, prince "desert" way. In short, as countries around the world pay more and more attention to the protection of the environment, those large displacement cross-country SUV in the near future will gradually disappear in our sight, so rather than miss it, it is better to take advantage of the young, in the lifetime to get a can let you admire the hard cross-country SUV. Today, I'd like to talk to you about the world's top 10 hard-style SUVs. After watching them, you might want to think about which one is your favorite. Let's get started. Prediction tag: Car original text: # Beauty30According to the Elson Air Force Base website, the U.S. Air Force recently launched an elephant walk off the coast of Alaska30A fighter jet and2The tankers took off from Elson Air Force Base and completed the "Elephant Walk" exercise off the Alaskan coast. During the 13th Five-Year Plan period (2016-2020), the unified compilation of textbooks for compulsory education in China will cover all grades. Ordinary senior high school three subjects unified compilation textbooks already covered20Provinces, estimated2022To achieve full coverage in all provinces by the year 2000,2025Year to achieve full coverage of all grades. The revision of the compulsory education curriculum plan and curriculum standards will be completed next year, tian Huisheng, director of the Ministry's teaching materials bureau, said at a press conference yesterday. Prediction tag: EducationCopy the code
- THUCNews data set
Beijing time12month26Day,2020- 21The NBA Christmas Game is on as promised. The Los Angeles Lakers face the Dallas Mavericks at home in a matchup. All out, Lakers138- 115.Sweep the Mavericks, pick up their first win of the season, and give it to the other team2In a row. Prediction tag: Sports In the past two years, began to constantly upgrade the phone's screen, high refresh rate have become a trend, even if again good performance and can't smooth real display, the phone's screen will largely affect the use feeling, so the screen is the Windows mobile phone hardware, can foresee the future high-end phones at the same time of impact performance, must be to improve the claim to the screen. 原文: songjiang sheshan plate has been too long without luxury housing market, making many high-end buyers in the region lonely. Fortunately, however, the international trade sheshan original villa that comes on stage hot not long ago satisfied the demand of this kind of client at a draught. Whether it is community planning, product type, decoration and configuration, or design techniques, compared with the surrounding tens of millions of villas, sheshan original villa is also unrestrained, the more surprising is that the quality of the villas only need the price of the apartment! Prediction tag: Real estateCopy the code
conclusion
Github:github.com/percent4/ke… .
We’ll see how to use Keras-Bert for text multi-label sorting.
Read more
Top 10 Sand Sculptures and fun GitHub apps \
5 minutes to master Python object references \
Slow application? You are probably writing fake Python\
Special recommendation \
\
Click below to read the article and join the community