The main reference of this article is:
Blog.csdn.net/asialee_bir…
Basic version of CNN
def get_model() :
K.clear_session()
model = Sequential()
model.add(Embedding(len(vocab) + 1.300, input_length=50)) Convert each word encoding into a word vector using the Embeeding layer
model.add(Conv1D(256.5, padding='same'))
model.add(MaxPooling1D(3.3, padding='same'))
model.add(Conv1D(128.5, padding='same'))
model.add(MaxPooling1D(3.3, padding='same'))
model.add(Conv1D(64.3, padding='same'))
model.add(Flatten())
model.add(Dropout(0.1))
model.add(BatchNormalization()) # (batch) normalization layer
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(3, activation='softmax'))
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
print(model.summary())
return model
Copy the code
Simple version TextCNN
def get_model() :
K.clear_session()
main_input = Input(shape=(50,), dtype='float64')
# Word embedding (using pre-trained word vectors)
embedder = Embedding(len(vocab) + 1.300, input_length=50, trainable=False)
embed = embedder(main_input)
# word window size is 3,4,5
cnn1 = Conv1D(256.3, padding='same', strides=1, activation='relu')(embed)
cnn1 = MaxPooling1D(pool_size=48)(cnn1)
cnn2 = Conv1D(256.4, padding='same', strides=1, activation='relu')(embed)
cnn2 = MaxPooling1D(pool_size=47)(cnn2)
cnn3 = Conv1D(256.5, padding='same', strides=1, activation='relu')(embed)
cnn3 = MaxPooling1D(pool_size=46)(cnn3)
# Merge the output vectors of the three models
cnn = concatenate([cnn1, cnn2, cnn3], axis=-1)
flat = Flatten()(cnn)
drop = Dropout(0.2)(flat)
main_output = Dense(3, activation='softmax')(drop)
model = Model(inputs=main_input, outputs=main_output)
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
print(model.summary())
return model
Copy the code
The appendix
All the source code
Guide package
import os
import random
from joblib import load, dump
from sklearn.model_selection import train_test_split
import pandas as pd
import jieba
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from tqdm import tqdm
import numpy as np
import pandas as pd
from keras.models import Sequential, Model
from keras.layers import Embedding, Conv1D, MaxPooling1D, Flatten, Dropout, BatchNormalization, Dense, Input, concatenate
from keras import backend as K
from keras.callbacks import EarlyStopping, ReduceLROnPlateau
Copy the code
Build a text iterator
def get_text_label_iterator(data_path) :
with open(data_path, 'r', encoding='utf-8') as f:
for line in f:
line_split = line.strip().split('\t')
if len(line_split) ! =2:
print(line)
continue
yield line_split[0], line_split[1]
it = get_text_label_iterator(r"data/keras_bert_train.txt")
next(it)
Copy the code
) Japan and America fight for the title or meet with death. The women’s World Cup final and the copa America quarterfinals will no doubt be the focus of attention for football fans and punters around the world on Sunday. Can Japan, the biggest surprise at the Women’s World Cup, pull off an Asian miracle? Can the United States, the dominant team in women’s soccer, pull off another triple crown? Brazil and Paraguay have narrow rivals. Who will win? Much will be revealed in the wee hours of Monday morning. Japan and America are fighting for the crown. This women’s World Cup is about subversion and counter-subversion. Host favourites Germany were beaten by Japan in extra time in the quarter-finals, while fellow favourites Sweden were thrashed 3-1 by Japan in the semi-finals. The United States maintained the dignity of the women’s soccer powerhouse, beating Brazil 5-3 in a penalty shootout in the quarterfinals and beating France 3-1 in the semifinals. The U.S. and Japan came into the Tournament in strikingly similar fashion, winning the first two sets of the group, losing the final round, drawing in 90 minutes in the quarterfinals, and beating each other 3-1 in the semifinals. The final, whether Japan or the United States wins, will make new history in the Women’s World Cup. When two men meet, they will die. There were plenty of surprises at this Copa America. The narrow path between Brazil and Paraguay seems more legendary. The two teams were drawn in Group B, but both of them had drawn in the first two rounds of the group stage. Brazil came from behind to beat Ecuador 4-2 in the second half to top the group, while Paraguay drew 3-3 with Venezuela to finish third, edging out Third-placed Costa Rica on goal difference for A place in the last eight. Brazil had to draw Paraguay in the group stage in the last minute. Will their luck repeat in the knockout rounds? Paraguay seemed to lack luck in their previous three group games. Could that be compensated for this? . In the other copa America quarterfinal, Chile topped group C with 2 wins and 1 draw. Venezuela, the least favored team in Group B, clinched a place in the group with Brazil and Paraguay in the first two rounds. They are unbeaten in the group with three games, one win and two draws, and scored the same four goals as Chile, but conceded one more than Chile. But since they were able to keep a clean sheet against the mighty Brazil, it was no surprise to see another success. ‘,
‘lottery’)
Get the glossary VOCab
def get_segment_iterator(data_path) :
data_iter = get_text_label_iterator(data_path)
for text, label in data_iter:
yield list(jieba.cut(text)), label
it = get_segment_iterator(r"data/keras_bert_train.txt")
# next(it)
def get_only_segment_iterator(data_path) :
segment_iter = get_segment_iterator(data_path)
for segment, label in tqdm(segment_iter):
yield segment
Tokenizer =Tokenizer(
# # FIT_on_texts function can number each word in the input text, and the number is based on word frequency. The larger the word frequency is, the smaller the number is
# tokenizer.fit_on_texts(get_only_segment_iterator(r"data/keras_bert_train.txt"))
# dump(tokenizer, r"data/keras_textcnn_tokenizer.bin")
tokenizer = load(r"data/keras_textcnn_tokenizer.bin")
vocab = tokenizer.word_index # Get the number of each word
Copy the code
Obtaining sample number
def get_sample_count(data_path) :
data_iter = get_text_label_iterator(data_path)
count = 0
for text, label in tqdm(data_iter):
count += 1
return count
train_sample_count = get_sample_count(r"data/keras_bert_train.txt")
dev_sample_count = get_sample_count(r"data/keras_bert_dev.txt")
Copy the code
Build TAB table
def read_category(data_path) :
""" Read the classified directory, fixed ""
categories = os.listdir(data_path)
cat_to_id = dict(zip(categories, range(len(categories))))
return categories, cat_to_id
categories, cat_to_id = read_category("000_text_classifier_tensorflow_textcnn/THUCNews")
cat_to_id
Copy the code
{‘ lottery ‘: 0,
‘Home ‘: 1,
‘Game ‘: 2,
‘Stock ‘: 3,
‘Technology ‘: 4,
‘Society ‘: 5,
Finance and Economics: 6,
‘Fashion ‘: 7,
Constellation: 8,
“Sports” : 9,
‘Real Estate ‘: 10,
‘Entertainment ‘: 11,
‘Current Politics ‘: 12,
‘Education ‘: 13}
Build the input data iterator
def get_data_iterator(data_path) :
while True:
segment_iter = get_segment_iterator(data_path)
for segment, label in segment_iter:
word_ids = tokenizer.texts_to_sequences([segment])
padded_seqs = pad_sequences(word_ids,maxlen=50) [0] # Truncate the portion that exceeds the fixed value, and fill it with 0 at the front
yield padded_seqs, cat_to_id[label]
it = get_data_iterator(r"data/keras_bert_train.txt")
next(it)
Copy the code
Building prefix dict from the default dictionary …
Loading model from cache /tmp/jieba.cache
Loading model cost 1.039 seconds.
Prefix dict has been built succesfully.
(array([ 69, 2160, 57, 3010, 55, 828, 68, 1028,
456, 3712, 2130, 1, 36, 116604, 361, 7019
377, 26, 8, 76, 539, 1, 346, 7323,
89885, 7019, 73, 7, 55, 84, 3, 33
3199, 69, 579, 1366, 2, 1526, 26, 89
456, 5741, 8256, 1, 6163, 7253, 10831, 14,
77404, 3], dtype=int32),
def get_batch_data_iterator(data_path, batch_size=64, shuffle=True) :
data_iter = get_data_iterator(data_path)
while True:
data_list = []
for _ in range(batch_size):
data = next(data_iter)
data_list.append(data)
if shuffle:
random.shuffle(data_list)
pad_sequences_list = []
label_index_list = []
for data in data_list:
pad_sequences, label_index = data
pad_sequences_list.append(pad_sequences.tolist())
label_index_list.append(label_index)
yield np.array(pad_sequences_list), np.array(label_index_list)
it = get_batch_data_iterator(r"data/keras_bert_train.txt", batch_size=1)
next(it)
Copy the code
(array([[ 69, 2160, 57, 3010, 55, 828, 68, 1028,
456, 3712, 2130, 1, 36, 116604, 361, 7019
377, 26, 8, 76, 539, 1, 346, 7323,
89885, 7019, 73, 7, 55, 84, 3, 33
3199, 69, 579, 1366, 2, 1526, 26, 89
456, 5741, 8256, 1, 6163, 7253, 10831, 14,
77404, 3]]),
array([0]))
it = get_batch_data_iterator(r"data/keras_bert_train.txt", batch_size=1)
next(it)
Copy the code
(array([[ 5, 5013, 14313, 601, 15377, 23499, 13, 493,
1541, 247, 5, 35557, 21529, 15377, 5, 1764
11, 2774, 15377, 5, 279, 1764, 430, 5,
4742, 36921, 24090, 6387, 23499, 13, 5013, 8319,
6387, 5, 2370, 1764, 6387, 5, 16122, 1764,
6387, 5, 14313, 3707, 6387, 5, 11, 2774,
247, 6387],
[69, 2160, 57, 3010, 55, 828, 68, 1028,
456, 3712, 2130, 1, 36, 116604, 361, 7019
377, 26, 8, 76, 539, 1, 346, 7323,
89885, 7019, 73, 7, 55, 84, 3, 33
3199, 69, 579, 1366, 2, 1526, 26, 89
456, 5741, 8256, 1, 6163, 7253, 10831, 14,
77404, 3]]),
array([0, 0]))
Define basic CNN
def get_model() :
K.clear_session()
model = Sequential()
model.add(Embedding(len(vocab) + 1.300, input_length=50)) Convert each word encoding into a word vector using the Embeeding layer
model.add(Conv1D(256.5, padding='same'))
model.add(MaxPooling1D(3.3, padding='same'))
model.add(Conv1D(128.5, padding='same'))
model.add(MaxPooling1D(3.3, padding='same'))
model.add(Conv1D(64.3, padding='same'))
model.add(Flatten())
model.add(Dropout(0.1))
model.add(BatchNormalization()) # (batch) normalization layer
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(3, activation='softmax'))
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
print(model.summary())
return model
early_stopping = EarlyStopping(monitor='val_acc', patience=3) # Early stop method to prevent overfitting
plateau = ReduceLROnPlateau(monitor="val_acc", verbose=1, mode='max', factor=0.5, patience=2) # Reduce learning rate when evaluation indicators are not improving
# checkpoint = ModelCheckpoint('trained_model/keras_bert_THUCNews.hdf5', monitor='val_acc',verbose=2, Save_best_only =True, mode=' Max ', save_weights_only=True
def get_step(sample_count, batch_size) :
step = sample_count // batch_size
ifsample_count % batch_size ! =0:
step += 1
return step
batch_size = 8
train_step = get_step(train_sample_count, batch_size)
dev_step = get_step(dev_sample_count, batch_size)
train_dataset_iterator = get_batch_data_iterator(r"data/keras_bert_train.txt", batch_size)
dev_dataset_iterator = get_batch_data_iterator(r"data/keras_bert_dev.txt", batch_size)
model = get_model()
# Model training
model.fit(
train_dataset_iterator,
steps_per_epoch=train_step,
epochs=10,
validation_data=dev_dataset_iterator,
validation_steps=dev_step,
callbacks=[early_stopping, plateau],
verbose=1
)
Copy the code
Model: “sequential”
Layer (type) Output Shape Param #
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
embedding (Embedding) (None, 50, 300) 454574700
conv1d (Conv1D) (None, 50, 256) 384256
max_pooling1d (MaxPooling1D) (None, 17, 256) 0
conv1d_1 (Conv1D) (None, 17, 128) 163968
max_pooling1d_1 (MaxPooling1 (None, 6, 128) 0
conv1d_2 (Conv1D) (None, 6, 64) 24640
flatten (Flatten) (None, 384) 0
dropout (Dropout) (None, 384) 0
batch_normalization (BatchNo (None, 384) 1536
dense (Dense) (None, 256) 98560
dropout_1 (Dropout) (None, 256) 0
dense_1 (Dense) (None, 3) 771
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
Total params: 455248431
Trainable params: 455247663
Non-trainable params: 768
None
Epoch 1/10
1/83608 […] -ETA: 3:28 – Loss: 1.1427-accuracy: 0.3750
Define a simple version of TextCNN
def get_model() :
K.clear_session()
main_input = Input(shape=(50,), dtype='float64')
# Word embedding (using pre-trained word vectors)
embedder = Embedding(len(vocab) + 1.300, input_length=50, trainable=False)
embed = embedder(main_input)
# word window size is 3,4,5
cnn1 = Conv1D(256.3, padding='same', strides=1, activation='relu')(embed)
cnn1 = MaxPooling1D(pool_size=48)(cnn1)
cnn2 = Conv1D(256.4, padding='same', strides=1, activation='relu')(embed)
cnn2 = MaxPooling1D(pool_size=47)(cnn2)
cnn3 = Conv1D(256.5, padding='same', strides=1, activation='relu')(embed)
cnn3 = MaxPooling1D(pool_size=46)(cnn3)
# Merge the output vectors of the three models
cnn = concatenate([cnn1, cnn2, cnn3], axis=-1)
flat = Flatten()(cnn)
drop = Dropout(0.2)(flat)
main_output = Dense(3, activation='softmax')(drop)
model = Model(inputs=main_input, outputs=main_output)
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
print(model.summary())
return model
batch_size = 8
train_step = get_step(train_sample_count, batch_size)
dev_step = get_step(dev_sample_count, batch_size)
train_dataset_iterator = get_batch_data_iterator(r"data/keras_bert_train.txt", batch_size)
dev_dataset_iterator = get_batch_data_iterator(r"data/keras_bert_dev.txt", batch_size)
model = get_model()
# Model training
model.fit(
train_dataset_iterator,
steps_per_epoch=train_step,
epochs=10,
validation_data=dev_dataset_iterator,
validation_steps=dev_step,
callbacks=[early_stopping, plateau],
verbose=1
)
Copy the code
Model: “functional_1”
Layer (type) Output Shape Param # Connected to
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
input_1 (InputLayer) [(None, 50)] 0
embedding (Embedding) (None, 50, 300) 454574700 input_1[0][0]
conv1d (Conv1D) (None, 50, 256) 230656 embedding[0][0]
conv1d_1 (Conv1D) (None, 50, 256) 307456 embedding[0][0]
conv1d_2 (Conv1D) (None, 50, 256) 384256 embedding[0][0]
max_pooling1d (MaxPooling1D) (None, 1, 256) 0 conv1d[0][0]
max_pooling1d_1 (MaxPooling1D) (None, 1, 256) 0 conv1d_1[0][0]
max_pooling1d_2 (MaxPooling1D) (None, 1, 256) 0 conv1d_2[0][0]
concatenate (Concatenate) (None, 1, 768) 0 max_pooling1d[0][0]
max_pooling1d_1[0][0]
max_pooling1d_2[0][0]
flatten (Flatten) (None, 768) 0 concatenate[0][0]
dropout (Dropout) (None, 768) 0 flatten[0][0]
dense (Dense) (None, 3) 2307 dropout[0][0]
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
Total params: 455499375
Trainable params: 924675
Non-trainable params: 454,574,700
None
Epoch 1/10
238/83608 […] -eta: 2:31:07 – Loss: 0.0308-accuracy: 0.9979
! [file](https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/25ba63f289764f8e9f45a308f80c8fc0~tplv-k3u1fbpfcp-zoom-1.image)Copy the code