Implement TextCNN with Keras

The main reference of this article is:

Blog.csdn.net/asialee_bir…

Basic version of CNN

def get_model() :
    K.clear_session()
    
    model = Sequential()
    model.add(Embedding(len(vocab) + 1.300, input_length=50)) Convert each word encoding into a word vector using the Embeeding layer
    model.add(Conv1D(256.5, padding='same'))
    model.add(MaxPooling1D(3.3, padding='same'))
    model.add(Conv1D(128.5, padding='same'))
    model.add(MaxPooling1D(3.3, padding='same'))
    model.add(Conv1D(64.3, padding='same'))
    model.add(Flatten())
    model.add(Dropout(0.1))
    model.add(BatchNormalization())  # (batch) normalization layer
    model.add(Dense(256, activation='relu'))
    model.add(Dropout(0.1))
    model.add(Dense(3, activation='softmax'))
    
    model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
    
    print(model.summary())
    
    return model
Copy the code

Simple version TextCNN

def get_model() :
    K.clear_session()
    
    main_input = Input(shape=(50,), dtype='float64')
    # Word embedding (using pre-trained word vectors)
    embedder = Embedding(len(vocab) + 1.300, input_length=50, trainable=False)
    embed = embedder(main_input)
    # word window size is 3,4,5
    cnn1 = Conv1D(256.3, padding='same', strides=1, activation='relu')(embed)
    cnn1 = MaxPooling1D(pool_size=48)(cnn1)
    cnn2 = Conv1D(256.4, padding='same', strides=1, activation='relu')(embed)
    cnn2 = MaxPooling1D(pool_size=47)(cnn2)
    cnn3 = Conv1D(256.5, padding='same', strides=1, activation='relu')(embed)
    cnn3 = MaxPooling1D(pool_size=46)(cnn3)
    # Merge the output vectors of the three models
    cnn = concatenate([cnn1, cnn2, cnn3], axis=-1)
    flat = Flatten()(cnn)
    drop = Dropout(0.2)(flat)
    main_output = Dense(3, activation='softmax')(drop)
    model = Model(inputs=main_input, outputs=main_output)
    
    model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
    
    print(model.summary())
    
    return model
Copy the code

The appendix

All the source code

Guide package

import os
import random
from joblib import load, dump

from sklearn.model_selection import train_test_split
import pandas as pd
import jieba
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from tqdm import tqdm
import numpy as np
import pandas as pd
from keras.models import Sequential, Model
from keras.layers import Embedding, Conv1D, MaxPooling1D, Flatten, Dropout, BatchNormalization, Dense, Input, concatenate
from keras import backend as K
from keras.callbacks import EarlyStopping, ReduceLROnPlateau
Copy the code

Build a text iterator

def get_text_label_iterator(data_path) :
    with open(data_path, 'r', encoding='utf-8') as f:
        for line in f:
            line_split = line.strip().split('\t')
            if len(line_split) ! =2:
                print(line)
                continue
            yield line_split[0], line_split[1]

it = get_text_label_iterator(r"data/keras_bert_train.txt")
next(it)
Copy the code

) Japan and America fight for the title or meet with death. The women’s World Cup final and the copa America quarterfinals will no doubt be the focus of attention for football fans and punters around the world on Sunday. Can Japan, the biggest surprise at the Women’s World Cup, pull off an Asian miracle? Can the United States, the dominant team in women’s soccer, pull off another triple crown? Brazil and Paraguay have narrow rivals. Who will win? Much will be revealed in the wee hours of Monday morning. Japan and America are fighting for the crown. This women’s World Cup is about subversion and counter-subversion. Host favourites Germany were beaten by Japan in extra time in the quarter-finals, while fellow favourites Sweden were thrashed 3-1 by Japan in the semi-finals. The United States maintained the dignity of the women’s soccer powerhouse, beating Brazil 5-3 in a penalty shootout in the quarterfinals and beating France 3-1 in the semifinals. The U.S. and Japan came into the Tournament in strikingly similar fashion, winning the first two sets of the group, losing the final round, drawing in 90 minutes in the quarterfinals, and beating each other 3-1 in the semifinals. The final, whether Japan or the United States wins, will make new history in the Women’s World Cup. When two men meet, they will die. There were plenty of surprises at this Copa America. The narrow path between Brazil and Paraguay seems more legendary. The two teams were drawn in Group B, but both of them had drawn in the first two rounds of the group stage. Brazil came from behind to beat Ecuador 4-2 in the second half to top the group, while Paraguay drew 3-3 with Venezuela to finish third, edging out Third-placed Costa Rica on goal difference for A place in the last eight. Brazil had to draw Paraguay in the group stage in the last minute. Will their luck repeat in the knockout rounds? Paraguay seemed to lack luck in their previous three group games. Could that be compensated for this? . In the other copa America quarterfinal, Chile topped group C with 2 wins and 1 draw. Venezuela, the least favored team in Group B, clinched a place in the group with Brazil and Paraguay in the first two rounds. They are unbeaten in the group with three games, one win and two draws, and scored the same four goals as Chile, but conceded one more than Chile. But since they were able to keep a clean sheet against the mighty Brazil, it was no surprise to see another success. ‘,

‘lottery’)

Get the glossary VOCab

def get_segment_iterator(data_path) :
    data_iter = get_text_label_iterator(data_path)
    for text, label in data_iter:
        yield list(jieba.cut(text)), label
        
it = get_segment_iterator(r"data/keras_bert_train.txt")
# next(it)

def get_only_segment_iterator(data_path) :
    segment_iter = get_segment_iterator(data_path)
    for segment, label in tqdm(segment_iter):
        yield segment
Tokenizer =Tokenizer(
# # FIT_on_texts function can number each word in the input text, and the number is based on word frequency. The larger the word frequency is, the smaller the number is
# tokenizer.fit_on_texts(get_only_segment_iterator(r"data/keras_bert_train.txt"))

# dump(tokenizer, r"data/keras_textcnn_tokenizer.bin")

tokenizer = load(r"data/keras_textcnn_tokenizer.bin")
vocab = tokenizer.word_index # Get the number of each word
Copy the code

Obtaining sample number

def get_sample_count(data_path) :
    data_iter = get_text_label_iterator(data_path)
    count = 0
    for text, label in tqdm(data_iter):
        count += 1
    return count

train_sample_count = get_sample_count(r"data/keras_bert_train.txt")
dev_sample_count = get_sample_count(r"data/keras_bert_dev.txt")
Copy the code

Build TAB table

def read_category(data_path) :
    """ Read the classified directory, fixed ""
    categories = os.listdir(data_path)

    cat_to_id = dict(zip(categories, range(len(categories))))

    return categories, cat_to_id

categories, cat_to_id = read_category("000_text_classifier_tensorflow_textcnn/THUCNews")
cat_to_id
Copy the code

{‘ lottery ‘: 0,

‘Home ‘: 1,

‘Game ‘: 2,

‘Stock ‘: 3,

‘Technology ‘: 4,

‘Society ‘: 5,

Finance and Economics: 6,

‘Fashion ‘: 7,

Constellation: 8,

“Sports” : 9,

‘Real Estate ‘: 10,

‘Entertainment ‘: 11,

‘Current Politics ‘: 12,

‘Education ‘: 13}

Build the input data iterator

def get_data_iterator(data_path) :
    while True:
        segment_iter = get_segment_iterator(data_path)
        for segment, label in segment_iter:
            word_ids = tokenizer.texts_to_sequences([segment])
            padded_seqs = pad_sequences(word_ids,maxlen=50) [0] # Truncate the portion that exceeds the fixed value, and fill it with 0 at the front
            yield padded_seqs, cat_to_id[label]

it = get_data_iterator(r"data/keras_bert_train.txt")
next(it)
Copy the code

Building prefix dict from the default dictionary …

Loading model from cache /tmp/jieba.cache

Loading model cost 1.039 seconds.

Prefix dict has been built succesfully.

(array([ 69, 2160, 57, 3010, 55, 828, 68, 1028,

456, 3712, 2130, 1, 36, 116604, 361, 7019

377, 26, 8, 76, 539, 1, 346, 7323,

89885, 7019, 73, 7, 55, 84, 3, 33

3199, 69, 579, 1366, 2, 1526, 26, 89

456, 5741, 8256, 1, 6163, 7253, 10831, 14,

77404, 3], dtype=int32),

def get_batch_data_iterator(data_path, batch_size=64, shuffle=True) :
    data_iter = get_data_iterator(data_path)
    while True:
        data_list = []
        for _ in range(batch_size):
            data = next(data_iter)
            data_list.append(data)
        if shuffle:
            random.shuffle(data_list)
        
        pad_sequences_list = []
        label_index_list = []
        for data in data_list:
            pad_sequences, label_index = data
            pad_sequences_list.append(pad_sequences.tolist())
            label_index_list.append(label_index)

        yield np.array(pad_sequences_list), np.array(label_index_list)

it = get_batch_data_iterator(r"data/keras_bert_train.txt", batch_size=1)
next(it)
Copy the code

(array([[ 69, 2160, 57, 3010, 55, 828, 68, 1028,

456, 3712, 2130, 1, 36, 116604, 361, 7019

377, 26, 8, 76, 539, 1, 346, 7323,

89885, 7019, 73, 7, 55, 84, 3, 33

3199, 69, 579, 1366, 2, 1526, 26, 89

456, 5741, 8256, 1, 6163, 7253, 10831, 14,

77404, 3]]),

array([0]))

it = get_batch_data_iterator(r"data/keras_bert_train.txt", batch_size=1)
next(it)
Copy the code

(array([[ 5, 5013, 14313, 601, 15377, 23499, 13, 493,

1541, 247, 5, 35557, 21529, 15377, 5, 1764

11, 2774, 15377, 5, 279, 1764, 430, 5,

4742, 36921, 24090, 6387, 23499, 13, 5013, 8319,

6387, 5, 2370, 1764, 6387, 5, 16122, 1764,

6387, 5, 14313, 3707, 6387, 5, 11, 2774,

247, 6387],

[69, 2160, 57, 3010, 55, 828, 68, 1028,

456, 3712, 2130, 1, 36, 116604, 361, 7019

377, 26, 8, 76, 539, 1, 346, 7323,

89885, 7019, 73, 7, 55, 84, 3, 33

3199, 69, 579, 1366, 2, 1526, 26, 89

456, 5741, 8256, 1, 6163, 7253, 10831, 14,

77404, 3]]),

array([0, 0]))

Define basic CNN

def get_model() :
    K.clear_session()
    
    model = Sequential()
    model.add(Embedding(len(vocab) + 1.300, input_length=50)) Convert each word encoding into a word vector using the Embeeding layer
    model.add(Conv1D(256.5, padding='same'))
    model.add(MaxPooling1D(3.3, padding='same'))
    model.add(Conv1D(128.5, padding='same'))
    model.add(MaxPooling1D(3.3, padding='same'))
    model.add(Conv1D(64.3, padding='same'))
    model.add(Flatten())
    model.add(Dropout(0.1))
    model.add(BatchNormalization())  # (batch) normalization layer
    model.add(Dense(256, activation='relu'))
    model.add(Dropout(0.1))
    model.add(Dense(3, activation='softmax'))
    
    model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
    
    print(model.summary())
    
    return model

early_stopping = EarlyStopping(monitor='val_acc', patience=3)   # Early stop method to prevent overfitting
plateau = ReduceLROnPlateau(monitor="val_acc", verbose=1, mode='max', factor=0.5, patience=2) # Reduce learning rate when evaluation indicators are not improving
# checkpoint = ModelCheckpoint('trained_model/keras_bert_THUCNews.hdf5', monitor='val_acc',verbose=2, Save_best_only =True, mode=' Max ', save_weights_only=True

def get_step(sample_count, batch_size) :
    step = sample_count // batch_size
    ifsample_count % batch_size ! =0:
        step += 1
    return step

batch_size = 8
train_step = get_step(train_sample_count, batch_size)
dev_step = get_step(dev_sample_count, batch_size)

train_dataset_iterator = get_batch_data_iterator(r"data/keras_bert_train.txt", batch_size)
dev_dataset_iterator = get_batch_data_iterator(r"data/keras_bert_dev.txt", batch_size)

model = get_model()

# Model training
model.fit(
    train_dataset_iterator,
    steps_per_epoch=train_step,
    epochs=10,
    validation_data=dev_dataset_iterator,
    validation_steps=dev_step,
    callbacks=[early_stopping, plateau],
    verbose=1
)
Copy the code

Model: “sequential”

Layer (type) Output Shape Param #

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

embedding (Embedding) (None, 50, 300) 454574700

conv1d (Conv1D) (None, 50, 256) 384256

max_pooling1d (MaxPooling1D) (None, 17, 256) 0

conv1d_1 (Conv1D) (None, 17, 128) 163968

max_pooling1d_1 (MaxPooling1 (None, 6, 128) 0

conv1d_2 (Conv1D) (None, 6, 64) 24640

flatten (Flatten) (None, 384) 0

dropout (Dropout) (None, 384) 0

batch_normalization (BatchNo (None, 384) 1536

dense (Dense) (None, 256) 98560

dropout_1 (Dropout) (None, 256) 0

dense_1 (Dense) (None, 3) 771

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

Total params: 455248431

Trainable params: 455247663

Non-trainable params: 768

None

Epoch 1/10

1/83608 […] -ETA: 3:28 – Loss: 1.1427-accuracy: 0.3750

Define a simple version of TextCNN

def get_model() :
    K.clear_session()
    
    main_input = Input(shape=(50,), dtype='float64')
    # Word embedding (using pre-trained word vectors)
    embedder = Embedding(len(vocab) + 1.300, input_length=50, trainable=False)
    embed = embedder(main_input)
    # word window size is 3,4,5
    cnn1 = Conv1D(256.3, padding='same', strides=1, activation='relu')(embed)
    cnn1 = MaxPooling1D(pool_size=48)(cnn1)
    cnn2 = Conv1D(256.4, padding='same', strides=1, activation='relu')(embed)
    cnn2 = MaxPooling1D(pool_size=47)(cnn2)
    cnn3 = Conv1D(256.5, padding='same', strides=1, activation='relu')(embed)
    cnn3 = MaxPooling1D(pool_size=46)(cnn3)
    # Merge the output vectors of the three models
    cnn = concatenate([cnn1, cnn2, cnn3], axis=-1)
    flat = Flatten()(cnn)
    drop = Dropout(0.2)(flat)
    main_output = Dense(3, activation='softmax')(drop)
    model = Model(inputs=main_input, outputs=main_output)
    
    model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
    
    print(model.summary())
    
    return model

batch_size = 8
train_step = get_step(train_sample_count, batch_size)
dev_step = get_step(dev_sample_count, batch_size)

train_dataset_iterator = get_batch_data_iterator(r"data/keras_bert_train.txt", batch_size)
dev_dataset_iterator = get_batch_data_iterator(r"data/keras_bert_dev.txt", batch_size)

model = get_model()

# Model training
model.fit(
    train_dataset_iterator,
    steps_per_epoch=train_step,
    epochs=10,
    validation_data=dev_dataset_iterator,
    validation_steps=dev_step,
    callbacks=[early_stopping, plateau],
    verbose=1
)
Copy the code

Model: “functional_1”

Layer (type) Output Shape Param # Connected to

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

input_1 (InputLayer) [(None, 50)] 0

embedding (Embedding) (None, 50, 300) 454574700 input_1[0][0]

conv1d (Conv1D) (None, 50, 256) 230656 embedding[0][0]

conv1d_1 (Conv1D) (None, 50, 256) 307456 embedding[0][0]

conv1d_2 (Conv1D) (None, 50, 256) 384256 embedding[0][0]

max_pooling1d (MaxPooling1D) (None, 1, 256) 0 conv1d[0][0]

max_pooling1d_1 (MaxPooling1D) (None, 1, 256) 0 conv1d_1[0][0]

max_pooling1d_2 (MaxPooling1D) (None, 1, 256) 0 conv1d_2[0][0]

concatenate (Concatenate) (None, 1, 768) 0 max_pooling1d[0][0]

max_pooling1d_1[0][0]

max_pooling1d_2[0][0]

flatten (Flatten) (None, 768) 0 concatenate[0][0]

dropout (Dropout) (None, 768) 0 flatten[0][0]

dense (Dense) (None, 3) 2307 dropout[0][0]

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

Total params: 455499375

Trainable params: 924675

Non-trainable params: 454,574,700

None

Epoch 1/10

238/83608 […] -eta: 2:31:07 – Loss: 0.0308-accuracy: 0.9979

! [file](https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/25ba63f289764f8e9f45a308f80c8fc0~tplv-k3u1fbpfcp-zoom-1.image)Copy the code

Basic version of CNN

Simple version TextCNN

The appendix

All the source code

Guide package

Build a text iterator

Get the glossary VOCab

Obtaining sample number

Build TAB table

Build the input data iterator

Define basic CNN

Define a simple version of TextCNN

Related Posts

Machine learning must know concepts: Bayesian estimation, maximum likelihood estimation, maximum posteriori estimation

Use MQTT programming on the Hongmeng system

Fruit identification Based on MATLAB GUI Apple quality detection and grading system