“This is the 17th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”
preface
Data Source:
2019 CCF Internet News Sentiment Analysis Competition
www.datafountain.cn/competition…
Data set attachment:
Link: pan.baidu.com/s/1ePKyHyE8…
Extraction code: 2021
Specific tools:
Deep learning framework uses tensorflow2.4.0 natural language processing library gensim3.8.3 word segmentation tool jieba 0.42.1Copy the code
What don’t you know about the code: The Python beginner tutorial is more like a query manual
www.runoob.com/python3/pyt…
The specific process
1. Data preprocessing
1.1 Reading Data
Here, the data set is read by read_CSV, because train_data and train_label are separate and merged using pd.merge()
- Pd. The merge (x, y, how = “left”, on = “id”) a:
- Notnull () : returns a Boolean value if it is null
- Fillna (value) : finds a null value and replaces it with value
import pandas as pd
import numpy as np
Read training data
train_data = pd.read_csv('Train_DataSet.csv')
train_label = pd.read_csv('Train_DataSet_Label.csv')
test = pd.read_csv('Test_DataSet.csv')
# Merge two training data and training tabs
train = pd.merge(train_data, train_label, how='left', on='id')
Use Boolean to determine if there is a null value in train and discard it
train = train[(train.label.notnull()) & (train.content.notnull())]
# Replace na with an empty string
train['title'] = train['title'].fillna(' ')
train['content'] = train['content'].fillna(' ')
test['title'] = test['title'].fillna(' ')
test['content'] = test['content'].fillna(' ')
Copy the code
1.2 Filter invalid text characters and labels
The text_filter(text) function can be stored in the tool library for future use
- Re.sub (regular expression, replaced string, replaced string)
- Str.strip () : Removes Spaces at the beginning and end of the string
import re
# text filter function
def text_filter(text) :
# re.sub(regular expression, replaced string, replaced string)
text = re.sub("[a-za-z0-9!=\? %[],\ (\) ><:</#.----- _]"."", text)
text = text.replace('images'.' ')
text = text.replace('\xa0'.' ') # remove NBSP
# Remove HTML tags
cleanr = re.compile('<. *? > ')
text = re.sub(cleanr, ' ', text)
# Remove other characters
r1 = "\". *?" + | \ ". *?" | \ + #. *? # + | [..! / _, $% ^ & * () < > +""'? + | @ | : ~ # {}] [-! \ \,. =? : ", "" RMB...... () of the []]"
text = re.sub(r1,' ',text)
Remove Spaces at the beginning and end of the string
text = text.strip()
return text
# Text cleanup function
def clean_text(data) :
# title text
data['title'] = data['title'].apply(lambda x: text_filter(x))
# Body text
data['content'] = data['content'].apply(lambda x: text_filter(x))
return data
# run clean_text
train = clean_text(train)
test = clean_text(test)
Copy the code
1.3 Participle and stop words
-
Str.maketrans (x,y,z) : three arguments x,y,z, the third argument z must be a string whose character will be mapped to None, that is, deleted; If a character in Z is the same as a character in x, the duplicate character will be deleted from the final result. That is, whenever there is a third argument z, the characters in z are deleted, regardless of whether it is repeated.
-
String. punctuation: All punctuation
-
[token for token in tokens if token not in stop_words
import jieba
import string
# load stop_words
stop_words = pd.read_table('stop.txt', header=None) [0].tolist()
Create a translation table that will be used later to remove English punctuation
table = str.maketrans(""."",string.punctuation)
def cut_text(sentence) :
tokens = list(jieba.cut(sentence))
# Remove the stop word list
tokens = [token for token in tokens if token not in stop_words]
# # Remove English punctuation
| | | | | | | | | | | | | | | | | | | | | | | | | |
return tokens
# Call the word segmentation function to segment the title and text of the training set and test set
train_title = [cut_text(sent) for sent in train.title.values]
train_content = [cut_text(sent) for sent in train.content.values]
test_title = [cut_text(sent) for sent in test.title.values]
test_content = [cut_text(sent) for sent in test.content.values]
# Connect all participles to prepare for the subsequent training of word vector
all_doc = train_title + train_content + test_title + test_content
Copy the code
1.4 Use Gensim to train word vector
This series of code can be directly used to train their own word vector, after testing the larger the sample size, the better. Vacob_size after participle is about 29244
import gensim
import time
class EpochSaver(gensim.models.callbacks.CallbackAny2Vec) :
For saving models, printing loss functions, etc.
def __init__(self, save_path) :
self.save_path = save_path # model storage path
self.epoch = 0 # rounds
self.pre_loss = 0 # Previous round losses
self.best_loss = 999999999.9 # Optimal loss
self.since = time.time() # Duration of a run
def on_epoch_end(self, model) :
self.epoch += 1
cum_loss = model.get_latest_training_loss() The value returned is accumulated from the first epoch
epoch_loss = cum_loss - self.pre_loss # epoch-loss = current loss - previous round loss
time_taken = time.time() - self.since # Duration
print("Epoch %d, loss: %.2f, time: %dmin %ds" %
(self.epoch, epoch_loss, time_taken//60, time_taken%60)) Print the result of a round in minutes
Record best_loss and early_stop by best_loss
if self.best_loss > epoch_loss:
self.best_loss = epoch_loss
print("Better model. Best loss: %.2f" % self.best_loss) # Print the best loss
model.save(self.save_path) # Save the model
print("Model %s save done!" % self.save_path)
self.pre_loss = cum_loss
self.since = time.time()
# The following code loads the trained word vector
# model_word2vec = gensim.models.Word2Vec.load('final_word2vec_model')
Copy the code
Create the word2vec trainer and use build_vocab to import the words into the theslexical library
model_word2vec = gensim.models.Word2Vec(min_count=1,
window=5,
size=256,
workers=4,
batch_words=1000)
since = time.time()
model_word2vec.build_vocab(all_doc, progress_per=2000)
time_elapsed = time.time() - since
print('Time to build vocab: {:.0f}min {:.0f}s'.format(time_elapsed // 60, time_elapsed % 60))
Copy the code
- Train the word vector and save
since = time.time()
model_word2vec.train(all_doc, total_examples=model_word2vec.corpus_count,
epochs=20, compute_loss=True, report_delay=60*10,
callbacks=[EpochSaver('./final_word2vec_model')])
time_elapsed = time.time() - since
print('Time to train: {:.0f}min {:.0f}s'.format(time_elapsed // 60, time_elapsed % 60))
Copy the code
1.5 Encoding in TF.tokenizer format
Tokenizer is a Tokenizer for TensorFlow, which contains a complete encapsulated dictionary index, etc.
Refer to the blog: dengbocong.blog.csdn.net/article/det…
# Convert Tokenizer
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(train_title + test_title)
# tokenizer.fit_on_texts(train_content + test_content)
Copy the code
1.6 Constructing the Embedding_matrix matrix
from tqdm import tqdm
# Transform to word vector matrix, using the new WORD2vec model
vocab_size = len(tokenizer.word_index) # Thesaurus size
error_count=0
embedding_matrix = np.zeros((vocab_size + 1.256))
for word, i in tqdm(tokenizer.word_index.items()):
if word in model_word2vec:
embedding_matrix[i] = model_word2vec.wv[word]
else:
error_count += 1
Copy the code
1.7 the padding coding
- Padding is used to complete short text or truncate long text
from tensorflow.keras.preprocessing.sequence import pad_sequences
sequence = tokenizer.texts_to_sequences(train_title)
traintitle = pad_sequences(sequence, maxlen=30)
sequence = tokenizer.texts_to_sequences(test_title)
testtitle = pad_sequences(sequence, maxlen=30)
# sequence = tokenizer.texts_to_sequences(train_content)
# traincontent = pad_sequences(sequence, maxlen=512)
# sequence = tokenizer.texts_to_sequences(test_content)
# testcontent = pad_sequences(sequence, maxlen=512)
Copy the code
2. Construct the model
- Zhuanlan.zhihu.com/p/95293440 Keras. The accuracy of the metrics
2.1 BiLSTM
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras import initializers, regularizers, constraints, optimizers, layers
model = Sequential([
layers.Embedding(input_dim=len(tokenizer.word_index) + 1,
output_dim=256,
input_length=30,
weights=[embedding_matrix]),
layers.Bidirectional(LSTM(32, return_sequences = True)),
layers.GlobalMaxPool1D(),
layers.Dense(20, activation="relu"),
layers.Dropout(0.05),
layers.Dense(3, activation="softmax"),
])
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['categorical_accuracy'])
model.summary()
Copy the code
2.2 TextCNN
Attention code written by someone else
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras import Input, Model,backend as K
from tensorflow.keras.layers import Embedding, Dense, Attention, Bidirectional, LSTM
from tensorflow.keras import initializers, regularizers, constraints
from tensorflow.keras.layers import Layer
class Attention(Layer) :
def __init__(self, step_dim,
W_regularizer=None, b_regularizer=None,
W_constraint=None, b_constraint=None,
bias=True, **kwargs) :
""" Keras Layer that implements an Attention mechanism for temporal data. Supports Masking. Follows the work of Raffel Et al. [https://arxiv.org/abs/1512.08756] # Input shape 3 d tensor with shape: `(samples, steps, features)`. # Output shape 2D tensor with shape: `(samples, features)`. :param kwargs: Just put it on top of an RNN Layer (GRU/LSTM/SimpleRNN) with return_sequences=True. The dimensions are inferred based on the output shape of the RNN. Example: # 1 model.add(LSTM(64, return_sequences=True)) model.add(Attention()) # next add a Dense layer (for classification/regression) or whatever... # 2 hidden = LSTM(64, return_sequences=True)(words) sentence = Attention()(hidden) # next add a Dense layer (for classification/regression) or whatever... "" "
self.supports_masking = True
self.init = initializers.get('glorot_uniform')
self.W_regularizer = regularizers.get(W_regularizer)
self.b_regularizer = regularizers.get(b_regularizer)
self.W_constraint = constraints.get(W_constraint)
self.b_constraint = constraints.get(b_constraint)
self.bias = bias
self.step_dim = step_dim
self.features_dim = 0
super(Attention, self).__init__(**kwargs)
def build(self, input_shape) :
assert len(input_shape) == 3
self.W = self.add_weight(shape=(input_shape[-1],),
initializer=self.init,
name='{}_W'.format(self.name),
regularizer=self.W_regularizer,
constraint=self.W_constraint)
self.features_dim = input_shape[-1]
if self.bias:
self.b = self.add_weight(shape=(input_shape[1],),
initializer='zero',
name='{}_b'.format(self.name),
regularizer=self.b_regularizer,
constraint=self.b_constraint)
else:
self.b = None
self.built = True
def compute_mask(self, input, input_mask=None) :
# do not pass the mask to the next layers
return None
def call(self, x, mask=None) :
features_dim = self.features_dim
step_dim = self.step_dim
e = K.reshape(K.dot(K.reshape(x, (-1, features_dim)), K.reshape(self.W, (features_dim, 1))), (-)1, step_dim)) # e = K.dot(x, self.W)
if self.bias:
e += self.b
e = K.tanh(e)
a = K.exp(e)
# apply mask after the exp. will be re-normalized next
if mask is not None:
# cast the mask to floatX to avoid float64 upcasting in theano
a *= K.cast(mask, K.floatx())
# in some cases especially in the early stages of training the sum may be almost zero
A workaround is to add A very small positive number ε to the sum.
a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())
a = K.expand_dims(a)
c = K.sum(a * x, axis=1)
return c
def compute_output_shape(self, input_shape) :
return input_shape[0], self.features_dim
Copy the code
TextCNN
# from keras import Input, Model
from tensorflow.keras.layers import Embedding, Dense, Conv1D, GlobalMaxPooling1D, Concatenate, Dropout
class TextCNN(object) :
def __init__(self, maxlen, max_features, embedding_dims,
class_num=1,
last_activation='sigmoid') :
self.maxlen = maxlen
self.max_features = max_features
self.embedding_dims = embedding_dims
self.class_num = class_num
self.last_activation = last_activation
def get_model(self) :
input = Input((self.maxlen,))
# Embedding part can try multichannel as same as origin paper
embedding = Embedding(self.max_features, self.embedding_dims, input_length=self.maxlen,
weights=[embedding_matrix])(input)
convs = []
for kernel_size in [3.4.5]:
c = Conv1D(128, kernel_size, activation='relu')(embedding)
c = GlobalMaxPooling1D()(c)
convs.append(c)
x = Concatenate()(convs)
output = Dense(self.class_num, activation=self.last_activation)(x)
model = Model(inputs=input, outputs=output)
return model
model = TextCNN(maxlen=30, max_features=len(tokenizer.word_index) + 1,
embedding_dims=256, class_num=3, last_activation='softmax').get_model()
# metric_F1score is below
model.compile('adam'.'categorical_crossentropy', metrics=['accuracy',metric_F1score])
model.summary()
Copy the code
2.3 Atention – BiLSTM
Atention-BiLSTM
class TextAttBiRNN(object) :
def __init__(self, maxlen, max_features, embedding_dims,
class_num=1,
last_activation='sigmoid') :
self.maxlen = maxlen
self.max_features = max_features
self.embedding_dims = embedding_dims
self.class_num = class_num
self.last_activation = last_activation
def get_model(self) :
input = Input((self.maxlen,))
embedding = Embedding(self.max_features, self.embedding_dims,
input_length=self.maxlen, weights=[embedding_matrix])(input)
x = Bidirectional(LSTM(128,return_sequences=True))(embedding) # LSTM or GRU
x = Attention(self.maxlen)(x)
output = Dense(self.class_num, activation=self.last_activation)(x)
model = Model(inputs=input, outputs=output)
return model
pass
model = TextAttBiRNN(maxlen=30, max_features=len(tokenizer.word_index) + 1,
embedding_dims=256, class_num=3, last_activation='softmax').get_model()
model.compile('adam'.'categorical_crossentropy', metrics=['categorical_accuracy'])
model.summary()
Copy the code
3. Model training
3.1 Evaluation Criteria
import tensorflow as tf
# F1 value indicator
def metric_F1score(y_true,y_pred) :
TP=tf.reduce_sum(y_true*tf.round(y_pred))
TN=tf.reduce_sum((1-y_true)*(1-tf.round(y_pred)))
FP=tf.reduce_sum((1-y_true)*tf.round(y_pred))
FN=tf.reduce_sum(y_true*(1-tf.round(y_pred)))
precision=TP/(TP+FP)
recall=TP/(TP+FN)
F1score=2*precision*recall/(precision+recall)
return F1score
Copy the code
3.2 Training set segmentation
- Enter: trainTitle when the previous padding sequence passes
- Output: Label is obtained from the original CSV
- Partition ratio: Training set: test machine == 4:1
import tensorflow as tf
from sklearn.model_selection import train_test_split
label = train['label'].astype(int)
train_X, val_X, train_Y, val_Y = train_test_split(traintitle, label, shuffle=True, test_size=0.2,random_state=42)
# to_categorical is a ONE-hot encoding shift for TF, because loss uses categorical_crossentropy
# loos is not converted by sparse_categorical_crossentropy
train_Y = tf.keras.utils.to_categorical(train_Y)
Copy the code
3.3 Model training
- Set other parameters by yourself
# model training history = model.fit(train_X,train_Y, batch_size=128, epochs=10, validation_split=0.1, validation_freq=1,)Copy the code
3.4 Verify the effect of the model
from sklearn.metrics import f1_score
pred_val = model.predict(val_X)
print(f1_score(val_Y, np.argmax(pred_val, axis=1), average='macro'))
Copy the code
3.5 Visualization of loss and F1 values
import matplotlib.pyplot as plt
Let me draw the loss function
def show_loss_acc_img(history) :
# loss
plt.plot(history.history['loss'], label="$Loss$")
plt.plot(history.history['val_loss'], label='$val_loss$')
plt.title('Loss')
plt.xlabel('epoch')
plt.ylabel('num')
plt.legend()
plt.show()
# accuracy
plt.plot(history.history['categorical_accuracy'], label="$categorical_accuracy$")
plt.plot(history.history['val_categorical_accuracy'], label='$val_categorical_accuracy$')
plt.title('Accuracy')
plt.xlabel('epoch')
plt.ylabel('num')
plt.legend()
plt.show()
pass
show_loss_acc_img(history)
Copy the code
3.6 Predict emotional polarity of test set
# Predict test set polarity
pred_val = model.predict(testtitle)
# Save the forecast file
submission = pd.DataFrame(test.id.values,columns=["id"])
submission["label"] = np.argmax(pred_val, axis=1)
submission.to_csv("submission.csv",index=False)
Copy the code
Dry stuff you can use straight away
1. Use regex to remove HTML and other symbols from text
import re
# text filter function
def text_filter(text) :
# re.sub(regular expression, replaced string, replaced string)
text = re.sub("[a-za-z0-9!=\? %[],\ (\) ><:</#.----- _]"."", text)
text = text.replace('images'.' ')
text = text.replace('\xa0'.' ') # remove NBSP
# Remove HTML tags
cleanr = re.compile('<. *? > ')
text = re.sub(cleanr, ' ', text)
# Remove other characters
r1 = "\". *?" + | \ ". *?" | \ + #. *? # + | [..! / _, $% ^ & * () < > +""'? + | @ | : ~ # {}] [-! \ \,. =? : ", "" RMB...... () of the []]"
text = re.sub(r1,' ',text)
Remove Spaces at the beginning and end of the string
text = text.strip()
return text
Copy the code
2. Use Gensim to train your word vector
Reference blog:
[1] www.jianshu.com/p/5f04e97d1… Print word vector images using TSEN dimension reduction
[2] www.cnblogs.com/johnnyzen/p… Gensim.models.Word2Vec Parameter description
import gensim
import time
from sklearn.manifold import TSNE
from matplotlib.font_manager import *
import matplotlib.pyplot as plt
class EpochSaver(gensim.models.callbacks.CallbackAny2Vec) :
For saving models, printing loss functions, etc.
def __init__(self, save_path) :
self.save_path = save_path # model storage path
self.epoch = 0 # rounds
self.pre_loss = 0 # Previous round losses
self.best_loss = 999999999.9 # Optimal loss
self.since = time.time() # Duration of a run
def on_epoch_end(self, model) :
self.epoch += 1
cum_loss = model.get_latest_training_loss() The value returned is accumulated from the first epoch
epoch_loss = cum_loss - self.pre_loss # epoch-loss = current loss - previous round loss
time_taken = time.time() - self.since # Duration
print("Epoch %d, loss: %.2f, time: %dmin %ds" %
(self.epoch, epoch_loss, time_taken // 60, time_taken % 60)) Print the result of a round in minutes
Record best_loss and early_stop by best_loss
if self.best_loss > epoch_loss:
self.best_loss = epoch_loss
print("Better model. Best loss: %.2f" % self.best_loss) # Print the best loss
model.save(self.save_path) # Save the model
print("Model %s save done!" % self.save_path)
self.pre_loss = cum_loss
self.since = time.time()
pass
Load to train the word vector
def load_model_word2vec(save_path) :
model_word2vec = gensim.models.Word2Vec.load(save_path)
# The following code loads the trained word vector
# model_word2vec = gensim.models.Word2Vec.load('final_word2vec_model')
return model_word2vec
def print_since_time(since) :
time_elapsed = time.time() - since
print('Time to build vocab: {:.0f}min {:.0f}s'.format(time_elapsed // 60, time_elapsed % 60))
def show_word2vec_2D(model_word2vec,random_word) :
Wv [random_word] where random_word must be a list of strings
X_tsne = TSNE(n_components=2, learning_rate=100).fit_transform(model_word2vec.wv[random_word])
# Fix the issue where the minus sign '-' appears as a box
plt.figure(figsize=(14.8))
myfont = FontProperties(fname='C:\Windows\Fonts\simsun.ttc') # load font
plt.scatter(X_tsne[:, 0], X_tsne[:, 1]) # create scatter chart
for i in range(len(X_tsne)):
x = X_tsne[i][0]
y = X_tsne[i][1]
plt.text(x, y, random_word[i], fontproperties=myfont, size=16) # output coordinate label
plt.show()
pass
if __name__=="__main__":
Input_doc = [[]]
input_doc = [['`.'advertising'.'contact'.'micro'.'signal'.'Flower capital region'.'rent'.'full'.'a'.'有望'.'make sure'.'degree'.'information'.The Times.'journalists'.'Cui Xiaoyuan'.'recently'.'release'.'hereinafter referred to as'.'know'.'this year'.'Flower capital region'.'the public'.'primary'.'plan'.'recruit'.'class'.'the public'.'middle school'.'plan'.'recruit'.'class'.'; '.'Private primary School'.'Flower capital region'.'class'.'private'.'middle school'.'plan'.'recruit'.'class'.'compared with'.'years'.'admissions'.'rules'.'this year'.'admissions'.'scale'.'the general'.'change'.'no'.'big'.'plan'.'recruit'.'know'.'Flower capital region'.'admissions'.'time'.'arrangement'.'month'.'day'.'~'.'month'.'day'.'Flower capital region'.'the public'.'primary'.'online'.'sign up'.'; '.Education Bureau.'~'.'integral'.'entrance'.'online'.'sign up'.'; '.'month'.'day'.'~'.'month'.'day'.'Flower capital region'.'Private primary School'.'online'.'sign up'.'; '.'~'.'Flower capital region'.'community'.'supporting'.'the owner'.'not'.'Guangzhou'.'hukou'.'age'.'child'.'sign up'.'security'.'区内'.'clear'.'the future'.'ten years'.'Tenant'.'child'.'entrance'.'方面'.'proposed'.'a'.'Guangzhou'.'hukou'.'with'.'Policy'.'take care'.'born'.'Guangzhou'.'no'.'their own'.'property'.'home'.'with'.'in urban and rural areas'.'Self-built'.'rent'.'home'.'location'.'the only'.'Place of Residence'.'home'.'rent'.'contract'.'registration'.'the record'.'row'.'full'.'a'.'截止'.'date'.'application'.'entrance'.'inner'.'Year month day'.'more than'.'application'.'when'.'rent'.'contract'.'effective'.'state'.'Tenant'.'age'.'child'.'Flower capital region'.Education Bureau.'make sure'.'degree'.'supply'.'years'.'当中'.'已经'.'zengcheng'.'flowers'.'hometown'.'张'.'bed'.'conditions'.'set up'.'professional'.'Mental illness'.'hospital'.'source'.'flowers'.'morning'.District Health Bureau.'guangzhou'.'flowers'.'release'.'today'.'flowers'.'flowers'.'job'.'recruitment'.'group'.'add'.'when'.'note'.'move on'.'job'.'small make up'.'wages'.'Thumb'.'link'.'point'.'a'.'A penny'.'求'.'exceptional'.'remember'.'and']]
Topic # training model -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
model_word2vec = gensim.models.Word2Vec(min_count=1,
window=5,
size=256,
workers=4,
batch_words=1000)
since = time.time() # Start the timer
model_word2vec.build_vocab(input_doc, progress_per=2000) # Progress_per builds vocabulary from a series of sentences. Progress_per indicates how many words are displayed at a time
print_since_time(since) # Time is over, printing time
since = time.time()
model_word2vec.train(input_doc,
total_examples=model_word2vec.corpus_count,
epochs=20,
compute_loss=True,
report_delay=60 * 10,
callbacks=[EpochSaver('./final_word2vec_model')]) # model_word2vec model storage
print_since_time(since) # Time is over, printing time
Print word vector images
show_word2vec_2D(model_word2vec, input_doc[0])
# model_word2vec = load_model_word2vec(./final_word2vec_model)
# print(model_word2vec)
# # Count the similarity between two words
# y2 = model_word2vec.wv.similarity(u" rent ", u" rent ") # y2 = model_word2vec.wv.similarity(u" rent ", u" rent ")
# print(y2)
# # Print word similarity
# for I in model_word2vec.wv.most_similar(u" build "):
# print(i[0], i[1])
Copy the code
Print word vector effects using TSNE