“This is the 17th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

preface

Data Source:

2019 CCF Internet News Sentiment Analysis Competition

www.datafountain.cn/competition…

Data set attachment:

Link: pan.baidu.com/s/1ePKyHyE8…

Extraction code: 2021

Specific tools:

Deep learning framework uses tensorflow2.4.0 natural language processing library gensim3.8.3 word segmentation tool jieba 0.42.1Copy the code

What don’t you know about the code: The Python beginner tutorial is more like a query manual

www.runoob.com/python3/pyt…

The specific process

1. Data preprocessing

1.1 Reading Data

Here, the data set is read by read_CSV, because train_data and train_label are separate and merged using pd.merge()

Pd. The merge (x, y, how = “left”, on = “id”) a:
Notnull () : returns a Boolean value if it is null
Fillna (value) : finds a null value and replaces it with value

import pandas as pd
import numpy as np

Read training data
train_data = pd.read_csv('Train_DataSet.csv')
train_label = pd.read_csv('Train_DataSet_Label.csv')
test = pd.read_csv('Test_DataSet.csv')
# Merge two training data and training tabs
train = pd.merge(train_data, train_label, how='left', on='id')
Use Boolean to determine if there is a null value in train and discard it
train = train[(train.label.notnull()) & (train.content.notnull())]
# Replace na with an empty string
train['title'] = train['title'].fillna(' ')
train['content'] = train['content'].fillna(' ')
test['title'] = test['title'].fillna(' ')
test['content'] = test['content'].fillna(' ')
Copy the code

1.2 Filter invalid text characters and labels

The text_filter(text) function can be stored in the tool library for future use

Re.sub (regular expression, replaced string, replaced string)
Str.strip () : Removes Spaces at the beginning and end of the string

import re
# text filter function
def text_filter(text) :
    # re.sub(regular expression, replaced string, replaced string)
    text = re.sub("[a-za-z0-9!=\? %[],\ (\) ><:</#.----- _]"."", text)
    text = text.replace('images'.' ')
    text = text.replace('\xa0'.' ') # remove NBSP
    # Remove HTML tags
    cleanr = re.compile('<. *? > ')
    text = re.sub(cleanr, ' ', text) 
    # Remove other characters
    r1 =  "\". *?" + | \ ". *?" | \ + #. *? # + | [..! / _, $% ^ & * () < > +""'? + | @ | : ~ # {}] [-! \ \,. =? : ", "" RMB...... () of the []]"
    text = re.sub(r1,' ',text)
    Remove Spaces at the beginning and end of the string
    text = text.strip()
    return text
# Text cleanup function
def clean_text(data) :
    # title text
    data['title'] = data['title'].apply(lambda x: text_filter(x))
    # Body text
    data['content'] = data['content'].apply(lambda x: text_filter(x))
    return data

# run clean_text
train = clean_text(train)
test = clean_text(test)
Copy the code

1.3 Participle and stop words

Str.maketrans (x,y,z) : three arguments x,y,z, the third argument z must be a string whose character will be mapped to None, that is, deleted; If a character in Z is the same as a character in x, the duplicate character will be deleted from the final result. That is, whenever there is a third argument z, the characters in z are deleted, regardless of whether it is repeated.
String. punctuation: All punctuation
[token for token in tokens if token not in stop_words

import jieba
import string

# load stop_words
stop_words = pd.read_table('stop.txt', header=None) [0].tolist()
Create a translation table that will be used later to remove English punctuation
table = str.maketrans(""."",string.punctuation)
def cut_text(sentence) :
    tokens = list(jieba.cut(sentence))
    # Remove the stop word list
    tokens = [token for token in tokens if token not in stop_words]
# # Remove English punctuation
| | | | | | | | | | | | | | | | | | | | | | | | | |
    return tokens

# Call the word segmentation function to segment the title and text of the training set and test set
train_title = [cut_text(sent) for sent in train.title.values]
train_content = [cut_text(sent) for sent in train.content.values]
test_title = [cut_text(sent) for sent in test.title.values]
test_content = [cut_text(sent) for sent in test.content.values]
# Connect all participles to prepare for the subsequent training of word vector
all_doc = train_title + train_content + test_title + test_content
Copy the code

1.4 Use Gensim to train word vector

This series of code can be directly used to train their own word vector, after testing the larger the sample size, the better. Vacob_size after participle is about 29244

import gensim
import time
class EpochSaver(gensim.models.callbacks.CallbackAny2Vec) :
    For saving models, printing loss functions, etc.
    def __init__(self, save_path) :
        self.save_path = save_path # model storage path
        self.epoch = 0 # rounds
        self.pre_loss = 0 # Previous round losses
        self.best_loss = 999999999.9 # Optimal loss
        self.since = time.time() # Duration of a run

    def on_epoch_end(self, model) :
        self.epoch += 1 
        cum_loss = model.get_latest_training_loss() The value returned is accumulated from the first epoch
        epoch_loss = cum_loss - self.pre_loss # epoch-loss = current loss - previous round loss
        time_taken = time.time() - self.since # Duration
        print("Epoch %d, loss: %.2f, time: %dmin %ds" % 
                    (self.epoch, epoch_loss, time_taken//60, time_taken%60)) Print the result of a round in minutes
        Record best_loss and early_stop by best_loss
        if self.best_loss > epoch_loss:
            self.best_loss = epoch_loss
            print("Better model. Best loss: %.2f" % self.best_loss) # Print the best loss
            model.save(self.save_path) # Save the model
            print("Model %s save done!" % self.save_path)

        self.pre_loss = cum_loss
        self.since = time.time()
        
# The following code loads the trained word vector
# model_word2vec = gensim.models.Word2Vec.load('final_word2vec_model')
Copy the code

Create the word2vec trainer and use build_vocab to import the words into the theslexical library

model_word2vec = gensim.models.Word2Vec(min_count=1, 
                                        window=5, 
                                        size=256,
                                        workers=4,
                                        batch_words=1000)
since = time.time()
model_word2vec.build_vocab(all_doc, progress_per=2000)
time_elapsed = time.time() - since
print('Time to build vocab: {:.0f}min {:.0f}s'.format(time_elapsed // 60, time_elapsed % 60))
Copy the code

Train the word vector and save

since = time.time()
model_word2vec.train(all_doc, total_examples=model_word2vec.corpus_count, 
                        epochs=20, compute_loss=True, report_delay=60*10,
                        callbacks=[EpochSaver('./final_word2vec_model')])
time_elapsed = time.time() - since
print('Time to train: {:.0f}min {:.0f}s'.format(time_elapsed // 60, time_elapsed % 60))
Copy the code

1.5 Encoding in TF.tokenizer format

Tokenizer is a Tokenizer for TensorFlow, which contains a complete encapsulated dictionary index, etc.

Refer to the blog: dengbocong.blog.csdn.net/article/det…

# Convert Tokenizer
from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer()
tokenizer.fit_on_texts(train_title + test_title)
# tokenizer.fit_on_texts(train_content + test_content)
Copy the code

1.6 Constructing the Embedding_matrix matrix

from tqdm import tqdm

# Transform to word vector matrix, using the new WORD2vec model
vocab_size = len(tokenizer.word_index) # Thesaurus size
error_count=0
embedding_matrix = np.zeros((vocab_size + 1.256))
for word, i in tqdm(tokenizer.word_index.items()):
    if word in model_word2vec:
        embedding_matrix[i] = model_word2vec.wv[word]
    else:
        error_count += 1
Copy the code

1.7 the padding coding

Padding is used to complete short text or truncate long text

from tensorflow.keras.preprocessing.sequence import pad_sequences

sequence = tokenizer.texts_to_sequences(train_title)
traintitle = pad_sequences(sequence, maxlen=30)
sequence = tokenizer.texts_to_sequences(test_title)
testtitle = pad_sequences(sequence, maxlen=30)

# sequence = tokenizer.texts_to_sequences(train_content)
# traincontent = pad_sequences(sequence, maxlen=512)
# sequence = tokenizer.texts_to_sequences(test_content)
# testcontent = pad_sequences(sequence, maxlen=512)
Copy the code

2. Construct the model

Zhuanlan.zhihu.com/p/95293440 Keras. The accuracy of the metrics

2.1 BiLSTM

from tensorflow.keras.layers import *
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras import initializers, regularizers, constraints, optimizers, layers

model = Sequential([
    layers.Embedding(input_dim=len(tokenizer.word_index) + 1, 
                    output_dim=256, 
                    input_length=30, 
                    weights=[embedding_matrix]),
    layers.Bidirectional(LSTM(32, return_sequences = True)),
    layers.GlobalMaxPool1D(),
    layers.Dense(20, activation="relu"),
    layers.Dropout(0.05),
    layers.Dense(3, activation="softmax"),
])

model.compile(loss='categorical_crossentropy',
               optimizer='adam',
               metrics=['categorical_accuracy'])
model.summary()
Copy the code

2.2 TextCNN

Attention code written by someone else

import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras import Input, Model,backend as K
from tensorflow.keras.layers import Embedding, Dense, Attention, Bidirectional, LSTM
from tensorflow.keras import initializers, regularizers, constraints
from tensorflow.keras.layers import Layer


class Attention(Layer) :
    def __init__(self, step_dim,
                 W_regularizer=None, b_regularizer=None,
                 W_constraint=None, b_constraint=None,
                 bias=True, **kwargs) :
        """ Keras Layer that implements an Attention mechanism for temporal data. Supports Masking. Follows the work of Raffel Et al. [https://arxiv.org/abs/1512.08756] # Input shape 3 d tensor with shape: `(samples, steps, features)`. # Output shape 2D tensor with shape: `(samples, features)`. :param kwargs: Just put it on top of an RNN Layer (GRU/LSTM/SimpleRNN) with return_sequences=True. The dimensions are inferred based on  the output shape of the RNN. Example: # 1 model.add(LSTM(64, return_sequences=True)) model.add(Attention()) # next add a Dense layer (for classification/regression) or whatever... # 2 hidden = LSTM(64, return_sequences=True)(words) sentence = Attention()(hidden) # next add a Dense layer (for classification/regression) or  whatever... "" "
        self.supports_masking = True
        self.init = initializers.get('glorot_uniform')

        self.W_regularizer = regularizers.get(W_regularizer)
        self.b_regularizer = regularizers.get(b_regularizer)

        self.W_constraint = constraints.get(W_constraint)
        self.b_constraint = constraints.get(b_constraint)

        self.bias = bias
        self.step_dim = step_dim
        self.features_dim = 0

        super(Attention, self).__init__(**kwargs)

    def build(self, input_shape) :
        assert len(input_shape) == 3

        self.W = self.add_weight(shape=(input_shape[-1],),
                                 initializer=self.init,
                                 name='{}_W'.format(self.name),
                                 regularizer=self.W_regularizer,
                                 constraint=self.W_constraint)
        self.features_dim = input_shape[-1]

        if self.bias:
            self.b = self.add_weight(shape=(input_shape[1],),
                                     initializer='zero',
                                     name='{}_b'.format(self.name),
                                     regularizer=self.b_regularizer,
                                     constraint=self.b_constraint)
        else:
            self.b = None

        self.built = True

    def compute_mask(self, input, input_mask=None) :
        # do not pass the mask to the next layers
        return None

    def call(self, x, mask=None) :
        features_dim = self.features_dim
        step_dim = self.step_dim

        e = K.reshape(K.dot(K.reshape(x, (-1, features_dim)), K.reshape(self.W, (features_dim, 1))), (-)1, step_dim))  # e = K.dot(x, self.W)
        if self.bias:
            e += self.b
        e = K.tanh(e)

        a = K.exp(e)
        # apply mask after the exp. will be re-normalized next
        if mask is not None:
            # cast the mask to floatX to avoid float64 upcasting in theano
            a *= K.cast(mask, K.floatx())
        # in some cases especially in the early stages of training the sum may be almost zero
        A workaround is to add A very small positive number ε to the sum.
        a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())
        a = K.expand_dims(a)

        c = K.sum(a * x, axis=1)
        return c

    def compute_output_shape(self, input_shape) :
        return input_shape[0], self.features_dim
Copy the code

TextCNN

# from keras import Input, Model
from tensorflow.keras.layers import Embedding, Dense, Conv1D, GlobalMaxPooling1D, Concatenate, Dropout
class TextCNN(object) :
    def __init__(self, maxlen, max_features, embedding_dims,
                 class_num=1,
                 last_activation='sigmoid') :
        self.maxlen = maxlen
        self.max_features = max_features
        self.embedding_dims = embedding_dims
        self.class_num = class_num
        self.last_activation = last_activation

    def get_model(self) :
        input = Input((self.maxlen,))

        # Embedding part can try multichannel as same as origin paper
        embedding = Embedding(self.max_features, self.embedding_dims, input_length=self.maxlen,
                              weights=[embedding_matrix])(input)
        convs = []
        for kernel_size in [3.4.5]:
            c = Conv1D(128, kernel_size, activation='relu')(embedding)
            c = GlobalMaxPooling1D()(c)
            convs.append(c)
        x = Concatenate()(convs)

        output = Dense(self.class_num, activation=self.last_activation)(x)
        model = Model(inputs=input, outputs=output)
        return model
    
model = TextCNN(maxlen=30, max_features=len(tokenizer.word_index) + 1,
                    embedding_dims=256, class_num=3, last_activation='softmax').get_model()
# metric_F1score is below
model.compile('adam'.'categorical_crossentropy', metrics=['accuracy',metric_F1score])
model.summary()
Copy the code

2.3 Atention – BiLSTM

Atention-BiLSTM

class TextAttBiRNN(object) :
    def __init__(self, maxlen, max_features, embedding_dims,
                 class_num=1,
                 last_activation='sigmoid') :
        self.maxlen = maxlen
        self.max_features = max_features
        self.embedding_dims = embedding_dims
        self.class_num = class_num
        self.last_activation = last_activation

    def get_model(self) :
        input = Input((self.maxlen,))

        embedding = Embedding(self.max_features, self.embedding_dims,
                              input_length=self.maxlen, weights=[embedding_matrix])(input)
        x = Bidirectional(LSTM(128,return_sequences=True))(embedding)  # LSTM or GRU
        x = Attention(self.maxlen)(x)

        output = Dense(self.class_num, activation=self.last_activation)(x)
        model = Model(inputs=input, outputs=output)
        return model
    pass

model = TextAttBiRNN(maxlen=30, max_features=len(tokenizer.word_index) + 1,
                    embedding_dims=256, class_num=3, last_activation='softmax').get_model()
model.compile('adam'.'categorical_crossentropy', metrics=['categorical_accuracy'])
model.summary()
Copy the code

3. Model training

3.1 Evaluation Criteria

import tensorflow as tf

# F1 value indicator
def metric_F1score(y_true,y_pred) :    
    TP=tf.reduce_sum(y_true*tf.round(y_pred))
    TN=tf.reduce_sum((1-y_true)*(1-tf.round(y_pred)))
    FP=tf.reduce_sum((1-y_true)*tf.round(y_pred))
    FN=tf.reduce_sum(y_true*(1-tf.round(y_pred)))
    precision=TP/(TP+FP)
    recall=TP/(TP+FN)
    F1score=2*precision*recall/(precision+recall)
    return F1score
Copy the code

3.2 Training set segmentation

Enter: trainTitle when the previous padding sequence passes
Output: Label is obtained from the original CSV
Partition ratio: Training set: test machine == 4:1

import tensorflow as tf
from sklearn.model_selection import train_test_split

label = train['label'].astype(int)

train_X, val_X, train_Y, val_Y = train_test_split(traintitle, label, shuffle=True, test_size=0.2,random_state=42)
# to_categorical is a ONE-hot encoding shift for TF, because loss uses categorical_crossentropy
# loos is not converted by sparse_categorical_crossentropy
train_Y = tf.keras.utils.to_categorical(train_Y)
Copy the code

3.3 Model training

Set other parameters by yourself

# model training history = model.fit(train_X,train_Y, batch_size=128, epochs=10, validation_split=0.1, validation_freq=1,)Copy the code

3.4 Verify the effect of the model

from sklearn.metrics import f1_score

pred_val = model.predict(val_X)
print(f1_score(val_Y, np.argmax(pred_val, axis=1), average='macro'))
Copy the code

3.5 Visualization of loss and F1 values

import  matplotlib.pyplot as plt
Let me draw the loss function
def show_loss_acc_img(history) :
    # loss
    plt.plot(history.history['loss'], label="$Loss$")
    plt.plot(history.history['val_loss'], label='$val_loss$')
    plt.title('Loss')
    plt.xlabel('epoch')
    plt.ylabel('num')
    plt.legend()
    plt.show()
    # accuracy
    plt.plot(history.history['categorical_accuracy'], label="$categorical_accuracy$")
    plt.plot(history.history['val_categorical_accuracy'], label='$val_categorical_accuracy$')
    plt.title('Accuracy')
    plt.xlabel('epoch')
    plt.ylabel('num')
    plt.legend()
    plt.show()
    pass
show_loss_acc_img(history)
Copy the code

3.6 Predict emotional polarity of test set

# Predict test set polarity
pred_val = model.predict(testtitle)

# Save the forecast file
submission = pd.DataFrame(test.id.values,columns=["id"])
submission["label"] = np.argmax(pred_val, axis=1)
submission.to_csv("submission.csv",index=False)
Copy the code

Dry stuff you can use straight away

1. Use regex to remove HTML and other symbols from text

import re
# text filter function
def text_filter(text) :
    # re.sub(regular expression, replaced string, replaced string)
    text = re.sub("[a-za-z0-9!=\? %[],\ (\) ><:</#.----- _]"."", text)
    text = text.replace('images'.' ')
    text = text.replace('\xa0'.' ') # remove NBSP
    # Remove HTML tags
    cleanr = re.compile('<. *? > ')
    text = re.sub(cleanr, ' ', text) 
    # Remove other characters
    r1 =  "\". *?" + | \ ". *?" | \ + #. *? # + | [..! / _, $% ^ & * () < > +""'? + | @ | : ~ # {}] [-! \ \,. =? : ", "" RMB...... () of the []]"
    text = re.sub(r1,' ',text)
    Remove Spaces at the beginning and end of the string
    text = text.strip()
    return text
Copy the code

2. Use Gensim to train your word vector

Reference blog:

[1] www.jianshu.com/p/5f04e97d1… Print word vector images using TSEN dimension reduction

[2] www.cnblogs.com/johnnyzen/p… Gensim.models.Word2Vec Parameter description

import gensim
import time
from sklearn.manifold import TSNE
from matplotlib.font_manager import *
import matplotlib.pyplot as plt

class EpochSaver(gensim.models.callbacks.CallbackAny2Vec) :
    For saving models, printing loss functions, etc.

    def __init__(self, save_path) :
        self.save_path = save_path  # model storage path
        self.epoch = 0  # rounds
        self.pre_loss = 0  # Previous round losses
        self.best_loss = 999999999.9  # Optimal loss
        self.since = time.time()  # Duration of a run

    def on_epoch_end(self, model) :
        self.epoch += 1
        cum_loss = model.get_latest_training_loss()  The value returned is accumulated from the first epoch
        epoch_loss = cum_loss - self.pre_loss  # epoch-loss = current loss - previous round loss
        time_taken = time.time() - self.since  # Duration
        print("Epoch %d, loss: %.2f, time: %dmin %ds" %
              (self.epoch, epoch_loss, time_taken // 60, time_taken % 60))  Print the result of a round in minutes
        Record best_loss and early_stop by best_loss
        if self.best_loss > epoch_loss:
            self.best_loss = epoch_loss
            print("Better model. Best loss: %.2f" % self.best_loss)  # Print the best loss
            model.save(self.save_path)  # Save the model
            print("Model %s save done!" % self.save_path)

        self.pre_loss = cum_loss
        self.since = time.time()
        pass


Load to train the word vector
def load_model_word2vec(save_path) :
    model_word2vec = gensim.models.Word2Vec.load(save_path)
    # The following code loads the trained word vector
    # model_word2vec = gensim.models.Word2Vec.load('final_word2vec_model')
    return model_word2vec

def print_since_time(since) :
    time_elapsed = time.time() - since
    print('Time to build vocab: {:.0f}min {:.0f}s'.format(time_elapsed // 60, time_elapsed % 60))

def show_word2vec_2D(model_word2vec,random_word) :

    Wv [random_word] where random_word must be a list of strings
    X_tsne = TSNE(n_components=2, learning_rate=100).fit_transform(model_word2vec.wv[random_word])

    # Fix the issue where the minus sign '-' appears as a box
    plt.figure(figsize=(14.8))
    myfont = FontProperties(fname='C:\Windows\Fonts\simsun.ttc') # load font

    plt.scatter(X_tsne[:, 0], X_tsne[:, 1]) # create scatter chart

    for i in range(len(X_tsne)):
        x = X_tsne[i][0]
        y = X_tsne[i][1]
        plt.text(x, y, random_word[i], fontproperties=myfont, size=16) # output coordinate label

    plt.show()
    pass

if __name__=="__main__":
    Input_doc = [[]]
    input_doc = [['`.'advertising'.'contact'.'micro'.'signal'.'Flower capital region'.'rent'.'full'.'a'.'有望'.'make sure'.'degree'.'information'.The Times.'journalists'.'Cui Xiaoyuan'.'recently'.'release'.'hereinafter referred to as'.'know'.'this year'.'Flower capital region'.'the public'.'primary'.'plan'.'recruit'.'class'.'the public'.'middle school'.'plan'.'recruit'.'class'.'; '.'Private primary School'.'Flower capital region'.'class'.'private'.'middle school'.'plan'.'recruit'.'class'.'compared with'.'years'.'admissions'.'rules'.'this year'.'admissions'.'scale'.'the general'.'change'.'no'.'big'.'plan'.'recruit'.'know'.'Flower capital region'.'admissions'.'time'.'arrangement'.'month'.'day'.'～'.'month'.'day'.'Flower capital region'.'the public'.'primary'.'online'.'sign up'.'; '.Education Bureau.'～'.'integral'.'entrance'.'online'.'sign up'.'; '.'month'.'day'.'～'.'month'.'day'.'Flower capital region'.'Private primary School'.'online'.'sign up'.'; '.'～'.'Flower capital region'.'community'.'supporting'.'the owner'.'not'.'Guangzhou'.'hukou'.'age'.'child'.'sign up'.'security'.'区内'.'clear'.'the future'.'ten years'.'Tenant'.'child'.'entrance'.'方面'.'proposed'.'a'.'Guangzhou'.'hukou'.'with'.'Policy'.'take care'.'born'.'Guangzhou'.'no'.'their own'.'property'.'home'.'with'.'in urban and rural areas'.'Self-built'.'rent'.'home'.'location'.'the only'.'Place of Residence'.'home'.'rent'.'contract'.'registration'.'the record'.'row'.'full'.'a'.'截止'.'date'.'application'.'entrance'.'inner'.'Year month day'.'more than'.'application'.'when'.'rent'.'contract'.'effective'.'state'.'Tenant'.'age'.'child'.'Flower capital region'.Education Bureau.'make sure'.'degree'.'supply'.'years'.'当中'.'已经'.'zengcheng'.'flowers'.'hometown'.'张'.'bed'.'conditions'.'set up'.'professional'.'Mental illness'.'hospital'.'source'.'flowers'.'morning'.District Health Bureau.'guangzhou'.'flowers'.'release'.'today'.'flowers'.'flowers'.'job'.'recruitment'.'group'.'add'.'when'.'note'.'move on'.'job'.'small make up'.'wages'.'Thumb'.'link'.'point'.'a'.'A penny'.'求'.'exceptional'.'remember'.'and']]
    Topic # training model -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
    model_word2vec = gensim.models.Word2Vec(min_count=1,
                                            window=5,
                                            size=256,
                                            workers=4,
                                            batch_words=1000)

    since = time.time() # Start the timer
    model_word2vec.build_vocab(input_doc, progress_per=2000) # Progress_per builds vocabulary from a series of sentences. Progress_per indicates how many words are displayed at a time
    print_since_time(since) # Time is over, printing time

    since = time.time()
    model_word2vec.train(input_doc,
                         total_examples=model_word2vec.corpus_count,
                         epochs=20,
                         compute_loss=True,
                         report_delay=60 * 10,
                         callbacks=[EpochSaver('./final_word2vec_model')]) # model_word2vec model storage
    print_since_time(since) # Time is over, printing time

    Print word vector images
    show_word2vec_2D(model_word2vec, input_doc[0])

    # model_word2vec = load_model_word2vec(./final_word2vec_model)
    # print(model_word2vec)
    # # Count the similarity between two words
    # y2 = model_word2vec.wv.similarity(u" rent ", u" rent ") # y2 = model_word2vec.wv.similarity(u" rent ", u" rent ")
    # print(y2)
    # # Print word similarity
    # for I in model_word2vec.wv.most_similar(u" build "):
    # print(i[0], i[1])
Copy the code

Print word vector effects using TSNE

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Emotion analysis method based on BiLSTM

preface

The specific process

1. Data preprocessing

1.1 Reading Data

1.2 Filter invalid text characters and labels

1.3 Participle and stop words

1.4 Use Gensim to train word vector

1.5 Encoding in TF.tokenizer format

1.6 Constructing the Embedding_matrix matrix

1.7 the padding coding

2. Construct the model

2.1 BiLSTM

2.2 TextCNN

2.3 Atention – BiLSTM

3. Model training

3.1 Evaluation Criteria

3.2 Training set segmentation

3.3 Model training

3.4 Verify the effect of the model

3.5 Visualization of loss and F1 values

3.6 Predict emotional polarity of test set

Dry stuff you can use straight away

1. Use regex to remove HTML and other symbols from text

2. Use Gensim to train your word vector

Emotion analysis method based on BiLSTM

preface

The specific process

1. Data preprocessing

1.1 Reading Data

1.2 Filter invalid text characters and labels

1.3 Participle and stop words

1.4 Use Gensim to train word vector

1.5 Encoding in TF.tokenizer format

1.6 Constructing the Embedding_matrix matrix

1.7 the padding coding

2. Construct the model

2.1 BiLSTM

2.2 TextCNN

2.3 Atention – BiLSTM

3. Model training

3.1 Evaluation Criteria

3.2 Training set segmentation

3.3 Model training

3.4 Verify the effect of the model

3.5 Visualization of loss and F1 values

3.6 Predict emotional polarity of test set

Dry stuff you can use straight away

1. Use regex to remove HTML and other symbols from text

2. Use Gensim to train your word vector

Related Posts

Halfway transition to do artificial intelligence, who said it is not feasible?

DARTS: classic web search method based on gradient descent, and open the end-to-end network search | ICLR 2019

Engineering practice of Meituan deep learning system