• 📍 From the column: Natural Language Processing NLP- Example Tutorial
  • 📍 Recommended column: “100 Cases of deep Learning”

LSTM is an advanced version of RNN. If the maximum limit of RNN is to understand a sentence, then the maximum limit of LSTM is to understand a paragraph. Details are as follows:

LSTM, Short for Long Short Term Memory Networks, is a special RNN that can learn long-term dependencies. LSTM was proposed by Hochreiter & Schmidhuber (1997). Many researchers improved and carried forward LSTM through a series of works. LSTM works very well for many problems and is now widely used.

All recurrent neural networks have the form of a chain of repeated neural network modules. In ordinary RNN, the repeating module structure is very simple, and its structure is as follows:

LSTM avoids the problem of long-term dependency. Remember long-term information! LSTM has a complex structure inside. It can select and adjust the transmitted information through the gated state, remember the information that needs to be remembered for a long time, and forget the unimportant information. Its structure is as follows:

The basic flow of neural network program is as follows


@[TOC]

The early stage of the work

Import data

import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import font_manager
from itertools import accumulate
# Support Chinese
plt.rcParams['font.sans-serif'] = ['SimHei']  # used to display Chinese labels normally
plt.rcParams['axes.unicode_minus'] = False  # is used to display the minus sign normally

df = pd.read_csv('data_single.csv')

df.head()
Copy the code
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

evaluation label
0 I’ve been using it for a while, and it feels good. Yeah positive
1 The TV was so good that it was the second one in the house. The first day to place an order, the next day to the local, but the logistics person said the car is broken, has been urging… positive
2 The TV is much bigger than imagined, the picture is very clear, the system is very intelligent, and more functions are still being explored positive
3 good positive
4 I’ve been using it for so many days, and it feels good. Sharp’s brand is more reliable. I hope it will be more durable in the future. Now is the time to consider the quality. positive

The data analysis

df.groupby('label') ["evaluation"].count()
Copy the code
Label Positive 1908 Negative 2375 Name: evaluation, DType: INT64Copy the code
df.label.value_counts().plot(kind='pie', autopct='% 0.05 f % %', colors=['lightblue'.'lightgreen'], explode=(0.01.0.01))
Copy the code

df['length'] = df['evaluation'].apply(lambda x: len(x))
df.head()
Copy the code
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

evaluation label length
0 I’ve been using it for a while, and it feels good. Yeah positive 15
1 The TV was so good that it was the second one in the house. The first day to place an order, the next day to the local, but the logistics person said the car is broken, has been urging… positive 97
2 The TV is much bigger than imagined, the picture is very clear, the system is very intelligent, and more functions are still being explored positive 33
3 good positive 2
4 I’ve been using it for so many days, and it feels good. Sharp’s brand is more reliable. I hope it will be more durable in the future. Now is the time to consider the quality. positive 46
len_df      = df.groupby('length').count()
sent_length = len_df.index.tolist()
sent_freq   = len_df['evaluation'].tolist()

Draw a statistical graph of sentence length and frequency
plt.bar(sent_length, sent_freq)
plt.title("Statistical chart of Sentence Length and Frequency")
plt.xlabel("Sentence Length")
plt.ylabel("Frequency of sentence length")
plt.show()
Copy the code

# Plot the cumulative distribution function of sentence length (CDF)
sent_pentage_list = [(count/sum(sent_freq)) for count in accumulate(sent_freq)]

# draw the CDF
plt.plot(sent_length, sent_pentage_list)

Find the length of the sentence with quantile
quantile = 0.9
for length, per in zip(sent_length, sent_pentage_list):
    if round(per, 2) == quantile:
        index = length
        break
print("\n Sentence length with %s :%d." % (quantile, index))

Draw the cumulative distribution function graph of sentence length
plt.plot(sent_length, sent_pentage_list)
plt.hlines(quantile, 0, index, colors="g", linestyles="--")
plt.vlines(index, 0, quantile, colors="g", linestyles="--")
plt.text(0, quantile, str(quantile))
plt.text(index, 0.str(index))
plt.title("Cumulative Distribution function of Sentence Length")
plt.xlabel("Sentence Length")
plt.ylabel("Cumulative frequency of sentence Length")
plt.show()
Copy the code
Length of a sentence with a fraction of 0.9 :172.Copy the code

Data preprocessing

Disrupted data

Scramble positive text data with negative text data

df = df.sample(frac=1)
df.head()
Copy the code
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

evaluation label length
1595 The TV is good! I didn’t expect it to arrive so soon! Praise a! Sun sheet: fold up, turn left, turn right positive 35
2305 Don’t make an appointment at all ah, 1198 directly bought, non-traditional brand, feeling rough and uncomfortable, bite the teeth to place an order to try, receiving goods tested the screen is ok… negative 134
3831 The screen is good, but the sound is so-so. You need a stereo to watch the movie negative 20
2119 The price is slightly higher than last year, but the picture is good and narwhal service is in place negative 23
3687 This TV has no intelligent voice experience, only some simple operation, should not corrupt hundreds of cheap, buy millet is good, switch machine is not sensitive… negative 166

Word processing

import jieba

word_cut = lambda x: jieba.lcut(x)
df['words'] = df["evaluation"].apply(word_cut)
df.head()
Copy the code
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\ADMINI~1\AppData\Local\Temp\jieba.cache
Loading model cost 0.434 seconds.
Prefix dict has been built successfully.
Copy the code
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

evaluation label length words
1595 The TV is good! I didn’t expect it to arrive so soon! Praise a! Sun sheet: fold up, turn left, turn right positive 35 [TV, pretty, good,!!, unexpectedly, soon, received,!!, great, a…
2305 Don’t make an appointment at all ah, 1198 directly bought, non-traditional brand, feeling rough and uncomfortable, bite the teeth to place an order to try, receiving goods tested the screen is ok… negative 134 [not at all booking ah direct buy non-traditional brand…
3831 The screen is good, but the sound is so-so. You need a stereo to watch the movie negative 20 [Screen, good, but sound, general, ah, watch, also, have to match, sound]
2119 The price is slightly higher than last year, but the picture is good and narwhal service is in place negative 23 The price is slightly higher than last year but the picture effect is good narwhal
3687 This TV has no intelligent voice experience, only some simple operation, should not corrupt hundreds of cheap, buy millet is good, switch machine is not sensitive… negative 166 [Well, TV, no intelligence, voice, experience, just some, simple, operation…

Remove stop words

with open("hit_stopwords.txt"."r", encoding='utf-8') as f:
    stopwords = f.readlines()
    
stopwords_list = []
for each in stopwords:
    stopwords_list.append(each.strip('\n'))

# Add a custom stop word
stopwords_list += ["..."."To go"."Also"."The west"."A"."Month"."Year"."."."All"]

def remove_stopwords(ls) :  # Remove the stop word
    return [word for word in ls if word not in stopwords_list]

df['Data after removing stop words']=df["words"].apply(lambda x: remove_stopwords(x))
df.head()
Copy the code
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

evaluation label length words Remove the data after the stop word
1595 The TV is good! I didn’t expect it to arrive so soon! Praise a! Sun sheet: fold up, turn left, turn right positive 35 [TV, pretty, good,!!, unexpectedly, soon, received,!!, great, a… [TV, pretty, good, didn’t expect, soon, copy, great,, sun sheet, pack up…
2305 Don’t make an appointment at all ah, 1198 directly bought, non-traditional brand, feeling rough and uncomfortable, bite the teeth to place an order to try, receiving goods tested the screen is ok… negative 134 [not at all booking ah direct buy non-traditional brand… [At all, no use, reservation, 1198, direct, buy, non-traditional, brand, mood, rough, upset,…
3831 The screen is good, but the sound is so-so. You need a stereo to watch the movie negative 20 [Screen, good, but sound, general, ah, watch, also, have to match, sound] [Screen, good, sound, watch, match, sound]
2119 The price is slightly higher than last year, but the picture is good and narwhal service is in place negative 23 The price is slightly higher than last year but the picture effect is good narwhal [Price, slightly, last year, high, picture, effect, good, narwhal, whale, service in place]
3687 This TV has no intelligent voice experience, only some simple operation, should not corrupt hundreds of cheap, buy millet is good, switch machine is not sensitive… negative 166 [Well, TV, no intelligence, voice, experience, just some, simple, operation… TV, nothing, intelligent, voice, experience, simple, operation, shouldn’t, greedy, hundreds of dollars, cheap, buy…

Word2vec processing

Word2vec is a model for generating word vectors. Is a tool for converting words into vector form. Through transformation, the processing of text content can be simplified to vector operation in vector space, and the similarity in vector space can be calculated to represent the semantic similarity of text.

from gensim.models.word2vec  import Word2Vec
import numpy as np

x = df["Data after removing the stop words."]

# Train Word2Vec shallow neural network model
w2v = Word2Vec(vector_size=300.# refers to the dimension of the eigenvector, which defaults to 100.
               min_count=10)     You can truncate the dictionary. Words with a frequency less than min_count are discarded, with a default value of 5.
w2v.build_vocab(x)
w2v.train(x,                         
          total_examples=w2v.corpus_count, 
          epochs=20)
# Save the Word2Vec model and word vector
w2v.save('w2v_model.pkl')

# Convert text to vectors
def average_vec(text) :
    vec = np.zeros(300).reshape((1.300))
    for word in text:
        try:
            vec += w2v.wv[word].reshape((1.300))
        except KeyError:
            continue
    return vec

Save the word vector as Ndarray
x_vec = np.concatenate([average_vec(z) for z in x])
Copy the code
# One-hot expansion of multi-class tags
y = pd.get_dummies(df['label']).values
y[:5]
Copy the code
array([[1, 0],
       [0, 1],
       [0, 1],
       [0, 1],
       [0, 1]], dtype=uint8)
Copy the code

Divide training set and test set

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(x_vec,y,test_size=0.2)
Copy the code
from keras.models          import Sequential
from keras.layers          import Dense,LSTM,Bidirectional,Embedding

# Define model
model = Sequential()
model.add(Embedding(100000.100))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(2, activation='softmax'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model.summary()
Copy the code
WARNING:tensorflow:Layer lstm will not use cuDNN kernels since it doesn't meet the criteria. It will use a generic GPU kernel as fallback when running on GPU. Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding (Embedding) (None, None, 100) 10000000 _________________________________________________________________ lstm (LSTM) (None, 100) 80400 _________________________________________________________________ dense (Dense) (None, 2) 202 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = Total params: 10080602 Trainable params: 10080602 Non - trainable params: 0 _________________________________________________________________Copy the code
epochs = 100
batch_size = 64

history = model.fit(X_train, 
                    y_train, 
                    epochs=epochs, 
                    batch_size=batch_size,
                    validation_split=0.2)
Copy the code
Epoch 1/100 43/43 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 39 s 868 ms/step - loss: 0.6140 accuracy: 0.7051 - val_loss: 0.5903 - val_accuracy: 0.6910 Epoch 2/100 43/43 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] 37 860 ms/s step - loss: 0.5555-accuracy: 0.7288-val_loss: 0.5735-val_accuracy: 0.7172 Epoch 3/100 43/43 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] 37 858 ms/s step - loss: 0.5421 accuracy: 0.736 - val_loss: 0.5713-val_accuracy: 0.7099...... 43/43 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 36 848 ms/s step - loss: 0.1602 accuracy: 0.9409 - val_loss: 0.5864 - val_accuracy: 0.8338 Epoch 99/100 43/43 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 36 849 ms/s step - loss: 0.1757-accuracy: 0.9292-val_loss: 0.5823-val_accuracy: 0.8338 Epoch 100/100 43/43 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 36 847 ms/s step - loss: 0.1410 accuracy: 0.9467-VAL_loss: 0.6176-VAL_accuracy: 0.8222Copy the code

Affective forecasting

# Read Word2Vec and perform word vector computations on the new input
def average_vec(words) :
    # Read the Word2Vec model
    w2v = Word2Vec.load('w2v_model.pkl')
    vec = np.zeros(300).reshape((1.300))
    for word in words:
        try:
            vec += w2v.wv[word].reshape((1.300))
        except KeyError:
            continue
    return vec

# Make emotional judgments about movie reviews
def model_predict(string) :

    # Comment participle
    words = jieba.lcut(str(string))
    words_vec = average_vec(words)
    # Read support vector machine model
    # model = joblib.load('svm_model.pkl')

    result = np.argmax(model.predict(words_vec))
    Return positive or negative results in real time
    if int(result) == 1:
        # print(string, '[positive]')
        return "Positive"
    else:
        # print(string, '[negative]')
        return "Negative"

Test with 10 pieces of data
for index, row in df.iloc[:10].iterrows():
    print(row["evaluation"],end="|")
    result = model_predict(row["Data after removing the stop words."])
    comment_sentiment.append(result)
    print(result)

# Merge emotional results with original data into new data
merged = pd.concat([df, pd.Series(comment_sentiment, name='User sentiment')], axis=1)
# Save file
pd.DataFrame.to_csv(merged,'comment_sentiment.csv',encoding="utf-8-sig")
print('done.')
Copy the code

Our final predictions are as follows:

The TV is good! I didn't expect it to arrive so soon! Praise a! Pack up bask in: turn left turn right | positive no appointment necessary, 1198 directly buy, non-traditional brand, mood unsettling, teeth to try order, receiving inspection across the screen nothing to your outfit, four screws. The first three times boot good slow ah, the longest almost 5 minutes, hurriedly call customer service, may be the system update, then good, standby boot fast, sound and painting satisfaction. That's a relief. Connect mobile hard disk incredibly even the original disk can also put, strong ah! | active screen is good, but sound general, film have to match audio | negative price slightly higher than last year, but the picture effect is good, the narwhal perfect service | experience negative this TV no intelligent voice, only some simple operations, should not be greedy hundreds of pieces of cheap, buy millet, switch machine is not sensitive, border is not metal, The base is the simplest one I have ever seen. The screen background is dim and it is hard to see clearly, and I don't know what the designer thinks. He got up in the middle of the night and robbed it at 12 o 'clock. Suggest you consider other brands Advantage is cheap, configuration, buying high and still want to buy a good TV | positive price is some higher than that of before buy delivery speed and service part white color commendable machine operation interface Set up as well Watching movies can be reluctantly accepted | positive TV content is rich, good sharpness. Voice search function is not perfect, reaction time 1-2 seconds. I hope the subsequent updates can be further enhanced. | positive TV is very good, the price is also affordable! | positive speed still pack mail, has been chosen, the member has a year to (need to put the system upgrade to the latest version to get yo); Speed brush, there is no such thing as 5 minutes on; Analog signal interface is a bit weak, who uses this TV to watch analog signal ah, usb3.0 is also great; 16GB of ram is enough for a TV; The screen is perfect, there is a little light leakage around the situation (in front of the TV is more obvious a little tilt on the pure black, I do not know the professional right), in short, or value! | negative TV was broken before, just change the goods today! I hope there is no problem! | negative done.Copy the code