Deep learning has been in use for over a year, and has recently begun work on NLP natural processing. Just take this opportunity to write a series of NLP machine translation deep learning practical courses.

This series of courses will go from principles and data processing to hands-on practice and application deployment, including the following content :(update ing)

  • NLP Machine Translation Deep Learning Practical Course · Zero (Basic Concepts)
  • NLP Machine Translation Deep Learning Practice course
  • NLP Machine Translation Deep Learning Practice course ii (RNN+Attention Base)
  • NLP Machine Translation Deep Learning Practice Course iii (CNN Base)
  • NLP Machine Translation Deep Learning Practice Course iv (Self-attention Base)
  • NLP Machine Translation Deep Learning practical course wu (Application deployment)

For this tutorial, see the blog: me.csdn.net/chinateleco…

Open source: github.com/xiaosongshi…

Personal homepage: www.yansongsong.cn/

0. Project background

In the last article, we briefly introduced NLP machine translation, this time we will introduce RNN based translation model in a practical way.

0.1 Introduction to RNN based SEQ2SEQ architecture translation model

Seq2seq structure

RNN based SEQ2SEQ architecture includes encoder and decoder, and the decoder part is divided into train and inference. The specific structure is shown in the following two figures:

It can be seen that the structure is very simple (compared with CNN and Attention Base). Now we will further explore and understand the internal principle of the model by implementing it in the form of code.

1. Data preparation

1.1 Downloading Data

 

There is a lot of translation data available at www.manythings.org/anki/ in many languages, and the Chinese to English data sets have been selected for this tutorial.

Training the download address: www.manythings.org/anki/cmn-en…

Unzip cmn-eng.zip, you can find the cmn. TXT file, which is as follows:

with open('cmn.txt', 'r', encoding='utf-8') as f:
Copy the code
['Tom died.\t died. ', 'Tom quit. ', 'Tom swam. ', 'Trust me.Copy the code

It can be found that each pair of translated data is on the same line, with English on the left and Chinese on the right using \ T as the boundary between English and Chinese.

1.2 Data Preprocessing

Using network training requires us to process the data into a format that the network can receive.

For this data, specifically, characters need to be converted into numbers (sentence digitization) and sentence length normalized.

Sentence digitization

Can refer to my blog: “deep application” NLP named entity Recognition (NER) open source combat tutorial, data preprocessing implementation.

English and Chinese characters are processed separately.

English to deal with

Because every word in English is separated by a space (except for abbreviations, which are treated as one word), and because punctuation marks are not separated from words, special treatment is required

Here I use a simple method to implement a space before punctuation:

def split_dot(strs,dots=", . ! ?" ): for d in dots.split(" "): strs = strs.replace(d," "+d)Copy the code

Use this method to dictionarize words:

for token in sample.split(" "):

if token not in w_all_dict.keys():

    sort_w_list = sorted(w_all_dict.items(),  key=lambda d: d[1], reverse=True)

    w_keys = [x for x,_ in sort_w_list[:7000-2]]

    w_dict = { x:i for i,x in enumerate(w_keys) }

    i_dict = { i:x for i,x in enumerate(w_keys) }
Copy the code

Chinese language processing

When dealing with Chinese, it can be found that there are both traditional and simplified Chinese, so it is best to convert to a unified form :(reference address)

pip install opencc-python-reimplemented
Copy the code

To convert traditional Chinese to Simplified Chinese:

cc = opencc.OpenCC('t2s')
Copy the code

Then use jieba to separate words from the sentence:

def get_chn_dicts(datas): for token in jieba.cut(sample): if token not in w_all_dict.keys(): sort_w_list = sorted(w_all_dict.items(), key=lambda d: d[1], reverse=True) w_keys = [x for x,_ in sort_w_list[:10000-4]] w_dict = { x:i for i,x in enumerate(w_keys) } i_dict = { i:x  for i,x in enumerate(w_keys) }Copy the code

Now let’s do the padding

def padding(lists,lens=LENS):
Copy the code

Finally, unified operation and processing:

if __name__ == "__main__":

    df = read2df("cmn-eng/cmn.txt")

    eng_dict,id2eng = get_eng_dicts(df["eng"])

    chn_dict,id2chn = get_chn_dicts(df["chn"])

    print(list(eng_dict.keys())[:20])

    print(list(chn_dict.keys())[:20])

    enc_in = [[get_val(e,eng_dict) for e in eng.split(" ")] for eng in df["eng"]]

    dec_in = [[get_val("<GO>",chn_dict)]+[get_val(e,chn_dict) for e in jieba.cut(eng)]+[get_val("<EOS>",chn_dict)] for eng in df["chn"]]

    dec_out = [[get_val(e,chn_dict) for e in jieba.cut(eng)]+[get_val("<EOS>",chn_dict)] for eng in df["chn"]]

    enc_in_ar = np.array(padding(enc_in,32))

    dec_in_ar = np.array(padding(dec_in,30))

    dec_out_ar = np.array(padding(dec_out,30))
Copy the code

The following output is displayed:

(TF_GPU) D:\Files\Prjs\Pythons\Kerases\MNT_RNN>C:/Datas/Apps/RJ/Miniconda3/envs/TF_GPU/python.exe d:/Files/Prjs/Pythons/Kerases/MNT_RNN/mian.py Using TensorFlow backend. Building prefix dict from the default dictionary . Loading model from cache C:\Users\xiaos\AppData\Local\Temp\jieba.cache Loading model cost 0.788 seconds. Prefix dict has  been built succesfully. ['<UNK>', '<PAD>', '.', 'I', 'to', 'the', 'you', 'a', '?', 'is', 'Tom', 'He', 'in', 'of', 'me', ', ', 'was' and' for ', 'have', 'The'] [' < UNK > ', '< PAD >', '< GO >', '< EOS >', '. ', 'I', 'The', 'a', 'you', 'he', '? 'and' in ', 'Tom ',' yes ', 'she ',' Is ', 'we ',' no ', 'very ']Copy the code
  1. Modeling and training

2.1 Model building and hyperparameters

A two-layer LSTM network is used

from keras.models import Model

from keras.layers import Input, LSTM, Dense, Embedding,CuDNNLSTM

from keras.optimizers import Adam

    encoder_inputs = Input(shape=(None,))

    emb_inp = Embedding(output_dim=128, input_dim=EN_VOCAB_SIZE)(encoder_inputs)

    encoder_h1, encoder_state_h1, encoder_state_c1 = CuDNNLSTM(HIDDEN_SIZE, return_sequences=True, return_state=True)(emb_inp)

    encoder_h2, encoder_state_h2, encoder_state_c2 = CuDNNLSTM(HIDDEN_SIZE, return_state=True)(encoder_h1)

    decoder_inputs = Input(shape=(None,))

    emb_target = Embedding(output_dim=128, input_dim=CH_VOCAB_SIZE)(decoder_inputs)

    lstm1 = CuDNNLSTM(HIDDEN_SIZE, return_sequences=True, return_state=True)

    lstm2 = CuDNNLSTM(HIDDEN_SIZE, return_sequences=True, return_state=True)

    decoder_dense = Dense(CH_VOCAB_SIZE, activation='softmax')

    decoder_h1, _, _ = lstm1(emb_target, initial_state=[encoder_state_h1, encoder_state_c1])

    decoder_h2, _, _ = lstm2(decoder_h1, initial_state=[encoder_state_h2, encoder_state_c2])

    decoder_outputs = decoder_dense(decoder_h2)

    model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

    encoder_model = Model(encoder_inputs, [encoder_state_h1, encoder_state_c1, encoder_state_h2, encoder_state_c2])

    decoder_state_input_h1 = Input(shape=(HIDDEN_SIZE,))

    decoder_state_input_c1 = Input(shape=(HIDDEN_SIZE,))

    decoder_state_input_h2 = Input(shape=(HIDDEN_SIZE,))

    decoder_state_input_c2 = Input(shape=(HIDDEN_SIZE,))

    decoder_h1, state_h1, state_c1 = lstm1(emb_target, initial_state=[decoder_state_input_h1, decoder_state_input_c1])

    decoder_h2, state_h2, state_c2 = lstm2(decoder_h1, initial_state=[decoder_state_input_h2, decoder_state_input_c2])

    decoder_outputs = decoder_dense(decoder_h2)

    decoder_model = Model([decoder_inputs, decoder_state_input_h1, decoder_state_input_c1, decoder_state_input_h2, decoder_state_input_c2], 

                        [decoder_outputs, state_h1, state_c1, state_h2, state_c2])

return(model,encoder_model,decoder_model)
Copy the code

2.2 Model configuration and training

A custom ACC is created to facilitate the display effect. The built-in ACC of Keras cannot be used

import keras.backend as K from keras.models import load_model def my_acc(y_true, y_pred): acc = K.cast(K.equal(K.max(y_true,axis=-1),K.cast(K.argmax(y_pred,axis=-1),K.floatx())),K.floatx()) if __name__ == "__main__": df = read2df("cmn-eng/cmn.txt") eng_dict,id2eng = get_eng_dicts(df["eng"]) chn_dict,id2chn = get_chn_dicts(df["chn"]) print(list(eng_dict.keys())[:20]) print(list(chn_dict.keys())[:20]) enc_in = [[get_val(e,eng_dict) for e in eng.split(" ")] for eng in df["eng"]] dec_in = [[get_val("<GO>",chn_dict)]+[get_val(e,chn_dict) for e in jieba.cut(eng)]+[get_val("<EOS>",chn_dict)] for eng in df["chn"]] dec_out = [[get_val(e,chn_dict) for e in jieba.cut(eng)]+[get_val("<EOS>",chn_dict)] for eng in df["chn"]] enc_in_ar = np.array(padding(enc_in,32)) dec_in_ar = np.array(padding(dec_in,30)) dec_out_ar = np.array(padding(dec_out,30)) model,encoder_model,decoder_model = get_model() Model. Load_weights ('e2c1.h5') opt = Adam(lr=LEARNING_RATE, beta_1=0.9, beta_2=0.99, epsilon=1e-08) model.compile(optimizer=opt, loss='sparse_categorical_crossentropy',metrics=[my_acc]) model.fit([enc_in_ar, dec_in_ar], np.expand_dims(dec_out_ar,-1), encoder_model.save("enc1.h5") decoder_model.save("dec1.h5")Copy the code

64Epoch Training results are as follows:

__________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ================================================================================================== input_1 (InputLayer) (None, None) 0 __________________________________________________________________________________________________ input_2 (InputLayer) (None, None) 0 __________________________________________________________________________________________________ embedding_1 (Embedding) (None, None, 128) 896000 input_1[0][0] __________________________________________________________________________________________________ embedding_2 (Embedding) (None, None, 128) 1280000 input_2[0][0] __________________________________________________________________________________________________ cu_dnnlstm_1 (CuDNNLSTM) [(None, None, 256), 395264 embedding_1[0][0] __________________________________________________________________________________________________ cu_dnnlstm_3 (CuDNNLSTM) [(None, None, 256), 395264 embedding_2[0][0] __________________________________________________________________________________________________ cu_dnnlstm_2 (CuDNNLSTM) [(None, 256), (None, 526336 cu_dnnlstm_1[0][0] __________________________________________________________________________________________________ cu_dnnlstm_4 (CuDNNLSTM) [(None, None, 256), 526336 cu_dnnlstm_3[0][0] __________________________________________________________________________________________________ dense_1 (Dense) (None, None, 10000) 2570000 cu_dnnlstm_4[0][0] = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = __________________________________________________________________________________________________ 19004/19004 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 98 - s 5 ms/step - loss: 0.1371 - my_acc: 0.9832 - val_loss: 2.7299 - val_my_acc: 0.7412 19004/19004 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 96 - s 5 ms/step - loss: 0.1234 - my_acc: 0.9851 - val_loss: 2.7378 - val_my_acc: 0.7410 19004/19004 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 96 - s 5 ms/step - loss: 0.1132 - my_acc: 0.9867 - val_loss: 2.7477 - val_my_acc: 0.7419 19004/19004 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 96 - s 5 ms/step - loss: 0.1050-my_ACC: 0.9879 - val_loss: 2.7660 - val_my_ACC: 0.7426 19004/19004 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 96 - s 5 ms/step - loss: 0.0983 - my_acc: 0.9893 - val_loss: 2.7569 - val_my_acc: 0.7408 19004/19004 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 96 - s 5 ms/step - loss: 0.0933 - my_acc: 0.9903 - val_loss: 2.7775 - val_my_acc: 0.7414 19004/19004 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 96 - s 5 ms/step - loss: 0.0885-my_ACC: 0.9911 - val_loss: 2.7885 - val_my_ACC: 0.7420 19004/19004 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 96 - s 5 ms/step - loss: 0.0845 - my_acc: 0.9920 - val_loss: 2.7914 - val_my_acc: 0.7423Copy the code
  1. Model application and prediction

Select some data from the training set for testing

if __name__ == "__main__": df = read2df("cmn-eng/cmn.txt") eng_dict,id2eng = get_eng_dicts(df["eng"]) chn_dict,id2chn = get_chn_dicts(df["chn"]) print(list(eng_dict.keys())[:20]) print(list(chn_dict.keys())[:20]) enc_in = [[get_val(e,eng_dict) for e in eng.split(" ")] for eng in df["eng"]] dec_in = [[get_val("<GO>",chn_dict)]+[get_val(e,chn_dict) for e in jieba.cut(eng)]+[get_val("<EOS>",chn_dict)] for eng in df["chn"]] dec_out = [[get_val(e,chn_dict) for e in jieba.cut(eng)]+[get_val("<EOS>",chn_dict)] for eng in df["chn"]] enc_in_ar = np.array(padding(enc_in,32)) dec_in_ar = np.array(padding(dec_in,30)) dec_out_ar = np.array(padding(dec_out,30)) encoder_model,decoder_model = load_model("enc1.h5",custom_objects={"my_acc":my_acc}),load_model("dec1.h5",custom_objects={"my_acc":my_acc}) for k in Range (16000-20160, 00) : test_data = enc_in_ar[k:k+1] h1, c1, h2, C2 = encoder_model.predict(test_data) target_seq = np.zeros((1,1)) target_seq[0, len(outputs)] = chn_dict["<GO>"] output_tokens, h1, c1, h2, c2 = decoder_model.predict([target_seq, h1, c1, h2, c2]) sampled_token_index = np.argmax(output_tokens[0, -1, :]) outputs.append(sampled_token_index) target_seq[0, 0] = sampled_token_index if sampled_token_index == chn_dict["<EOS>"] or len(outputs) > 28: break print("< "+' '.join([id2chn[i] for i in outputs[:-1]]))Copy the code

The test results are as follows: Basically all translations are correct.

> I can understand you to some extent. > I can't recall the last time we met. > I can't remember which is my racket . > I can't stand that noise any longer . > I can't stand this noise any longer . > I caught the man stealing the money . > I could not afford to buy a bicycle . > I couldn't answer all the questions . > I couldn't think of anything to say . > I cry every time I watch this movie . > I did not participate in the dialog . > I didn't really feel like going out . > I don't care a bit about the future .Copy the code