How to train chatbots with TensorFlow

preface

In practical engineering, there are few end-to-end chatbots that are directly implemented by deep learning, but here we look at how to use seQ2SEQ model of deep learning to implement a simple chatbot. This paper will try to use TensorFlow to train a chatbot based on SEq2SEq and realize the training of the bot to answer questions according to the corpus.

seq2seq

The mechanism and principle of SEQ2SEQ can be seen in the previous article “SeQ2SEQ Model of Deep Learning”.

Recurrent neural network

Circulating neural networks are used in SEQ2SEQ model. Currently, several popular circulating neural networks include RNN, LSTM and GRU. The mechanism and principle of these three kinds of recurrent neural network can be seen in the previous articles “recurrent neural network”, “LSTM Neural Network” and “GRU Neural network”.

Training sample set

It’s mostly QA pairs, and a lot of open data is available for download, but this is just a random selection of questions and answers in the format of question 1, answer 2, question 3, answer 4, and so on.

Data preprocessing

To train the data into numbers, you can use the value of 0 to N to represent the entire vocabulary, each value represents a word, here with VOCAB_SIZE to define. And the maximum and minimum length of the question, the maximum and minimum length of the answer. UNK, GO, EOS and PAD symbols are also defined to denote unknown words. For example, if you exceed VOCAB_SIZE, it is considered unknown words. GO is the symbol to start decoder, EOS is the symbol to end the answer, and PAD is used to fill. Because all QA pairs placed in the same SEQ2SEQ model must have the same inputs and outputs, shorter questions or answers need to be padded with pads.

limit = {
    'maxq': 10,
    'minq': 0.'maxa': 8,
    'mina': 3
}

UNK = 'unk'
GO = '<go>'
EOS = '<eos>'
PAD = '<pad>'
VOCAB_SIZE = 1000Copy the code

Filter according to QA length limits.

def filter_data(sequences):
    filtered_q, filtered_a = [], []
    raw_data_len = len(sequences) // 2

    for i in range(0, len(sequences), 2):
        qlen, alen = len(sequences[i].split(' ')), len(sequences[i + 1].split(' '))
        if qlen >= limit['minq'] and qlen <= limit['maxq'] :if alen >= limit['mina'] and alen <= limit['maxa'] : filtered_q.append(sequences[i]) filtered_a.append(sequences[i + 1]) filt_data_len = len(filtered_q) filtered = int((raw_data_len - filt_data_len) * 100 / raw_data_len)print(str(filtered) + '% filtered from original data')

    return filtered_q, filtered_aCopy the code

We also need to get the frequency statistics of all words in the whole corpus, and count the top N frequency words as the whole words according to the frequency size, that is, the corresponding VOCAB_SIZE in front. In addition, we also need to get the index of the word according to the index, and the index of the corresponding index value according to the word.

def index_(tokenized_sentences, vocab_size):
    freq_dist = nltk.FreqDist(itertools.chain(*tokenized_sentences))
    vocab = freq_dist.most_common(vocab_size)
    index2word = [GO] + [EOS] + [UNK] + [PAD] + [x[0] for x in vocab]
    word2index = dict([(w, i) for i, w in enumerate(index2word)])
    return index2word, word2index, freq_distCopy the code

In our seq2seq model, for encoder, the length of the problem is different, so it is not long enough to use PAD to fill, for example, the problem is “How are you”, if the length is 10, Fill it with “How are you pad pad pad pad pad pad”. For the decoder, it starts with GO and ends with EOS. It’s not long enough to fill, like “Fine thank you” or” GO Fine thank you EOS pad pad pad pad pad “. Target is actually the same as the decoder input, but it just has a position offset, for example, the top to remove go, become “fine thank you eos pad pad pad pad pad pad”.

def zero_pad(qtokenized, atokenized, w2idx):
    data_len = len(qtokenized)
    # +2 dues to '<go>' and '<eos>'
    idx_q = np.zeros([data_len, limit['maxq']], dtype=np.int32)
    idx_a = np.zeros([data_len, limit['maxa'] + 2], dtype=np.int32)
    idx_o = np.zeros([data_len, limit['maxa'] + 2], dtype=np.int32)

    for i in range(data_len):
        q_indices = pad_seq(qtokenized[i], w2idx, limit['maxq'], 1)
        a_indices = pad_seq(atokenized[i], w2idx, limit['maxa'], 2)
        o_indices = pad_seq(atokenized[i], w2idx, limit['maxa'], 3)
        idx_q[i] = np.array(q_indices)
        idx_a[i] = np.array(a_indices)
        idx_o[i] = np.array(o_indices)

    return idx_q, idx_a, idx_o


def pad_seq(seq, lookup, maxlen, flag):
    if flag == 1:
        indices = []
    elif flag == 2:
        indices = [lookup[GO]]
    elif flag == 3:
        indices = []
    for word in seq:
        if word in lookup:
            indices.append(lookup[word])
        else:
            indices.append(lookup[UNK])
    if flag == 1:
        return indices + [lookup[PAD]] * (maxlen - len(seq))
    elif flag == 2:
        return indices + [lookup[EOS]] + [lookup[PAD]] * (maxlen - len(seq))
    elif flag == 3:
        return indices + [lookup[EOS]] + [lookup[PAD]] * (maxlen - len(seq) + 1)Copy the code

The above structures are then persisted for use in training.

Build the figure

encoder_inputs = tf.placeholder(dtype=tf.int32, shape=[batch_size, sequence_length])
decoder_inputs = tf.placeholder(dtype=tf.int32, shape=[batch_size, sequence_length])
targets = tf.placeholder(dtype=tf.int32, shape=[batch_size, sequence_length])
weights = tf.placeholder(dtype=tf.float32, shape=[batch_size, sequence_length])Copy the code

Create four placeholders, one for encoder input, one for decoder input, one for decoder target, and one for weight. Where batch_size is the number of input samples, and sequence_length is the length of the sequence defined by us.

cell = tf.nn.rnn_cell.BasicLSTMCell(hidden_size)
cell = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers)Copy the code

To create the loop neural network structure, use LSTM structure here, hidden_size is the number of hidden layers, use MultiRNNCell because we want to create a more complex network, num_layers is the number of layers of LSTM.

results, states = tf.contrib.legacy_seq2seq.embedding_rnn_seq2seq(
    tf.unstack(encoder_inputs, axis=1),
    tf.unstack(decoder_inputs, axis=1),
    cell,
    num_encoder_symbols,
    num_decoder_symbols,
    embedding_size,
    feed_previous=False
)Copy the code

Build seq2seq using the embedding_RNN_seq2seq function that TensorFlow has prepared for us, or build encoder and decoder from LSTM. But just use embedding_RNN_seq2seq for convenience. The tf.unstack function is used to expand encoder_inputs and decoder_inputs into a list. Num_encoder_symbols and num_decoder_symbols correspond to our number of words. Embedding_size is the number of our embedded layers, and the feed_previous variable is important. False indicates that this is the training phase, which uses decoder_inputs as one of its inputs to the decoder. Feed_previous is True, which indicates the prediction stage. The prediction stage does not have decoder_inputs, so only the previous decoder output can be used as the input of the current stage.

logits = tf.stack(results, axis=1)
loss = tf.contrib.seq2seq.sequence_loss(logits, targets=targets, weights=weights)
pred = tf.argmax(logits, axis=2)
train_op = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)Copy the code

The loss is then created using sequence_loss, which is calculated based on the output of embedding_RNN_seq2seq, which can also be used to make predictions. The maximum value corresponds to an index of words, which the optimizer uses as AdamOptimizer.

Create a session

with tf.Session() as sess:
    ckpt = tf.train.get_checkpoint_state(model_dir)
    if ckpt and ckpt.model_checkpoint_path:
        saver.restore(sess, ckpt.model_checkpoint_path)
    else:
        sess.run(tf.global_variables_initializer())
    epoch = 0
    while epoch < 5000000:
        epoch = epoch + 1
        print("epoch:", epoch)
        for step in range(0, 1):
            print("step:", step)
            train_x, train_y, train_target = loadQA()
            train_encoder_inputs = train_x[step * batch_size:step * batch_size + batch_size, :]
            train_decoder_inputs = train_y[step * batch_size:step * batch_size + batch_size, :]
            train_targets = train_target[step * batch_size:step * batch_size + batch_size, :]
            op = sess.run(train_op, feed_dict={encoder_inputs: train_encoder_inputs, targets: train_targets,
                                               weights: train_weights, decoder_inputs: train_decoder_inputs})
            cost = sess.run(loss, feed_dict={encoder_inputs: train_encoder_inputs, targets: train_targets,
                                             weights: train_weights, decoder_inputs: train_decoder_inputs})
            print(cost)
            step = step + 1
        if epoch % 100 == 0:
            saver.save(sess, model_dir + '/model.ckpt', global_step=epoch + 1)Copy the code

The tf.train.Saver object will be used to save and read the model. To be safe, you can save the model at certain intervals. The next time you restart the model, you do not have to start from the beginning. Not in batches.

To predict

with tf.device('/cpu:0') : batch_size = 1 sequence_length = 10 num_encoder_symbols = 1004 num_decoder_symbols = 1004 embedding_size = 256 hidden_size = 256 num_layers = 2 encoder_inputs = tf.placeholder(dtype=tf.int32, shape=[batch_size, sequence_length]) decoder_inputs = tf.placeholder(dtype=tf.int32, shape=[batch_size, sequence_length]) targets = tf.placeholder(dtype=tf.int32, shape=[batch_size, sequence_length]) weights = tf.placeholder(dtype=tf.float32, shape=[batch_size, sequence_length]) cell = tf.nn.rnn_cell.BasicLSTMCell(hidden_size) cell = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers) results, states = tf.contrib.legacy_seq2seq.embedding_rnn_seq2seq( tf.unstack(encoder_inputs, axis=1), tf.unstack(decoder_inputs, axis=1), cell, num_encoder_symbols, num_decoder_symbols, embedding_size, feed_previous=True, ) logits = tf.stack(results, axis=1) pred = tf.argmax(logits, axis=2) saver = tf.train.Saver() with tf.Session() as sess: module_file = tf.train.latest_checkpoint('./model/')
        saver.restore(sess, module_file)
        map = Word_Id_Map()
        encoder_input = map.sentence2ids(['you'.'want'.'to'.'turn'.'twitter'.'followers'.'into'.'blog'.'readers'])

        encoder_input = encoder_input + [3 for i in range(0, 10 - len(encoder_input))]
        encoder_input = np.asarray([np.asarray(encoder_input)])
        decoder_input = np.zeros([1, 10])
        print('encoder_input : ', encoder_input)
        print('decoder_input : ', decoder_input)
        pred_value = sess.run(pred, feed_dict={encoder_inputs: encoder_input, decoder_inputs: decoder_input})
        print(pred_value)
        sentence = map.ids2sentence(pred_value[0])
        print(sentence)Copy the code

The prediction phase also creates the same model, then loads the model saved during the training, and then realizes the prediction of the answer to the question. The prediction phase is performed by CPU instead of GPU. The steps to create the graph are basically the same as the training, and the parameters are the same, except that we set the feed_previous parameter to True for the embedding_rnn_seq2seq function, since we have no decoder input. In addition, we do not need loss functions and optimizers, just provide prediction functions.

[‘how’, ‘do’, ‘you’, ‘do’, ‘this’, ‘, ‘, ‘, ‘, ‘, ‘].

github

Github.com/sea-boat/se…

Below is the advertisement and related reading

======== advertising time ========

My new book “Analysis of Tomcat kernel Design” has been sold in JINGdong, friends in need can go to item.jd.com/12185360.ht… Make a reservation. Thank you all.

Why to write “Analysis of Tomcat Kernel Design”

= = = = = = = = = = = = = = = = = = = = = = = = =