Dopro. IO/NLP_seq2seq…

Intelligent robots can be seen everywhere in life: iPhone will talk Siri, will play chess afa dog, naughty and lovely Microsoft ice… They are all intelligent enough to interact with humans. These intelligent robots are very magical and seem very far away from us, but in fact, as long as we move, we can build an intelligent robot of our own.

This article will teach you how to build a retarded, sorry “intelligent chatbot” from scratch.


To build a chatbot, you first need to understand a few related concepts — natural language processing (NLP), a science that combines linguistics, computer science, and mathematics to make computers “understand” human language. Of course, it also includes many branches: text reading, speech recognition, syntactic analysis, natural language generation, human-machine dialogue, information retrieval, information extraction, text proofreading, text classification, automatic abstracting, machine translation, text implication and so on.

Don’t let that scare you away when you see your friends here. Since this article is called building a “retarded” chatbot from Scratch, it doesn’t matter if you don’t understand it! Just follow my lead step by step.


0x1 Basic Concepts

Here involved in the principle of the basis, not interested to see the officials can skip, does not affect the subsequent code implementation.

01 | Neural network

At the bottom of artificial intelligence are “neural networks” on which many complex applications (such as pattern recognition and automatic control) and advanced models (such as deep learning) are based. Learning artificial intelligence must start with it.

So the question is, what is a neural network? Simply put, neural networks mimic the human brainNeuronal networkSo that computers can “think”. Specific concepts are not repeated here, there are many easy to understand explanations on the Internet.

This article usesRecurrent Neural Network (RNN)Let’s look at the simplest oneBasic recurrent neural network:

Although the image looks abstract, it’s actually quite easy to understand. X, O and S are onevector, x representsThe input layerWhere o stands forOutput layerS is zeroHidden layerValue of (there are actually many nodes here); U, V isWeight matrices, UThe input layertoHidden layertheWeight matrices, and V representsHidden layertoOutput layertheWeight matrices. So what is W? Actually,Recurrent neural networktheHidden layerThe value of s depends not only on x and U, but also on the last timeHidden layerValue s, and W is the last hidden layer to this timeWeight matrices, expand it like this:

So the logic is much clearer, and this is a simple recurrent neural network. And our retarded, sorry “intelligent chatbot” is useRecurrent neural network, based on natural language lexical analysis, syntactic analysis constantly training corpus, and semantic analysis into the supplement and improvement.


02 | Deep learning framework

Suitable forRNNThere are many deep learning frameworks. The chatbot in this paper is based on Google open sourceTensorflowFrom GayhubGithubAs can be seen from the number of starts,TensorflowIs a very popular deep learning framework, and can be easily distributed computing on CPU/GPU, the following lists some of the features of the current mainstream deep learning framework, you can choose the framework for research by interest:


03 | SEQ2SEQ model

As the name implies, the SEQ2SEq model is like a translation model, where the input is a sequence (such as an English sentence) and the output is a sequence (such as the Corresponding French translation of the English sentence). The most important aspect of this structure is that the lengths of the input and output sequences are variable.

Here’s an example:

In a conversation machine: Input (hello) -> Output (hello).

The input is 1 English word and the output is 2 Chinese characters. We ask (input) a question and the machine automatically generates (output) an answer. The input and output here are clearly sequences of undefined length.

Let’s take another longer example:

I taught little yellow chicken to say, “What are you dreaming about in the middle of the day?” The answer was, “Oh, ha, ha, don’t worry about it.”

Step1: Applying bidirectional maximum matching algorithm to word segmentation: bidirectional word segmentation results, forward “broad day, what do, sweet dream, ah”; Reverse “Broad daylight, what do you do, sweet dreams, ah”. Forward and reverse are the same, so you don’t have to deal with ambiguity. Long words are preferred, “broad day” and “what to do”.

Step2: assume that the hash function is f () and that f (broad day) points to the initial hash entry [large, 11, P]. This entry then points to the “3-word index” and then to the corresponding “word list”.

Step3: Put the structure < daytime,… < p style = “max-width: 100%; clear: both; There is an Ans field in the body, and one of the Pointers in the field says, “Oh, ha, ha, don’t worry about it.”

This is the basic principle of SEQ2SEq, we have the principle and technology, the next step is to implement it!


0x2 Corpus Preparation

After understanding some of the pre-foundation, we don’t have to say much more and go straight to buildingIntelligent chatbotPhase. First we need to prepare the relevant training corpus.

01 | Corpus collation

Download the training corpus is from making (making for dialogue system in both Chinese and English corpora: https://github.com/candlewill/Dialog_Corpus). We downloaded xiaohuangji50W_fencia. conv(little yellow Chicken language material) for our training.

When we opened it after downloading it, the corpus looked like this:

We can understand the text and the dialogue, but what are these E’s, M’s and /? It’s actually pretty easy to see from the picture,MIt stands for this sentence, andEIs the beginning and end of a conversation.

After we get the corpus, we divide it into two categories according to the Question/Answer: “Question. TXT “and” answer.txt “:

1import re 2import sys 3def prepare(num_dialogs=50000): 4 with open(“xhj.conv”) as fopen: 6 reg = re.compile(” match (.*?)nM (.*?)n”) 7 match_match = re.compile(reg, 9 if num_dialogs >= len(match_dialogs): 10 dialogs = match_dialogs 11 else: 12 dialogs = match_dialogs[:num_dialogs] 13 questions = [] 14 answers = [] 15 for que, ans in dialogs: 16 questions. Append (que) 17 answers. Append (ans) 18 # Save to data/ folder directory “data/Question.txt”) 20 save(answers, “data/Answer.txt”) 21def save(dialogs, file): 22 with open(file, “w”) as fopen: 23 fopen.write(“n”.join(dialogs))

Finally, we got 5W questions and answers:

02 | Mapping to the scale is established

At this point, you might ask, is this “intelligent” chatbot actually matching the questions we type inQuestion.txtInside, and then fromAnswer.txtFind the answer and output it?

Of course, it’s not that simple. Essentially, chatbots generate a new response based on the context of the question, rather than pulling a corresponding response from a database.

So how does the machine know what to answer?Here’s Google’s SEQ2SEQ schematic:



To put it simply: Every sentence we type is broken down into words and vectorized. These words serve asThe input layerThe vector of theta, and thetaWeight matricesDo the calculation and arriveHidden layer.Hidden layerAnd then the output vectorWeight matricesYou compute it, and you get the final vector. When we map this vector to the word vector library, we can get the result we want.

It is easier to implement in code because the complex underlying logic is made up ofTensorflowHelp us finish, we will be the vocabulary for the final combing:

 1def gen_vocabulary_file(input_file, output_file): 

2    vocabulary = {}

3    with open(input_file) as f:

4        counter = 0

5        for line in f:

6            counter += 1

7            tokens = [word for word in line.strip()]

8            for word in tokens:

9                                # Filter non-Chinese characters

10                if u'u4e00' <= word <= u'u9fff':

11                    if word in vocabulary:

12                        vocabulary[word] += 1

13                    else:

14                        vocabulary[word] = 1

15        vocabulary_list = START_VOCABULART + sorted(vocabulary, key=vocabulary.get, reverse=True)

16        Vocabulary_size = 3500

17        if len(vocabulary_list) > vocabulary_size:

18            vocabulary_list = vocabulary_list[:vocabulary_size]

19        print(input_file + "Vocabulary size :", len(vocabulary_list))

20        with open(output_file, "w"as ff:

21            for word in vocabulary_list:

22                ff.write(word + "n")

23        ff.close

Copy the code



0x3 Start training

01 | training

After our corpus is ready, we can start our training. In fact, the training itself is very simple, the core of which is to call the Seq2SeqModel of Tensorflow and carry out circular training continuously. The following is the core code and parameter Settings of the training:

 1Size of the source input thesaurus

2vocabulary_encode_size = 3500

3Size of the target output thesaurus

4vocabulary_decode_size = 3500

5# An effective way to deal with sentences of different lengths

6buckets = [(5.10), (10.15), (20.25), (40.50)]

7# Number of units per layer

8layer_size = 256

9# Number of layers of the network.

10num_layers = 3

11Batch size for training

12batch_size =  64

13# max_gradient_norm: indicates that gradients will be minimized to this norm

14# learning_rate: the initial learning rate

15# learning_rate_decay_factor: learning rate decay factor

16# forward_only: false means that at the decoder side, use decoder_inputs as input. For example: inputs' GO, W, X, Y, Z '; the correct output should be 'W, X, Y, Z, EOS'. Assuming that the output at the first time is not 'W', use 'W' as the input at the second time. When set to true, only inputs at the first time of the decoder_inputs, 'GO', and the true output of the decoder at each stage are used as inputs at the next stage.

17model = seq2seq_model.Seq2SeqModel(source_vocab_size=vocabulary_encode_size, target_vocab_size=vocabulary_decode_size,buckets=buckets, size=layer_size, num_layers=num_layers, max_gradient_norm= 5.0,batch_size=batch_size, learning_rate=0.5, learning_rate_decay_factor=0.97, forward_only=False)

18

19config = tf.ConfigProto()

20config.gpu_options.allocator_type = 'BFC'  # prevent out of memory

21

22with tf.Session(config=config) as sess:

23    # Resume previous training

24    ckpt = tf.train.get_checkpoint_state('. ')

25    ifckpt ! =None:

26        print(ckpt.model_checkpoint_path)

27        model.saver.restore(sess, ckpt.model_checkpoint_path)

28    else:

29        sess.run(tf.global_variables_initializer())

30

31    train_set = read_data(train_encode_vec, train_decode_vec)

32    test_set = read_data(test_encode_vec, test_decode_vec)

33

34    train_bucket_sizes = [len(train_set[b]) for b in range(len(buckets))]

35    train_total_size = float(sum(train_bucket_sizes))

36    train_buckets_scale = [sum(train_bucket_sizes[:i + 1]) / train_total_size for i in range(len(train_bucket_sizes))]

37

38    loss = 0.0

39    total_step = 0

40    previous_losses = []

41    Keep training and save the model every once in a while

42    while True:

43        random_number_01 = np.random.random_sample()

44        bucket_id = min([i for i in range(len(train_buckets_scale)) if train_buckets_scale[i] > random_number_01])

45

46        encoder_inputs, decoder_inputs, target_weights = model.get_batch(train_set, bucket_id)

47        _, step_loss, _ = model.step(sess, encoder_inputs, decoder_inputs, target_weights, bucket_id, False)

48

49        loss += step_loss / 500

50        total_step += 1

51

52        print(total_step)

53        if total_step % 500= =0:

54            print(model.global_step.eval(), model.learning_rate.eval(), loss)

55

56            # If the model is not improved, reduce the learning rate

57            if len(previous_losses) > 2 and loss > max(previous_losses[- 3:]):

58                sess.run(model.learning_rate_decay_op)

59            previous_losses.append(loss)

60            # Save the model

61            checkpoint_path = "chatbot_seq2seq.ckpt"

62            model.saver.save(sess, checkpoint_path, global_step=model.global_step)

63            loss = 0.0

64            # Evaluate the model using test data

65            for bucket_id in range(len(buckets)):

66                if len(test_set[bucket_id]) == 0:

67                    continue

68                encoder_inputs, decoder_inputs, target_weights = model.get_batch(test_set, bucket_id)

69                _, eval_loss, _ = model.step(sess, encoder_inputs, decoder_inputs, target_weights, bucket_id, True)

70                eval_ppx = math.exp(eval_loss) if eval_loss < 300 else float('inf')

71                print(bucket_id, eval_ppx)

Copy the code



 02 | Practical question and answer effect

If our model is always training, how does the machine know when to stop training? What is the threshold for stopping training? Here we introduce a language model evaluation index — Perplexity.

① What is a Perplexity?

PPL is used in the field of natural language processing (NLP) to measure the quality of language models. It mainly estimates the probability of occurrence of a sentence according to each word and normalize the sentence length. The formula is:

S representativesentence, N is the sentence length, p(wi) is the probability of the ith word. The first word is p (w1 | w0), and w0 is START, said the sentence START, is a placeholder.

This equation can be understood as follows: the smaller PPL is, the larger P (WI) is, and the higher the probability of sentence occurrence is.

Others said that Perplexity can be thought of as the Average branch factor, or the number of choices available to predict the next word. When others report that the PPL of the model drops to 90, it can be intuitively understood that there are 90 reasonable choices for a word when the model generates a sentence. The fewer words available, the more accurate the model, we generally think. This also explains why the smaller the PPL, the better the model.

For our training, its recent several Perplexity are as follows:

By the time of dispatch, this model has been trained for 27h, and itsPerplexity is still a difficult thing to get rid of,So the training of the model really requires some patience. This speed would be greatly increased if the GPU were available for training.

We used the current model to have some conversations and found that it was taking shape:

  

At this point, our “intelligent chatbot” is done! But it is not difficult to see that the robot is still constantly silly, a lot of problems are irrelevant, so we also call it “mentally retarded robot”.

0 x4 epilogue

So far, we have trained a question-answering robot from scratch. Although it is still a little “retarded”, it does not understand more words, but the whole process has run, and has a certain effect. The following work is to constantly improve the algorithm, parameters and corpus. Corpus is a particularly critical part, probably taking up 50%-70% of the workload, because this paper uses corpus that has been processed on the Internet, saving a lot of time. In fact, most developers spend their time on corpus preprocessing: data cleaning, word segmentation, pos tagging, word stopping, etc.

Later, I will share with you the topic of corpus preprocessing. Here is a cute QR code, please keep an eye on it



Related literature and reference materials:

From machine learning (http://www.cnblogs.com/subconscious/p/4107357.html)

Using the python implementation neural network (http://www.wildml.com/2015/09/implementing-a-neural-network-from-scratch/)

Loop neural network (https://zybuluo.com/hanbingtao/note/541458)

Language model evaluation index (https://blog.csdn.net/index20001/article/details/78884646)

Tensorflow(https://github.com/google/seq2seq)


It was sent from wechat official account: Tencent DeepOcean