This is the fifth day of my participation in the November Gwen Challenge. Check out the details: The last Gwen Challenge 2021

In our daily life, we often see the application of chatbot, such as online taobao customer service, wechat automatic reply, offline shop robot reception and so on. Chatbots can help us free ourselves from repetitive, tedious work. To develop a chatbot, you need to first master three basic concepts: Chinese word segmentation, mathematical representation of text and text similarity calculation.

Chinese word segmentation

Chinese word segmentation is to divide a sentence into independent words. After receiving a sentence, the robot first needs to divide it into words so that it can make corresponding responses according to the keywords. Python provides Jieba thesaurus to help us do this.

To get a sentence participle using Jieba

Import jieba s = 'Python is an object-oriented dynamically typed language. ' [ print(c) for c in jieba.cut(s)]Copy the code

Results:

Python is an object-oriented, dynamically typed language.Copy the code

Mathematical representation of text

The language directly used by computers is machine code, which is instructions encoded in binary code. Therefore, the text needs to be translated into machine code that the machine can recognize and then handed to the machine for processing.

In mathematics, a word can be represented by a vector. This is called a word vector.

Here’s a dictionary:

Python, is, a, object-oriented, dynamic, type, language

Given the following three words:

Python, object orientation, artificial intelligence

The word vectors of these three words are:

  • Python. (1, 0, 0, 0, 0, 0, 0, 0)
  • Object oriented. (0, 0, 0, 1, 0, 0, 0, 0)
  • Artificial intelligence. (0, 0, 0, 0, 0, 0, 0, 0)

It can be seen that if the word at a location in the dictionary matches the given word, it is set to 1, otherwise it is set to 0.

Word vector conversion using Python:

# thesaurus word_vector_list = [" Python ", ""," a ", "object oriented", ""," dynamic ", "type", Def get_word_vector_result(word): word1 = "Python" word2 = "object-oriented" word3 = "artificial intelligence" return [ 1 if(w == word) else 0 for w in word_vector_list] print(get_word_vector_result(word1)) print(get_word_vector_result(word2)) print(get_word_vector_result(word3))Copy the code

Results:

(1, 0, 0, 0, 0, 0, 0, 0] [0, 0, 0, 1, 0, 0, 0, 0] [0, 0, 0, 0, 0, 0, 0, 0]Copy the code

Now that we know how to represent word vectors, how to represent a sentence as a vector?

Use the dictionary again:

Python, is, a, object-oriented, dynamic, type, language

Give the following two sentences:

Python is a high-level language

We’re learning Python

Vectors of the two sentences are:

  • Python is a high-level language. (1, 1, 1, 0, 0, 0, 0, 1)
  • We’re learning Python. (1, 0, 0, 0, 0, 0, 0, 0)

Before obtaining the sentence vector, it is necessary to split the participle of each sentence, and then check one by one whether the participle in the dictionary exists in the participle of the given sentence. If so, it is set to 1; otherwise, it is set to 0.

In fact, the transformation of sentence vector is essentially the transformation of word vector, but the sentence is composed of multiple participles.

Use Python to convert sentence vectors:

Import jieba # thesaurus word_vector_list = [" Python ", ""," a ", "object oriented", ""," dynamic ", "type", S1 = "Python is a high-level language "s2 =" we're learning Python" # def get_vector(data): Data_iter = list(jieba.cut(data)) return [1 if(w in data_iter) else 0 for W in word_vector_list] # print(get_vector(s1)) print(get_vector(s2))Copy the code

Python provides the gensim utility to help with vector conversion:

Sentences = word2vec.text8corpus ('xxx.txt') # Convert the sentences to vector, When the number of participles is small, Model = word2vec. Word2vec (sentences, Model. Save ('word2vec.model') # get the vector model.wv[' participle 1', 'participle 2']Copy the code

Text similarity calculation

Typically in the robot program built-in multiple q&a template, we can take the sentence from the template library users to send matching corresponding answer to respond to the user, the user sends the sentences are unpredictable, so you can find the template database with the user’s sentence is the most similar statements to match, this goes to the text similarity calculation. The similarity calculation of text involves the following knowledge points: Euclidean distance, Manhattan (block) distance, cosine similarity.

Euclidean distance

In the rectangular coordinate system, given two points: A (x1, y1) and b(x2, y2), their distances are calculated as follows:


d = ( x 1 x 2 ) 2 + ( y 1 y 2 ) 2 d = \sqrt{(x_1 – x_2)^2 + (y_1 – y_2)^2}

The resulting distance is the Euclidean distance. The distance calculation formula for extending it to n-dimensional space is as follows:

d = \sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2 + ... + (n_1 - n_2)^2}
Copy the code

So how do you calculate the similarity of two sentences by Euclidean distance?

Given the sentence vectors of the following two sentences:

  • Python is a high-level language. (1, 1, 1, 0, 0, 0, 0, 1)
  • We’re learning Python. (1, 0, 0, 0, 0, 0, 0, 0)

Applying the Euclidean distance calculation formula, we can get:

d = \sqrt{(1 - 1)^2 + (1 - 0)^2 + ... + (1 - 0)^2}
Copy the code

The closer d is to 0, the more similar the sentence is.

Manhattan distance

Manhattan distance, also known as Manhattan Block distance. The Manhattan distance between two points in a rectangular coordinate system is calculated as follows:

d = |x_1 - x_2| + |y_1 - y_2|
Copy the code

That’s the sum of the sides of a right triangle where the two points are connected.

The calculation formula for the distance extended to n-dimensional space is as follows:

d = |x_1 - x_2| + |y_1 - y_2| + ... + |n_1 - n_2|
Copy the code

Similarly, the closer D is to 0, the more similar the sentence is.

Cosine similarity

The essence of cosine similarity is to calculate the cosine value of the Angle between two vectors. The formula for calculating cosine similarity of n-dimensional vectors is:


c o s Theta. = x 1 y 1 + x 2 y 2 + . . . + x n y n x 1 2 + x 2 2 + . . . + x n 2 y 1 2 + y 2 2 + . . . + y n 2 cos\theta = \frac{x_1y_1 + x_2y_2 + … + x_ny_n}{\sqrt{x_1^2 + x_2^2 + … + x_n^2} \cdot \sqrt{y_1^2 + y_2^2 + … + y_n^2}}

Substitute the values of the two n-dimensional sentence vectors into the above equation, and the resulting value is the cosine similarity of the two sentences.

The interval of cosine value is [-1, 1]. When the value approaches 1, it means that the two sentences are more similar; when the value approaches -1, it means that the two sentences are less similar.

A complete code example for calculating segmentation similarity

Import jieba from gensim.models import word2vec # open fenci. TXT File1 = open('fenci.txt', encoding=' utF-8 ') # save the result to fen_ci.txt, # open() R /w/x/a # 'r' -> readonly, 'w' -> truncating the file if it already exists, 'x' -> creating and writing to a new file, 'a' -> append file2 = open('fenci_result.txt', mode='w', Encoding ='utf-8') # line = file1.readlines() # line = file1.readlines() For line in lines: replaced_line = line.replace(' ', '').replace('\t', '').replace('\r', '').replace('\n', ") seg_list = jieba.cut(replaced_line) file2.write(".join(seg_list)) # close resource file1.close() file2.close() # load the newly generated corpora Sentences = word2vec Text8Corpus('fenci_result.txt') # Use the word2vec model to train the machine (here calculate the word vector), due to the small number of participles in the corpus, Word2vec (sentences, min_count=1) # name the model word2vec.model, Save as a local file model.save('word2vec.model') # to get the word vector, the segmentation of the vector to be computed must exist in the corpus, Word_vec_arr = model.wv['Python', S1 = model.wv. Similarity ('Python', s1 = model.wv. Similarity ('Python', s1 = model.wv. Similarity ('Python', s1 = model.wv. 'Object-Oriented ') s2 = model.wv.similarity(' can ',' can ') s3 = model.wv.similarity(' can ', 'can ') # Print (s1, s2, s3) print(s1, s2)Copy the code