Text similarity computing (1) Text similarity computing (2) Previously said two chapters, respectively introduced TFIDF and vector space related things, and then introduced the topic model, this chapter we will try these two things. I’m not going to try word vectors in this article, word vectors don’t have much to do with these two, but I did give you the code at the end.

0. Prepare tools

To do a good job, must first sharpen his, so let’s benefit its device, here we are using the python gensim kit, the address is: radimrehurek.com/gensim/inde… , this toolkit is very powerful, I will not introduce one, anyway we need all the functions, and we use it very simple, it can also be distributed deployment, interested can go to the official website for detailed introduction.

Why not write your own? This question.... Ha ha... Ha ha... I can't write.....

As for the installation, you need to install Python 2.6 or above (duh), NumPy 1.3 or above, SciPy 0.7 or above, the last two are python scientific computing packages.

Easy_install is easy to install, so I won’t talk about it here, it might be a little difficult to install on Windows, but I haven’t used Windows for a long time, it’s easy to install on my computer, three or four commands are done, you can go to the official documentation of Gensim, it also has instructions on how to install, if you can’t install it, just Google, Baidu, there is always a solution.

In addition to gensim, there is another participle pack to install, namely Jieba participle, which is also easy to install.

1. Data preparation

Data preparation is a technical job, my professional ethics is very high, I did not use the company’s data, it can only find their own data, if directly looking for online corpus, it seems too Low. So I crawled some data myself.

First of all, I aimed at the current the most popular of the country’s whole stack technology website (just SegmentFault), then aiming at an automotive website, then began to climb data, began to write their own a creeper crawled data, well, I think can also, my creeper scheduler + crawl composition and crawl editor plug-in, you can use any language to write, Chrome can even be directly linked to the pure JS single page site crawling, but also support proxy pool, if you are interested I can also talk about crawler related things, distributed oh, can be added to the machine to increase the crawling ability.

Ok, finished the gossip, climbed two websites, can start to work, climb two types of websites is to illustrate the LDA theme model behind, we will have an understanding.

2. Clear data

After data climbing down, what we need to do is data cleaning. As I said in my previous article about machine learning skills, data cleaning is a necessary skill for an algorithm engineer. If there is no good data, it is useless how good the algorithm is.

Once you have the data, write a script

  • First of all, get the title, the author, the time, and so on, and it’s easy to get these things out with regular or xpath.
  • Then, the HTML tag is dried out, a bunch of re’s are done, and the rest is basically the text, in addition, SF station things also special processing, theA bunch of code is useless to me.
  • Finally, dry out the punctuation marks, dry out the special marks, adjust the formatting, and end up with each passage looking like this

    ID [TAB]TITLE [TAB]CONTENT

There are 11628 articles in total, including about 6000 for automobile and 6000 for technology (SF station). Well, the data is basically cleaned.

4. Training data

All think this section is the focus, in fact, after jieba participle and gensim, the code is very simple, no more than 50 lines, we come to play step by step.

4.1 Word segmentation — build a dictionary — prepare numerical corpus

Word segmentation is the basis, the first word segmentation

from gensim import corpora,models,similarities,utils
import jieba
import jieba.posseg as pseg
jieba.load_userdict( "user_dic.txt" ) # Load custom dictionary, mainly some computer words and car model words
# Define the original corpus set
train_set=[]
f=open("./data/all.txt")
lines=f.readlines()
for line in lines:
    content = (line.lower()).split("\t") [2] + (line.lower()).split("\t") [1]
    The etL function is used to remove useless symbols. Cut_all means not fully sharded
    word_list = filter(lambda x: len(x)>0,map(etl,jieba.cut(content,cut_all=False)))
    train_set.append(word_list)
f.close()Copy the code

The tain_set is the original corpus, and then import the corpus into the dictionary to build a dictionary.

# create a dictionary
dictionary = corpora.Dictionary(train_set)
# Remove extremely low frequency impurity words
dictionary.filter_extremes(no_below=1,no_above=1,keep_n=None)
# Save the dictionary for future use
dictionary.save(output + "all.dic")Copy the code

After importing the corpus into the dictionary, each word is actually numbered as 1,2,3…. This numbering, this is the first step in vectorization, and then preserving the dictionary. The digital corpus is then generated

corpus = [dictionary.doc2bow(text) for text in train_set]Copy the code

In this way, the variable of corpus is two-dimensional data. Each line represents the number and word frequency of each word in a document, and each line looks like this

[(1, 2), (2, 4), (5, 2)… It means that the word numbered 1 appears twice and the word numbered 2 appears…. four times

OK, the preliminary preparation is OK, the original article has been digitized by us after cutting the word -> building the dictionary -> generating the corpus, and then it is simple.

4.1 the TFIDF model

With a digital corpus, we can generate a TFIDF model

# Generate TFIDF model using digital corpus
tfidfModel = models.TfidfModel(corpus)
# storage tfidfModel
tfidfModel.save(output + "allTFIDF.mdl")Copy the code

This sentence is the key, we use the original digital corpus, generated a TFIDF model, what can this model do? Gensim overloads the [] operator, we can use something like [(1,2),(2,4),(5,2)….] The original vector transfer into, become a tfidf vector, like this [,0.98 (1), (2,0.23), (5,0.56)… , which indicates that the word numbered 1 is more important than the following two words, and this vector can be used as the original vector input of the following LDA.

Then we vectorized all the corpus TFIDF and stored it as an index data for easy use in future search

# Convert all corpus vectors into TFIDF patterns. This tfidfModel can be passed into two-dimensional arrays
tfidfVectors = tfidfModel[corpus]
Create index and save
indexTfidf = similarities.MatrixSimilarity(tfidfVectors)
indexTfidf.save(output + "allTFIDF.idx")Copy the code

We have generated a model data (alltFIDF.mDL) and a TFIDF vector index data (alltFIDf.idx) for all the corpus. With the above dictionary data (all.dic), we now have three sets of data. We will talk about how to use them later. Now let’s move on to the LDA.

4.2 the LDA model

LDA last article said so much, in gensim’s view is the following lines of code, and the use of legendary machine learning oh. Let’s just say gensim’s code is wrapped too succinctly.

# Generate LDA model through TFIDF vector, id2word refers to the corresponding dictionary of number, num_topics refers to the number of topics, we set 50 here, too much time can not stand the topic.
lda = models.LdaModel(tfidfVectors, id2word=dictionary, num_topics=50)
# Save the model
lda.save(output + "allLDA50Topic.mdl")
Convert all TFIDF vectors to LDA vectors
corpus_lda = lda[tfidfVectors]
Create index to save LDA data
indexLDA = similarities.MatrixSimilarity(corpus_lda)
indexLDA.save(output + "allLDA50Topic.idx")Copy the code

It’s only three steps, but it takes a lot of time, and if you have log open, you can see what’s going on, and I’ve just taken a few of them, like the ones below, and obviously, the first ones are related to cars, and the second ones are related to technology, so it’s kind of a good idea.

#38 (0.020): 0.003* Novelty + 0.003* Jun + 0.002* Touran + 0.002* Equipped + 0.002* Metropolitan + 0.001* Except + 0.001* Envision
#27 (0.020): 0.003* Configuration + 0.003* Interior + 0.003* Model + 0.002* Airbag + 0.002* Lucky + 0.002* Ten thousand Yuan + 0.002* Yizhi +
#0 (0.020): 0.004* Pentium + 0.003* Acceleration + 0.003* Carnival + 0.002* Throttle + 0.002* Elysee + 0.002* SEC
#49 (0.020): 0.004* Tiger + 0.004* Saab + 0.004* Erno + 0.002* Lexus + 0.002* Model + 0.002* Lotto
#26 (0.020): 0.011* List + 0.009* Stream + 0.007* hotkeys + 0.006* Crash + 0.002* God + 0.002* Confusion + 0.002* mailbox
#21 (0.020): 0.035* Command + 0.018* Browser + 0.007* third party + 0.007* Installation + 0.006* console
topic #25 (0.020): 0.064* file + 0.004* constraint + 0.004* Exercise + 0.003* Copy to + 0.003* will do + 0.003* decompileCopy the code

Alllda50topic. MDL and alllda50topic. idx, plus the previous 3 files, total 5 files, OK, take a break, have a coke, continue to the next step.

5. Verify the results

Okay, we’ve already used machine learning in part 4, and now we’re going to see how well it works.

We have saved model and vector data for TFIDF and LDA, so let’s use two new articles to see which articles are most similar to this article to verify the reliability of these two models.

I opened a random car website, selected an article about cars (BMW review), and found my previous article about technology (search engines), and only randomly selected a paragraph of the article for testing.

This is an article about the new BMW X1 Li, and I think many BMW fans are ready to comment........

Under normal circumstances, the default search engine will think that the index is not too much change, so the index is divided into full index and incremental index two parts, full index is generally day.......

Ok, the article is selected, first load the data file saved before

# load the dictionary
dictionary = corpora.Dictionary.load(output + "all.dic")
Load the TFIDF model and index
tfidfModel = models.TfidfModel.load(output+"allTFIDF.mdl")
indexTfidf = similarities.MatrixSimilarity.load(output + "allTFIDF.idx")
Load the LDA model and index
ldaModel = models.LdaModel.load(output + "allLDA50Topic.mdl")
indexLDA = similarities.MatrixSimilarity.load(output + "allLDA50Topic.idx")Copy the code

Then cut the test data, TFIDF vectorization, find similarity, LDA vectorization, find similarity

#query = test data, cut word first
query_bow = dictionary.doc2bow(filter(lambda x: len(x)>0,map(etl,jieba.cut(query,cut_all=False))))
Vectorization using TFIDF model
tfidfvect = tfidfModel[query_bow]
# Then the LDA vectorization, because the LDA in our training is done on the basis of TFIDF, so we vectorize again with ITIDFVect
ldavec = ldaModel[tfidfvect]
# TFIDF similarity
simstfidf = indexTfidf[tfidfvect]
# LDA similarity
simlda = indexLDA[ldavec]Copy the code

All right, that’s it. That’s all the code. Too easy… So let’s see what happens.

6 Output Results

Let’s look at the TFIDF results first

  • Car test article TFIDF results (3 randomly selected from the top 10 results)

    Preferential car purchase recommendation BMW X3 preferential 35,000 to 70,000 yuan Porsche Macan competitiveness analysis BMW X3 BMW 2014 new car outlook up to more than ten new models

  • Technical test article TFIDF results (3 randomly selected from the top 10 results)

Golang to write a search engine (0x06) index that point [search engine] sphinx introduction and principle exploration

Obviously, the results are pretty good. The first one is about BMWS, and the second one is about search engines and indexes. Let’s look at the results of LDA. The main function of LDA is text classification rather than keyword matching. It is to see whether the classification of test articles is correct. Then see if the articles most similar to each other happen to be technical or automotive. If so, the model is better.

  • Car test article LDA results (randomly selected 3 out of the top 10 results)

    Editor’s mind the most beautiful mid-level car FAW – Volkswagen new CC 250,000 fashion quality 4 luxury compact car Mercedes Benz A-class iPhone html5

  • Technical test article LDA results (3 randomly selected from the top 10 results)

    Java multithreading core technology comb (attached source code) springsession principle analysis of concurrent lock file mode

From the results, the basic more reliable, but the car that appeared a badcaseiphone mobile html5 upload picture direction problem solved, this is a technical article, but appeared in the car class above.

7. Result analysis

For us to analyze the results and TFIDF model, in the existing data sets (article 12000), the recommended result is strong, let a person feel to recommend the result is, this is also where TFIDF this algorithm is simple and effective, he put the key words in the article very good extracted, so the recommendation results let a person feel strong, But he has problems of his own.

  • For short articles (such as weibo), because the text is too short, IT is difficult for TFIDF to extract important keywords or get wrong, leading to unreliable recommendation results.
  • To illustrate the importance of this word simply by its frequency feels incomplete, such as this article, which should be viewed by humansTextual similarityMost important, but it is possible to calculate it by TFIDFmodelThis word is the most important. For the text-only recommendation system, the text-dependent recommendation may be more suitable for vertical websites, such as SegmentFault. People who read a certain article may hope to see similar articles and have a deeper understanding of this field. This algorithm is more reliable, but according to my observation, SegmentFault is the tag recommendation that I use, which is better, but more artificial, and it would be more troublesome if I tagged randomly when I was writing.

Take a look at the LDA model, LDA is mainly used in text clustering, and he is the basis of the theme, if put his algorithms to use as a recommendation system, depends on the specific scene, he recommended the results in the data sample is not enough, may seem less on (even if the sample size may also look not so good, appear granularity is very thick, but because of very thick, Therefore, it is more suitable for content discovery. For example, I am interested in digital news, which is not only interested in iPhone, but also as long as the theme of digital. Therefore, LDA can recommend things under the theme of digital to me, which is better than reading articles on iPhone. The iPhone articles below are much more reliable.

What happens when LDA appears in one of the badcases of the previous section? Since it’s unlikely to change the model, you can only start with a few things.

  • If it’s just the occasional one or two, you can put up with it.
  • If there are many, you can only adjust the number of topics first, and then there are some parameters in the LDA that can be adjusted (Algorithm engineerThe value of
  • Another approach is to wash the input data as clean as possible, removing unwanted impurities (Algorithm engineerEssential skillPatience and carefulnessTherefore, different models are very important for different scenes, and choose the right model for your scene to achieve the right effect.

    8. What you write at the end

    This article is only a most basic text similarity article, can be the most intuitive understanding of TFIDF model and LDA model, at the same time, also used the hottest at presentMachine learningTechnology, oh.

In fact, like the LDA and word2vec this model, the mathematical abstraction was too strong model, and is basically out of the actual scene, already completely mathematically, so it doesn’t need to be used in text processing, the flow analysis, user behavior analysis as useful, this is the algorithm engineer want to things, How a good algorithm can be used in an existing scenario.

Imagine if we wanted to categorize our users and see which ones have similar interests. We can actually do it this way:

  • First, if we have a bunch of browsing behavior data, each piece of data records the user clicking on a link, or clicking on a button.
  • When these browsing behaviors are combined by the user dimension, each piece of data in the new data is a record of a user’s actions, in order of what he did at what time. Similar to theUser A: [browse page A, click button B, browse page C....]
  • Well, if we playAlgorithm engineerOne of the essential skills —-imagination, so we take each user’s behavior as an article, each behavior data as a word, and then useLDA. Ha ha so calculate the theme, is not the category of users? Users with similar behavior data will appear under the same theme, so these users are classified, so can it be understood that users under the same category have similar hobbies? If you think it works, try using your company’s user data and see if it works 🙂

9. The back of the back

Finally, all of the code is on Github, and you can see it here. The code is fairly simple, no more than 200 lines, and the core is the one I listed above. There is also the code and use of Word2vec, which I will not mention in this article.

If you want to play, corpus oneself can on wiki, they opened their all data corpus analysis is done to the world, including Chinese, the address is: dumps.wikimedia.org/zhwiki/late… , but wikipedia is not much Chinese corpus, Chinese corpus is baidu encyclopedia, but look at baidu encyclopedia, ha ha, not only not open, anti-crawler and anti-thief, ha ha, but I also give you an address, 100G baidu encyclopedia original page: pan.baidu.com/s/1i3wvfil, joint password: Neqs, courtesy of Penny Liang, the second reptile king of Asia.

Well, today’s article is a bit long, to this, the algorithm part will put a put, work too busy recently, and this period is over, I will say again algorithm part, because now work will have some more fun algorithm to use, the next article will mainly write about things the system architecture, and I own the search engine is too busy and don’t have time for the whole It will be a while, sorry 🙂 but rest assured, there will be no tail.


Welcome to follow my official account, mainly talk about search, recommendation, advertising technology, and nonsense. The article will be posted here first 🙂 scan or search wechat XJJ267 or search Spanish language