Text similarity computing (1) Text similarity Computing (2) Text similarity computing (2) Text similarity computing (1) Text similarity computing (2) Text similarity computing (2) Text similarity computing (2) Text similarity computing (2) text similarity computing (2) Text similarity computing (2) Text similarity computing (2) Text similarity computing (2) Text similarity computing (2) Text similarity computing (2) Text similarity computing (2) Text similarity computing (2) I’m not going to try word vectors in this article, word vectors don’t have much to do with these two, but I did give you the code at the end.

0. Prepare tools

To do a good job, must first sharpen his, so let’s benefit its device, here we are using the python gensim kit, the address is: radimrehurek.com/gensim/inde… , this toolkit is very powerful, I will not introduce one, anyway we need all the functions, and we use it very simple, it can also be distributed deployment, interested can go to the official website for detailed introduction.

Why not write your own? This question.... Ha ha... Ha ha... I can't write.....

As for the installation, you need to install Python 2.6 or above (duh), NumPy 1.3 or above, SciPy 0.7 or above, the last two are python scientific computing packages.

Easy_install is easy to install, so I won’t talk about it here, it might be a little difficult to install on Windows, but I haven’t used Windows for a long time, it’s easy to install on my computer, three or four commands are done, you can go to the official documentation of Gensim, it also has instructions on how to install, if you can’t install it, just Google, Baidu, there is always a solution.

In addition to gensim, there is another participle pack to install, namely Jieba participle, which is also easy to install.

1. Data preparation

Data preparation is a technical job, my professional ethics is very high, I did not use the company’s data, it can only find their own data, if directly looking for online corpus, it seems too Low. So I crawled some data myself.

First of all, I aimed at the current the most popular of the country’s whole stack technology website (just SegmentFault), then aiming at an automotive website, then began to climb data, began to write their own a creeper crawled data, well, I think can also, my creeper scheduler + crawl composition and crawl editor plug-in, you can use any language to write, Chrome can even be directly linked to the pure JS single page site crawling, but also support proxy pool, if you are interested I can also talk about crawler related things, distributed oh, can be added to the machine to increase the crawling ability.

Here’s a quick story. When I was climbing onto the couch (SegmentFault.com), I first masturbated through the list page of all articles. Originally, I wanted to pull the title down, but I found that the SF site didn’t respond, so I immediately climbed down, so I lost my mind and started to masturbate the details page without the agent pool. I tried to climb down the articles, but my IP was blocked when I climbed to more than 6000 articles. So.. I can only hang VPN on SF station now, hang VPN ah!! Go to Singapore hit a turn to a hangzhou website!! If any administrator or operation personnel see this article, please unseal it. I mean no harm, or I wouldn’t write it here.

In addition, as an old programmer, or a back-end staff, ran to a post-90 generation of front-end oriented technical community to write back-end things, there are not many people to read, should be in OSChina, CNblog, CSDN such a place to write articles. What kind of spirit is this? The spirit of internationalism! Because I really like the color scheme of SF, and the rendering effect of Markdown is very good, and the effect is also very good on the phone, so I came here, as few people see it, anyway, mainly for myself.

Ok, finished the gossip, climbed two websites, can start to work, climb two types of websites is to illustrate the LDA theme model behind, we will have an understanding.

2. Clear data

After data climbing down, what we need to do is data cleaning. As I said in my previous article about machine learning skills, data cleaning is a necessary skill for an algorithm engineer. If there is no good data, it is useless how good the algorithm is.

Once you have the data, write a script

  • First of all, get the title, the author, the time, and so on, and it’s easy to get these things out with regular or xpath.

  • Then I take out the HTML tags, a bunch of re’s, and basically the rest is the text. Besides, the SF stuff has done a special thing to get rid of the content, so a bunch of code is useless to me.

  • Finally, dry out the punctuation marks, dry out the special marks, adjust the formatting, and end up with each passage looking like this

ID [TAB]TITLE [TAB]CONTENT

There are 11628 articles in total, including about 6000 for automobile and 6000 for technology (SF station). Well, the data is basically cleaned.

4. Training data

All think this section is the focus, in fact, after jieba participle and gensim, the code is very simple, no more than 50 lines, we come to play step by step.

4.1 Word segmentation — build a dictionary — prepare numerical corpus

Word segmentation is the basis, the first word segmentation

from gensim import corpora,models,similarities,utils import jieba import jieba.posseg as pseg jieba.load_userdict( "User_di.txt") # load custom dictionary, Train_set =[] f=open("./data/all.txt") lines=f.readlines() for lines in lines: f=open("./data/all.txt") lines=f.readlines() for lines in lines: The content = (line. The lower ()). The split (" \ t ") [2] + (line. The lower ()). The split (" \ t ") [1] # cut word, etl function is used to get rid of useless symbol, Word_list = filter(lambda x: len(x)>0,map(etl,jieba.cut(content,cut_all=False))) train_set.append(word_list) f.close()Copy the code

The tain_set is the original corpus, and then import the corpus into the dictionary to build a dictionary.

Dictionary = corpora.Dictionary(train_set Dictionary.filter_extremes (no_below=1,no_above=1,keep_n=None)Copy the code

After importing the corpus into the dictionary, each word is actually numbered as 1,2,3…. This numbering, this is the first step in vectorization, and then preserving the dictionary. The digital corpus is then generated

corpus = [dictionary.doc2bow(text) for text in train_set]Copy the code

In this way, the variable of corpus is two-dimensional data. Each line represents the number and word frequency of each word in a document, and each line looks like this

[(1, 2), (2, 4), (5, 2)… It means that the word numbered 1 appears twice and the word numbered 2 appears…. four times

OK, the preliminary preparation is OK, the original article has been digitized by us after cutting the word -> building the dictionary -> generating the corpus, and then it is simple.

4.1 the TFIDF model

With a digital corpus, we can generate a TFIDF model

TfidfModel = models.TfidfModel(corpus) # Save (output + "alltfidf.mdl ")Copy the code

This sentence is the key, we use the original digital corpus, generated a TFIDF model, what can this model do? Gensim overloads the [] operator, we can use something like [(1,2),(2,4),(5,2)….] The original vector transfer into, become a tfidf vector, like this [,0.98 (1), (2,0.23), (5,0.56)… , which indicates that the word numbered 1 is more important than the following two words, and this vector can be used as the original vector input of the following LDA.

Then we vectorized all the corpus TFIDF and stored it as an index data for easy use in future search

# Transform all corpus vectors into TFIDF mode, The tfidfModel can pass in a two-dimensional array tfidfVectors = tfidfModel [corpus] # index and save indexTfidf = similarities. MatrixSimilarity (tfidfVectors)  indexTfidf.save(output + "allTFIDF.idx")Copy the code

We have generated a model data (alltFIDF.mDL) and a TFIDF vector index data (alltFIDf.idx) for all the corpus. With the above dictionary data (all.dic), we now have three sets of data. We will talk about how to use them later. Now let’s move on to the LDA.

4.2 the LDA model

LDA last article said so much, in gensim’s view is the following lines of code, and the use of legendary machine learning oh. Let’s just say gensim’s code is wrapped too succinctly.

# Generate LDA model through TFIDF vector, id2word refers to the corresponding dictionary of number, num_topics refers to the number of topics, we set 50 here, too much time can not stand the topic. lda = models.LdaModel(tfidfVectors, id2word=dictionary, Lda. save(output + "alllDA50topic.mdl ") # change all TFIDF vectors to lDA vectors corpus_lda = lda[tfidfVectors] # create index, The LDA data preserved indexLDA = similarities. MatrixSimilarity (corpus_lda) indexLDA. Save + "allLDA50Topic. Independence idx" (the output)Copy the code

It’s only three steps, but it takes a lot of time, and if you have log open, you can see what’s going on, and I’ve just taken a few of them, like the ones below, and obviously, the first ones are related to cars, and the second ones are related to technology, so it’s kind of a good idea.

Novel + # 38 (0.020) : 0.003 * 0.003 * jun + 0.002 + 0.002 + 0.002 * * piece with urban except + 0.001 + 0.001 * * * ong coldwell # 27 (0.020) : 0.003* Configuration + 0.003* Interior + 0.003* Model + 0.002* Airbag + 0.002* Lucky + 0.002* Ten thousand Yuan + 0.002* Yizhi + #0 (0.020): 0.004* Pentium + 0.003* Acceleration + 0.003* Carnival + 0.002* Throttle + 0.002* Elysee + 0.002* SEC 0.004* Tiger + 0.004* Saab + 0.004* Orono + 0.002* Lexus + 0.002* Model + 0.002* Letto #26 (0.020): List + 0.009* Stream + 0.007* Hotkeys + 0.006* Crash + 0.002* God + 0.002* Confusion + 0.002* Mailbox #21 (0.020): 0.035* Command + 0.018* Browser + 0.007* third-party + 0.007* Install + 0.006* Console Topic #25 (0.020): 0.064* files + 0.004* constraints + 0.004* Exercises + 0.003* Copy to + 0.003* will do + 0.003* decompileCopy the code

Alllda50topic. MDL and alllda50topic. idx, plus the previous 3 files, total 5 files, OK, take a break, have a coke, continue to the next step.

5. Verify the results

Okay, we’ve already used machine learning in part 4, and now we’re going to see how well it works.

We have saved model and vector data for TFIDF and LDA, so let’s use two new articles to see which articles are most similar to this article to verify the reliability of these two models.

I opened a random car website, selected an article about cars (BMW review), and found my previous article about technology (search engines), and only randomly selected a paragraph of the article for testing.

This is an article about the new BMW X1 Li, and I think many BMW fans are ready to comment........

Under normal circumstances, the default search engine will think that the index is not too much change, so the index is divided into full index and incremental index two parts, full index is generally day.......

Ok, the article is selected, first load the data file saved before

Dictionary = corpora.dictionary. load(output + "all.dic" models.TfidfModel.load(output+"allTFIDF.mdl") indexTfidf = similarities.MatrixSimilarity.load(output + "allTFIDF.idx") LdaModel = models.ldamodel.load (output + "alllda50topic.mdl ") indexLDA = similarities.MatrixSimilarity.load(output + "allLDA50Topic.idx")Copy the code

Then cut the test data, TFIDF vectorization, find similarity, LDA vectorization, find similarity

Query_bow = dictionary.doc2bow(filter(lambda x: Len (x)>0,map(etl,jieba.cut(query,cut_all=False)))) Because the LDA we're training is based on TFIDF, Ldavec = ldaModel[tFIDfvect] #TFIDF similarity simstfidf = indexTfidf[tFIDfvect] #LDA similarity simlda = indexLDA[ldavec]Copy the code

All right, that’s it. That’s all the code. Too easy… So let’s see what happens.

6 Output Results

Let’s look at the TFIDF results first

  • Car test article TFIDF results (3 randomly selected from the top 10 results)

Preferential car purchase recommendation BMW X3 preferential 35,000 to 70,000 yuan Porsche Macan competitiveness analysis BMW X3 BMW 2014 new car outlook up to more than ten new models

  • Technical test article TFIDF results (3 randomly selected from the top 10 results)

Golang to write a search engine (0x06) index that point [search engine] sphinx introduction and principle exploration

Obviously, the results are pretty good. The first one is about BMWS, and the second one is about search engines and indexes.

Let’s look at the results of LDA. The main function of LDA is text classification rather than keyword matching. It is to see whether the classification of test articles is correct. Then see if the articles most similar to each other happen to be technical or automotive. If so, the model is better.

  • Car test article LDA results (randomly selected 3 out of the top 10 results)

Editor’s mind the most beautiful mid-level car FAW – Volkswagen new CC 250,000 fashion quality 4 luxury compact car Mercedes Benz A-class iPhone html5

  • Technical test article LDA results (3 randomly selected from the top 10 results)

Java multithreading core technology comb (attached source code) springsession principle analysis of concurrent lock file mode

From the results, the basic more reliable, but the car that appeared a badcaseiphone mobile html5 upload picture direction problem solved, this is a technical article, but appeared in the car class above.

7. Result analysis

For us to analyze the results and TFIDF model, in the existing data sets (article 12000), the recommended result is strong, let a person feel to recommend the result is, this is also where TFIDF this algorithm is simple and effective, he put the key words in the article very good extracted, so the recommendation results let a person feel strong, But he has problems of his own.

  • For short articles (such as weibo), because the text is too short, IT is difficult for TFIDF to extract important keywords or get wrong, leading to unreliable recommendation results.

  • It is not comprehensive to explain the importance of this word simply by word frequency. For example, in this article, text similarity should be the most important in human view, but it may be calculated according to TFIDF that the word model is the most important. For the text-only recommendation system, the text-dependent recommendation may be more suitable for vertical websites, such as SegmentFault. People who read a certain article may hope to see similar articles and have a deeper understanding of this field. This algorithm is more reliable, but according to my observation, SegmentFault is the tag recommendation that I use, which is better, but more artificial, and it would be more troublesome if I tagged randomly when I was writing.

Take a look at the LDA model, LDA is mainly used in text clustering, and he is the basis of the theme, if put his algorithms to use as a recommendation system, depends on the specific scene, he recommended the results in the data sample is not enough, may seem less on (even if the sample size may also look not so good, appear granularity is very thick, but because of very thick, Therefore, it is more suitable for content discovery. For example, I am interested in digital news, which is not only interested in iPhone, but also as long as the theme of digital. Therefore, LDA can recommend things under the theme of digital to me, which is better than reading articles on iPhone. The iPhone articles below are much more reliable.

What happens when LDA appears in one of the badcases of the previous section? Since it’s unlikely to change the model, you can only start with a few things.

  • If it’s just the occasional one or two, you can put up with it.

  • If there are many, you can only adjust the number of topics first, and then there are some parameters in the LDA that can be adjusted.

  • Another way is to clean the input data as much as possible and remove useless impurities (patience and carefulness are essential skills for algorithm engineers). Therefore, different models are very important for different scenes. Only by choosing the right model for your scene can you achieve the right effect.

8. What you write at the end

This article is just a text similarity of the most basic article, can be the most intuitive understanding of TFIDF model and LDA model, at the same time, also used the hottest machine learning technology oh.

In fact, like the LDA and word2vec this model, the mathematical abstraction was too strong model, and is basically out of the actual scene, already completely mathematically, so it doesn’t need to be used in text processing, the flow analysis, user behavior analysis as useful, this is the algorithm engineer want to things, How a good algorithm can be used in an existing scenario.

Imagine if we wanted to categorize our users and see which ones have similar interests. We can actually do it this way:

  • First, if we have a bunch of browsing behavior data, each piece of data records the user clicking on a link, or clicking on a button.

  • When these browsing behaviors are combined by the user dimension, each piece of data in the new data is a record of a user’s actions, in order of what he did at what time. Similar to user A: [browsing page A, clicking button B, browsing page C….]

  • Well, if we use one of the essential skills of algorithmic engineers —- imagination, then we treat each user’s behavior as an article, each behavioral data as a word, and use LDA….. Ha ha so calculate the theme, is not the category of users? Users with similar behavior data will appear under the same theme, so these users are classified, so can it be understood that users under the same category have similar hobbies? If you think it works, try using your company’s user data and see if it works 🙂

9. The back of the back

Finally, all of the code is on Github, and you can see it here. The code is fairly simple, no more than 200 lines, and the core is the one I listed above. There is also the code and use of Word2vec, which I will not mention in this article.

If you want to play, corpus oneself can on wiki, they opened their all data corpus analysis is done to the world, including Chinese, the address is: dumps.wikimedia.org/zhwiki/late… , but wikipedia is not much Chinese corpus, Chinese corpus is baidu encyclopedia, but look at baidu encyclopedia, ha ha, not only not open, anti-crawler and anti-thief, ha ha, but I also give you an address, 100G baidu encyclopedia original page: pan.baidu.com/s/1i3wvfil, joint password: Neqs, courtesy of Penny Liang, the second reptile king of Asia.

Well, today’s article is a bit long, to this, the algorithm part will put a put, work too busy recently, and this period is over, I will say again algorithm part, because now work will have some more fun algorithm to use, the next article will mainly write about things the system architecture, and I own the search engine is too busy and don’t have time for the whole It will be a while, sorry 🙂 but rest assured, there will be no tail.

Welcome to follow my official account, mainly talk about search, recommendation, advertising technology, and nonsense. The article will be posted here first 🙂 scan or search wechat XJJ267 or search Spanish language