Article content correlation statistics

As the straight left

 

Content relevance All I can think of is comparing the tags extracted from the two articles.

Each article can extract some words with high frequency, that is, tags, stored in the database. The labels with the highest frequency are stored in the front, and the labels with the lowest frequency are stored in the back. Assume the label table structure is as follows

PageTag

field meaning type
Id   INT
TagId The tag ID INT
PageId The article ID INT

 

For the same article (PageId is the same), the tag with high frequency, ID must be less than the ID of the tag with low frequency.

The idea of comparison is that the higher the frequency and number of the same tags of two articles, the higher the similarity of the two articles. There seems to be something called “weights” involved. But I don’t know what weights are.

Mysql > select * from ‘SQL’;

SELECT a.PageId,SUM(b.Row) AS Weight FROM PageTag AS a,

 

(SELECT TagId,ROW_NUMBER() OVER(ORDER BY Id DESC) AS Row

FROM PageTag

WHERE PageId=某篇文章的ID) AS b

 

WHERE a.pageId <> ID of an article

AND a.TagId=b.TagId

GROUP BY a.PageId

The result is a list of articles and the corresponding similarity. The greater the Weight, the higher the similarity.

ROW_NUMBER() = SUM (ROW) AS Weight = SUM (ROW) AS Weight Also, if two articles have many identical tags, the total count will be higher, so the Weight should reflect the Weight.

Note: The effect of this algorithm is not good in practical application. In addition to the accuracy of extracted labels is not very high, the algorithm itself is not perfect. For example, an article is very long, and there are more than 10 extracted tags, while short articles have only 2 or 3. At this time, the weight of long articles is generally greater than that of short articles, and there is statistical deviation.

Throw my sling to attract your jade.