Article content correlation statistics
As the straight left
Content relevance All I can think of is comparing the tags extracted from the two articles.
Each article can extract some words with high frequency, that is, tags, stored in the database. The labels with the highest frequency are stored in the front, and the labels with the lowest frequency are stored in the back. Assume the label table structure is as follows
PageTag
field | meaning | type |
---|---|---|
Id | INT | |
TagId | The tag ID | INT |
PageId | The article ID | INT |
For the same article (PageId is the same), the tag with high frequency, ID must be less than the ID of the tag with low frequency.
The idea of comparison is that the higher the frequency and number of the same tags of two articles, the higher the similarity of the two articles. There seems to be something called “weights” involved. But I don’t know what weights are.
Mysql > select * from ‘SQL’;
SELECT a.PageId,SUM(b.Row) AS Weight FROM PageTag AS a,
(SELECT TagId,ROW_NUMBER() OVER(ORDER BY Id DESC) AS Row
FROM PageTag
WHERE PageId=某篇文章的ID) AS b
WHERE a.pageId <> ID of an article
AND a.TagId=b.TagId
GROUP BY a.PageId
The result is a list of articles and the corresponding similarity. The greater the Weight, the higher the similarity.
ROW_NUMBER() = SUM (ROW) AS Weight = SUM (ROW) AS Weight Also, if two articles have many identical tags, the total count will be higher, so the Weight should reflect the Weight.
Note: The effect of this algorithm is not good in practical application. In addition to the accuracy of extracted labels is not very high, the algorithm itself is not perfect. For example, an article is very long, and there are more than 10 extracted tags, while short articles have only 2 or 3. At this time, the weight of long articles is generally greater than that of short articles, and there is statistical deviation.
Throw my sling to attract your jade.