Introduction to Recommendation Engine Algorithm Learning: Collaborative Filtering, Clustering and Classification (2011年 old papers)

Introduction to recommendation engine algorithm learning: Collaborative filtering, clustering, classification

The method of structure the method of arithmetic

The introduction

Yesterday, I saw several keywords: semantic analysis, collaborative filtering, intelligent recommendation, and I got excited. So from yesterday afternoon to 3 o ‘clock this morning, I studied the recommendation engine and made a preliminary understanding. In the future, I will slowly and thoroughly study it (my future work will also be related to this). Of course, this article will be slowly added to improve.

This article as a preliminary introduction of the recommendation engine of an article on the introductory theory, will be cut out most of the details, focus on using the most simple language this paper briefly introduces the working principle of the recommendation engine and its relevant algorithm thought, and in order to focus on simple some quoted since I published on January 7th on weibo text (specially finishing, convenient read at any time in the future). Try to keep this article short. However, contrary to our wishes, the article was supplemented and improved, and the writing became longer and longer.

At the same time, all relevant algorithms in this paper will be elaborated one by one in future articles. In this paper, but for micro introduction, but for specific in the future. If you have any questions, please feel free to comment or criticize. thank you

1. Recommendation engine principle

Recommendation engines do your utmost to collect as much information about users and behavior, the so-called wide net, frequently to fish and then “special love for special you”, based on the similarity based on the last “to force”, the principle as shown in the figure below (figure from one of the resources in this article: the secrets of their recommendation engine internal) :

2. Classification of recommendation engines

Recommendation engines are classified according to the following criteria:

Are recommended for different users according to its different data, based on mass behavior is divided into (web site administrators to recommend, or all user feedback statistics calculated based on the system of the more popular items), and personalized recommendation engine (help you find like-minded, like minded friends, and then on the basis of a recommendation).
According to the data source, divided into based on demographic (users same as similar age or gender), based on the content (items with the same keyword and Tag, does not take into account human factors), and based on collaborative filtering recommendation (the correlation of found objects, content or user recommendation, divided into three subclasses, described below).
According to its establishment method, it can be divided into item and user based (user-item two-dimensional matrix to describe user preferences, clustering algorithm), association rules-based (The Apriori algorithm is The most influential algorithm to mine frequent item sets of Boolean association rules), and model-based recommendation (machine learning, Machine learning, or making a computer learn continuously like a human brain, is a sub-field of artificial intelligence.

As for the recommendation based on collaborative filtering in the second category above (2, according to its data source) : With the development of Web2.0, Web sites more advocate user participation and user contribution, so the recommendation mechanism based on collaborative filtering comes into being. Its principle is very simple. It is based on users’ preferences for items or information, to find the relevance of items or content itself, or to find the relevance of users, and then make recommendations based on these relevance.

The recommendation based on collaborative filtering can be divided into three subcategories:

Based on user recommendations (find similar neighbors by common tastes and preferences, k-neighbor algorithm, your friends like it, you might like it),
Item-based recommendations (find similarities between items, recommend similar items, you like item A, C is similar to A and may also like C),
Model-based recommendation (construct a recommendation model based on sample user preference information, and then predict recommendations based on real-time user preference information).

We see that this collaborative filtering algorithm takes maximum advantage of similar correlations between users, or between items, and then makes recommendations based on this information. This collaborative filtering is described in more detail below.

However, in general practice, we usually divide recommendation engines into two categories:

The first type is called collaborative filtering, that is, collaborative filtering recommendation based on similar users (all information and clues left by users’ interaction with the system or the Internet, or the numerous connections between users), and collaborative filtering recommendation based on similar items (find the similarity between items as much as possible);
The second category is recommendations based on content analysis (surveys, emails, or analysis of blog content by a recommendation engine).

3. Recommendation mechanism of Sina Weibo

In the mechanism of friend recommendation on Sina Weibo: 1. I am not friends with A, but many of my friends are friends with A, that is, I have many common friends with A, so the system will recommend A to me as well (Sina calls it “mutual friends”); 2. Many of my followers have followed B, so the system presumes that I may also like B and recommend B to me as well (Sina calls it “indirect followers”).

But in the actual operation of Sina, these two ways will be mixed together. For example, many people I follow follow B, but in fact many people who follow B are also my friends. These recommendations are collectively known as collaborative filtering recommendations based on similar users (which is nothing more than to find out: links between users, either from your friends or from people you follow).

Of course, there is another category, such as popular user recommendation, which is the recommendation based on mass behavior mentioned above, that is, following the trend of others. The system assumes everyone likes it, and maybe you’ll like it, too. If everyone knows that Yao Chen ranks first in the number of fans on Sina Weibo, they will scramble to pay attention to her, and finally the number of fans will be pushed higher and higher. The two recommended methods are shown in the figure below:

However, the whether the recommended way, based on the user and based on the recommendations from the public behavior doesn’t really look for the common interests between the user and the user, preferences and tastes, because a lot of time, friends of friends may not be able to be your friend, and some people clearly higher than the world, you are all to pursue, I disdain. Therefore, it is the king’s way to find common concerns and points of interest from the analysis of the content of users’ micro-blogs. Of course, Sina Weibo recently gave users the option to tag their own micro-blog content, so that they could find tags and keywords common to related users in the micro-blog content in the future. This recommendation method is based on the analysis of micro-blog content. The diagram below:

The only question is, who will go to great lengths to tag a tweet? So Sina Weibo will have to work hard to find another way to better analyze its content. Otherwise, the system would be too large and unaffordable to scan all the microblog content of sea users.

However, I think we can start from the keywords of Weibo (tag cloud) and the tags that each user makes for themselves (the more common tags they hold, the more similar users can be defined as), as shown in the left and right parts of the picture below:

In other words, it is not reliable to define similar users through common friends and people who follow them indirectly. It is only feasible to find similar users through analysis of microblog content. Meanwhile, further, after obtaining tag cloud through microblog content analysis, Finding the same or similar tags from the tag cloud is certainly a better way to find similar users than the existing method of recommending friends through mutual friends and defining similar users through indirect followers.

3.1. Combination of multiple recommendation methods

Recommendations on existing Web sites often do not simply use a single recommendation mechanism or strategy, they often mix multiple methods together to achieve better results.

For example, in Amazon, besides user-based recommendations, content-based recommendations (items with the same keywords and tags) are also used: new product recommendations; Project-based collaborative filtering recommendations (like A, C is similar to A and may also like C) : such as bundling and items purchased/viewed by others.

In short, multiple recommendation methods are combined and weighted (linear formula is used to combine several different recommendations according to a certain weight, and the specific weight value needs to be repeatedly tested on the test data set to achieve the best recommendation effect). , switching, partitioning, layering, etc. However, no matter what kind of recommendation, it is generally covered in the recommendation method mentioned above.

4. Collaborative filtering is recommended

Collaborative filtering is a typical approach to utilizing collective intelligence. To understand Collaborative Filtering, start with a simple question: If you wanted to watch a movie right now, and you didn’t know which one to watch, what would you do? Most people ask their friends, or neighborhood in a broader sense, to see what movies are recommended recently. We tend to get recommendations from friends who have similar tastes. This is the core idea of collaborative filtering. How much information can you see in the picture below?

4.1. Recommended steps for collaborative filtering

To make collaborative filtering recommendations, the following steps are generally required:

1) For collaborative filtering, collecting user preferences is the key. It can be obtained through user behaviors such as rating (for example, different users have different ratings for different works, and scoring close means similar preferences and tastes, so it can be judged as similar users), voting, forwarding, saving, bookmarking, marking, comments, click stream, page stay time, whether to buy, etc. As noted in point 2 below: All of this information can be digitized and represented as a two-dimensional matrix.

2) after collecting user behavior data, we will de-noise and normalize the data (obtain a two-dimensional matrix of user preferences, one dimension is the list of users, the other dimension is the list of items, and the value is the user’s preference for items, generally [0,1] or [-1, 1] floating point value). Here is a brief introduction to denoising and normalization operations:

The so-called noise reduction: the user behavior data is generated by the user in the process of using the application, it may have a lot of noise and user misoperation, we can filter out the noise in the behavior data through the classical data mining algorithm, so that our analysis can be more accurate (similar to the denoising of web pages).
Normalization: Unifying the data of various behaviors into a single value range, thus making the weighted sum of overall preferences more accurate. The simplest normalization is to divide each class of data by the maximum value of the class to ensure that the normalized value is in the range [0,1]. As for the so-called weighting, it is easy to understand, because the weight of each person is different. It is similar to a singing contest in which several contestants are voted to decide whether they will advance. The audience’s vote is worth 1 point, and the expert judge’s vote is worth 5 points.

3) How to find similar users and items? It is to calculate the similarity of similar users or similar items.

4) There are many methods to calculate similarity, but they are all based on Vector, which is to calculate the distance between two vectors. The closer the distance is, the greater the similarity will be. In the recommendation, under the two-dimensional matrix of user-item preference, we take the preference of one or several users for two items as a vector to calculate the similarity between two items, or take the preference of two users for one or several items as a vector to calculate the similarity between two users.

The similarity calculation algorithm can be used to calculate user or project similarity. Taking Item Similarity Computation as an example, the common feature is that common scoring users are selected for both projects I and J from the scoring matrix, while Similarity si and j are calculated for the scoring vector of this common user, as shown in the figure below. Row represents users. Columns represent items (note that the common comments are extracted from the I and j vectors to form a pair of vectors for similarity calculation) :

So, it’s very simple, look for similarity between items, constant users, look for ratings of items by multiple users; Look for similarity between users, unchanged items, look for user ratings for certain items.

5) The two similarity degrees calculated will be used as the recommendation of the two collaborative filtering items based on users and projects.

Common methods to calculate similarity include Euclid distance, Pearson correlation coefficient (for example, when two users score multiple movies, Pearson correlation coefficient can be used to determine whether their tastes and preferences are the same or not), Cosine similarity, Tanimoto coefficient. Below, the Euclidean distance and Pearson correlation coefficient are briefly introduced:

Euclidean Distance is originally used to calculate the Distance between two points in Euclidean space. Suppose x and y are two points in n-dimensional space, and the Euclidean Distance between them is:

You can see that when n is equal to 2, the Euclidean distance is the distance between two points on the plane. When Euclidean distance is used to represent similarity, the following formula is generally used for conversion: The smaller the distance, the greater the similarity (at the same time, avoid divisor 0) :

The two items I and j are regarded as two m-dimensional user space vectors. Similarity calculation is carried out by calculating the Cosine Angle between the two vectors. Then, for m*n scoring matrix, the Similarity of I and j is calculated by the following formula: sim(I, j)

(Where “·” is called the inner product of two vectors.)
Pearson correlation coefficient is generally used to calculate the closeness of the relationship between two distance variables. In order to make the calculation result accurate, it is necessary to find out the users who have the same score. Remember that user set U is the user set that has commented on both I and J, then the corresponding Pearson correlation coefficient can be calculated by:

Where Ru, I is user u’s score for item I, and the marked with bar is user set U’s score for item I.

6) Similar neighbor calculation. Neighbors are divided into two categories: 1, A fixed number of neighbors K – neighborhoods (or Fix – size neighborhoods), regardless of neighbor “distance”, only in recent K, as its neighbors, as the chart shown in part A; 2. Neighbors based on the similarity threshold, all points in the region with the current point as the center and a distance of K are the neighbors of the current point, as shown in Part B of the following figure.

Let me introduce k-nearest Neighbor (KNN) classification algorithm again: this is a relatively mature method in theory and one of the simplest machine learning algorithms. The idea is that a sample belongs to a category if most of the k most similar (that is, the closest neighbors in the feature space) samples in the feature space belong to that category.

7) User-based CF calculated by 4) (user-based recommendation: find similar neighbor users through common taste and preference, K-neighbor algorithm, your friends like it, you may also like it), item-based CF(item-based recommendation: Find similarities between items and recommend similar items. If you like item A and C is similar to A, you may also like item C.

4.2 Based on user similarity and project similarity

Section 3.1 above three similarity formula is based on the similarity of scenarios, and, in fact, based on user similarity and similarity calculation based on the project is a basic difference, based on the user similarity matrix is based on the score of row vector similarity, similarity calculation based on the project type based on the score matrix column vector similarity solution, Then the three formulas can be applied respectively, as shown in the figure below:

(Where, 0 means not scored)

Calculate the similarity of two column vectors such as Item3 and Item4 based on item similarity calculation formula;
Based on the user similarity calculation formula to calculate such as User3, User4 volume row vector similarity.

An example is better than a thousand words. Let’s look at a specific example based on user similarity calculation.

Suppose we have a group of users who show a preference for a group of books. The more users like a book, the higher the rating. Let’s show it through a matrix, with rows representing users and columns representing books.

As shown below, all ratings range from 1 to 5, with 5 being the most liked. The first user (row 1) has a rating of 4 for the first book (column 1), and an empty cell indicates that the user has not rated the book.

Using the user-based collaborative filtering method, the first thing we need to do is to calculate the similarity between users based on their evaluation of books.

Let’s consider this from the point of view of a single user, looking at the first line in Figure 1. To do this, a common practice is to represent each user with a vector (or array) containing the user’s preferences. This approach is more straightforward than using a variety of similarity measures.

In this example, we will use cosine similarity to calculate the similarity between users.

When we compare the first user to the other five users, we can see how similar he is to the other users.

For most similarity measures, the more similar the vectors are, the more similar they are. In this example, the first user is very similar to the second and third users, with two books in common, less similar to the fourth and fifth users, with only one book in common, and not at all similar to the last user, because there is no book in common.

More generally, we can calculate the similarities of each user and represent them in a similarity matrix. This is a symmetric matrix, where the background color of the cells indicates how similar the users are, and a darker red indicates that they are more similar.

So, we find the second user who is most similar to the first user, delete the books that the user has already reviewed, weight the books that the most similar user is reading, and then calculate the sum.

In this case, we calculate the n = 2, said to produce recommendations that need to find two similar to the target users the most users, the two users is the second and third respectively, then the first user has evaluated the first and the fifth book, so the recommendation is the third (4.5), and the fourth book (3 points).

Also, when to use item-base and when to use user-base: weibo.com/1580904460/… ?

Generally speaking, if the number of items is small, such as no more than 100,000, and does not grow significantly, use item-based. Why? As @wuzh670 said, if the number of items is small + does not increase significantly, it indicates that the relationship between items is relatively stable within a period of time (compared with the relationship between users), the demand for real-time update of item-similarity will be greatly reduced, and the efficiency of recommendation system will be greatly improved. Therefore, it is wiser to use item-based similarity.

On the contrary, user-base is recommended for a large number of items. Of course, in practice, it depends on the situation. As shown in the figure below (excerpted from Xiang Liang’s Recommendation System Practice) :

5. Clustering algorithm

Clustering, popularly speaking, namely the so-called “birds of a feather flock together and people cluster together”. Clustering is a classic problem of data mining. Its purpose is to divide data into multiple clusters. Objects in the same Cluster have a high degree of similarity, while objects in different clusters differ greatly.

5.1 k-means clustering algorithm

The K-means clustering algorithm is very similar to the maximum expectation algorithm for mixed normal distribution because they both try to find the center of the natural clustering in the data. The algorithm assumes that object attributes come from spatial vectors and aims to minimize the sum of mean square errors within each group.

K-means clustering algorithm firstly randomly determines K center locations (points in space that represent clustering centers), and then assigns each data item to the nearest center point. After the allocation is complete, the cluster center is moved to the average position of all nodes assigned to the cluster, and the whole allocation process starts again. This process is repeated until there is no change in the allocation process. The following figure shows the k-means clustering process with two clusters:

The following code is a Python implementation of this k-means clustering algorithm:

// K-means clustering algorithmimport random  
  
def kcluster(rows,distance=pearson,k=4) :  
  # Determine the minimum and maximum values for each point
  ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows]))   
  for i in range(len(rows[0"))"Create k centers randomly
  clusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0]   
  for i in range(len(rows[0"))"for j in range(k)]  
    
  lastmatches=None  
  for t in range(100) :print 'Iteration %d' % t  
    bestmatches=[[] for i in range(k)]  
      
    Find the nearest center point in each line
    for j in range(len(rows)):  
      row=rows[j]  
      bestmatch=0  
      for i in range(k):  
        d=distance(clusters[i],row)  
        if d<distance(clusters[bestmatch],row): bestmatch=i  
      bestmatches[bestmatch].append(j)  
  
    # If the result is the same as last time, the whole process ends
    if bestmatches==lastmatches: break  
    lastmatches=bestmatches  
      
    Move the center point to the average position of all its members
    for i in range(k):  
      avgs=[0.0] *len(rows[0])  
      if len(bestmatches[i])>0:  
        for rowid in bestmatches[i]:  
          for m in range(len(rows[rowid])):  
            avgs[m]+=rows[rowid][m]  
        for j in range(len(avgs)):  
          avgs[j]/=len(bestmatches[i])  
        clusters[i]=avgs  
    
  # return k sequences, where each sequence represents a cluster
  return bestmatches  
Copy the code

K-means is an unsupervised learning in the field of machine learning. The following is a brief introduction to supervised learning and unsupervised learning:

The task of supervised learning is to learn the function of labeled training data in order to predict the value of any valid input. Common examples of supervised learning include classifying E-mail messages as spam, tagging web pages by category, and recognizing handwritten input. Many algorithms are used to create supervised learning programs, the most common of which include neural networks, Support Vector Machines (SVMs), and Naive Bayes classifiers.
The task of unsupervised learning is to make sense of data, regardless of whether it is correct or not. It is most commonly used to integrate similar inputs into logical groups. It can also be used to reduce the dimensional data in a data set to focus only on the most useful attributes, or to detect trends. Common approaches to unsupervised learning include K-means, hierarchical clustering, and self-organizing maps.

5.2 Canopy clustering algorithm

The basic principle of Canopy clustering algorithm is as follows: first, the low-cost approximate distance calculation method is applied to efficiently divide data into multiple groups, called a Canopy here, which we will translate as “Canopy”. There can be overlapping parts between Canopy. Then, strict distance calculation method was used to accurately calculate points in the same Canopy and allocate them to the most appropriate cluster. Canopy clustering algorithm is often used in the pre-processing of k-means clustering algorithm to find the appropriate K value and cluster center.

5.3. Fuzzy K-means clustering algorithm

Fuzzy K-means clustering algorithm is an extension of K-means clustering. Its basic principle is the same as k-means clustering, but its clustering results allow objects to belong to multiple clusters, that is to say, it belongs to the overlapping clustering algorithm we introduced earlier. To understand the difference between the fuzzy k-mean and the K-mean, we need to spend some time understanding a concept called the Fuzziness Factor.

Similar to the principle of K-means clustering, fuzzy K-means also circulates on the vector set of objects to be clustered, but it does not allocate the vector to the nearest cluster, but calculates the Association between the vector and each cluster. Suppose there is a vector V with k clusters, and the distances between v and the centers of k clusters are d1, d2… dk, then the correlation between V and the first cluster, u1, can be calculated as follows:

Calculating the correlation between V and other clusters simply replaces D1 with the corresponding distance. From the above calculation, we can see that when m approximates 2, the correlation approximates 1; When m is approximate to 1, the correlation is approximate to the distance to the cluster, so the value of m is in the interval of (1, 2). When M is larger, the degree of ambiguity is larger, and M is the fuzzy parameter we just mentioned.

Other clustering algorithms are not introduced in this paper. Problems such as cold start, data sparsity, scalability, portability, interpretability, diversity and value of recommendation information will be explained in the future.

6. Classification algorithm

Next, there are many classification algorithms, this paper introduces decision tree learning, and Bayes’ theorem.

6.1 Decision tree learning

Let’s cut to the chase. A decision tree, as its name implies, is a kind of tree built on strategic choices.

In machine learning, decision tree is a predictive model. It represents a mapping between object attributes and object values. Each node in the tree represents an object, each bifurcated path represents a possible attribute value, and each leaf represents a value for an object along the path from the root to the leaf. The decision tree only has a single output. If complex output is desired, an independent decision tree can be established to deal with different outputs. The machine learning technique of generating decision trees from data is called decision tree learning.

The theory is too abstract. Here are two easy to understand examples:

First example: In layman’s terms, the idea of decision tree classification is similar to finding objects. Now imagine a girl’s mother wants to set her up with a boyfriend.

Daughter: How old are you? Mother: twenty-six. Daughter: Are you handsome? Mother: He’s handsome. Daughter: Do you have a high income? Mother: Not very high, medium condition. Daughter: Are you a civil servant? Mother: Yes, I work for the tax bureau. Daughter: Ok, I’ll go and see.

The girl’s decision-making process is typical of classification tree decisions. It is equivalent to dividing men into two categories, seen and unseen, by age, appearance, income and whether they are civil servants or not. Suppose the girl’s requirements for a man are: under 30 years old, above average looking, and a civil servant with high or above average income, then this can be represented by the following figure:

Simple strategy is, that is to say, the decision tree is like a company in the process of job interview screening of a person’s resume, if your condition is quite good for example graduated from tsinghua university, Dr, so promptly, directly call the interview, if not a key university graduate, but the actual project experience, so also should consider to come over the interview, the so-called particular case is particular analysis, and decision making.

The second example comes from Tom M.Mitchell’s book on machine learning:

Wang’s goal is to find out when people will play golf by using the weather forecast for the next week. He knows that people’s decision to play golf depends on the weather. And weather conditions are sunny, cloudy and rainy; Temperature is expressed in Fahrenheit; Percentage of relative humidity; And whether there’s wind. In this way, we can construct a decision tree that looks like this (depending on whether the weather is a good day to play tennis) :

The above decision tree corresponds to the following expression :(Outlook=Sunny ^Humidity<=70) V (Outlook= Overcast) V (Outlook=Rain ^ Wind=Weak) The best classification attributes obtained are shown in the figure below:

In the figure above, the information gain of two different attributes, namely humidity and wind, is calculated. Finally, the information gain of this classification of humidity is 0.151> 0.048 of wind gain. To put it more plainly, the decision tree comes from the fact that it is better to use the humidity rather than the wind as the classification attribute in the problem of whether it is suitable to play tennis on A Saturday morning.

The formation of ID3 algorithm decision tree

OK, the following figure shows the partial decision tree formed after the first step of ID3 algorithm. So when you put it all together, it makes a lot of sense. 1. The overcast example must be positive, so it is a leaf node, always yes; 2. ID3 has no backtracking, is locally optimal rather than globally optimal, and there is another post-tree pruning decision tree. The following figure is the partial decision tree formed after the first step of ID3 algorithm:

6.2 The basis of Bayesian classification: Bayes’ theorem

Bayes’ theorem: A conditional probability is known, how to get the probability of two events after the exchange, which is known in P (A | B) how to calculate P (B | A). Here’s what conditional probability is:

The conditional probability of event A given that event B has already occurred. Its basic solution formula is:.

Bayes’ theorem is useful, because in our life, we often encounter this kind of situation, we can easily draw directly, P (A | B) P (B | A) it is hard to draw directly, but we are more concerned with P (B | A), bayes’ theorem is for us to get through from P (A | B) P (B | A) of the road.

Below, bayes’ theorem is directly given without proof (the formula was pointed out by netizens to be problematic, to be verified and corrected later) :

7. Recommend instance extension

7.1. Reading recommendations

Here’s an excerpt from 36KR:

Beijing Very Technology is also very optimistic about reading recommendation applications, they spent a lot of energy (a team of 60 people a year), just launched the iPhone version of “cool cloud reading” today.

Why put so many people into this reading app? Li Peng, the CEO, told me that more than half of the team is working on back-end stuff, including algorithms like semantic analysis and machine learning. Their aim is to “semantically” the Internet, to clarify people’s interests, and finally to recommend what everyone is interested in to the relevant people. On the iPhone, coolcloud basically does what the Zite iPad does, with users’ likes’ and ‘dislikes’ and clicking on the appropriate media source or TAB to tell Coolcloud that you’d like to see more of these content in the future.

This purpose is common to most reading recommendation apps, but cool Cloud’s approach seems more perverted. In addition to scraping more than 100,000 articles from the Internet every day, they also index videos from 200 TV stations across the country so that users can search for videos by text and make similar recommendations. The general approach is to record all these programs, then convert the sound to text, and finally build summaries and indexes. “

Is the algorithm applied by the general recommendation system as complex as the collaborative filtering described above? The following is a quote from my weibo post on January 21:

1. Most recommended reading apps will tag articles based on their content: algorithm, iPhone (click on the tag to add weight to it) and invite comments on articles: like or dislike. Every click recommended system record down, and eventually form a user label tag cloud (at the same time, also can be based on the same or similar tag the tag to find similar users, which is based on user recommendation), and then each new article retrieval system, extract the key of the article, match the label orientation of the user, to push.

2, the current mobile phone news reading done on the classification, such as science and technology, education, but don’t usually take score, as web pages, so won’t be able to record the user’s behavior characteristics, also there will be no new articles out recommend follow-up after reading service, and has produced a number of mobile phone recommended reading, reading such as @ cool cloud, refers to reading and so on.

But the habit of the average user is to finish watching a section of news, choose the day to see the day to see. For example, how many users are willing to sign up for an account in order to review an article? In my opinion, the key is how to make users pay extra cost to use such readers and change their habits.

Then I also have a question about the above sentence: first record these video programs, and then convert the sound to text. We already know that if it is music like Douban FM, it may do as follows:

You like some songs, and I like some songs, if you and I like many songs are similar, the system will define you and ME as friends, that is, similar users, based on user collaborative filtering recommendation: friends like, you may like;
There is also A recommendation for songs. You like A song A, and another song B is similar to song A (for example, both songs are about love and sadness), so the system guesses that you may also like B, and recommends B to you. This is collaborative filtering recommendations based on items (items).

According to repeat similar decision is good friend to listen to the song by user-based collaborative filtering recommendation, through some songs are almost similar to project-based collaborative filtering recommendation, but the problem out, repeatedly to say, the same singer with a song, but are similar to those music songs and how to define the judgement? Analyzing the spectrum of the song through the system? How fast or slow is the tempo of each song? While this seems to be effective, it is unlikely to be practical.

I think the music is tagged (I think videos are tagged too, so that they can be indexed later). The recording of the whole video is still unreliable), such as “love” and “sentimental” tag, and then the same tag can be judged as similar songs. But the key is how? Speech recognition?

7.2 How to type the tag

Early can be human flesh, crawler, buy database, such as flow up, you can consider UGC. Ugc, user-generated content. But users are unlikely to label music themselves. It’s too complicated (for example, sina Weibo recently added a “tag” prompt to every post, but how many users care about it?). Of course, some systems will automatically generate some tags for you (of course, you can also add some tags), such as Sina blog:

How do you do that? And my idea is,

The system scans your text behind you once and extracts some keywords as tags for you to choose from. What keywords should I take? Take high-frequency words, of course. Scan the whole passage and count the frequency of each word.
Then take its TOP K, as shown in the screenshot above, “algorithm” appeared four times in that article and “blog” appeared three times, so the system automatically matches these labels for you.
As to what kind of data structure or method to count the frequency of these keywords. Generally, apply hash+ heap (11), parse hash table algorithms from beginning to end, or trie tree (from trie tree to suffix tree). But when the trie tree faces Chinese characters, it becomes more difficult. So hash+ heap is the ideal choice.

Similarly, for the video, it should be similar: 1, read the video content through the system or machine, convert the video into text, and then extract the keywords with high frequency (how to extract the keywords, which involves a key problem: word segmentation. This blog will elaborate later), use these extracted keywords as the tag of this video; 2. Then build an index summary (what kind of index? Invert the index. As for what is inverted index, refer to programming art chapter 24: chapter 23, chapter 4: Young’s matrix search, inverted index keyword Hash does not repeat coding practice), and ultimately convenient for future users or systems to search (this section is discussed and summarized with friends in programming art).

Details will be elaborated later.

8. References

My micro blog on January 7, January 21 (hanging in the left sidebar of this blog);
Exploring the Inner Secrets of recommendation Engine, Zhao Chenting, Ma Chune;
Programming with collective intelligence by TobySeganra.
Overview of collaborative filtering for recommendation systems.
www.cnblogs.com/leoo2sk/.
Mitchell, Tom M. Machine Learning. McGraw-hill, 1997 (a pioneering work in Machine Learning).
Zh.wikipedia.org/wiki/%E5%86…
www.36kr.com/p/75415.htm… .
Intelligent Web Algorithm, Chapter 3 Recommendation system (realize user and project similarity calculation, worth a look)
Collaborative filtering recommendation algorithm and filtering algorithm based on content: time.geekbang.org/article/194…

Afterword.

OK, this paper is only preliminary, but there are still many problems and loopholes to be improved. At the same time, everything is still only my understanding, has not been applied in the actual work. Therefore, the understanding is not deep, did not distinguish true knowledge. Everything has to be tested in subsequent practice. If readers find any problems or errors in this article or this blog, please feel free to correct them. Thanks a million. To the end. Out, 2011.01.12.