[Editor’s note] This account is the official account for the first recommendation of the fourth paradigm intelligent recommendation product. This account is based on the computer field, especially the cutting-edge research related to artificial intelligence. It aims to share more knowledge related to artificial intelligence with the public and promote the public’s understanding of ARTIFICIAL intelligence from a professional perspective. At the same time, we also hope to provide an open platform for discussion, communication and learning for people related to ARTIFICIAL intelligence, so that everyone can enjoy the value created by artificial intelligence as soon as possible.


This article briefly summarizes the important concepts and terms that must be known in the process of learning the recommendation system, hoping that students who want to get started can benefit from it.

1. Recommendation system

Recommendation system is equivalent to information filter, which solves the problem of information overload and helps people make better decisions. Its main principle is to establish a user interest model based on the user’s past behavior (such as purchase, rating, click, etc.), and then recommend the content that users are most likely to be interested in to users by using certain recommendation algorithm, as shown in the figure below:

                            

2. Data

To complete the above calculation, we need three pieces of data:

2.1 User data: refers to the data used to build user models, which are different according to different recommendation algorithms. Typical data include user interest points, user profiles, user social friends, etc.

2.2 Content data: refers to the data used to describe the main attributes of a recommended content, which are mainly related to specific content, such as the director, actor, type and style of a film, etc.

2.3 User-content data: User-content interaction refers to data reflecting the internal relationship between users and content, which can be divided into implicit and explicit types. Explicit mainly refers to the interactive data that can obviously reflect users’ interest in content, such as evaluation, scoring and purchase, while implicit refers to the interactive data that indirectly reflect users’ interest in content, such as users’ clicks and search records.

3. The algorithm

The current mainstream recommendation algorithms can be divided into the following six categories (non-mainstream classification methods) :

3.1 Content-based Recommendation: Similar Content is recommended based on what users have liked in the past

3.2 Collaborative Filtering (CF) : Recommend similar content to current users based on the interests of similar users

3.3 Demograph-based Recommendation: Common recommendations shall be made based on the common Demographic information of users, such as age and region

3.4 Knowledge-based Recommendation: Recommends specific content to specific users based on their domain-specific Knowledge of the users and content

3.5 Community-based Recommendation: Based on users’ social friend relationships, users can recommend the content that their friends are interested in

3.6 Hybrid Recommender System: a specific combination of the above Recommender algorithms

4. Data preprocessing

In addition to normalization and variable substitution, the most important data preprocessing techniques related to recommendation systems are similarity calculation, sampling and dimension reduction.

4.1 Similarity calculation

Similarity is usually measured in two ways, one is directly calculated similarity, the other is calculated distance, distance is essentially a measure of the degree of dissimilarity, the smaller the distance, the higher the similarity.

4.1.1 Similarity measurement

4.1.1.1 Cosine similarity

The most common way to calculate similarity is cosine similarity. For two vectors in n-dimensional space, the following formula is used to calculate similarity. The geometric meaning of this is the cosine of the Angle between two vectors, which is between -1 and 1. A value of -1 means exactly the opposite, a value of 1 means exactly the same, and the rest means somewhere in between.

4.1.1.2 Pearson correlation coefficient

Another common way of calculating similarity is Pearson correlation coefficient. The actual meaning of Pearson’s correlation coefficient is the linear correlation between two random variables X and Y, and the value range is between -1 and 1. -1 means negative linear correlation, 1 means positive linear correlation, and the remaining values mean something in between.

4.1.1.3 Jaccard Coefficient of correlation

Jaccard is a way of using set similarity.

4.1.2 Distance measurement

4.1.2.1 Euclidean Distance

The most common distance measure is the Euclidean distance, which calculates the absolute distance between two points in a bit space.

4.1.2.2 Manhattan Distance

The Manhattan distance, also known as the city block distance, is the sum of linear distances from multiple dimensions.

4.1.2.3 Chebyshev Distance

4.1.2.4 Minkowski Distance

The Minkowski distance is a generalization of the Euclidean distance (p=2), Manhattan distance (p=1), and Chebyshev distance (p= infinity).

4.1.2.5 Standardized Euclidean Distance

The standardized Euclidean distance was developed to solve an important shortcoming of the above four distances, that is, the above four distances treat the differences of different dimensional indicators as the same. The standard Euclidean distance is calculated by standardizing the standard deviation of each dimension.

4.1.2.6 Mahalanobis Distance

The Mahalanobis distance is a generalization of the normalized Euclidean distance, which becomes the normalized Euclidean distance when the covariance matrix is diagonal.

4.2 sampling

Sampling technology is mainly used in two places in data mining: one is in the stage of data pre-processing and post-processing, in order to avoid the calculation scale is too large, sampling calculation; Second, in the stage of data mining, cross-validation of trained models is usually carried out, and all samples are divided into training sets and test sets by sampling.

Sampling is generally referred to as random sampling, which is mainly applicable when all sample points can be considered undifferentiated. There is also striated sampling. When the sample needs to be significantly divided into different subsets, each subset should be sampled separately.

4.3 Dimensionality Reduction

In statistical learning theory, when the dimension of the sample increases, the complexity of the model to be learned increases exponentially with the dimension, which is often called “curse of Dimensionality”. This means that the number of samples needed to learn a model with the same precision in a higher-dimensional space as in a lower-dimensional space increases exponentially.

Dimension reduction is usually used to deal with dimension disaster problems. Generally, there are two approaches to dimension reduction. One is to select some dimensions that can best express data from high-dimensional data and use these dimensions to represent data, which is called feature selection. The other is to map high-dimensional data to low-dimensional space through some TRICK transform, which is called feature extraction.

Principal Component Analysis (PCA) is the most important feature selection method. It can obtain the contribution degree of each dimension to the minimum mean square error of the whole data through feature decomposition, so as to quantitatively judge the contribution degree of each dimension to the information contained in the data. Then keep some of the most important dimensions and discard some of the less significant dimensions to reduce the dimension of the data.

Singular Value Decomposition (SVD) is the main feature construction method. It maps data from high dimensional space to low dimensional space through matrix decomposition, and reduces the dimension of data.

5. Data mining – Classification

Classification is the main content of data mining, and there are many methods, each with different data hypothesis and theoretical support. Here are some of the most representative algorithms.

5.1 KNN (K – on his Neighbor)

KNN is the easiest to understand classifier, and it does not train any models. When an unknown sample needs to be predicted, it finds the K points closest to the unknown sample from the known sample and predicts the category of the unknown sample according to the category of the K points.

The main disadvantage of this method is that it requires a very large sample size, and because it does not have any training model, each prediction needs to calculate k times of distance, so the calculation is very large.

5.2 Decision Tree

The decision tree abstracts the classification process into a tree. It divides the branches of the tree by maximizing the information gain, and finally stops the division of the tree by setting the threshold of impurity to form the final decision tree.

Its main advantage is that the training and prediction of the model are very fast, but its disadvantage is that the accuracy of the model is sometimes lower than other classifiers. However, ensemble learning can greatly overcome this problem, such as Random Forest with Bagging thought and GBDT with Boosting thought, both of which are extensions of decision tree. They synthesize the classification results of multiple decision trees to assemble a more accurate classifier.

5.3 Rule-based Classifier (Rule-based Classifier)

Rule-based classifiers typically take advantage of “if… Then…” A class of rules to classify. Its applicability is limited, and it is difficult to obtain reliable rules.

5.4 Bayesian Classifier (Bayes Classifier)

In fact, Bayesian classifier is a class of classifier, which mainly uses Bayesian formula to estimate prior probability and likelihood probability and use part of prior information to calculate the probability that a sample belongs to a certain category in the case of data values of each dimension of a given sample.

5.5 Artificial Neural Network (ANN)

Neural network is an operational model consisting of a large number of nodes (or neurons) connected to each other. Each node represents a specific output function, called an activation function. Each connection between two nodes represents a weighted value for the signal passing through the connection, called weight, which is equivalent to the memory of artificial neural network. The output of the network varies according to the connection mode, weight value and excitation function of the network. The network itself is usually the approximation of some algorithm or function in nature, or the expression of a logical strategy.

5.6 Support Vector Machine (SVM)

Support vector machine is the representative of linear classifier. Unlike Bayesian classifiers, which estimate the probability density first and then calculate the discriminant function, linear classifiers directly estimate the linear discriminant, minimize a certain objective function, and obtain the final linear discriminant by using a convex optimization method.

This is one of the most popular classifiers, which is generally considered to be fast in training and prediction and reliable in accuracy, so it is widely used in various fields.

5.7 Ensemble Learning

The idea of ensemble learning is to combine several weak classifiers to form a strong classifier, which usually includes bagging and Boosting.

5.8 Classifier evaluation

Classifier evaluation is an important step to evaluate the performance of a classifier. It mainly has the following criteria:

Precision-recall: Precision-recall, calculated according to the obfuscation matrix

F1: a comprehensive index combining accuracy and recall rate

ROC: Intuitive curve to compare classifier performance

AUC: Quantitative expression of ROC

MAE: Mean absolute error

RMSE: mean root square error

References:

  1. Recommender System Handbook
  2. Recommendation System Practice

The above content is published by the fourth paradigm – first recommendation, only for learning exchange, copyright belongs to the original author.


Welcome everyone to like, collect, share more technical dry goods with friends around.

Related reading:

Getting started with recommendation systems, a list of knowledge you shouldn’t miss

Dry goods | five research hot spots of personalized recommendation system can explain recommended (5)

Dry goods | five research hot spots of personalized recommendation system user portrait (4)

Every member of the fourth Paradigm contributes to the early arrival of artificial intelligence. In this account, you can read the academic frontiers, knowledge, industry information from the computer field, as well as the internal sharing of paradigm members.

For more information, please search and follow the official weibo @Xianjian and wechat official account (ID: DSFSXJ).