1. What is LDA model
LDA can be divided into the following five steps:
- One function: gamma function.
- There are four distributions: binomial, multinomial, beta and Dirichlet.
- One concept and one idea: conjugate prior and Bayesian framework.
- Two models: pLSA and LDA.
- A sample: Gibbs sample
LDA has two meanings, one is Linear Discriminant Analysis (LDA), the other is probability topic model: Latent Dirichlet Allocation (LDA), this paper discusses the latter.
According to the introduction on wiki, LDA was proposed by Blei, David M., Ng, Andrew Y. and Jordan in 2003. It is a topic model, which can give the topic of each document in the document set in the form of probability distribution, so that after extracting the topic (distribution) of some documents by analyzing them, You can then do topic clustering or text categorization based on topic (distribution). At the same time, it is a typical word bag model, that is, a document is composed of a group of words, and there is no sequential relationship between words. In addition, a document can contain multiple topics, and each word in the document is generated by one of the topics.
How do humans generate documents? First list several topics, then select the topic with a certain probability, select the words contained in the topic with a certain probability, and finally combine into an article. As shown in the picture below (where the words in different colors correspond to the words under different themes in the picture above).
So LDA is the reverse of this: based on a given document, deduce its topic distribution backwards.
In the LDA model, a document is generated as follows:
- From the Dirichlet distributionGenerates the topic distribution for document I.
- Polynomial distribution from topicIs sampled to generate the JTH word topic of document I.
- From the Dirichlet distributionGenerate topics by sampling fromCorresponding word distribution.
- Polynomial distribution from the wordThe medium sample eventually generates the word.
The Beta like distribution is the conjugate prior probability distribution of the binomial distribution, and the Dirichlet distribution is the conjugate prior probability distribution of the polynomial distribution. In addition, the graph model structure of LDA is shown in the figure below (similar to the Bayesian network structure) :
1.1 Understanding of the five distributions
Let me explain the concepts that emerged above.
-
Binomial distribution
The binomial distribution is advanced from the Bernoulli distribution. Bernoulli distribution, also known as two-point distribution or 0-1 distribution, is a discrete random distribution, in which the random variable has only two values, either positive or negative {+, -}. And the binomial distribution, which is Bernoulli’s test repeated n times, is called. In short, if you do this once, it’s a Bernoulli distribution, and if you do it n times, it’s a binomial distribution.
-
Multinomial distribution
This is the case where the binomial distribution extends to multiple dimensions. Multinomial distribution means that the random variable in a single trial is no longer 0-1, but has multiple discrete values (1,2,3… , k). For example, in the experiment of rolling six-sided dice, the results of N experiments obey the multinomial distribution of K=6. Among them:
-
Conjugate prior distribution
In Bayesian statistics, if the posterior distribution and the prior distribution belong to the same category, they are called conjugate distribution, and the prior distribution is called conjugate prior of the likelihood function.
-
Beta distribution
Conjugate prior distribution of binomial distribution. The given parameters 和 , probability density function of random variable x whose value range is [0,1] :
Among them:
Note: This is known as the gamma function, which is explained below.
-
Dirichlet distribution
It’s a generalization of beta distribution in higher dimensions. The density function of Dirichlet distribution is in the same form as that of beta distribution:
Among them
So far, we can see that the binomial distribution is very similar to the multinomial distribution, the Beta distribution is very similar to the Dirichlet distribution.
If you want to learn more about the principle, you can refer to the LDA theme model in a general way, or you can go down first, and then go back to the detailed formula.
In summary, the following information can be obtained.
-
The beta distribution is the conjugate prior probability distribution of the binomial distribution: for non-negative real numbers 和 , we have the following relationship:
Among themThat corresponds to a binomial distributionThe count. For the case that the observed data conforms to the Binomial distribution and the prior and posterior distributions of parameters are Beta distributions, it is beta-binomial conjugate.”
-
Dirichlet distribution is the conjugate prior probability distribution of polynomial distribution, and the general expression is as follows:
Multinomial conjugate is also applied to the case where the observed data is Multinomial and the prior and posterior distributions of parameters are Dirichlet distributions.
-
Bayesianism’s fixed mode of thinking:
Prior distribution+ Sample information= posterior distribution.
1.2 Understanding of the three basic models
Before talking about LDA model, understand the basic models step by step: Unigram model, mixture of Unigrams Model, and pLSA model, which is most similar to LDA. To facilitate description, we define some variables first:
- Said the words,Represents the number of words (fixed value).
- Represents the subject,Is the number of topics (predetermined, fixed value).
- Represents corpus, where M is the number of documents in corpus (fixed value).
- Represents documents, where N represents the number of words (random variables) in a document.
-
Unigram model
For the documentwithSay the word, and the probability of generating document W is:
-
Mixture of unigrams model
The generation process of this model is: select a topic Z for a document, and then generate the document according to the topic, and all the words in the document are from a topic. Let’s say the topic has, the probability of generating document W is:
-
PLSA model
Once you understand the pLSA model, you are one step away from the LDA model — add the Bayesian framework to pLSA and you have THE LDA.
In the Mixture of unigrams model above, we assume that only one topic is generated in a document, but in reality, an article often has multiple topics, but the probability of the occurrence of these multiple topics in the document is not the same. For example, when introducing a country, it is common to introduce a number of topics, such as education, economy and transportation. So how are documents generated in pLSA?
Suppose you have a total of K possible themes and V possible words, let’s play a game of dice.
A * *, * * assume every time you write a document will be making a K “document – theme” below the dice (throw the dice to get K a theme of a), and K V “theme – word” dice below (each dice corresponds to a topic, K dice K before the corresponding theme, and dice corresponding to each side of the choice of words, V faces V alternative words).
For example, if K=3, make a “document-topic” die with three topics: education, economy, transportation. Then make V = 3, 3 a 3 “theme – word” below the dice, among them, the education subject word can be on the surface of the three dice: university, teachers, curriculum, the economic subject of dice three words can be: on the surface of the market, enterprises, finance, transportation subject word can be on the surface of the three dice: high-speed, cars, planes.
** 2, ** each write a word, first throw the “document-topic” dice to choose the topic, after getting the result of the topic, use the “theme-word” dice corresponding to the result of the topic, throw the dice to choose the word to write.
Roll the document-topic dice and assume (with some probability) that the topic is education, so the next step is to roll the education topic dice and get (with some probability) one of the words in the education topic dice: university.
The process of casting dice to produce words is simplified as follows: “First choose the topic with a certain probability, and then choose the word with a certain probability”.
Finally, you repeatedly roll the “document-topic” dice and “subject-word” dice, repeat N times (to produce N words), complete a document, repeat this method to produce a document M times, then complete M documents.
The above process is abstracted into the document generation model of PLSA. In this process, we do not care about the order of occurrence between words, so pLSA is a bag of words approach. The entire process of document generation is to select the topic of document generation and determine the topic generation word.
Conversely, now that the document has been generated, how do you deduce the topic from the already generated document? This process of infering hidden topics (distribution) from the documents you see (actually the reverse of producing the documents) is the purpose of topic modeling: to automatically discover topics (distribution) in the document set.
Document D and word W are the samples we get and can be observed, so for any document, theThat’s given. This is based on a large number of known document-term information, training out of the document – themeAnd subject-word items, as shown in the following formula:
Therefore, the generation probability of each word in the document is:
Due to theCan be calculated in advance, and 和 Unknown, soIt’s the parameter that we’re trying to estimate, and in a very simple way, we’re trying to maximize this theta.
What methods are used for estimation? Commonly used parameter estimation methods include maximum likelihood estimation (MLE), maximum post-verification estimation (MAP), Bayesian estimation and so on. Since the parameter to be estimated contains implicit variable Z, we can consider EM algorithm. For detailed EM algorithms, see the section on EM algorithms written earlier.
1.3 the LDA model
In fact, if you understand the pLSA model, you can almost understand the LDA model, because LDA is a Layer of Bayesian framework on the basis of pLSA, that is, LDA is the Bayesian version of pLSA (just because LDA is Bayesitized, you need to consider the historical prior knowledge and add the two prior parameters).
Let’s compare how a document is generated in the LDA model described at the beginning of this article:
- In terms of prior probabilitySelect a document.
- From the Dirichlet distributionTo generate documentsTopic distribution ofIn other words, topic distributionThe hyperparameter isThe Dirichlet distribution is generated.
- Polynomial distribution from topicTo generate documentsJTH word theme.
- From the Dirichlet distributionGenerate topics by sampling fromCorresponding word distributionIn other words, word distributionBy the parameter for theThe Dirichlet distribution is generated.
- Polynomial distribution from the wordThe medium sample eventually generates the word.
In LDA, topic selection and word selection are still two random processes. It is still possible to extract the topic “education” from the topic distribution {education: 0.5, economy: 0.3, transportation: 0.2} first, and then extract the word “university” from the corresponding word distribution {university: 0.5, teacher: 0.3, course: 0.2}.
What is the difference between PLSA and LDA? The difference is:
In PLSA, subject distribution and word distribution are uniquely determined, which can clearly point out that subject distribution may be {education: 0.5, economy: 0.3, transportation: 0.2}, and word distribution may be {university: 0.5, teacher: 0.3, course: 0.2}. However, in LDA, topic distribution and word distribution are no longer uniquely fixed, that is, they cannot be given exactly. For example, the subject distribution could be {education: 0.5, economy: 0.3, transportation: 0.2} or {education: 0.6, economy: 0.2, transportation: 0.2}, which we are no longer sure (i.e. do not know) because it is randomly variable. However, no matter how they change, they still obey certain distribution, that is, topic distribution and word distribution are randomly determined by Dirichlet prior. Because LDA is a Bayesian version of PLSA, topic distribution and word distribution are randomly given by prior knowledge.
In other words, LDA gives these two parameters on the basis of pLSATwo parameters of prior distribution are added (Bayesitizing) : a prior distribution of topic distribution Dirichlet distribution, and a prior distribution of word distribution Dirichlet distribution.
To sum up, LDA is really just a Bayesian version of pLSA. After the document is generated, both of them need to infer their topic distribution and word distribution according to the document (that is, both of them are essentially to estimate the topic generated by a given document and the probability of word generated by a given topic), but they use different parameter inference methods. In pLSA, maximum likelihood estimation is used to infer two unknown fixed parameters, while LDA makes these two parameters into random variables and adds Dirichlet priors.
Therefore, the essential difference between pLSA and LDA is that they use different ideas to estimate unknown parameters. The former uses frequency theory while the latter uses Bayes theory.
LDA parameter estimation: Gibbs sampling, see the references at the end of this article.
2. How to determine the number of TOPICS in LDA?
- Based on subjective judgment of experience, continuous debugging, strong operability, the most commonly used.
- Based on the degree of confusion (mainly comparing the quality of two models).
- The log-marginal likelihood method, which is also quite common.
- Nonparametric method: HDP method based on Dirichlet process proposed by Teh.
- Based on similarity between topics: calculate cosine distance between topic vectors, KL distance, etc.
3. How to use theme model to solve the problem of cold start in recommendation system?
The cold-start problem in recommendation systems is how to personalize recommendations to users without a large amount of user data in order to optimize click-through rates, conversion rates, or user experience (user retention time, etc.). Cold start problems are generally divided into three categories: user cold start, item cold start and system cold start.
- User cold start refers to the recommendation of a new user with no previous behavior or minimal behavior;
- An item cold start is the discovery of potentially interested users for a new product or movie that has no associated ratings or user behavior data.
- System cold start refers to how to design a personalized recommendation system for a newly developed website.
The solution to the cold start problem is generally content-based recommendations. Take the Hulu scenario as an example. For cold startup, we want to use the user’s registration information (e.g. Age, gender, hobbies, etc.), search keywords, or other information obtained from legitimate sites (for example, users log in using Facebook accounts and are authorized to have access to Facebook friends and comments) to predict the topic of interest of users. After obtaining the user’s interest theme, we can find other users with the same interest theme and predict the movie that the user is interested in through their historical behavior.
Similarly, for the cold start problem of objects, we can also predict the theme of the movie according to the director, actor, category, keywords and other information of the movie, and then find similar movies based on the theme vector, and recommend the new movie to the users who like to watch these similar movies in the past. You can use the theme model (pLSA, LDA, etc.) to get the theme of the user and the movie.
Taking users as an example, we treat each user as a document in the topic model, and the characteristics corresponding to the user as a word in the document, so that each user can be represented as a bag of characteristics. After learning through the topic model, common features will correspond to the same topic, and each user will get a topic distribution accordingly. The distribution of themes for each film can be obtained in a similar way.
** So how to solve the system cold start problem? ** First of all, we can get the theme vector corresponding to each user and movie. In addition, we also need to know the preference degree between user theme and movie theme, that is, which theme users may like which theme movies. When there is no data in the system, we need some prior knowledge to specify, and since the number of topics is usually small, with the launch of the system, after collecting a small amount of data, we can get an accurate estimate of the preference degree among topics.
4. References
The LDA topic model is generally understood
5. Code implementation
LDA model application: Seeing Through Hillary Clinton’s emails
Machine Learning
Author: @ mantchs
GitHub:github.com/NLP-LOVE/ML…
Welcome to join the discussion! Work together to improve this project! Group Number: [541954936]