LDA topic model for text mining
Author: Zheng Pei
The introduction
Topic model is an important tool for text mining and has received much attention in industry and academia in recent years. In the field of text mining, a large amount of data is unstructured and it is difficult to directly obtain relevant and expected information from information. A text mining method: Topic Model can identify the Topic in the document, and mine the hidden information in the corpus, and has a wide range of uses in Topic aggregation, information extraction from unstructured text, feature selection and other scenes.
Latent Dirichlet Allocation (LDA) is one of the most representative models. LDA was proposed by Blei, David M., Ng, Andrew Y. and Jordan in 2003 to predict the topic distribution of documents. It can:
- Explore english-Tibetan thematic patterns in corpus;
- Annotate documents by topic;
- Use annotations to organize, organize, summarize and retrieve documents.
1. What is a topic?
From the perspective of vocabulary: make an article or several articles common representation, hidden semantics, a pattern of common words, common words according to a class, a weak classification class list;
From the perspective of probability distribution: each topic is a probability distribution of all words; Subjects assign a higher probability to words that appear at the same time; There is some correlation between the co-occurring sub-entities;
From the perspective of machine learning, topic model is a typical application of hierarchical Bayesian networks to data (documents or images) : Each document contains multiple topics, implicit variables serve to represent the main structure between documents, and topic models are based on bag-of-word or bag-of-feature assumptions (so word order is meaningless).
2. Why is it latent?
Bayesian network describes the relationship between variables through the following aspects:
- The edge of the connection node and the direction of the edge;
- Probability distribution of nodes – prior distribution and posterior distribution.
For the variable relation which cannot be described accurately, implicit nodes are introduced to solve the problem. In LDA, word co-occurrence is described by the posterior probability of implied nodes, and a high probability is assigned to it, which can be expressed by the formula:
Repeat structures are represented by the following box structure:
In the figure, nodes represent variables and edges represent possible dependencies. Implicit nodes are hollow, observation nodes are solid, and boxes represent repeated structures. The overall structure of LDA model can be obtained:
3. Document generation process of LDA model
In the LDA model, a document is generated as follows:
- From the dirichlet distributionGenerates the topic distribution for document I;
- The diversity from subject is distributedIs sampled to generate the topic of the JTH word of document I;
- From the dirichlet distributionGenerate topics by sampling from,j corresponds to the distribution in this domain;
- Polynomial distribution from the wordThe medium sample eventually generates the word.
The processing process of the ABOVE LDA is the process of disassembling document by document, not the actual processing process. For generating each word in the document, two dice are rolled. The first one is a doc-topic dice to get topic, and the second one is a topic-word dice to get Word. Each time a word in each document is generated, the two dice are rolled close to the rotation.
If there are N words in the corpus, god will roll the dice 2N times, alternating the doc-topic dice and the topic-word dice. But there are actually some dice rolls that are interchangeable, and we can equivalently adjust the order of 2N dice rolls: The first N times only throw the doc-topic dice to get the topics of all the words in the corpus. Then, based on the topic number of each word, the last N times only throw the topic-word dice to generate N words. At this point, we can get:
4. LDA model training
According to the joint probability distribution in the previous section, we can use Gibbs Sampling to sample it. The Gibbs Sampling formula of the LDA model is obtained:
According to the formula, we have two goals:
- Estimate the parameters in the model 和 ;
- For a new document, we can calculate the topic distribution for that document 。
Training process:
- For every word in every document in the corpus, randomly assigned a topic number Z;
- Rescan the corpus for each word, sampled by Gibbs Sampling formula, its topic was calculated and updated in corpus;
- Repeat Step 2 until Gibbs Sampling converges.
- The topic-word co-occurrence frequency matrix of corpus is counted, which is the MODEL of LDA.
According to this topic – word frequency matrix, we can calculate each p (word | topic) probability, so as to calculate the model parametersThis is the K topic-word dice. And documents in the corpus correspond to dice parametersIt can also be calculated in the above training process. As long as the frequency distribution of topics in each document is counted after Gibbs Sampling convergence, we can calculate each oneProbability, and then you can calculate each of them 。
Due to the parameterIs related to each document in the training corpus and is not useful for us to understand the new document, so it is generally not necessary to retain the LDA model when the project finally stores it. Generally, in the process of LDA model training, we average the results of N iterations after Gibbs Sampling convergence for parameter estimation, so that the model quality is higher.
With the LDA model, for the new document DOC, we only need to consider the Gibbs Sampling formulaThe part is stable and invariant, provided by the model obtained from the training corpus, so we only need to estimate the topic distribution of the document in the sampling processThe specific algorithm is as follows:1. For each word in the current document, randomly initialize a topic number z; 2. Gibbs Sampling formula is used for each word, resampling its topic; 3. Repeat the above process to know Gibbs Sampling convergence; 4. Count the distribution of topics in the document, which is.
5. Project realization
Project source code address: momodel.cn/workspace/5…
6. References
[1] Blei D M, Ng A Y, Jordan M I. Latent dirichlet allocation[J]. Journal of machine Learning research, 2003, 3(Jan): [2] LDA Math Gossip Rickjin.
Mo (momodel.cn) is a Python-enabled online modeling platform for artificial intelligence that helps you quickly develop, train, and deploy models.
Mo ARTIFICIAL Intelligence Club is initiated by THE R&D and product team of Mo, which is committed to lowering the threshold for the development and use of artificial intelligence. The team has experience in big data processing and analysis, visualization and data modeling, has undertaken multi-field intelligent projects, and has full design and development capabilities from the bottom to the front end. His research interest is big data management and analysis and artificial intelligence technology, which can promote data-driven scientific research.