(NLP Intelligent dialogue robot practical training course based on Transformer)
One Architecture, One Course, One Universe
Gavin of Star Intelligent Conversational Robotics sees Transformer as embracing the art of data uncertainty.
Transformer’s architecture, training and reasoning are all completed under the Bayesian neural network uncertainty mathematical thinking. Encoder-Decoder architecture, multi-head attention mechanism, Dropout and residual network are all specific implementations of Bayesian neural network. All variations and practices based on Transformer model are also based on Bayesian thinking to deal with data uncertainty; The use of Embeddings of various types to provide better Prior information is actually an application of Bayesian idea to integrate and deal with the uncertainty of information expression. The high scores in various modern NLP competitions are mostly through the integration of Transforme such as RoBERTa, GPT, ELECTRA and XLNET R model and others try their best to counter the uncertainty of model information representation and reasoning to the maximum extent.
From the perspective of mathematical principles, the objective function of traditional Machine Learning and Deep Learning algorithm training is generally realized on the basis of maximum likelihood estimation MLE and maximum posteriori probability MAP under Naive Bayes mathematical principles, the core of which is to find the best model parameters. The core of Bayesian is to calculate Posterior probability predictive distribution, which can better express information and deal with uncertainty by providing uncertainty of the model. For Bayesian architecture, Prior probability knowledge of multiple perspectives is the basis. When there is only small data or no data, the Prior probability distribution of the model (such as the classical Gaussian distribution) is mainly used for model inference. With the increase of data, Multiple models will constantly update the parameters of each model to get closer to the model probability distribution of real data. At the same time, all model parameters are integrated (theoretically) to carry out Inference, so Bayesian neural network can provide the distribution interval of Confidence based on probability for results, so as to better grasp the uncertainty of data in various Inference tasks.
Of course, since the Bayesian model was used by expensive CPU, Memory and Network, computing P(B) of all probability model distributions in the Bayesian neural Network in practical engineering practice was Intractable and even Intractable and almost impossible to achieve, Therefore, Sampling technologies such as Collapsed Gibbs Sampling of MCMC, Metropolis Hastings, Rejection Sampling and Variational Inference Mean are adopted during the project implementation Field and Stochastic methods to reduce the cost of training and reasoning. When Transformer falls to the Bayesian idea, it balances various factors to achieve the maximum Approximation, For example, the multi-head self-attention mechanism with higher COST performance of CPU and memory compared with CNN and RNN is used to complete the expression of more integrated perspective information. During Decoder training, multi-dimensional Prior information is generally used to achieve faster training speed and higher quality model training. In normal engineering landing, Transformer can also integrate Embeddings from different sources. For example, Transformer implementation of Star Intelligent dialogue robot integrates One-hot Encoding, Word2vec, fastText, GRU, BERT and other encoding to express information of more levels and perspectives.
Transformer embracing data uncertainty, based on the conjugate prior distribution under Bayesian conjugate prior distribution and other characteristics, forms an ideal framework that can integrate various prior knowledge and diversify for information expression, cheap training and reasoning. In theory, Transformer is better able to handle all data “set of units”. Computer vision, speech, natural language processing and so on fall into this category, so Transformer will theoretically dominate these fields for decades to come.
This course takes Transformer architecture as the cornerstone, extracts the most valuable content in NLP, and focuses on the knowledge points of the whole life cycle needed to manually realize industrial intelligent business dialogue robot. After learning, it can not only integrate all the core links of NLU, NLI and NLG in the NLP field from the aspects of algorithm, source code and actual practice. At the same time, I will have the knowledge system, tools and methods, and reference source code to independently develop industry-leading intelligent business dialogue robot, and become the Top 1% talent in the industry with NLP hard strength.
Course Features: Chapter 101 NLP practical courses, 5137 NLP subdivision knowledge points, nearly 1200 code cases, all the course content, 10000+ lines of pure manual implementation of industrial intelligent business dialogue robots Ai-related mathematical knowledge is acquired in specific architecture scenarios and project cases, and Attention mechanism under Bayesian deep learning is used as the foundation to construct the whole life cycle explanation of five NLP competitions, including the complete code implementation of the competition
Chapter 3: Details of Language Model and Transformer XL source code ****
1. Mathematical essence analysis and code practice of MLE, one of the most important formulas in artificial intelligence
2. Mathematical principle of Language Model, Chain Rule analysis and Sparsity problem
3, Markov Assumption: First order, second order, third order analysis
4, Language Model: UnigRAM and its problem analysis, BigRAM and dependency order, N-gram
5. Use Unigram to train a Language Model analysis and practice
6. Use Bigram to train a Language Model analysis and practice
7. Use N-gram to train a Language Model analysis and practice
8, Spelling correction case combat: Simplified Naive Bayes error correction algorithm detailed explanation and source code implementation
9. Use the PPL(Perplexity) based on Average Log Likelihood to evaluate the Language Model
10. Laplace Smoothing analysis and specific methods for selecting optimal K based on PPL
11, Hydraulics: Weighted average different N-gram probability
12. Smoothing algorithm analysis for good-turning Smoothing
13. Vallina Transformer Language Model handles long text schema parsing
14, Vallina Transformer Training Losses: Multiple Postions Loss, Intermediate Layer Losses, Multiple Targets Losses
15. Three core problems of Vallina Transformer: Segment context fracture, difficult to distinguish locations and low prediction efficiency
16, Transformer XL: Attentive Language Models Beyond a fixed-length Context
17. Segment-level Recurrence with State Reuse mathematical principle and implementation analysis
18, Relative Positional Encoding
19. Trick analysis of Transformer XL to reduce the complexity of matrix operation
20. Thinking about the use of caching mechanism in language models
21, Transformer XL data preprocessing complete source code implementation and debugging
22, Transformer XL MemoryTransformerLM complete source code implementation and debugging
23, the Transformer of XL PartialLearnableMultiHeadAttention source implementation and debugging
24, the Transformer of XL PartialLearnableDecoderLayer source implementation and debugging
Transformer XL AdaptiveEmbedding source code implementation and debugging
26, Transformer XL relative position encoding PositionalEncoding source code implementation and debugging
27, Transformer XL Adaptive Softmax analysis and source code integrity
28, Transformer XL Training complete source code implementation and debugging
29, Transformer XL Memory update, read, maintenance reveal
30, Transformer XL Unit test
31. Transformer XL case debugging and visualization
More lessons can be found in Gavin’s open lecture on Starry Sky Intelligent Talking Robots
Here’s a selection of previous open course videos:
www.bilibili.com/video/BV1N3…
www.bilibili.com/video/BV1aS…