This paper mainly introduces tencent AI Platform Department published a paper on Chinese error correction model, ACL in 2021 s, the paper address: aclanthology.org/2021.acl-lo…
function
PLOME is designed for Chinese spelling Correction (CSC) tasks, which are designed to detect and correct spelling errors in Chinese text.
Abstract
- This paper proposes a pretrained mask language model PLOME with spelling error knowledge for Chinese spelling correction (CSC), which can learn semantic and spelling error knowledge during training
- PLOME masks the selected tag with similar characters based on the obfuscation set, instead of using the fixed tag “[MASK]” as BERT did.
- In addition to character prediction, PLOME also introduces character sounds and strokes as input, and learns character and speech-level misspellings by recovering real characters and speech at mask positions
- PLOME models speech and strokes using GRU networks
contribution
- The present method has a significant advantage over the most advanced methods
- Release source code and pre-training models for community use: github.com/liushulinle…
- PLOME is the first task-specific language model designed specifically for Chinese spelling correction, and the first to model this task at both the character and phonetic levels
- PLOME introduced pinyin and strokes, which allowed it to model similarities between arbitrary characters
Model framework
Below is the main framework of PLOME.
Confusion Set based Masking Strategy
PLOME covers only 15% of the tokens entered. Unlike BERT, PLOME covers only 15% of the tokens entered with masks. Instead, PLOME performs four possible variations on the tokens covered, as shown in the following example:
- 60% of the time it becomes a phonetic character
- Fifteen percent of the time it becomes a visually similar character
- 15% chance of not changing
- 10% chance of picking one at random from the word list
Embedding Layer
By looking at the frame diagram above, we know that the final embedding of each character is the sum of character embedding, Position embedding, Phonic embedding and Shape embedding. The former two are obtained through lookup table. The thesaurus size and embedded dimension are the same as in BERTbase.
In this article, Unihan Database3 is used to get the Hanzi/pinyin mapping (without tones), and the pinyin letters of each word are input to layer 1 GRU network to generate Phonic Embedding expecting similar pinyin to have similar Embedding. An example is shown in the middle of the figure below.
Use Stroke Order4 to represent a sequence of Chinese strokes. In this article, stroke data is obtained through Chaizi Database5. To model the visual relationship between characters, the stroke sequence of each Chinese character is fed into another 1-layer GRU network to generate Shape Embedding, an example is shown at the bottom of the figure above.
Transformer Encoder
The architecture of Transformer used in PLOME is the same as that used in BERTbase. A total of 12 layers, each layer of hidden unit size of 768, the number of attention is 12.
Output Layer
PLMOE makes two content predictions for each selected character:
- Character Prediction: Similar to BERT, PLOME predicts the original Character of each mask marker based on the embeddings generated by the last Transformer layer
- Pronunciation Prediction: Pronunciation errors play a dominant role in Chinese spelling mistakes. About 80% of spelling errors are phonetic. To learn about spelling errors at the phonetic level, PLOME also predicts the true pronunciation of each mask mark, which is rendered in pinyin and without tones.
Learning
The above Prediction has two parts, and the learning process of the model is also goal-driven by two parts, namely, Character Prediction and Pronunciation Prediction. Lc is the target of character prediction, Li is the real character of Xi, Lp is the target of pronunciation prediction, ri is the real pinyin.
The overall objectives are defined as:
Fine-tuning Procedure
PLOME is designed for CSC tasks to detect and correct spelling errors in Chinese text. Formally, given a sequence of n characters X = {x1, x2… , xn}, the model expects to generate the target sequence Y = {y1, y2… Yn}, errors can be corrected.
The learning objectives are exactly the same as during pre-training. This process is similar to pre-training, but the differences are as follows: (1) The mask strategy of obfuscation set is removed; (2) All input characters need to be predicted, instead of just selecting tokens as pre-trained.
Inference: Since PLOME pretrained to predict the character distribution and pronunciation distribution of each shadowing token, the joint distribution was constructed as follows:
Pj (yi = j | X) is a combination of characters and pronunciation forecast, forecast the xi’s original character for the first j characters of probability, PC and pp respectively defined in equations 1 and 2 equation, jp is the first j a pronunciation of the characters. Therefore, an index matrix I∈Rnc× NP is constructed, where Ii and j are set to 1 if the i-th character is pronounced as the J-th pinyin, and 0 otherwise. Then the joint distribution can be calculated as:
Joint probability is used as the prediction distribution. For each input token, the character with the highest probability of union is selected as the final output. The joint distribution takes into account both character and pronunciation predictions and is therefore more accurate.
The experiment
Pre-training
Dataset: Uses Wiki2019ZH6 as a pre-training corpus, which consists of 1 million Chinese Wikipedia pages. There were also 3 million news articles. Breaking down the pages and articles into sentences yielded 162.1 million sentences. Contiguous sentences are then concatenated to obtain text fragments of up to 510 characters as training examples.
The Parameter Settings: a little
Fine-tuning
Dataset: Training data consists of 10K manually annotated samples and 271K automatically generated samples from SIGHAN datasets of 2013, 2014 and 2015.
Evaluation Data: Use the latest SIGHAN test Data set to evaluate the proposed model, which contains 1100 texts and 461 error types.
Evaluation Metrics: Use the most common accuracy, recall and F1 scores for Evaluation.
The Parameter Settings: a little
Main Results
From the test results table above, the following conclusions can be drawn:
- In the absence of fine-tuning, the pre-training model in the intermediate group achieved better results, even significantly better than the supervised method PN. This suggests that the masking strategy based on confusion sets enables the model to learn task-specific knowledge during pre-training.
- Cbert-finetune outperforms bert-Finetune on all metrics, suggesting that the proposed mask strategy provides the necessary knowledge and cannot be learned by fine tuning.
- Combined with speech and shape embedding, PLome-Finetune outperformed Cbert-Finetune by 2.3% and 2.8% in terms of sentence level detection and correction. This suggests that character pinyin and strokes provide useful information and are difficult to learn from obfuscation.
- SpellGCN is the same confusion set as the method used in this article. But use different strategies to learn what it contains. SpellGCN has built a GCN network to model this information, while PLOME learns it from large-scale data during pre-training. PLOME achieved better performance across all metrics, indicating that this approach is more effective in modeling such knowledge.
- From the diagram above, using the entire test set compared to SIGHAN, PLOME almost consistently outperforms BERT and SpellGCN on all metrics.
- The proposed model was also evaluated on SIGHAN13 and SIGHAN14, and PLOME consistently outperformed all comparison models.
Effects of Prediction Strategy
PLOME predicts three distributions per character: character distribution PC, pronunciation distribution PP, and joint distribution PJ. The latter two distributions are related to pronunciation prediction. The CSC task requires character prediction, so only the effects of character prediction PC and joint prediction PJ are compared.
It is observed that the joint distribution is superior to the character distribution in all evaluation indexes. In particular, the gap between accuracy scores is more obvious. The joint distribution considers both character prediction and pronunciation prediction, so the prediction result is more accurate.
Effects of Initialization Strategy
In general, initialization strategies have a significant impact on the performance of depth models. This article implements four baselines based on cBERT and PLOME. It is observed from experiments that both cBERT and PLOME initialized with BERT achieve better performance. In particular, recall scores for all assessments increased significantly. This paper holds that the following two reasons can explain this phenomenon. 1) The rich semantic information in BERT can effectively improve generalization ability. 2) PLOME consists of two 1-layer GRU networks and a 12-layer Transformer encoder, containing a total of more than 110M of parameters. When training such a large scale model from scratch, it’s easy to get bogged down in local optimization.
Phonic/Shape Embedding Visualization
Voice and shape inserts are generated for each character over the GRU network, and then visualized. The cosine similarity of the 768-dimensional embedding generated from the GRU network illustrates the 30 characters closest to the “ingot”, and the similarity is visualized by T-SNE. On the one hand, almost all the Chinese characters for “ingot” and “blossom” are similar to “ingot”. On the other hand, characters that are similar to each other are very close to each other. These phenomena indicate that learning shape embedding well simulates shape similarity. The picture also shows that pinyin similar to ding is clustered together.
Converging Speed of Various Models
CBERT and PLOME achieved better performance than BERT at the beginning of training because of the masking strategy based on confusion sets, which learned task-specific knowledge during pre-training. In addition, the proposed model requires fewer training steps to achieve relatively good performance. PLOME needed only 7K steps to achieve an 80% score, while BERT needed 47K steps.
The tail down
- This article is not easy to write, reprint to indicate the source
- I hope you can learn knowledge, if you can like support, of course, more perfect 🙏
- Welcome to discuss in the comments section