This is the 25th day of my participation in the First Challenge 2022
This article is the fourth in the Prompt series. In the first two chapters, I shared two classic works of Exploring prompt design methods, AutoPrompt and Null prompt. Today’s pre-trained Prompt Tuning is an exploration of hybrid Prompt. There is also prompt pre-training, which is the work of well-known teams such as Tsinghua CoAI and Tsinghua NLP.
This article was uploaded to arXiv in September 2021. Co-first authors Yuxian Gu and Xu Han are from Tsinghua University. PPT: Pre-trained Prompt Tuning for fee-shot Learning
Motivation
In the case of sufficient data, the effect of Prompt tuning is similar to that of traditional fine-tuning, but in the case of fee-shot, the effect of Prompt tuning is much worse.
The authors attribute this to the initialization of soft Prompt, so they try to add soft Prompt during the pre-training phase to get better initialization. This is pre-trained Prompt Tuning (PPT).
Pilot Experiments
The author has done several pilot experiments on Prompt tuning:
1. Mixed Prompt Tuning (Hard + Soft)
The author combines soft Prompts with three manual designed Hard Prompts and two automatically generated hard Prompts. P is soft prompt and S is input statement. The results are as follows:
This method is beneficial to prompt tuning, but the effect is not as good as fine tuning.
2. Verbalizer choice
As shown above, the author compares the effects of different Verbalizers within the same Prompt template and finds that the choice of Verbalizer makes a big difference. In general, words that explain the meaning of the corresponding label work better.
3. Initialize the soft Prompt tag with word embedding
The authors try four initialization strategies that have been validated in previous work and proved to be effective in small models. However, the authors tried to initialize soft Prompt tags with embedment of specific words in models with 11B parameters, with little or even negative effect.
In addition, none of the above three methods can well solve the problem of Prompt tuning in the case of few shot.
Method
(Dig a hole and fill it later)
Experiments
The authors used 32 training samples and 32 validation samples for each dataset. The classification task results are as follows:
The main conclusions are as follows:
- Comparison between fine tuning: The larger the model, the better the fine tuning effect. This shows that large models are more advantageous in few-shot situations.
- Comparison of prompt- Tuning: PPT clearly outperformed Vanilla PT and LM Adaption in most data sets, while a simple combination of PPT and Hard Prompt (Hybrid PPT) achieved the best results in almost all data sets. This suggests that pre-trained prompt and mixed prompt may be complementary.
- Comparison between PPT and Fine-tuning: PPT is due to fine-tuning in most English data sets and all Chinese data sets, which indicates that PPT can bridge the difference between MLM and downstream tasks better than fine-tuning.
- Comparison of variances of prompt-tuning effects: In the case of few shot, the performance of prompt-tuning in different data sets is very unstable, while the performance variances of PPT in all data sets decrease significantly.
- To be continued…