This is the 23rd day of my participation in the November Gwen Challenge. Check out the event details: The last Gwen Challenge 2021

Continuous Prompts

The initial intention of the Prompt was to find a way for the pre-trained Language Model (PLM) to better output the desired results, but it is not necessary to design the Prompt form into a natural Language that humans can understand, just a machine can understand. Therefore, there are also ways to explore a continuum prompt — a direct interface to the interface space of the model. The continuum Prompts remove two constraints:

  1. The Embedding of the word in the template can be the whole natural language Embedding, not just some limited Embedding
  2. The parameters of the template no longer take PLM parameters directly, but have their own independent parameters, which can be adjusted through the training data of downstream tasks

Prefix Tuning was first proposed by Li et al., which is a method of adding a group of continuous type vectors before input sentences. This method keeps PLM parameters unchanged and only trains Prefix vectors. In order to perform generation tasks, different Prompt stitching methods are defined according to different model structures, and [Prefix;x;y][\text{Prefix}; x; X;x] is used in the AUTO-regressive model of GPT class. Y][Prefix;x;y] [Prefix;x;y] [\text{Prefix};x; text{Prefix}’;y][Prefix;x;Prefix ‘;y][Prefix;x;Prefix ‘;y][Prefix;x;Prefix ‘;y][Prefix;x;Prefix ‘;y] Y]

Input part Prefix,x,y\text{Prefix},x,yPrefix,x,y Position ID as respectively Pidx Xidx, Yidx \ text {P} _ {\ text {independence idx}}, \ text {X} _ {\ text {independence idx}}, \ text {Y} _ {\ text {independence idx}} Pidx, Xidx, Yidx. Initialize a trainable matrix, As P theta ∈ R ∣ Pidx ∣ x dim ⁡ (hi) P_ \ theta \ \ mathbb in ^ {R} {\ lvert P_ {\ text {independence idx}} \ rvert \ times \ dim (h_i)} P theta ∈ R ∣ Pidx ∣ x dim (hi), one of them


h i = { P Theta. [ i . : ] . if  i P idx L M ϕ ( z i . h < i ) . otherwise h_i = \begin{cases}P_\theta [i,:],\quad &\text{if}\ i \in \text{P}_{\text{idx}}\\ \mathbf{LM}_{\phi}(z_i,h_{<i}), \quad &\text{otherwise} \end{cases}

The formula above means that if index III is part of a prefix, the vector is extracted from PθP_\thetaPθ; Iii If it is not a prefix part, the corresponding vector is generated by the pretraining model with fixed parameters. The training objectives are:


max ϕ   log p ϕ ( y x ) = i Y idx log p ϕ ( z i h < i ) \mathop{\text{max}}\limits_{\phi} \ \log p_{\phi}(y\mid x) = \sum\limits_{i\in \text{Y}_{\text{idx}}} \log p_{\phi} (z_i\mid h_{<i})

Nn.linear () PθP_{\theta}Pθ is a matrix.

Again, searching Prompt in a continuous space, OptiPrompt builds “templates” that are not limited to prefixes, but can also be in the middle of sentences

Start by defining a Prompt template according to AutoPrompt:


[ x ]   [ v ] 1   [ v ] 2   . . .   [ v ] m   [ MASK ] [x]\ [v]_1\ [v]_2\ … \ [v]_m\ [\text{MASK}]

Where [v] I [v]_i[v] I is a continuous vector (consistent with BERT’s input dimension). OptiPrompt also considers using manually constructed discrete prompts as a starting point for searching in continuous space to build better prompts. For example [x] is [MASK] citizen[x]\ \text{is}\ [\text{MASK}]\ \text{citizen}[x] is [MASK] citizen can be converted to


[ x ]   [ v ] 1   [ MASK ]   [ v ] 2 [x]\ [v]_1\ [\text{MASK}]\ [v]_2

The corresponding input Embedding of IS and CITIZEN is initialized as [v]1[V]_1[V]1 and [V]2[V]_2[v]2

Hard-soft Prompt Hybrid Tuning method can be said to be a combination of manual design and automatic learning. It usually does not simply use learnable Prompt templates, but inserts some learnable Embedding into the manually designed templates. In fact, we all know that continuous prompts are better than discrete ones, but is there any room for improvement on this basis? The p-tuning strategy proposed by Liu et al solves the problem of the correlation between Prompt tokens

Previous iterations of Prompt generation have consisted of training a matrix and then indexing the matrix’s rows together. Frankly speaking, we hope that there is a good correlation between these Prompt token Embedding rather than independent learning. To solve this problem, P-Tuning introduces a Prompt Encoder (as shown in Figure B below).

Figure A is the traditional discrete Prompt. We call the thing that generates discrete Prompt token Prompt Generator. Figure B first passes in some Virtual (Pseudo) tokens, such as [unused1],[unused2],… Of course, the number of tokens is a hyperparameter, and the insertion position can be adjusted. Pass these Pseudo tokens through a Prompt Encoder to get a continuous vector h0,… ,hmh_0,… ,h_mh0,… Hm, among them

\begin{align} h_i &= \text{MLP}([\overrightarrow{\mathop{h_i}}; \overleftarrow{\mathop{h_i}}])\\ &= \text{MLP}([\text{LSTM}(h_{0:i}):\text{LSTM}(h_{i:m})]) \end{align}

That is, Prompt Encoder is a simple network composed of BiLSTM+MLP. The authors also found that adding some Anchor tokens (domain or task-specific tokens) could help optimize the Template. For example, the text contains tasks, and the input is the premise and hypothesis. A continuous template is


[PRE] [ continuous tokens ] [ HYP ] [ continuous tokens ] [ MASK ] \text{[PRE]} [\text{continuous tokens}][\text{HYP}][\text{continuous tokens}] [\text{MASK}]

Add an Anchor Token to it: [?] The effect will be even better. Now the template becomes


[PRE] [ continuous tokens ] [ HYP ] ? [ continuous tokens ] [ MASK ] \text{[PRE]} [\text{continuous tokens}][\text{HYP}]? [\text{continuous tokens}] [\text{MASK}]

You may want to ask, how to optimize P-tuning? In fact, according to the amount of annotated data, there are two cases to discuss

  1. Annotated data is less. [P0] ~ [Pm][\text{P}_0]\sim [\text{P}_m][P0] ~ [Pm] In other words, we are simply updating the Prompt Encoder’s parameters
  2. Annotation data is abundant. In this case, direct release of all parameters fine tuning