This is the 21st day of my participation in the November Gwen Challenge. Check out the event details: The last Gwen Challenge 2021

EMNLP2021 Findings have an article called TSDAE: A Study on Transformer-based Sequential Denoising Auto-encoder for Unsupervised Sentence Embedding Learning Using Transformer structure unsupervised training sentence encoding, the network architecture is shown below

Specifically, the input text adds some definite noise, such as deleting, swapping, adding, masking some words, etc. Encoder needs to encode the sentence containing noise into a fixed size vector, and then use Decoder to restore the original sentence without noise. That’s true, but there are a lot of details, starting with training objectives


J SDAE ( Theta. ) = E x …… D [ log P Theta. ( x x ~ ) ] = E x …… D [ t = 1 l log P Theta. ( x t x ~ ) ] = E x …… D [ t = 1 l log exp ( h t T e t ) i = 1 N exp ( h t T e i ) ] \begin{aligned} J_{\text{SDAE}}(\theta) &= \mathbb{E}_{x\sim D}[\log P_{\theta}(x\mid \tilde{x})]\\ &=\mathbb{E}_{x\sim D}[\sum_{t=1}^l \log P_{\theta}(x_t\mid \tilde{x})]\\ &=\mathbb{E}_{x\sim D}[\sum_{t=1}^l \log \frac{\exp(h_t^T e_t)}{\sum_{i=1}^N \exp(h_t^T e_i)}] \end{aligned}

DDD is the training set. X =x1x2… XLX = x_1x_2\cdots x_lx=x1x2… xl is an input sentence of length LLL; X ~ tilde{x}x~ is the sentence after XXX added noise; Ete_tet is word embedding xtx_txt. NNN is Vocabulary size; Hth_tht is the hidden state output of TTT step of Decoder

Different from original Transformer, Decoder only uses fixed size vector of Encoder output to decode. Specifically, cross-attention between Encoder and Decoder is formally expressed as follows:


H ( k ) = Attention ( H ( k 1 ) . [ s T ] . [ s T ] ) Attention ( Q . K . V ) = Softmax ( Q K T d ) V \begin{aligned} &H^{(k)}=\text{Attention}(H^{(k-1)}, [s^T], [s^T])\\ &\text{Attention}(Q,K,V) = \text{Softmax}(\frac{QK^T}{\sqrt{d}})V \end{aligned}

Where H(k)∈Rt×dH^{(k)}\in \mathbb{R}^{t\times d}H(k)∈Rt× D is the hidden state in TTT decoding steps of KKK layer Decoder; DDD is the dimension of sentence vector (the dimension of Encoder output vector); [sT]∈R1×d[s^T]\in \mathbb{R}^{1\times d}[sT]∈R1×d is the sentence (line) vector of Encoder output. As can be seen from the above formula, no matter which level of cross-attention, KKK and VVV are always sTs^TsT. The author designed this to artificially add a bottleneck to the model. If the Encoder encoded sentence vector sTs^TsT is not accurate enough, Decoder is harder to decode. In other words, it’s designed to make Encoder coding more accurate. After the training, if you need to extract sentence vectors, you only need to use Encoder

By adjusting the parameters in the STS data set, the author found the best combination method as follows:

  1. Add noise by deleting words and set the ratio to 60%
  2. Use the output of the [CLS] position as a sentence vector

Results

From the TSDAE results, it’s basically punching SimCSE and kicking Bert-flow

Personal summary

If I were a Reviewer, one question I would particularly like to ask would be: “How does your approach differ from BART’s?”

TSDAE has been packaged as a PIP package by the sentance-transformers. For the complete training process, you can refer to the use of Sentence-Transformer and fine-tune tutorial. On this basis, you can easily train TSDAE by modifying dataset and Loss

Create special de-noising datasets that add noise instantly
train_dataset = datasets.DenoisingAutoEncoderDataset(train_sentences)

# DataLoader batch processes data
train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True)

# Use denoising autoencoder loss
train_loss = losses.DenoisingAutoEncoderLoss(model, decoder_name_or_path=model_name, tie_encoder_decoder=True)

# Model training
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=1,
    weight_decay=0,
    scheduler='constantlr',
    optimizer_params={'lr': 3e-5},
    show_progress_bar=True
)
Copy the code