BART Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

2020-07-16 nlp study group PV:

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Abstract

Bart爲預訓練Sequence to Sequence model而設計的去噪自編碼器
基於之前的預訓練結構之後，利用兩種方法去得到BART
(1) 使用任意的噪聲去破壞文本
(2) 學習重建文本(原本為有破壞的文本)
會在生成文本之類的任務相當有效，例如：問答，對話，總結等等…。其他類型也都跟RoBerta差不多

1
2
3

We present BART, a denoising autoencoderfor pretraining sequence-to-sequence models.
BART is trained by (1) corrupting text with anarbitrary noising function, 
and (2) learning amodel to reconstruct the original text.

Introduction

BART 是一個預訓練的模型組合了 Bidirectional and Auto-Regressive Transformers
BART 基礎模型是使用standard Transformer-based neural machine translation結構，但可以看作是對BERT（雙向編碼器），GPT（從左至右解碼器）以及許多其他較新的預訓練方案進行了概括。

BART uses a standard Tranformer-based neural machine translation architecture
which, despite its simplicity, can be seen as generalizing BERT(due to the
bidirectional encoder), GPT (with the left-to-right decoder), and many other 
more recent pretraining schemes.

BERT採用隨機選擇token進行mask，但Bidirectional Encode會去獨立預測單個mask的字，故不太適合文本生成任務。
GPT採用單向Encoder和利用Autoregressive Decoder去預測下一個字的token是什麼，但缺陷在於無法學習到雙向的部位
BART採用Sequence-to-Sequence的架構把BERT和GPT做一個整合，使用雙向模型去對損壞文本做編碼，在用自回歸模型做解碼去預測該字的下個字，最後的Finetune也會把完整的文本拿進來做微調的動作。
（這邊的mask不只改掉一個，有可能會連續的抹掉）

Model

base model : 6 layers in encoder and decoder
large model : 12 layers in encoder and decoder
基本模型跟BERT相似（transformer Sequence-to-Sequence）
並且還刪除了最後不需要的Linear的地方，整個模型縮小了10%

(1) each layer of the decoder additionally performs cross-attention over
the final hidden layer of the encoder (as in the transformer 
sequence-to-sequence model)
(2) BERT uses an additional feed-forward network before wordprediction,
which BART does not. In total, BART contains roughly 10% more parameters 
than the equivalently sized BERT model.

Pretraining in Noising input

他們實驗了很多種類的加噪的方法（還說這個地方是可以被研究，應該是一個很重要的地方）

Token Masking : 如同Bert一樣，單純對某些東西做隱藏
Token Deletion : 隨機的刪除輸入中的tokens，只需要決定tokens的位置即可(要給定的)
Sentence Permutation : 將句子做亂數排列，same as XLNET.
Document Rotation : 在句子內隨機選取一個token，然後將該token設為開始旋轉，目的是要認得句子的開頭。
Text Infilling : 自poison分佈($\lambda = 3$)，對句字的某段部分做採樣(mask超過1個)，similar to spanBERT(different distribution).

Fine-tuning BART

在模型的訓練完之後會有一個fine-tuning的微調動作

Sequence Classification Tasks : 給予encoder和decoder相同的輸入，並且將最後一層的decoder 的 hidden state 再丟入一個linear classifier去做預測label的動作。
Token Classification Tasks : 給予encoder和decoder相同的輸入，但這次是給予decoder 的 hidden state，用這個state去分類這些token
Sequence Generation Tasks : 由於有自回歸的decoder，所以可以直接執行生成文本的任務訓練。
Machine Translation : 神奇的解釋（你只要多加了一個encoder，就會變好喔，出處：Edunov et al. (2019))，train end-to-end, 用randomly initialized Encoder取代原本的embedding, 並訓練其將外來詞映射到BART（BART能將其降噪成英語的輸入）。訓練步驟如下：
步驟1 : 先將pretrained encoder, decoder 去固定住，用bitext(平行機器翻譯文本)去對randomly initialized Encoder 去做訓練
步驟2 : 再將全部不要固定住，再做一次fine-tuning
Previous work Edunov et al (2019) has shown that models can be improved by incorporating pre-trained encoders, but gains from using pre-trained language models in decoders have been limited.
Dataset
SQuAD : (Rajpurkar et al., 2016) an extractive question answering task onWikipedia paragraphs. Answers are text spans extracted from a given document context.(問答任務)
MNLI : (Williams et al., 2017), a bitext classification task to predict whether one sentence entails another. （觀看句子包含的任務）
ELI5 : (Fan et al., 2019), a long-form abstractive question answering dataset.（很長的問答任務）
XSum : (Narayan et al., 2018), a news summarization dataset with highly abstractive summaries.（新聞抽象式摘要任務）
ConvAI2 : (Dinan et al., 2019), a dialogue response generation task, conditioned on context and a persona.（對話生成文本任務）
CNN/DM : (Hermann et al., 2015), a news summarization dataset. Summaries here are typically closely related to source sentences.（同為新聞摘要任務）

Results

Comparing with pretrain objects
我只是站得夠高而已！！！

每個方法都是獨立的在每個任務(language model best at ELI5, bad at SQUAD)
Token masking 很重要 (4,5的表現都差了一點在BART上)
Left-to-right pre-training improves generation
BART achieves the most consistently strong performance.(除了ELI5以外都是最好的

Result:
他的設定基本都跟Roberta一樣，large model with 12 layers in each of encoder and decoder，use a batch size of 8000, and train the model for 500000 steps,mask 30% of tokens in each document, and permute all sentences.並且為了更好的fitting, 最後的10% step是不使用dropout的.