Discourse-Aware Neural Extractive Text Summarization

2020-07-08 nlp study group PV:

Discourse-Aware Neural Extractive Text Summarization

tags: `references` `NLP`

論文
BERT 的sentence-based extractive model 通常extracte出多餘的或沒有訊息的summary 且 long dependency on document 也是一個問題(since bert pretrain is sentence-based)

DISCOBERT extracts 一些子句的discourse unit(而非整個句子)更精細的來選出較好結果
而long dependency on discourse unit的問題透過使用RST tree and coreference mentions(Graph Convolutional Networks) 解決
通常summary 分成兩種

extractive summary : Directly selects sentences from the document to assemble into a summary.
abstractive summary : Genterated word-by-word after encoding whole document

各自優點
- extractive 有更好的呈現，和較快
- abstractive summary 更靈活，產生較少多餘文章
Hybrid -> pipeline or candidate 原目的是丟棄所選句子中沒有信息的短語但是會受到pipeline分開產生而產生的語意斷層
EDU(elementary discourse unit):

而在這個paper使用EDU當成基礎單元而非原來的句子，來進行提取壓縮和減少句子間的冗言贅字

Here is the example of document

透過使用EDU可以減少使用多餘的細節，保留了更多的包含主要概念或事件的資訊，從而產生更簡潔，內容更豐富的摘要。此外使用已知的知識來finetune 句子內的資訊，透過使用2種graphic model (都是base on EDU)

RST graph $G_R$
Coreference graph $G_C$ 有點像是BFS+cluster:找定一點(core) explore 與其互相影響的 other point(event/concepts) ，亦用來讓model找long-dependency
最後使用 Graph Convolutional Network (GCN)(bse on EDU) 來解決long-range interaction問題

主要的貢獻有三

提出discourse base 的summarization model 生成簡潔且較少冗句的摘要
結構化兩種類型的 graphic model
在summary 上贏過bert
Discourse Graph Construction

Initial with disconnection on all point
Discourse focus on unit間的關係
In the RST framework, document can be segmented into (1)contiguous,(2)adjacent and (3)non-overlapping text spans(=EDU,其通常標記為 1.Nucleus or 2.Satellite)

Nucleus : 通常位於中心位置
Satellite : 位於外圍，且這類在內容和語意相依性不太重要。

Note : EDU 存在相依性，代表著它們互相的修飾關係，但是當考慮修飾關係會使得summary more ambiguous，因此修飾關係不在model的考量當中。

RST Graph

當選擇句子作為提取摘要的候選者時，假設每個句子在語法上是獨立的。但是對於EDU，有些額外限制以確保語法。

在圖中 [2]EDU是完整句子而[3]不是。只看那些不完整EDU->需要了解EDU之間的依賴性，以確保所選組合的語法合法。
We use the converted dependency version of the tree to build the RST Graph G_R, by initializing an empty graph and treating every discourse dependency from the i-th EDU to the j-th EDU as a directed edge, i.e., G_R[i][j] = 1

Coreference Graph

Text summarization 通常會有’position bias’ 問題：即為關鍵訊息聚集在特定地方。i.e.,在新聞中大部分關鍵訊息都是在一開始就描述的。
Coreference meaning

演算法

首先使用Stanford CoreNLP來檢測文章中的所有coreference clusters。對於每個coreference clusters，將提及同一cluster的所有語篇單元link在一起。在所有coreference clusters上重複此過程，以創建最終的Coreference Graph。

但是，在文檔的中間或末尾仍然散佈了大量的信息，匯總模型通常會忽略這些信息。
經過觀察大約25％的 oracle sentence 出現在CNNDM數據集中的前10個句子之外。
此外，在長篇新聞文章中，整個文檔中經常有多個核心人物和事件。但是，現有的神經模型在建模這樣的長篇上下文時效果不佳，尤其是當存在多個模棱兩可的共指關係(兩種詞共指同物)要解析時。

在圖一中也顯示
構造圖$G_C$時，由於提到了“ Pulitzer prizes”，因此1-1、2-1、20-1和22-1之間的邊都連接在一起。

DISCOBERT Model

首先使用BERT encode整個篇章，使用BERT得到的hidden state表示，每個EDU內部做selfattentive span extractor得到新的EDU的表示，由得到的EDU表示和兩個矩陣表示$G_R$和$G_C$，做GCN得到EDU新的表示，通過MLP預測EDU是否被抽取出來做EDU（0-1序列標註）[ where each EDU $d_i$ is scored by neural networks]。
而在生成過程中，需要進一步考慮語篇依賴性，以確保輸出摘要的連貫性和語法性。
Note :Insert <CLS> and <SEP> tokens at the beginning and the end of each "sentence", respectively

Document Encoder

Self-Attentive Span Extractor
Note : all the W matrices and b vectors are parameters to learn,$h^S = {h^S_1,…h^S_n} \in R^{d_h*n}$

Graph Encoder

LN(·) represents Layer Normalization, $N_i$ denotes the neighorhood of the i-th EDU node.
For different graphs, the parameter of DGEs are not shared. If we use both graphs, their output are concatenated with:(合2圖的結果再丟進2分類器))
Loss:BCE loss
we sort y^ in descending order, and select EDUs accordingly. Note that the dependencies between EDUs are also enforced in prediction to ensure grammacality of generated summaries
Experiments
Datasets
NYT(New York Times)
CNNDM(CNN and Dailymail)

Toolkit

See et al.用於extract summaries
Stanford CoreNLP for sentence boundary detection,tokenization and parsing
Detail
Theedges in $G_C$ are undirected, while those in $G_R$ are directional.
AllenNLP as the code framework
DGL as the implementation of graph encoding

Datasets	training	validation	test
CNNDM	287226	13368	11490
NYT	137778	17222	17223

Compare with State-of-the-art Baselines

Extractive Models:
- BanditSum is a policy gradient methods
- NeuSum is a seq2seq architecture, where the attention mechanism scores the document and emits the index as the selection
Compressive Models:
- JECS(BiLSTM as the encoder)The first stage is selecting sentences, and the second stage is sentence compression by pruning constituency parsing tree
BERT-based Models:
- BertSum(model re-implementation as baseline)
- PNBert proposed a BERT-based model with various training strategies, including “reinforcement learning” and “Pointer Networks”
- HiBert is a hierarchical BERT-based model for document encoding, which is further pretrained with unlabeled data

Hardware	NVIDIA P100 card
Mini-batch	6
length	768 BPEs
pre-trained	bert-base-uncased
Iter	80000
evaluation metrics	ROUGE
validation criteria	R-2

Note:EDU之間的依賴性對於所選EDU的語法至關重要。這是學習依賴關係的兩個步驟：head inheritance andtree conversion。
Head inheritance:為每個有效的非終止樹定義Head node。對於每個葉節點，頭部都是自身。我們根據非終止樹的Nucleus數確定其Head node
ex:

If model selects “[5])(N) being carried … Liberia.” as a candidate span, we will enforce the model to select “[3](N) and shows … 8,” and “[2](N) This … series,” as well.
通過篇章分析，可以在篇章上構造得到一棵樹，樹的葉子節點是EDU，樹上的邊代表的是對應子節點的重要性程度，N代表主要，S代表次要，可以認為S是N的補充。相鄰兩個子節點可以有三種關係，N-N,N-S,S-N。
作者提出假設：S依賴N,所以存在一條路徑從S指向N；如果兩個節點都是N，就認為是右N依賴做N。
根據這個假設，可以將RST discourse tree轉成成RST dependence graph。

Results

CNNDM

The second section lists the performance of baseline models, including non-BERT-based and BERTbased variants
By a significant margin (0.52/0.61/1.04 on R-1/-2/-L on F1).
NYT
DISCOBERT provides 1.30/1.29/1.82 gain on R-1/-2/-L over the BERT baseline. However, the use of discourse graphs does not help much in this case.

Grammatical problem

由於句子的分割和部分選擇，我們模型的輸出在語法上可能不如原始句子。需要手動檢查並自動評估模型輸出，並觀察到總體而言，考慮到RST依賴樹限制了EDU之間的修辭關係，所生成的摘要仍然是語法上的。一組簡單而有效的後處理規則在某些情況下有助於完成EDU。
- Automatic Grammar Checking
- Human Evaluation
- Examples & Analysis : We notice that a decent
  amount of irrelevant details are removed from the
  extracted summary.
  In this example “[‘Johnny is believed to have drowned,]1 [but actually he is fine,’]2 [the police say.]3”, only selecting the second EDU yields a sentence “actually he is fine”, which is not clear who is ‘he’ mentioned here.

發現錯誤主要源自RST依賴關係解析和語篇解析的解析錯誤。
RST依賴關係的錯誤分類和手工製定的依賴關係解決規則損害了生成輸出的語法和連貫性。常見的標點符號問題包括逗號過多或缺失以及引號缺失。

Conclusion

在本文中，我們提出了DISCOBERT，它使用EDU作為最小選擇基礎來減少摘要冗句，並利用兩種類型的圖model作為來捕獲EDU之間以及長期依賴性。
在兩個摘要生成數據集上驗證了所提出的方法，並觀察到相對於baseline模型的一致改進。
在以後的工作中，我們將探索更好的圖形編碼方法，並將圖model應用於需要長文檔編碼的其他任務。

忘了做圖之後再++ 南部網路真滴慢

Discourse-Aware Neural Extractive Text Summarization

tags: references NLP

Discourse Graph Construction

RST Graph

Coreference Graph

演算法

DISCOBERT Model

Document Encoder

Graph Encoder

Experiments

Datasets

Toolkit

Compare with State-of-the-art Baselines

Results

CNNDM

NYT

Grammatical problem

Conclusion

tags: `references` `NLP`