LSTMEmbed- Learning Word and Sense Representations from a Large Semantically Annotated Corpus with Long Short-Term Memories

2020-08-28 nlp study group PV:

論文網址

問題：一字多義在傳統的word2vec並不能很好的表示
利用LSTM從語意註釋的corpus中學習word representation
API:WordNet,BabelNet,Freebase

Abstract

對於人來說：一字多義（模糊）不是問題，因為常常使用上下文以及語言常識來判斷意義與上下文的落差。因此這個模型想要做的是能在一定程度上處理語言模糊問題。

原本的Word2vec：在同一個空間中把所以有一字多義的意義放在同一個vector中(每一種詞的各種含義混在一起，最常出現的representation優先)。
context embedding(GPT2,Elmo)是根據上下文，而可能把同一個意義的字map到不同的空間。

多語言詞彙語義：這一篇論文主要目的是把相同的詞義的不同詞map到相同的word2vec在同一空間中，同字不同義map到不同的vec(所以training是by word annotation)

在解決解決詞彙多義性問題的一連串工作提出了意義嵌入的創建，即將單詞中每個單詞的不同意義分開embedding。缺點之一：沒有考慮詞語順序。而在基於RNN考慮字詞結構的embedding在速度和品質方面較差。

在圖中 word vector 和sense vector是分開的
ex:首先，模棱兩可的單詞“ bank”位於與其同時出現的單詞附近（圖中的正方形），其次，最接近的銀行含義（點代表金融機構的含義，並與它的地理含義）聚集在兩個分開的區域中，而與與此相關的詞（可能含糊不清）沒有明顯的關聯。
Main idea
主要想法：

★用biLSTM考慮到word ordering
把pre-train的embedding當成training objective
LSTM不只學習上下文資訊，亦可以表示各個單詞的representation 以及sense
與已經存在的knowledge resource結合，可以利用先前就已經得到的semantic information

Word Embedding

Glove train on word-word co-occurences, good for single word；however fail to represent non-dominant ‘sense’ of word.(Glove只能學到很常出現的字意)
另一個問題： bar和pub，bar和stick 要相同，但是pub和stick應要無關 (word sense disambiguous)

Solve this issue

To make more similar for same word type,vice versa.：Word type extract from PPDB or WordNet
Context2vec learning sentence and word embedding(large raw text corpora)
Base on word2vec ：embedding extracted from the output topmost weight matrix(使用output matrix as embedding而不是與input最接近的)

Sense Embedding
與上述每種方法旨在學習詞彙表述(word2vec)的方法相比，sense embedding將各個詞義表示為單獨的向量。

使用額外API：
embedding的主要訓練方法是根據knowledge base，依賴預先定義好的model，i.e.:WordNet,BabelNet,Freebase.
SensEmbed：’SOTA of WSD and NE’：一個用於詞義消除歧義和實體鏈接的先進工具，用於構建詞義標註的語料庫，而該語料庫又可用於使用word2vec訓練詞義的向量空間模型。’SenseEmbed using BabelNet &distributional information from text corpora’。
Spliting the vector which came from pretrained word embedding into their respective senses
★AutoExtend:延伸word2vec，在學習過程加上了wordnet的資訊:(words are sums of their lexemes and synsets are sums of their lexemes.)(AE架構)參考網站
作者認為單詞集合W和synset集合S之間存在這一種線性映射關係
- S =E⊗W
- $\hat W$=D⊗S
  於是這個就可以很自然看成一個encode和decode的過程了，損失函數自然是讓decode後的結果和encode前的一致。
  不使用額外API：
  Contexualize embedding(不使用額外的API)
  在不使用額外API時，在訓練之後所表示的這些向量僅被識別為彼此不同，而沒有清楚地識別其內在含義，所以要額外的上下文資訊。
每一個單字都train多個代表的vector(by上下文representation cluster pool)
在訓練w2v過程，如果一個單詞所訓練出來的向量和之前的差很多，就為該單詞延伸一個新的sense vector
ELmo：作者表示雖然是train by bilstm，但是每個token由三個向量表示，其中兩個是上下文向量。

這些模型通常由於缺乏與詞義語義資源(babelnet…)的聯繫而難以評估。

LSTMEmbed旨在學習單詞sense representation，並與諸如BabelNet的多語言詞彙語義資源鏈接，同時處理’word ordering’並使用預訓練好的embedding作為目標。

Model

Sense tagged input $s_i$這個可能是一個單字或是word sense(from inventory ex:BabelNet)

主要架構圖

★ 在這裡的訓練過程當成是w2v的CBOW

左半邊

因為是雙向lstm 對特定單字$s_i$，其embedding包含了$s_{i-W},…s_{i-1}$以及$s_{i+1}…s_{i+W}$
每一個$s_i$都有其對應的embedding(by lookup table) $v(s_i)\in R^n$
進入LSTM後
Concate 2個LSTM output丟入一層的Linear layer 得到左半邊的LSTMembed

右半邊

用來與左半邊得出來的$out_{LSTMEmbed}$比較，$emb(s_i)$是使用pretrain好的embedding vector
根據使用的預訓練數據集和註釋，這個$s_i$可以是單字或是word sense

而最後面的objective loss則是cosine similarity

Once the training is over, we obtain latent semantic representations of words and senses jointly in the same vector space from the look-up table,
i.e., the embedding matrix between the input and the LSTM, with the embedding vector of an item 's' given by v(s)

在訓練完後，使用lookup tabel的embedding，在相同的向量空間中共同獲得單詞和sense的 semantic representations。

和一般bi-lstm不同處

使用帶有註釋的語料庫（包括單詞和word sense）來train。
從單個lookup table中的單詞和感官的embedding形式，在左右兩個不同方向的LSTM之間共享
一種新的學習方法，它以一組預訓練的embedding為目標，這使我們能夠學習大詞彙量的embedding

Implement detail

sense-level tasks
word level tasks

Training data

Sense inventory: BabelNet(大型多語言百科全書和semantic網絡，包括大約1600萬個通過語義關係鏈接的概念和命名實體條目)
Training corpus:BabelWiki(包含英語維基百科的多語言語料庫，使用Babelfy自動標註了命名的實體和概念)
30億個tokens和約300萬個unique tokens

Embedding setting

look-up table : 200-dim
丟棄前1000個最常出現的token
batch size : 2048
1 epoch
Optimizer : Adam or AME(Adaptive Moment Estimation)
$emb(s_i)$ : 400-dim，使用w2v的SkipGram window_size=10，negtive sampling=10，sub-sampling of frequent words set to $10^3$

Experimen

Sense evaluation

第一組實驗旨在展示模型在需要語義（而不僅僅是詞彙）相關性的任務中的影響。我們分析了兩個任務，Cross-Level Semantic Similarity(跨級別語義相似性)和Most Frequent Sense Induction

Cross-Level Semantic Similarity(word-to-sense)

為了最好地評估embedding以區分單詞的各種含義的能力，選擇了SemEval-2014的跨級別語義相似性任務其中包括單詞間的相似度作為其子任務之一: CLSS單詞感知相似性數據集包含500個單詞，每個單字與WordNet候選單詞的簡短列表配對，並對其進行人類評分其semantic similarity。
- 為了計算單詞間的相似度，使用了lookup table的embedding space，並使用cosine similarity來計算相似度。

MeerkatMafia:which uses Latent Semantic Analysis and WordNet glosses to get word-sense similarity measurements
SemantiKLU :an approach based on a distributional semantic model trained on a large Web corpus from different sources
SimCompass : which combines word2vec with information from WordNet.

Most Frequent Sense Induction

WSD system comparsion:counting the word sense pairs in an annotated corpus such as SemCor
94比較爛…

Word-based Evaluation

基於單詞的評估，目的是證明模型能夠解決傳統上使用基於單詞的模型處理的任務

Synonym Recognition

給定目標詞和一組替代詞，此任務的目的是從集合中選擇與目標詞含義最相似的成員(考試囉)

Dataset: synonym questions of the TOEFL & ESL

單詞相似性任務

WordSim353
根據不同預訓練embedding與LSTMEmbed的餘弦相似度之間的Spearman相關性來計算LSTMEmbed的性能。

Conclusions

這是一種基於雙向LSTM的新模型，用於聯合學習單詞和word sense的 embedding
更好地學習semantic representation
semantic representation能夠正確反映單詞和意義表示之間的相似性，在word-to-sense以及most frequent sense induction有不錯表現。
還能夠在基於標准單詞的語義評估中有不錯表現