Efficient Estimation of Word Representations in Vector Space

2020-04-08 nlp study group PV:

論文網址: Efficient Estimation of Word Representations in Vector Space

4 model architectures:

`NNLM; RNNLM; CBOW; Skip-Gram`

為了比較模型好壞，先定義接下來訓練深度模型的複雜度皆為:$$O = E\times T\times Q$$

E: 迭代次數
T: 訓練集的詞個數
Q: 模型參數

old: NNLM, RNNLM

What’s NNLM (Feedforward Neural Net Model)?

(上圖是原始paper(A Neural Probabilistic Language Model)的圖，annotation可能會跟下面word2vec的對不起來，下面使用原本word2vec的annotation)

有4層: input, projection, hidden, output

input layer: 前N words使用one-hot encoding成V維的向量，V是(vocab size)
- 注意這裡是用前N words，而不是用所有words來訓練，這是和word2vec中CBOW投影層的差異!!
projection layer: input(NxV)會使用同一個projection matrix(VxD)投影到projection layer P(NxD)
- D是投影後的維度
- 共用一個projection matrix，所以這裡的cost還算低
hidden layer: 隱藏層來計算整個word的機率，有H個neuron
output layer有V個neuron

所以整體的模型參數量是$$Q=N×D+N×D×H+H×V$$

其中output layer的HxV最重要
- 有一些優化的方法，例如hierarchical softmax，使用binary tree representations of the vocabulary(Huffman tree)，可以降到$\log_2(V)$
所以其實主要的計算量在hidden layer

What’s RNNLM (Recurrent Neural Net Language Model)?

Remark:

$y(t)$ produces a probability distribution over words

只有input, hidden, output層，訓練複雜度是$$Q=D×H+H×V$$

D和隱藏層H有相同的維度
- 使用hierarchical softmax + huffman tree，H×V可以降低為H×$\log_2V$，所以大部分的複雜度來自D×H

new: CBOW (Continuous Bag-of-Words Model), Skip-Gram (Continuous Skip-Gram model)

overview (from cs224n)
(圖片來自[7]）

What’s CBOW?

cs224n note:
http://web.stanford.edu/class/cs224n/readings/cs224n-2019-notes01-wordvecs1.pdf

loss function:

和NNLM相似，但刪除了hidden層，並且投影層是所有的words共用(NNLM是前N words共用)

所有的单词都投影到同一个位置（所有向量取平均值）
- 这样不考虑单词的位置顺序信息，叫做词袋模型
- 詞的順序對於不影響投影
会用到将来的词，例如如果窗口 windows 为 2，这样训练中心词的词向量时，会选取中心词附近的 4 个上下文词（前面 2 个后面 2 个）

整體的模型參數量為:$$Q=N×D+D×log(V)$$

log(V)是用到了hierarchical softmax + huffman tree

What’s Skip-Gram?

cs224n note:http://web.stanford.edu/class/cs224n/readings/cs224n-2019-notes01-wordvecs1.pdf

loss function:

跟CBOW相似，不過是根據中心的詞去預測上下文

通過實驗發現，windows越大效果越好(但cost也越大)
距離較遠的word通常關聯性較小，所以透過抽取較少的樣本(降低機率)來降低對距離較遠word的權重

整體複雜度:$$Q=C×(D+D×log(V))$$

C是window size的2倍，也就是要預測的word個數
- 也就是說預測每個word所需的參數量是D+D×log(V)
- 看到logV就知道用到了hierarchical softmax

Experiment Result

1. Experiment task 1:

–主要： There are totally 8869 semantic questions and 10675 syntatic questions.
(They create the correct word pairs manually, and then randomly connecting two words to form the word pair question.)

–在table 3中多了一個MSR task: SemEval-2012: Semantic Evaluation Exercises
e.g. Singular/Plural: year:years law:?[4]

Evaluation: Accuracy (預測的字，必須剛好是正確的word pair，才算對！預測的僅是相似字也不算對)
Training data: 不同實驗用不同training data,主要為Google News corpus (6B tokens)。

Testing data:
5 types of semantic questions & 9 types of syntatic questions

Experiment result & conclusion: 雖然paper跑的實驗很多，但主要結論是CBOW＆skip-gram準確率高且花的時間短（需要運算量較低）。另外，增加training data也須增加維度

We have to increase both vector dimensionality and the amount of the training data together.

(使用1 CPU去train,CBOW花一天;skip gram花約三天)

從表格可看出NNLM花的時間明顯多於其他兩個模型。（使用mini-batch asynchronous grandient decent and Adagrad (a adaptive learning rate procedure) ）

MSR: SemEval-2012: Semantic Evaluation Exercises
e.g. Singular/Plural: year:years law:?[4]
MSR的training set: LDC corpora（320M words, 82K vocabulary).

2. Experiment task 2: Microsoft Sentence Completion Challenge

The Microsoft Research Sentence Completion Challenge

Training data: many novels
https://www.kaggle.com/c/mlsd-hw3/data
Testing data: https://www.kaggle.com/c/mlsd-hw3/data

– Skip-gram model作法：

– Skip-gram + RNNLMs: weighted combination

Result

補充

DistBelief
https://www.iotone.com/term/distbelief/t187
dataset

–LDC corpora
–SemEval-2012: Semantic Evaluation Exercises
–Microsoft Sentence Completion Challenge

(延伸討論)Word2vec implement detail

Word2vec hidden layer沒有activation function

在input-hidden層，沒有非線性變換，而是簡單地把所有vector加總並取平均，以減少計算複雜度

Hierarchical Softmax v.s. Negative sampling

論文：Distributed Representations of Words and Phrases
and their Compositionality
實際上，input layer有CBOW和Skip-gram兩種版本，output也有Hierarchical Softmax和Negative sampling兩種版本

Hierarchical Softmax

以下引用[6]

一般正常input-hidden-output的model如果套用在embedding training，因為output層的softmax計算量很大(要去算所有詞的softmax機率，再去找機率最大值)

透過huffman tree + Hierarchical Softmax，tree的根節點是每一個word，透過一步步走到leaf來求得softmax的值

從root到leaf只需要log(V)步

如此可以大幅的加快求得softmax的速度

缺點: 但是如果我们的训练样本里的中心词w是一个很生僻的词，那么就得在霍夫曼树中辛苦的向下走很久了

於是有了Negative Sampling

Negative sampling

定義window內的為正樣本，window外的為負樣本，如此就可以不用把全部的word拿進來一起train

只需要window內的所有正樣本 + sampling一定數量的負樣本就足夠訓練模型

透過Unigram distribution來模擬負樣本(不在window內的word)被選中的機率

設計這個分佈時希望詞被抽到的機率要跟這個詞出現的頻率有關，出現在文本中的頻率越高越高越有可能被抽到

公式為:$$P(w_i) = \frac{ {f(w_i)}^{3/4} }{\sum_{j=0}^{n}\left( {f(w_j)}^{3/4} \right ) }$$
$f(w_i)$代表$w_i$出現次數(頻率)，3/4是實驗try出來的數據

例如：有一個詞編號是 100，它出現在整個文本中 1000 次，所以 100 在 unigram table 就會出現 1000 ^ 0.75 = 177 次

至於要選幾個詞當 negative sample，paper 中建議如下

Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can be as small as 2–5.

Subsampling of frequent words

英文中 “the”, “a”, “in”，中文中的「的」、「是」等等這種詞，其實在句子中並沒有辦法提供太多資訊但又常常出現，對訓練沒有太大幫助，所以就用一個機率來決定這個詞是否要被丟掉，公式如下$$P(f_i) = (\sqrt{\frac{f(w_i)}{0.001}} + 1) \cdot \frac{0.001}{f(w_i)}$$

idea

table 2
(3 training epochs; stochastic gradient decent and backpropogation; learning rate=0.025 and decreased it linearly)
Reference

4 model architectures:

NNLM; RNNLM; CBOW; Skip-Gram

old: NNLM, RNNLM

What’s NNLM (Feedforward Neural Net Model)?

What’s RNNLM (Recurrent Neural Net Language Model)?