tags: study
paper
DSMI lab
paper: Sequence to Sequence Learning with Neural Networks
Abstract and Introduction
- DNN cannot be used to map sequences to sequences because they can only work when dimensionalty of input/output is fixed and known
- Task: English to French translation task from the WMT’14 dataset
- Proposed method 的優點:
- Goal: given an input sentance $(x_1,x_2,…,x_T)$ and its corresponding output sentacne $(y_1, y_2, …,y_{T’})$ (where $T$ need not equal to $T’$), want to estimate $p(y_1, y_2, …,y_{T’}|x_1,x_2,…,x_T)$
- $p(y_1, y_2, …,y_{T’}|x_1,x_2,…,x_T)=\Pi_{t=1}^{T’}p(y_{t}|v,y_1,…,y_{t-1})$, where $p(y_{t}|v,y_1,…,y_{t-1})$ is each distribution is represented with a softmax over all the words in the vocabulary
- 把 input sequence 倒過來餵進去效果比較好
Experiment
Dataset: WMT’14 English to French dataset
- 12M sentences
- Vocabulary: 160000 most frequently used English words and 80000 most frequently used Frech words
- 沒有在vocabulary出現的字用”UNK”代替
- test set (for evaluation): 1000-best lists generated by SMT system(baseline)
Objective: maximizing the log probability of a correct translation $T$ given the source sentence $S$: $\hat{T}=\mathop{argmax}\limits_{T}p(T|S)$
Left-to-Right beam search
Reverse the source sentence:
- Imporoves the performance, but they don’t have a complete explanation XDD
- 當input sentence 被反過來之後,input sentence 的前幾個字和output sentence的前幾個字更近了,有助於output sentence 在一開始就有更精準的生成,後面生成的也會比較準 (類似好的開始就是成功的一半的概念?)
Trianing details
- 1000 dimensional word embedding (但他沒有說word embedding是怎麼做的)
- LSTM: 4 layers, 1000 cells in each layers
- Parameter initialization from uniform distribution between -0.08 to 0.08
- 平行化計算: 總共使用8個GPU,訓練10天
Experimental Results
最好的結果是ensemble不同random initialization 的LSTM 所得到的Model analysis:
- 能夠分辨使用相同字但不同排序的句子,以及相同意思但使用不同文字表達的句子
- 對於長句的表現仍然良好(左圖: x軸是句子長度)
- 句子中如果有出現很多不常用的字,表現也能維持一定的水準(右圖: x軸是句子裡面出現的字的詞頻在整個vocabulary中的排名的平均)
- 能夠分辨使用相同字但不同排序的句子,以及相同意思但使用不同文字表達的句子
補充
- SMT system: Statistical Machine translation
- 通過對大量的平行語料進行統計分析,構建統計翻譯模型
- 不需要依靠語法規則,所以容易推廣到不同語言的翻譯工作
- Word-based translation: 一個一個字翻
- Phrase-based translation: 視情況將幾個字組起來變成詞彙來翻
- Syntax-based translation: 使用句法分析(例如parsing tree)作為翻譯的依據
- Hierarchical phrase-based translation: a combination of pharse-based and syntax based
- Beam search: 演算法細節再這裡