Effective Approaches to Attention-based Neural Machine Translation

2020-04-22 nlp study group PV:

Introduction

Neural Machine Translation (NMT) requires minimal domain knowledge and is conceptually simple
NMT generalizes very well to very long word sequences => don’t need to store phrase tables
The concept of “attention”: learn alignments between different modalities
- image caption generation task: visual features of a picture v.s. text description
- speech recognition task: speech frames v.s. text
Proposed method: novel types of attention- based models
- global approach
- local approach

Goal: translate the source sentence $x_1, x_2,…,x_n$ to the target sentence $y_1, y_2,…,y_m$
A basic form of NMT consists of two components:
- Encoder: compute the representation $s$ for each sentence
- Decoder: generates one target word at a time
  $p(y_j|y_{<j},s)=softmax(g(h_j))$, where $g$ is a transformation function that outputs a vocabulary-sized vector, $h_j$ is the RNN hidden unit.
Traning objective: $J_t=\sum_{(x,y)\in D}-log(p(y|x))$, $D$ is the parallel training corpus.
Attention-based model

Bahdanau uses bidirectional encdoer
Bahdanau uses deep-output and max-out layer
Bahdanau uses a different alignment funciton (but ours are better): $e_{ij}=v^T tanh(W_ah_i+U_a\hat{h_j})$

Training Data: WMT’14
- 4.5M sentence pairs.
- 116M English words, 110M German words
- vocabularies: top 50K modst frequent words for both languages.
Model:
- stacking LSTM models with 4 layers
- each layers with 1000 cells
- 1000 dimensional embeddings
Results:
- English-German reuslts
- German-English results