Paper Review 19: On the properties of neural machine translation: Encoder-decoder approaches

Summary

This paper analyzes Gated RNN and grConv(which is introduced in this paper) to show

Performance is degraded
1. when length of the sentence increases
2. when the number of unknown words increases
grConv learns grammatical structure of sentence automatically (= w/o supervision)

Neural Network for Sequences

Gated RNN

Works on variable-length sequence x by maintaining hidden state h over time. Hidden state h at timestamp t is updated by

\[h^{(t)}=f(h^{(t-1)},x_t)\]

This leads for RNN to learn the distribution over the next input

\[p(x_{t+1}|x_t,\ ...\ ,x_1)\]

grConv (Gated Recursive Conv Net)

Notions

Binary ConvNet whose weights are recursively applied to the input sequence until it outputs a single fixed-length vector.

Notations

\[\text{input sequence: }x = (x_1,x_2,...,x_T) \in \R^d\\ \text{weight matrices: } W^l, W^r, G^l, G^r\\ \text{recursion level: } t \in [1,T-1]\]

How to compute?

Activation ot j-th hidden unit h is computed by
\[h_j^{(t)}=w_c\tilde{h}_j^{(t)}+w_lh_{j-1}^{(t-1)}+w_rh_j^{(t-1)}\\ h_j^{(0)}=Ux_j (U \text{ projects input into hidden space})\]
New Activation h tilde is computed as
\[\tilde{h}_j^{(t)} = \phi \left( W^l h_{j-1}^{(t)} + W^r h_j^{(t)} \right)\\(\phi \text{ = element-wise nonlinearity})\]
Gating coefficients w’s are computed by
\[\begin{bmatrix}\omega_c \\\omega_l \\\omega_r\end{bmatrix}= \frac{\exp \left( G^l h_{j-1}^{(t)} + G^r h_j^{(t)} \right)}{\sum_{k=1}^{3} \left[ \exp \left( G^l h_{j-1}^{(t)} + G^r h_j^{(t)} \right) \right]_k}\\ (G^l, G^r \in \mathbb{R}^{3 \times d})\]

Intuitions

Activation of a single node chooses among (new activation, activation from left, activation from right).
grConv may learn unsupervised parsing from input sequence

Experiment & Results

Evaluate encoder-decoder models on the task of English-to-French translation. Trained 2 models - RNN Encoder(RNNenc) and grConv. Both of them uses Gated RNN decoder.

Blue Score Results

Both models suffer significantly as the length of the sequences or the number of unknown words increases

grConv’s parsing structure

Sample Sentence “Obama is the President of the United States” is parsed by grConv.

grConv extracts the vector representation of the sentence by first merging “of the United States” together with “is the President of” and finally combining this with “Obama is” and “.”, which is well correlated with our intuition.

Future Research Direction

Toward larger vocabularies → Need to research for lowering computation and memory
Prevent from underperforming with long sentences
New Decoder architectures (Because both encoders in this paper suffers from long sentences, it may be the problem of Decoder)