Paper Review 26: Distributed representation of words and phrases and their compositionality (Negative Sampling)

Summary

Improved Skip-gram model by using Negative Sampling and subsampling of frequent words

1. Original Skip-gram model

Training objective of the Skip-gram model is to find word representation that are useful for predicting the surrounding words in a sentence of a document. Given a sequence of training words $w_1, w_2,…,w_T$, the objective of the Skip-gram model is to maximize

\[\frac{1}{T}\sum^T_{t=1}\sum_{-c \le j \le c, \ j \ne 0 } \log p(w_{t+j}|w_t)\]

where c is the size of the training context

The basic Skip-gram formulation defines $p(w_{t+j} \mid w_t)$ using softmax function

\[p(w_o|w_I) = \frac{\exp \left( v_{w_o}'^T v_{w_I} \right)}{\sum_{w=1}^{W} \exp \left( v_w'^T v_{w_I} \right)} \tag{2}\]

where $v_w$ and $v_w’$ are input and output vector representations of w, and W is the number of words in the vocabulary.

1-1. Hierarchical Softmax

Uses a binary tree representation of the output layer with the W words as its leaves and, for each node, explicitly represents the relative probabilities of its child nodes

2. Improving Skip-gram model

2-1. Negative Sampling

Instead of computing the probability distribution over all possible words, Negative Sampling reformulates the problem as a binary classification task

The model learns to distinguish between correct word pairs (positive samples) and randomly generated incorrect word pairs (negative samples). The objective is defined as below.

\[\log \sigma \left( v_{w_o}'^T v_{w_I} \right) + \sum_{i=1}^{k} \mathbb{E}_{w_i \sim P_n(w)} \left[ \log \sigma \left( -v_{w_i}'^T v_{w_I} \right) \right]\]

How to choose Negative Samples

Negative Sampling uses a noise distribution $P_n(w)$, which is a free-parameter. This paper found that sampling negative words in proportion to their frequency (unigram distribution $U$) raised to the power of 3/4 outperformed significantly.

\[P_n(w) = \frac{U(w)^{3/4}}{Z}\]

2-2. Subsampling of Frequent Words

Motivation: The vector representation of frequent words shouldn’t change significantly after training on many examples

Solution: Subsampling → Each word $w_i$ in the training set is discarded with probability

\[P(w_i)=1-\sqrt{\frac{t}{f(w_i)}}\]

where $f(w_i)$ is the frequency of word $w_i$ and $t$ is a chosen threshold

3. Learning Phrases

Motivation: Many phrases have a meaning!

Solution:

Find words that appear frequently together and infrequently in other contexts
- EX) “New York Times” will be replaced by unique tokens, while “this is” will remain unchanged
Train Skip-gram model using all n-grams
- This could be too memory intensive, so need to threshold at some degree