Summary
Improved Skip-gram model by using Negative Sampling and subsampling of frequent words
1. Original Skip-gram model
Training objective of the Skip-gram model is to find word representation that are useful for predicting the surrounding words in a sentence of a document. Given a sequence of training words $w_1, w_2,…,w_T$, the objective of the Skip-gram model is to maximize
\[\frac{1}{T}\sum^T_{t=1}\sum_{-c \le j \le c, \ j \ne 0 } \log p(w_{t+j}|w_t)\]where c is the size of the training context
The basic Skip-gram formulation defines $p(w_{t+j} \mid w_t)$ using softmax function
\[p(w_o|w_I) = \frac{\exp \left( v_{w_o}'^T v_{w_I} \right)}{\sum_{w=1}^{W} \exp \left( v_w'^T v_{w_I} \right)} \tag{2}\]where $v_w$ and $v_w’$ are input and output vector representations of w, and W is the number of words in the vocabulary.
1-1. Hierarchical Softmax
Uses a binary tree representation of the output layer with the W words as its leaves and, for each node, explicitly represents the relative probabilities of its child nodes
2. Improving Skip-gram model
2-1. Negative Sampling
Instead of computing the probability distribution over all possible words, Negative Sampling reformulates the problem as a binary classification task
The model learns to distinguish between correct word pairs (positive samples) and randomly generated incorrect word pairs (negative samples). The objective is defined as below.
\[\log \sigma \left( v_{w_o}'^T v_{w_I} \right) + \sum_{i=1}^{k} \mathbb{E}_{w_i \sim P_n(w)} \left[ \log \sigma \left( -v_{w_i}'^T v_{w_I} \right) \right]\]How to choose Negative Samples
Negative Sampling uses a noise distribution $P_n(w)$, which is a free-parameter. This paper found that sampling negative words in proportion to their frequency (unigram distribution $U$) raised to the power of 3/4 outperformed significantly.
\[P_n(w) = \frac{U(w)^{3/4}}{Z}\]2-2. Subsampling of Frequent Words
Motivation: The vector representation of frequent words shouldn’t change significantly after training on many examples
Solution: Subsampling → Each word $w_i$ in the training set is discarded with probability
\[P(w_i)=1-\sqrt{\frac{t}{f(w_i)}}\]where $f(w_i)$ is the frequency of word $w_i$ and $t$ is a chosen threshold
3. Learning Phrases
Motivation: Many phrases have a meaning!
Solution:
- Find words that appear frequently together and infrequently in other contexts
- EX) “New York Times” will be replaced by unique tokens, while “this is” will remain unchanged
- Train Skip-gram model using all n-grams
- This could be too memory intensive, so need to threshold at some degree