Principle and Implementation of Negative Sampling in Word2Vec

Principle and Implementation of Negative Sampling in Word2Vec

Problem Description
Negative Sampling is an efficient training technique in the Word2Vec model, designed to address the computational efficiency issues of the traditional Skip-gram model caused by large vocabulary sizes. The Skip-gram model needs to calculate the probability distribution (via Softmax) between each center word and all context words in the entire vocabulary. However, vocabulary sizes often range from tens of thousands to millions, leading to extremely high computational costs. Negative Sampling transforms the multi-class classification problem into a binary classification problem, updating parameters only for a small number of negative samples, thereby significantly accelerating training.

Step-by-Step Analysis of the Core Principle

  1. Original Objective Function of Skip-gram
    • Objective: Given a center word \(w_c\), predict the words \(w_o\) within its context window.
    • Original Softmax Loss:

\[ P(w_o | w_c) = \frac{\exp(\mathbf{u}_o^\top \mathbf{v}_c)}{\sum_{i=1}^{V} \exp(\mathbf{u}_i^\top \mathbf{v}_c)} \]

 where $ V $ is the vocabulary size, $ \mathbf{v}_c $ is the center word vector, and $ \mathbf{u}_o $ is the context word vector. Computing the normalization term in the denominator requires traversing the entire vocabulary, resulting in $ O(V) $ complexity.
  1. Improvement Idea of Negative Sampling
    • Reformulate the problem: Determine whether a word pair \((w_c, w_o)\) forms a "true context pair" (positive sample) or a "random noise pair" (negative sample).
    • New Objective Function (Binary Logistic Loss):

\[ \sigma(\mathbf{u}_o^\top \mathbf{v}_c) + \sum_{k=1}^{K} \mathbb{E}_{w_k \sim P_n(w)} [\log(1 - \sigma(\mathbf{u}_k^\top \mathbf{v}_c))] \]

 where $ \sigma $ is the Sigmoid function. The first term maximizes the probability of positive samples, and the second term minimizes the probability of $ K $ negative samples. Negative samples are drawn from a noise distribution $ P_n(w) $, typically set proportional to the word frequency raised to the 3/4 power ($ P_n(w) \propto \text{freq}(w)^{3/4} $) to balance high-frequency and low-frequency words.
  1. Negative Sample Sampling Strategy

    • The number of samples \(K\) is usually between 5 and 20, which is much smaller than the vocabulary size \(V\).
    • Design of the Noise Distribution \(P_n(w)\):
      • The original paper uses a modified version of the "unigram distribution" to reduce the sampling probability of high-frequency words (e.g., "the," "is"), preventing the model from overfitting to meaningless words.
      • Modified formula: \(P_n(w_i) = \frac{\text{freq}(w_i)^{3/4}}{\sum_j \text{freq}(w_j)^{3/4}}\).
  2. Gradient Calculation and Parameter Update

    • The loss function is simplified to:

\[ J = -\log \sigma(\mathbf{u}_o^\top \mathbf{v}_c) - \sum_{k=1}^{K} \log \sigma(-\mathbf{u}_k^\top \mathbf{v}_c) \]

  • Gradient with respect to the center word vector \(\mathbf{v}_c\):

\[ \frac{\partial J}{\partial \mathbf{v}_c} = [\sigma(\mathbf{u}_o^\top \mathbf{v}_c) - 1] \mathbf{u}_o + \sum_{k=1}^{K} [\sigma(-\mathbf{u}_k^\top \mathbf{v}_c) - 1] (-\mathbf{u}_k) \]

 Only the parameters corresponding to the positive sample $ \mathbf{u}_o $ and the $ K $ negative samples $ \mathbf{u}_k $ need to be updated, reducing computational complexity from $ O(V) $ to $ O(K) $.
  1. Implementation Steps Example
    • Step 1: Build the vocabulary, count word frequencies, and calculate the sampling probability for each word.
    • Step 2: For each training sample \((w_c, w_o)\), sample \(K\) negative sample words \(w_1, \dots, w_K\).
    • Step 3: Calculate the loss function and update parameters:
      • Positive sample: Increase the similarity between \(\mathbf{v}_c\) and \(\mathbf{u}_o\).
      • Negative samples: Decrease the similarity between \(\mathbf{v}_c\) and \(\mathbf{u}_k\).
    • Step 4: Repeat iterating through the corpus until the model converges.

Key Advantages

  • Improved Computational Efficiency: Avoids full-vocabulary Softmax, speeding up training by tens of times.
  • Preserved Semantic Quality: By balancing positive and negative samples, meaningful word vector distributions can still be learned.