Principles and Optimization of the Skip-gram Model in Word2Vec

Principles and Optimization of the Skip-gram Model in Word2Vec

Problem Description
Skip-gram is a core model in Word2Vec, used for learning word embeddings. Its objective is to predict context words from a center word, thereby capturing semantic and syntactic relationships between words in vector space. For example, given the center word "artificial intelligence," the model should be able to predict context words such as "technology" and "deep learning." Key issues that Skip-gram needs to address include: how to define the prediction task, design the loss function, and employ optimization methods like Negative Sampling to reduce computational complexity.

Step-by-Step Explanation

  1. Basic Model Idea

    • Input and Output: Skip-gram takes the one-hot encoding of a center word (e.g., "cat") as input and uses a neural network to predict all words within its context window (e.g., 2 words before and after).
    • Core Assumption: Semantically similar words appear in similar contextual environments. The model is trained to bring the vectors of similar words closer together in space.
    • Example: For the sentence "The cat is sleeping on the sofa," with a window size of 2, the context words for the center word "is" are {cat, sleeping, on}.
  2. Model Structure Breakdown

    • Input Layer: A one-hot vector of dimension V (where V is the vocabulary size). For example, with a vocabulary ["cat", "is", "sleeping", "on", "sofa"], the one-hot vector for "is" is [0,1,0,0,0].
    • Hidden Layer: A fully connected layer without an activation function, with weight matrix W₁ (dimensions V × d, where d is the word vector dimension). Multiplying the input word by W₁ directly yields its d-dimensional word vector.
    • Output Layer: Another weight matrix W₂ (dimensions d × V) transforms the hidden layer output into V-dimensional scores, which are then passed through a Softmax function to calculate the probability of each word being a context word.
    • Mathematical Formulation:
      • Vector for center word c: v_c = W₁^T · one_hot(c)
      • Score for context word o: score_o = v_c^T · W₂[:, o]
      • Probability: P(o|c) = exp(score_o) / ∑_{w=1}^V exp(score_w)
  3. Loss Function and Training Challenges

    • Loss Function: Uses negative log-likelihood loss. For center word c and context word o, the loss is:
      J = -log P(o|c)
    • Computational Bottleneck: The denominator of the Softmax requires summing over the entire vocabulary (∑ exp(score_w)). When V is large (e.g., millions), the computational cost becomes extremely high.
  4. Optimization Method: Negative Sampling

    • Idea: Transforms the multi-class classification problem into multiple binary classification problems. For each true context word (positive sample), K non-context words (negative samples) are randomly sampled. The model is trained to distinguish positive from negative samples.
    • New Loss Function:
      J = -log σ(v_o^T v_c) - ∑{k=1}^K log σ(-v{n_k}^T v_c)
      where σ is the Sigmoid function, v_o is the positive sample word vector, and v_{n_k} are the negative sample word vectors.
    • Negative Sample Sampling Strategy: Samples according to the word frequency raised to the 3/4 power to prevent high-frequency words from dominating the training. For example, the word "the" has a higher probability of being sampled than "artificial intelligence."
    • Advantage: Reduces computational complexity from O(V) to O(K+1), where K is typically set between 5 and 20.
  5. Training Process Example

    • Input center word: "is", true context word: "cat".
    • Positive sample: ("is", "cat") → label 1.
    • Negative samples: Randomly sample K=2 words (e.g., "eat", "run") → label 0.
    • The model calculates the Sigmoid probability for these three samples separately and updates the vectors v_c, v_o, and the negative sample words via gradient descent.
  6. Key Techniques and Extensions

    • Subsampling of Frequent Words: High-frequency words like "the" and "is" are discarded with a certain probability to balance the training frequency between common and rare words.
    • Vector Results: After training, each row of the matrix W₁ is the word vector for the corresponding word. Semantically similar words (e.g., "cat" and "dog") will have a high cosine similarity between their vectors.

Summary
Skip-gram learns word vectors through a local context prediction task. Techniques like Negative Sampling address computational efficiency issues with large vocabularies, making it a classic method in word embedding. Its core idea is to approximate the complex Softmax with simple binary classification while preserving the distributed semantic information of words.