🔤

Understanding Word2vec and Its Architectures

Feb 17, 2025

Word2vec Overview

Introduction

  • Word2vec is a set of models to produce word embeddings.
  • Originally created in C by Tomas Mikolov.
  • Implemented in Python and Java.
  • Uses a two-layer neural network to process natural language.
  • Input: a corpus of text; Output: a set of vectors representing semantic distribution of words.

Key Concepts

  • Word Embedding: Represents words as points in a multi-dimensional space.
  • Semantic Similarity: Words closer in space are more similar semantically.
  • CBOW and Skip-Gram Architectures: Essential for understanding word embedding production.

CBOW and Skip-Gram Architectures

  • Uses a corpus to extract a vocabulary of distinct words (tokens).
  • Tokens are represented using One-hot encoding.

CBOW (Continuous Bag-of-Words)

  • Predicts the current token based on a context window of words.
  • Input: One-hot encoding representation of context tokens.
  • Hidden Layer: Average of vectors corresponding to input context words.
  • Loses positional information of tokens.

Skip-Gram

  • Predicts context words based on the current token.
  • Input: One-hot encoding of the current token.
  • Output: Vector for each context token.
  • Hidden layer represents word embedding of the current token.
  • Generates error vectors for backpropagation.

Training and Parameters

  • Context Window Size (C): Defines number of adjacent tokens included in context.
  • Backpropagation: Updates weight matrix containing word embeddings.
  • Softmax Function: Applied at the last layer.

Implementations

Shell, C Implementation

  • Original implementation allows shell execution.
  • Command: $BIN_DIR/word2vec with parameter options for training.
  • Parameters include: size, window, threads, binary.

Gensim, Python

  • Initialize model with iterable list of words for training.
  • Parameters include: size, window, min_count, workers, seed.
  • Tutorial uses specific initial parameter values.

DL4J, Java

  • Parameters like batchSize, minWordFrequency, layerSize are crucial.
  • Example initialization using builder pattern.

References

  • Original sources and more in-depth tutorials and documentations listed.

Additional Info

  • Address limitations like curse of dimensionality and meaning conflation deficiency.
  • Challenges in handling changes to word embedding without retraining.