🔤

Understanding Word2vec and Its Architectures

Feb 17, 2025

Word2vec Overview

Introduction

Word2vec is a set of models to produce word embeddings.
Originally created in C by Tomas Mikolov.
Implemented in Python and Java.
Uses a two-layer neural network to process natural language.
Input: a corpus of text; Output: a set of vectors representing semantic distribution of words.

Key Concepts

Word Embedding: Represents words as points in a multi-dimensional space.
Semantic Similarity: Words closer in space are more similar semantically.
CBOW and Skip-Gram Architectures: Essential for understanding word embedding production.

CBOW and Skip-Gram Architectures

Uses a corpus to extract a vocabulary of distinct words (tokens).
Tokens are represented using One-hot encoding.

CBOW (Continuous Bag-of-Words)

Predicts the current token based on a context window of words.
Input: One-hot encoding representation of context tokens.
Hidden Layer: Average of vectors corresponding to input context words.
Loses positional information of tokens.

Skip-Gram

Predicts context words based on the current token.
Input: One-hot encoding of the current token.
Output: Vector for each context token.
Hidden layer represents word embedding of the current token.
Generates error vectors for backpropagation.

Training and Parameters

Context Window Size (C): Defines number of adjacent tokens included in context.
Backpropagation: Updates weight matrix containing word embeddings.
Softmax Function: Applied at the last layer.

Implementations

Shell, C Implementation

Original implementation allows shell execution.
Command: $BIN_DIR/word2vec with parameter options for training.
Parameters include: size, window, threads, binary.

Gensim, Python

Initialize model with iterable list of words for training.
Parameters include: size, window, min_count, workers, seed.
Tutorial uses specific initial parameter values.

DL4J, Java

Parameters like batchSize, minWordFrequency, layerSize are crucial.
Example initialization using builder pattern.

References

Original sources and more in-depth tutorials and documentations listed.

Additional Info

Address limitations like curse of dimensionality and meaning conflation deficiency.
Challenges in handling changes to word embedding without retraining.

View note sourcehttps://it.m.wikipedia.org/wiki/Word2vec