word2vec

What is word2vec?

Word2vec is an approach to create word embeddings.
Word embedding is a representation of a word as a numeric vector.
Except for word2vec there exist other methods to create word embeddings, such as fastText, GloVe, ELMO, BERT, GPT-2, etc.

Screenshot 2025-09-24 at 5.30.57 PM.png
Image: Word2Vec Overview. Source: Stanford CS224N Notes

Model Architecture

Word2vec is based on the idea that a word’s meaning is defined by its context. Context is represented as surrounding words.

Pasted image 20250923162204.png
Image: A word and its context. Image by Author

There are two word2vec architectures proposed in the paper:

CBOW (Continuous Bag-of-Words) – a model that predicts a current word based on its context words.
Skip-Gram – a model that predicts context words based on the current word.

Both CBOW and Skip-Gram models are multi-class classification models by definition.

Pasted image 20250923162354.png
Image: CBOW Model: High-level Overview.

Pasted image 20250923162358.png
Image: Skip-Gram Model: High-level Overview.

Word2vec model is very simple and has only two layers:

Embedding layer, which takes word ID and returns its 300-dimensional vector.
Then comes the Linear (Dense) layer with a Softmax activation. We create a model for a multi-class classification task, where the number of classes is equal to the number of words in the vocabulary.

Pasted image 20250923162605.png
Image: CBOW Model: Architecture in Details.

Pasted image 20250923162628.png
Image: Skip-Gram Model: Architecture in Details.

Data Preparation

It is better to create vocabulary:

Either by filtering out rare words, that occurred less than N times in the corpus;
Or by choosing the top N most frequent words.
Vocabulary is usually represented as a dictionary data structure:

vocab = {
     "a": 1,
     "analysis": 2,
     "analytical": 3,
     "automates": 4,
     "building": 5,
     "data": 6,
     ...
}

Pasted image 20250923164056.png
Image: How to create Vocabulary from a text corpus.

Pasted image 20250923164333.png
Image: How to Encode words with Vocabulary IDs.

Objective Function

For each position $t = 1, \dots, T$ , predict context words within a window of fixed size $m$ , given center word $w_{t}$ . Data likelihood:
$Likelhood = L (θ) = \prod_{t = 1}^{T} \prod_{- m \leq j \leq m, j | n e 0} P (w_{t + j} | w_{t}; θ)$

The objective function is the average negative log likelihood:
$J (θ) = - \frac{1}{T} \log L (θ) = - \frac{1}{T} \sum_{t = 1}^{T} \sum_{- m \leq j \leq m, j \neq 0} \log P (w_{t + j} | w_{t}; θ)$
Minimizing objective function $⟺$ Maximizing predictive accuracy

For a center word $c$ and a context word $o$ :
$P (o | c) = \frac{\exp (u_{o}^{T} v_{c})}{\sum_{w \in V} \exp (u_{w}^{T} v_{c})}$

Training Details

Word2vec is trained as a multi-class classification model using Cross-Entropy loss.
TODO: add details on the dataset prep, optimizer choice, ...

Notes:

build_vocab_from_iterator
max_embed_norm
expect a batch of inputs in the forward method.

References