Summary

Word2Vec is a seminal NLP model by Google that generates word embeddings to capture semantic relationships between words based on their context within a corpus, using either a Continuous Bag-of-Words (CBOW) or Skip-Gram architecture.

Abstract

Word2Vec is a pivotal model in the field of natural language processing (NLP) that transforms text into numerical representations, known as word embeddings. These embeddings are designed to reflect the semantic nuances of words by placing similar words in close proximity within a continuous vector space. The model operates through two primary architectures: Continuous Bag-of-Words (CBOW), which predicts a target word from its surrounding context, and Skip-Gram, which predicts the context given a target word. Word2Vec's ability to capture word relationships is contingent upon the size and quality of the training corpus. The model has inspired several extensions and alternatives, such as GloVe, FastText, ELMo, and BERT, each offering improvements or different approaches to word representation. Word2Vec and its successors have a wide range of applications, including semantic similarity measurement, text classification, machine translation, information retrieval, and recommendation systems. Despite its contributions, Word2Vec has limitations, such as its inability to handle word order and out-of-vocabulary words, and the constraint of fixed-size vectors for all words.

Opinions

Word2Vec is considered an important breakthrough in NLP for its ability to produce dense vector representations of words that capture semantic meaning.
The CBOW architecture is noted for its efficiency in prediction, as it uses the average of context word vectors to infer the target word.
Skip-Gram is praised for its performance on smaller datasets and its capacity to capture more nuanced word relationships.
The model's effectiveness is acknowledged to be highly dependent on the size and diversity of the training corpus.
Extensions like GloVe and FastText address some of Word2Vec's limitations, such as handling rare words and incorporating global co-occurrence statistics.
Advanced models like ELMo and BERT, which provide contextualized embeddings, are seen as significant advancements in the evolution of word representation techniques.
Despite its influence, Word2Vec's drawbacks, including context independence and the challenge of out-of-vocabulary words, are recognized as areas for further improvement in NLP models.

Word2Vec (CBOW, Skip-gram) In Depth

Word2Vec is an important model for natural language processing (NLP) developed by researchers at Google.
Word2Vec is a group of related models used to produce word embeddings, which are dense vector representations of words in a continuous vector space.
A two-layer network to generate word embedding given a text corpus.
These embeddings capture semantic relationships between words based on their usage in a large corpus of text.

Word Embeddings: Word embeddings are fixed-size, dense vectors representing words. They capture semantic meaning in such a way that similar words have similar vectors. For example, “king” and “queen” might have vectors that are close to each other.

Architecture

Word2Vec has two main architectures for generating word embeddings:

Continuous Bag-of-Words (CBOW)
Skip-Gram

1. Continuous Bag-of-Words (CBOW)

Predict the target word (center word) from the surrounding context words.
The model averages the vectors of context words and uses this average to predict the target word.
Faster to train since it predicts only one word from multiple context words.

Practical Example:

For simplicity, imagine we got these five words: “google dream company software engineer”

1st Iteration:

Select the window size (window_size=3).
The target is to predict the center word from context words (Surrounding words).
We create the dataset where the context word is independent features and the center word is our output.

Convert it into one hot encoding.

Next step, we pass it to the Neural Network

2nd Iteration:

We go for the next three words

3rd Iteration:

Finally, for the last three words the process is shown below:

Getting Word Embeddings:

below is shown the process of getting word embeddings:

Skip-Gram

Predict the surrounding context words given a target word.
The model takes the target word and tries to predict each of the context words within a window.
Performs better on smaller datasets and can capture more complex relationships between words.

Practical Example:

We will use the above example:

1st Iteration:

Here, our input is our output value, and we predict the surrounding words.

2nd Iteration:

3rd Iteration:

Extensions and Alternatives

Several models build on or improve Word2Vec:

GloVe (Global Vectors for Word Representation): Combines global word co-occurrence statistics with local context-based learning.
FastText: Extends Word2Vec by representing words as n-grams of characters, improving representations for rare words.
ELMo (Embeddings from Language Models): Uses deep, contextualized word representations.
BERT (Bidirectional Encoder Representations from Transformers): Uses transformers for contextualized word embeddings, considering both left and right context.

Applications

Semantic Similarity: Measuring similarity between words.
Text Classification: Improving feature representation for classification tasks.
Machine Translation: Enhancing translation quality by providing better word embeddings.
Information Retrieval: Improving search results by understanding word semantics.
Recommendation Systems: Enhancing recommendations by understanding user preferences and item descriptions.

Limitations

Context Independence: It doesn’t consider the order of words or their syntactic roles.
Out-of-Vocabulary Words: It cannot handle words that were not present in the training corpus.
Fixed Embedding Size: All words are represented by fixed-size vectors, regardless of their frequency or importance.