LLM Architectures Explained: Word Embeddings (Part 2)

Deep Dive into the architecture & building real-world applications leveraging NLP Models starting from RNN to Transformer.

Posts in this Series

NLP Fundamentals
Word Embeddings ( This Post )
RNNs, LSTMs & GRUs
Sequence to Sequence Models
Attention Mechanism
Transformers
BERT
GPT
LLama
Mistral

· 1. Introduction ∘ 1.1 Word Embeddings · 2. Fundamentals of Word Embeddings ∘ 2.1 Understanding Vectors and Vector Space ∘ 2.1.1 What is a Vector? ∘ 2.1.2 What is a Vector Space? ∘ 2.1.3 How Vectors Represent Words ∘ 2.1.4 Operations in Vector Space ∘ 2.1.5 Visualization of Vector Space ∘ 2.2 How Word Embeddings Represent Meaning ∘ 2.2.1 The Concept of Meaning in Word Embeddings ∘ 2.2.2 Capturing Meaning Through Context ∘ 2.2.3 Geometric Relationships in Embeddings ∘ 2.2.4 Dense Representation ∘ 2.2.5 Applications of Meaning in Word Embeddings ∘ 2.3 The Concept of Context in Word Embeddings ∘ 2.3.1 Why Context Matters · 3. Word Embedding Techniques · 3.1 Frequency-Based Methods (Shallow Embeddings) ∘ 3.1.1 Count Vectorizer ∘ 3.1.2 Bag-of-Words (BoW) ∘ 3.1.3 Term Frequency-Inverse Document Frequency (TF-IDF) ∘ 3.1.4 N-Grams ∘ 3.1.5 Co-occurrence Matrices ∘ 3.1.6 One-Hot Encoding · 3.2 Static Embeddings ∘ 3.2.1 Word2Vec ∘ 3.2.1.1 Continuous Bag of Words (CBOW) ∘ 3.2.1.2 Skip-Gram ∘ 3.2.2 GloVE (Global Vectors for Word Representation) ∘ 3.2.3 FastText · 3.3 Contextual Embeddings ∘ 3.3.1 Self Attention ∘ 3.3.2 BERT ∘ 3.3.3 ELMo · 4. Training Word Embeddings ∘ 4.1 Continuous Bag-of-Words (CBOW) model ∘ 4.1.1 Continuous Bag-of-Words (CBOW) with Python and TensorFlow ∘ 4.2 Skip-gram Model ∘ 4.2.1 Skip-Gram Model with Python and TensorFlow ∘ 4.3 GloVE Model ∘ 4.3.1 GloVe Word Embeddings In Python · 6. Model Training Optimisation ∘ 6.1 Improving predictive functions ∘ 6.1.1 Softmax-based approaches ∘ 6.1.2 Sampling-based approaches · 7. Considerations for Deploying Word Embedding Models · 8. How to choose an embedding model? · 9. Conclusion · 10. Test Your Knowledge! ∘ 10.1 DIY

1. Introduction

Word embeddings are a fundamental concept in the field of natural language processing (NLP). They are essentially a way to convert words into numerical representations, or vectors, in a continuous vector space. The goal is to capture the semantic meaning of words such that words with similar meanings have similar vector representations.

This blog post covers the essential aspects of word embeddings from basic to advanced levels, ensuring that readers gain a thorough understanding of the topic and its evolution within the context of NLP and LLMs.

Differentiation between contextualized and context-less (static) embedding models (adapted of Haj-Yahia et al., 2019) | Credits: SE3M

1.1 Word Embeddings

Definition: Word embeddings are dense, low-dimensional, and continuous vector representations of words that capture semantic and syntactic information.

Characteristics:

Dense: Unlike sparse representations like one-hot encoding, embeddings are dense, meaning most of their elements are non-zero.
Vector Space: Words are positioned in a vector space, allowing for mathematical operations and comparisons.
Dimensionality Reduction: The vector space is often reduced to a lower dimension for computational efficiency and visualization.
Semantic Similarity: Words with similar meanings have similar embeddings.

Example

If “king” is represented by a vector v_king and "queen" by v_queen, the relationship between these vectors can capture the gender difference, as in v_king - v_man + v_woman ≈ v_queen.

Need of Word Embeddings

While this does make some sense, why should we be motivated enough to learn and build these word embeddings?

With regard to speech or image recognition systems, all the information is already present in the form of rich dense feature vectors embedded in high-dimensional datasets like audio spectrograms and image pixel intensities.
However when it comes to raw text data, especially count based models like Bag of Words, we are dealing with individual words which may have their own identifiers and do not capture the semantic relationship amongst words.
This leads to huge sparse word vectors for textual data and thus if we do not have enough data, we may end up getting poor models or even overfitting the data due to the curse of dimensionality.

Comparing feature representations for audio, image and text (Credits: Dipanjan (DJ) Sarkar)

How are Word Embeddings used?

They are used as input to machine learning models. Take the words — -> Give their numeric representation — -> Use in training or inference.
To represent or visualize any underlying patterns of usage in the corpus that was used to train them.

Let’s take an example to understand how word vector is generated by taking emotions which are most frequently used in certain conditions and transform each emoji into a vector and the conditions will be our features.

2. Fundamentals of Word Embeddings

2.1 Understanding Vectors and Vector Space

In the context of natural language processing (NLP) and word embeddings, understanding vectors and vector space is fundamental because they form the mathematical foundation for representing words and their relationships.

2.1.1 What is a Vector?

A vector is a mathematical object that has both magnitude (length) and direction. In simple terms, a vector can be thought of as an ordered list of numbers that represents a point in space. For example, in a 2-dimensional space, a vector can be represented as:

Where v1 and v2 are the components of the vector in the two dimensions (e.g., x and y axes).

In NLP, words are represented as vectors in a multi-dimensional space, where each dimension captures a different aspect or feature of the word’s meaning.

2.1.2 What is a Vector Space?

A vector space is a mathematical structure formed by a collection of vectors that can be added together and multiplied by scalars (numbers) to produce another vector within the same space. Vector spaces are defined by their dimensionality (e.g., 2D, 3D, etc.), which refers to the number of coordinates required to specify any point within that space.

In the context of word embeddings, we work with high-dimensional vector spaces, often with hundreds or even thousands of dimensions. Each word is mapped to a unique vector in this space.

2.1.3 How Vectors Represent Words

When we represent words as vectors in a vector space, the goal is to capture the semantic meaning of the words. Words with similar meanings or that appear in similar contexts should be close to each other in the vector space. The process of training word embeddings learns these vector representations based on the context in which words appear in large text corpora.

For example, the words “king” and “queen” might be represented by vectors that are close to each other in the vector space because they share similar contexts (e.g., royalty, leadership).

2.1.4 Operations in Vector Space

Vector spaces allow us to perform various operations that are useful in NLP:

Addition and Subtraction:

By adding or subtracting vectors, we can explore relationships between words. For example, the famous analogy:

This operation shows how the vector representing “queen” can be derived by adjusting the vector for “king” by the difference between “man” and “woman.”

2. Dot Product:

The dot product of two vectors provides a measure of similarity between them. If two word vectors have a high dot product, it means they are similar and share context.

3. Cosine Similarity:

Cosine similarity is a commonly used measure of similarity between two vectors, calculated as the cosine of the angle between them. This is particularly useful when comparing word vectors, as it helps to normalize the magnitude of the vectors, focusing solely on their direction.

2.1.5 Visualization of Vector Space

While the vector spaces used in NLP are often high-dimensional (far beyond our ability to visualize directly), it’s common to reduce the dimensions to 2D or 3D using techniques like PCA (Principal Component Analysis) or t-SNE (t-Distributed Stochastic Neighbor Embedding). This reduction allows us to visualize how words are positioned relative to each other, revealing clusters of semantically related words.

t-SNE projection of word/sense embeddings. Green labels show the two sense embeddings for dogychairman, whereas yellow and red labels show the nearest neighbours for the two senses. Best viewed in colour. Credits

2.2 How Word Embeddings Represent Meaning

Word embeddings are a powerful tool in natural language processing (NLP) because they allow words to be represented in a way that captures their semantic meaning and relationships with other words. Unlike traditional methods like one-hot encoding, which treat words as independent and unrelated entities, word embeddings encode rich information about the context and usage of words in a compact, dense vector form.

2.2.1 The Concept of Meaning in Word Embeddings

The core idea behind word embeddings is that words that appear in similar contexts tend to have similar meanings. This is based on the distributional hypothesis in linguistics, which states that words that occur in the same contexts tend to have similar meanings.

For example, consider the words “cat” and “dog.” These words often appear in similar contexts (e.g., “The cat/dog is playing with a ball”). Therefore, their embeddings should be close to each other in the vector space, reflecting their similar meanings.

2.2.2 Capturing Meaning Through Context

Word embeddings are typically learned from large text corpora by analyzing the contexts in which words appear. The learning process involves mapping each word to a vector in a high-dimensional space such that the geometry of this space captures semantic relationships between words.

1. Co-occurrence Statistics:

Word embeddings are often derived from co-occurrence statistics, where the vector for each word is learned based on the words that frequently appear nearby in text. For instance, in the Word2Vec model, the embeddings are trained such that words with similar co-occurrence patterns have similar vector representations.

2. Contextual Similarity:

Words that appear in similar contexts (i.e., surrounded by the same set of words) are given similar vector representations. For example, the words “king” and “queen” might frequently appear in similar contexts (e.g., “The _ ruled the kingdom”), leading their embeddings to be close in the vector space.

2.2.3 Geometric Relationships in Embeddings

The true power of word embeddings lies in the geometric relationships between the vectors in the embedding space. These relationships encode different types of meaning and semantic information.

1. Synonymy and Similarity:

Words with similar meanings have embeddings that are close to each other in the vector space. For example, the words “happy” and “joyful” might be represented by vectors that are close together, indicating their semantic similarity.

2. Analogies and Semantic Relationships:

One of the fascinating properties of word embeddings is their ability to capture analogical relationships through vector arithmetic. A famous example is the analogy:

(King-Man) approximates to (Queen — Woman) | Credits: Avi Chawla

This means that the vector difference between “king” and “man” is similar to the vector difference between “queen” and “woman.” This kind of arithmetic shows that embeddings capture complex semantic relationships like gender, royalty, or even geographical relationships (e.g., “Paris — France + Italy ≈ Rome”).

3. Hierarchical Relationships:

Some embeddings also capture hierarchical relationships. For example, in a well-trained embedding space, “dog” might be close to “animal” and “cat,” reflecting the hierarchy where “dog” and “cat” are both types of “animals.”

4. Polysemy and Contextual Meaning:

While traditional word embeddings struggle with polysemy (words with multiple meanings), newer models like contextualized embeddings (e.g., BERT, ELMo) have advanced this concept. In these models, the embedding of a word changes depending on the context in which it appears, allowing the model to capture the different meanings of a word like “bank” (e.g., river bank vs. financial bank).

2.2.4 Dense Representation

Word embeddings are called dense representations because they condense the meaning of a word into a relatively small number of dimensions (e.g., 100–300), where each dimension captures a different aspect of the word’s meaning or context. This is in contrast to sparse representations like one-hot encoding, where each word is represented by a long vector with mostly zeros.

For example, consider the following word embeddings for “cat” and “dog” in a 3-dimensional space:

These vectors are close to each other, reflecting the similar meanings of “cat” and “dog.”

2.2.5 Applications of Meaning in Word Embeddings

The ability of word embeddings to capture meaning has broad applications in NLP:

Similarity and Relatedness: Embeddings are used to measure how similar two words are, which is useful in tasks like information retrieval, clustering, and recommendation systems.
Semantic Search: Word embeddings enable more intelligent search engines that can understand synonyms and related terms.
Machine Translation: Embeddings help align words across languages, facilitating more accurate translations.
Sentiment Analysis: By understanding the meaning of words in context, embeddings improve the accuracy of sentiment classification.

2.3 The Concept of Context in Word Embeddings

Word embeddings use context to capture the meaning of words by analyzing the words that appear in close proximity to the target word across a large corpus of text. The idea is that words that occur in similar contexts tend to have similar meanings.

1. Context Window:

When training word embeddings, a context window is often used to define the span of words around the target word that are considered as its context. For example, in a sentence like “The cat sat on the mat,” if the target word is “cat” and the context window size is 2, the context words would be “The” and “sat.”
The size of the context window can affect the quality of the embeddings. A smaller window focuses on closer words, capturing more specific relationships, while a larger window might capture more general semantic relationships.

2. Contextual Similarity:

The training process adjusts the word vectors so that words with similar contexts end up with similar vectors. For example, the words “cat” and “dog” might have similar contexts like “pets,” “animals,” “home,” etc. As a result, their vectors will be close to each other in the embedding space.

3. Contextualized Embeddings:

Traditional word embeddings like Word2Vec and GloVe produce a single vector for each word, regardless of its context. However, newer models like ELMo and BERT produce contextualized embeddings, where the vector for a word changes depending on its context.
For example, in BERT, the word “bank” in “river bank” and “financial bank” would have different vectors, reflecting their different meanings in these contexts. This allows for a more nuanced understanding of words and their meanings.

2.3.1 Why Context Matters

Context is crucial because the meaning of a word is often ambiguous without it. The same word can have different meanings depending on the words around it, and understanding this is key to natural language understanding.

1. Disambiguation:

Context helps in disambiguating words with multiple meanings (polysemy). For instance, the word “bark” could mean the sound a dog makes or the outer covering of a tree. The context (e.g., “The dog barked loudly” vs. “The tree’s bark was rough”) helps determine the correct meaning.

2. Synonyms and Related Words:

Words with similar meanings or that are used in similar contexts will have similar embeddings. For example, “happy” and “joyful” might appear in similar contexts, such as “feeling” or “emotion,” leading to similar embeddings.

3. Capturing Nuances:

By leveraging context, embeddings can capture subtle differences in meaning. For example, “big” and “large” might have similar embeddings, but the context might reveal slight differences in their use, like “big opportunity” vs. “large amount.”

3. Word Embedding Techniques

In Natural Language Processing (NLP), the generation of word embeddings lies at the heart of understanding language semantics. These embeddings, dense numerical representations of words, capture semantic relationships and enable machines to process textual data effectively. Several techniques have been developed to generate word embeddings, each offering unique insights into the semantic structure of language. Let’s explore some of the prominent methods:

3.1 Frequency-Based Methods (Shallow Embeddings)

3.1.1 Count Vectorizer

When collecting the word data for distributional word representations, one can begin with a simple count of the words in a series of documents. The sum of the number of times each word appears per document is a count vector. CountVectorizer converts text into fixed-length vectors by counting how many times each word appears. The tokens are now stored as a bag-of-words.

from sklearn.feature_extraction.text import CountVectorizer

# Sample corpus
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the corpus
X = vectorizer.fit_transform(corpus)

# Convert the result to a dense matrix and print it
print("Count Vectorized Matrix:\n", X.toarray())

# Print the feature names
print("Feature Names:\n", vectorizer.get_feature_names_out())

Output:

Count Vectorized Matrix:
 [[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]
Feature Names:
 ['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']

Limitations

High dimensionality due to large vocabulary size.
Ignores semantic meaning and context of words.
No consideration of word order.
Produces sparse feature matrices.
Limited in capturing long-range word relationships.
Cannot handle synonyms; treats each word as distinct.
Struggles with out-of-vocabulary (OOV) words in new documents.
Sensitive to document length, potentially introducing bias.
Rare words may introduce noise without meaningful contribution.
Frequent words can dominate feature space unless managed.
Assigns equal importance to all terms, lacking discriminative power.
Resource-intensive for large corpora, with potential scalability issues.

3.1.2 Bag-of-Words (BoW)

BoW is a text representation technique that represents a document as an unordered set of words and their respective frequencies. It discards the word order and captures the frequency of each word in the document, creating a vector representation.

This is very flexible, intuitive, and the easiest of feature extraction methods. The text/sentence is represented as a list of counts of unique words, for this reason, this method is also referred to as count vectorization. To vectorize our documents, all we have to do is count how many times each word appears.

Since the bag-of-words model weighs words based on occurrence. In practice, the most common words like “is”, “the”, “and” add no value. Stop words (introduced in my blog in this series) are removed before counting vectorization.

Example

Vocabulary is the total number of unique words in these documents.

Vocabulary: [‘dog’, ‘a’, ‘live’, ‘in’, ‘home’, ‘hut’, ‘the’, ‘is’]

Code

Limitations

Ignores word order and context, losing semantic meaning.
High-dimensional, sparse vectors can lead to computational inefficiency.
Fails to capture semantic similarity between words.
Struggles with polysemy (multiple meanings) and synonymy (same meaning).
Sensitive to vocabulary size and selection.
Does not capture multi-word phrases or expressions.
Common words (stop words) can dominate the representation.
Requires extensive preprocessing (tokenization, stop word removal, etc.).
Not suitable for complex tasks requiring nuanced language understanding.

3.1.3 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic used to evaluate the importance of a word in a document relative to a collection of documents (or corpus). The basic idea is that if a word appears frequently in a document but not in many other documents, it should be given more importance.

The TF-IDF score for a term t in a document d within a corpus D is calculated as the product of two metrics: Term Frequency (TF) and Inverse Document Frequency (IDF).

1. Term Frequency (TF)

Term Frequency (TF) measures how frequently a term appears in a document. It is often normalized by the total number of terms in the document to prevent bias toward longer documents.

2. Inverse Document Frequency (IDF)

Inverse Document Frequency (IDF) measures how important a term is in the entire corpus. It decreases the weight of terms that appear in many documents and increases the weight of terms that appear in fewer documents.

The “+1” is added to the denominator to prevent division by zero in case the term doesn’t appear in any document.

3. Combining TF and IDF: TF-IDF

The TF-IDF score is calculated by multiplying the TF value with the IDF value for a term t in a document d:

Importance: Helps in identifying important words in documents and is commonly used in information retrieval and text mining.

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample text data (documents)
documents = [
    "The cat sat on the mat.",
    "The cat sat on the bed.",
    "The dog barked."
]
# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()
# Fit the model and transform the documents into TF-IDF representation
tfidf_matrix = vectorizer.fit_transform(documents)
# Get the feature names (unique words in the corpus)
feature_names = vectorizer.get_feature_names_out()
# Convert the TF-IDF matrix into an array
tfidf_array = tfidf_matrix.toarray()
# Display the TF-IDF matrix
print("Feature Names (Words):", feature_names)
print("\nTF-IDF Matrix:")
print(tfidf_array)

Output:

Feature Names (Words): ['barked' 'bed' 'cat' 'dog' 'mat' 'on' 'sat' 'the']

TF-IDF Matrix:
[[0.         0.         0.37420726 0.         0.49203758 0.37420726
  0.37420726 0.58121064]
 [0.         0.49203758 0.37420726 0.         0.         0.37420726
  0.37420726 0.58121064]
 [0.65249088 0.         0.         0.65249088 0.         0.
  0.         0.38537163]]

Limitations

Does not capture the context or meaning of words.
High-dimensional and sparse vectors for large vocabularies.
Does not handle synonyms or polysemy effectively.
Can over-penalize longer documents.
Limited to linear relationships, unable to capture complex patterns.
Static and does not adapt to new contexts or evolving language.
Not effective for very short or very long documents.
Insensitive to word order

3.1.4 N-Grams

N-grams are sequences of words that can be used together as a unit in text analysis. Statements like “Mary had a little lamb”, “Mary had”, “had a”, “a little” and “little lamb” are bi-grams (N=2). Many N-grams may not appear frequently enough in the data to be useful, leading to sparse and less meaningful representations.

Types:

Unigram: Single word.
Bigram: Pair of words.
Trigram: Sequence of three words.

Importance: Captures context and word dependencies in text.

import nltk
from nltk.util import ngrams
from collections import Counter

# Sample text data
text = "The quick brown fox jumps over the lazy dog"
# Tokenize the text into words
tokens = nltk.word_tokenize(text)
# Generate Unigrams (1-gram)
unigrams = list(ngrams(tokens, 1))
print("Unigrams:")
print(unigrams)
# Generate Bigrams (2-gram)
bigrams = list(ngrams(tokens, 2))
print("\nBigrams:")
print(bigrams)
# Generate Trigrams (3-gram)
trigrams = list(ngrams(tokens, 3))
print("\nTrigrams:")
print(trigrams)
# Count frequency of each n-gram (for demonstration)
unigram_freq = Counter(unigrams)
bigram_freq = Counter(bigrams)
trigram_freq = Counter(trigrams)
# Print frequencies (optional)
print("\nUnigram Frequencies:")
print(unigram_freq)
print("\nBigram Frequencies:")
print(bigram_freq)
print("\nTrigram Frequencies:")
print(trigram_freq)

Output:

Unigrams:
[('The',), ('quick',), ('brown',), ('fox',), ('jumps',), ('over',), ('the',), ('lazy',), ('dog',)]
Bigrams:
[('The', 'quick'), ('quick', 'brown'), ('brown', 'fox'), ('fox', 'jumps'), ('jumps', 'over'), ('over', 'the'), ('the', 'lazy'), ('lazy', 'dog')]
Trigrams:
[('The', 'quick', 'brown'), ('quick', 'brown', 'fox'), ('brown', 'fox', 'jumps'), ('fox', 'jumps', 'over'), ('jumps', 'over', 'the'), ('over', 'the', 'lazy'), ('the', 'lazy', 'dog')]
Unigram Frequencies:
Counter({('The',): 1, ('quick',): 1, ('brown',): 1, ('fox',): 1, ('jumps',): 1, ('over',): 1, ('the',): 1, ('lazy',): 1, ('dog',): 1})
Bigram Frequencies:
Counter({('The', 'quick'): 1, ('quick', 'brown'): 1, ('brown', 'fox'): 1, ('fox', 'jumps'): 1, ('jumps', 'over'): 1, ('over', 'the'): 1, ('the', 'lazy'): 1, ('lazy', 'dog'): 1})
Trigram Frequencies:
Counter({('The', 'quick', 'brown'): 1, ('quick', 'brown', 'fox'): 1, ('brown', 'fox', 'jumps'): 1, ('fox', 'jumps', 'over'): 1, ('jumps', 'over', 'the'): 1, ('over', 'the', 'lazy'): 1, ('the', 'lazy', 'dog'): 1})

Limitations

High Dimensionality and Sparsity
Lack of Semantic Understanding
Context Ignorance
Scalability Issues
Sensitivity to Noise and Rare Words
Difficulty in Capturing Polysemy and Synonymy
Inability to Generalise Across Languages
Difficulty in Capturing Long-Term Dependencies

3.1.5 Co-occurrence Matrices

Co-occurrence matrices capture the frequency with which words appear together in a given context window. These matrices quantify the statistical relationships between words, providing a basis for generating word embeddings.

Let’s consider a simple example to illustrate the concept of co-occurrence matrices. Suppose we have a corpus consisting of the following three documents:

Document 1: “The quick brown fox jumps over the lazy dog.”
Document 2: “The brown dog barks loudly.”
Document 3: “The lazy cat sleeps peacefully.”

We want to construct a co-occurrence matrix based on the words in these documents within a window size of 1. This means we consider the occurrence of each word with its immediate neighboring words. We’ll ignore punctuation and treat words in a case-insensitive manner.

First, let’s construct a vocabulary based on unique words in the corpus:

Vocabulary: [the, quick, brown, fox, jumps, over, lazy, dog, barks, loudly, cat, sleeps, peacefully]

Next, we create a co-occurrence matrix where rows and columns represent words from the vocabulary. The value in each cell (i, j) of the matrix indicates the number of times the word i co-occurs with word j within the specified window size.

           the  quick  brown  fox  jumps  over  lazy  dog  barks  loudly  cat  sleeps peacefully
the         0      1      1    0      0     0     1    1      0       0     1       1         1
quick       1      0      0    1      0     0     0    0      0       0     0       0         0
brown       1      0      0    0      0     0     0    1      1       0     0       0         0
fox         0      1      0    0      1     0     0    0      0       0     0       0         0
jumps       0      0      0    1      0     1     0    0      0       0     0       0         0
over        0      0      0    0      1     0     1    0      0       0     0       0         0
lazy        1      0      0    0      0     1     0    1      0       0     1       0         0
dog         1      0      1    0      0     0     1    0      1       1     0       0         0
barks       0      0      1    0      0     0     0    1      0       1     0       0         0
loudly      0      0      0    0      0     0     0    1      1       0     0       0         0
cat         1      0      0    0      0     0     1    0      0       0     0       1         1
sleeps      1      0      0    0      0     0     0    0      0       0     1       0         1
peacefully  1      0      0    0      0     0     0    0      0       0     1       1         0

This co-occurrence matrix captures the frequency of co-occurrence of each word with every other word within a window size of 1. For example, the entry at row ‘the’ and column ‘lazy’ has a value of 1, indicating that the word ‘lazy’ co-occurs once with the word ‘the’ within the specified window size across the corpus.

This example demonstrates how co-occurrence matrices can be constructed and utilized to capture the statistical relationships between words in a text corpus, providing valuable insights for various natural language processing tasks such as word embeddings, sentiment analysis, and named entity recognition.

One of the primary applications of co-occurrence matrices in NLP is generating word embeddings. By analyzing the co-occurrence patterns of words within a corpus, co-occurrence matrices can capture the contextual information surrounding each word. Techniques like word2vec and GloVe utilize these matrices to create dense, low-dimensional vector representations of words, where the geometric relationships between vectors reflect semantic similarities between words. This enables task measurement of word similarity, analogy detection, and semantic search.

The geometric relationships between vectors reflect semantic similarities between words.

import numpy as np
import pandas as pd
from collections import Counter
from sklearn.preprocessing import normalize

# Example corpus
corpus = [
    "I love machine learning",
    "machine learning is great",
    "I love deep learning",
    "deep learning and machine learning are related"
]

# Tokenize the sentences
corpus = [sentence.lower().split() for sentence in corpus]

# Flatten the list of sentences into a single list of words
vocab = set([word for sentence in corpus for word in sentence])
vocab = sorted(vocab)  # Sorting for consistent order
vocab_size = len(vocab)

# Initialize an empty co-occurrence matrix
co_occurrence_matrix = np.zeros((vocab_size, vocab_size))

# Define the window size
window_size = 2

# Create a mapping from word to index
word2idx = {word: i for i, word in enumerate(vocab)}

# Populate the co-occurrence matrix
for sentence in corpus:
    for i, word in enumerate(sentence):
        word_idx = word2idx[word]
        start = max(0, i - window_size)
        end = min(len(sentence), i + window_size + 1)
        
        for j in range(start, end):
            if i != j:
                context_word = sentence[j]
                context_idx = word2idx[context_word]
                co_occurrence_matrix[word_idx, context_idx] += 1

# Convert the matrix to a DataFrame for better visualization
co_occurrence_df = pd.DataFrame(co_occurrence_matrix, index=vocab, columns=vocab)

# Normalize the co-occurrence matrix
co_occurrence_normalized = normalize(co_occurrence_matrix, norm='l1', axis=1)

# Convert the normalized matrix to a DataFrame for better visualization
co_occurrence_normalized_df = pd.DataFrame(co_occurrence_normalized, index=vocab, columns=vocab)

# Display the co-occurrence matrix
print("Co-occurrence Matrix:")
print(co_occurrence_df)

# Display the normalized co-occurrence matrix
print("\nNormalized Co-occurrence Matrix:")
print(co_occurrence_normalized_df)

Limitations

High dimensionality leads to inefficiency and storage issues.
Extreme sparsity due to most word pairs rarely co-occurring.
Limited ability to capture deep semantic relationships.
Context independence, not differentiating between different meanings of the same word.
Scalability challenges with large corpora, as matrix size grows quadratically.
Bias towards frequent words, unreliable for rare words or phrases.
Limited expressiveness, unable to capture complex linguistic structures.
Noisy data due to the presence of stop words and less meaningful word pairs.

3.1.6 One-Hot Encoding

One-Hot Encoding is a basic method for representing words in NLP. Each word in the vocabulary is represented as a unique vector, with all elements set to 0 except one, which corresponds to the word’s index in the vocabulary.

Example: Given this vocabulary of 10,000 words, what’s the simplest way to represent each word numerically?

Well, you could assign an integer index to each word:

Our vocabulary of 10,000 words, with each word assigned an index. (Credits)

So, some examples:

The vector representation for our first vocabulary word “aardvark” will be [1, 0, 0, 0, …, 0], which is a “1” in the first position followed by 9,999 zeroes.
The vector representation for our second vocabulary word “ant” will be [0, 1, 0, 0, …, 0], which is a “0” in the first position, a “1” in the second position, and 9,998 afterwards.
and so on …

This process is called one-hot vector encoding. You may have also heard of this approach being used to represent labels in multi-class classification problems.

Now, say our NLP project is building a translation model and we want to translate the English input sentence “the cat is black” into another language. We first need to represent each word with a one-hot encoding. We would first look up the index of the first word, “the”, and find that its index in our 10,000-long vocabulary list is 8676.

We could then represent the word “the” using a length 10,000 vector, where every entry is a 0 aside from the entry at position 8676, which is a 1. (Credits)

We do this index look-up for every word in the input sentence, and create a vector to represent each input word. The whole process looks a bit like this as a GIF:

GIF showing the one-hot encoding of the words in the input sentence “the cat is black”. (Credits)

Note that this process has generated a very sparse (mostly zero) feature vector for each input word (here, the terms “feature vector”, “embedding”, and “word representation” are used interchangeably).

These one-hot vectors are a quick and easy way to represent words as vectors of real-valued numbers.

Note:

What if you wanted to generate a representation of the whole sentence, not just each word? The simplest methods either concatenate or average the sentence’s constituent word embeddings (or some mixture of both). More advanced methods, like encoder-decoder RNN models, will read embeddings of each word sequentially in order to gradually build up a dense representation of sentence meaning through layers of transformations.

Code:

def one_hot_encode(text):
    words = text.split()
    vocabulary = set(words)
    word_to_index = {word: i for i, word in enumerate(vocabulary)}
    one_hot_encoded = []
    for word in words:
        one_hot_vector = [0] * len(vocabulary)
        one_hot_vector[word_to_index[word]] = 1
        one_hot_encoded.append(one_hot_vector)
 
    return one_hot_encoded, word_to_index, vocabulary
 
# sample
example_text = "cat in the hat dog on the mat bird in the tree"
 
one_hot_encoded, word_to_index, vocabulary = one_hot_encode(example_text)
 
print("Vocabulary:", vocabulary)
print("Word to Index Mapping:", word_to_index)
print("One-Hot Encoded Matrix:")
for word, encoding in zip(example_text.split(), one_hot_encoded):
    print(f"{word}: {encoding}")

Output:

Vocabulary: {'mat', 'the', 'bird', 'hat', 'on', 'in', 'cat', 'tree', 'dog'}
Word to Index Mapping: {'mat': 0, 'the': 1, 'bird': 2, 'hat': 3, 'on': 4, 'in': 5, 'cat': 6, 'tree': 7, 'dog': 8}
One-Hot Encoded Matrix:
cat: [0, 0, 0, 0, 0, 0, 1, 0, 0]
in: [0, 0, 0, 0, 0, 1, 0, 0, 0]
the: [0, 1, 0, 0, 0, 0, 0, 0, 0]
hat: [0, 0, 0, 1, 0, 0, 0, 0, 0]
dog: [0, 0, 0, 0, 0, 0, 0, 0, 1]
on: [0, 0, 0, 0, 1, 0, 0, 0, 0]
the: [0, 1, 0, 0, 0, 0, 0, 0, 0]
mat: [1, 0, 0, 0, 0, 0, 0, 0, 0]
bird: [0, 0, 1, 0, 0, 0, 0, 0, 0]
in: [0, 0, 0, 0, 0, 1, 0, 0, 0]
the: [0, 1, 0, 0, 0, 0, 0, 0, 0]
tree: [0, 0, 0, 0, 0, 0, 0, 1, 0]

The problems with sparse one-hot encodings

We’ve done our one-hot encoding and have successfully represented each word as a vector of numbers. Plenty of NLP projects have done just this, but the end results can be mediocre, especially when the training dataset is small. This is because one-hot vectors aren’t a great input representation method.

Why is one-hot encoding of words sub-optimal?

1. Lack of Semantic Similarity: One-hot encoding fails to capture semantic relationships between words. For instance, “cat” and “tiger” are represented as entirely distinct vectors, offering no indication of their similarity. This is problematic for tasks like analogy-based vector operations, where we’d expect operations such as “cat — small + large” to yield something akin to “tiger” or “lion.” One-hot encoding lacks the richness required for such tasks.

2. High Dimensionality: The dimensionality of one-hot vectors scales linearly with the vocabulary size. As the vocabulary grows, the feature vectors become increasingly large, exacerbating the curse of dimensionality. This not only increases the number of parameters that need to be estimated but also demands exponentially more data to train a model that generalizes well.

3. Computational Inefficiency: One-hot encoded vectors are sparse and high-dimensional, with most elements being zero. Many machine learning models, especially neural networks, struggle with such sparse data. The large feature space can also lead to memory and storage challenges, particularly if the models do not efficiently handle sparse matrices.

3.2 Static Embeddings

Dense vectors, or word embeddings, address the limitations of one-hot encoding by providing more informative and compact representations of words.

Dimensionality Reduction: Instead of having a vector length equal to the vocabulary size, embeddings typically use vectors with much smaller dimensions (e.g., 50, 100, or 300).
Semantic Proximity: Dense vectors place semantically similar words close to each other in the vector space.
For example, the cosine similarity between the vectors for “cat” and “dog” will be higher than that between “cat” and “fish”.

Example:

Word2Vec: Learns embeddings by predicting a word from its context or vice versa.
GloVe: Uses matrix factorization to derive embeddings.
Transformers (e.g., BERT, GPT): Generate contextual embeddings that capture word meanings based on surrounding context.

What’s the most important problem that one-hot vectors have, that dense embeddings solve?

The core problem that embeddings solve is generalisation.

The generalisation issue. If as assume that words like “cat” and “tiger” are indeed similar, we want some way to pass that information on to the model. This becomes especially important if one of the words is rare (e.g. “liger”), since it can piggy-back on the computation path that a similar, more common word takes through the model. This is because, during training, the model learns to treat the input “cat” a certain way, by sending it through layers of transformations defined by weights and bias parameters. When the network finally sees “liger”, if its embedding is similar to “cat”, then it will take a similar path to “cat” instead of the network having to learn how to handle it completely from scratch. It’s very difficult to make predictions about things unlike you’ve ever seen before–much easier if it’s related to something you have seen.

This means embeddings allow us to build much more generalisable models–instead of the network needing to scramble to learn many disparate ways to handle disconnected input, we instead let similar words “share” parameters and computation paths.

Towards dense, semantically-meaningful representation

If we take 5 example words from our vocabulary (say… the words “aardvark”, “black”, “cat”, “duvet” and “zombie”) and examine their embedding vectors created by the one-hot encoding method discussed above, the result would look like this:

Word vectors using one-hot encoding. Each word is represented by a vector that is mostly zeroes, except there is a single “1” in the position dictated by that word’s index in the vocabulary. Note: it’s not that “black”, “cat”, and “duvet” have the same feature vector, it just looks like it here.

But, as humans speaking some language, we know that words are these rich entities with many layers of connotation and meaning. Let’s hand-craft some semantic features for these 5 words. Specifically, let’s represent each word as having some sort of value between 0 and 1 for four semantic qualities, “animal”, “fluffiness”, “dangerous”, and “spooky”:

Hand-crafted semantic features for 5 words in the vocabulary.

So, to explain a couple of examples:

Given the word “aardvark”, I’ve given it a high value for the feature “animal” (since it’s very much an animal), and relatively low values for “fluffiness” (aarvarks have short bristles), “dangerous” (they’re small, nocturnal burrowing pigs), and “spooky” (they’re charming).
Given the word “cat”, I’ve given it a high value for the features “animal” and “fluffiness” (self-explanatory), a medium value for “dangerous” (self-explanatory if you’ve ever had a pet cat), and a medium value for “spooky” (try doing an image search for “sphynx cat”).

Plotting words based on semantic feature values

We’ve worked our way to the main point:

Each semantic feature can be though of as a single dimension in the broader, higher-dimensional semantic space.

In the above made-up dataset, there are four semantic features, and we can plot two of these at a time as a 2D scatter plot (see below). Each feature is a different axis/dimension.
The coordinates of each word within this space are given by its specific values on the features of interest. For example, the coordinates of the word “aardvark” on the 2D plot of fluffiness vs. animal 2D plot are (x=0.97, y=0.03).

Plotting word feature values on either 2 or 3 axes.

Similarly, we could consider the three features (“animal”, “fluffiness” and “dangerous”) and plot the position of words in this 3D semantic space. For example, the coordinates of the word “duvet” are (x=0.01, y=0.84, z=0.12), indicating that “duvet” is highly associated with the concept of fluffiness, is maybe slightly dangerous, and not an animal.

This is a hand-crafted toy example, but actual embedding algorithms will of course automatically generate embedding vectors for all the words in an input corpus. If you’d like, you can think of word embedding algorithms like word2vec as unsupervised feature extractors for words.

Word embedding algorithms like word2vec are unsupervised feature extractors for words.

What Is the Dimensionality of Word Embedding?

In general, the dimensionality of word embedding refers to the number of dimensions in which the vector representation of a word is defined. This is typically a fixed value determined while creating the word embedding. The dimensionality of the word embedding represents the total number of features that are encoded in the vector representation.

Different methods to generate word embeddings can result in different dimensionality. Most commonly, word embeddings have dimensions ranging from 50 to 300, although higher or lower dimensions are also possible.

For example, the figure below shows the word embeddings for “king”, “queen”, “man”, and “women” in a 3-dimensional space:

3.2.1 Word2Vec

Word2Vec, introduced by Mikolov et al., Word2Vec from Google is a popular prediction-based method that learns word embeddings by predicting the surrounding words within a context window. This approach yields dense vector representations that capture semantic relationships between words.

The approach introduced by Bengio opened new opportunities for NLP researchers to modify the technique and the architecture itself, to create a method that’s computationally less expensive. Why?

The method that Bengio et al proposed takes words for the vocabulary, and feeds them into a feed forward neural network with an embedding layer, hidden layer(s) and a softmax function.

These embeddings have associated learnable vectors, which optimize themselves through back propagation. Essentially, the first layer of the architecture yields word embeddings, since it’s a shallow network.

The problem with this architecture is that it’s computationally expensive between the hidden layer and the projection layer. The reason for it is complex:

The values produced in the projection are dense.
The hidden layer computes probability distribution for all the words in the vocabulary.

To address this issue, researchers (Mikolov et al. in 2013) came along with a model called ‘Word2Vec’.

A Word2Vec model essentially addresses the issues of Bengio’s NLM.

It removes the hidden layer altogether, but the projection layer is shared for all words, just like Bengio’s model. The downside is that this simple model without a neural network won’t be able to represent data as precisely as the neural network can, if there’s less data.

On the other hand, with a larger dataset, it can represent the data precisely in the embedding space. Along with it, it also reduces complexity, and the model can be trained in larger datasets.

There are two neural embedding methods for Word2Vec: Continuous Bag of Words (CBOW) and Skip-gram.

Difference between SkipGram and CBOW training architectures . Credits: Kavita Ganeshan

3.2.1.1 Continuous Bag of Words (CBOW)

CBOW is a variant of the word2vec model predicts the center word from (bag of) context words. So given all the words in the context window (excluding the middle one), CBOW would tell us the most likely the word at the center.

For example, say we have a window size of 2 on the following sentence. Given the words (“Quick”, “Brown”, “and”), we want the network to predict “Fox”.

The input of the Skip-gram network needs to change to take in multiple words. Instead of a “one hot” vector as the input, we use a “bag-of-words” vector. It’s the same concept, except that we put 1s in multiple positions (corresponding to the context words).

**CBOW Training Samples | Credits: Praveen Kumar Anwla**

With a window size of 2, skip-gram will generate (up to) four training samples per center word, whereas CBOW only generates one. With skip-gram, we saw that multiplying with a one-hot vector just selects a row from the hidden layer weight matrix. What happens when you multiply with a bag-of-words vector instead?

The result is that it selects the corresponding rows and sums them together.

For the CBOW architecture, we also divide this sum by the number of context words to calculate their average word vector. So the output of the hidden layer in the CBOW architecture is the average of all the context word vectors. From there, the output layer is identical to the one in skip-gram.

This model takes out the complexity of calculating probability distribution over all the words in the vocabulary by just calculating the log2(V), where V is the vocabulary size. Hence this model is faster and efficient.

3.2.1.2 Skip-Gram

First of all, you know you can’t feed a word just as a text string to a neural network, so we need a way to represent the words to the network. To do this, we first build a vocabulary of words from our training documents–let’s say we have a vocabulary of 10,000 unique words. We’re going to represent an input word like “ants” as a one-hot vector. This vector will have 10,000 components (one for every word in our vocabulary) and we’ll place a “1” in the position corresponding to the word “ants”, and 0s in all of the other positions. The output of the network is a single vector (also with 10,000 components) containing, for every word in our vocabulary, the probability that a randomly selected nearby word is that vocabulary word. Here’s the architecture of our neural network.

SkipGram Model Architecture | Credits: Praveen Kumar Anwla

There is no activation function on the hidden layer neurons, but the output neurons use softmax. Kindly note that when training this network on word pairs, the input is a one-hot vector representing the input word and the training output is also a one-hot vector representing the output word. But when you evaluate the trained network on an input word, the output vector will actually be a probability distribution (i.e., a bunch of floating point values, not a one-hot vector).

The Hidden Layer

For our example, we’re going to say that we’re learning word vectors with 300 features. So the hidden layer is going to be represented by a weight matrix with 10,000 rows (one for every word in our vocabulary) and 300 columns (one for every hidden neuron).

300 features is what Google used in their published model trained on the Google news dataset (you can download it from here). The number of features is a “hyper parameter” (which is nothing but the Embedding Dimension of each word) that you would just have to tune to your application (that is, try different values and see what yields the best results).

General approach for dealing with words in your text data is to one-hot encode your text. You will have thousands or millions of unique words in your text vocabulary. Computations with such one-hot encoded vectors for these words will be very inefficient because most values in your one-hot vector will be 0. So, the matrix calculation that will happen in between a one-hot vector and a first hidden layer will result in a output that will have mostly 0 values .

We use embeddings to solve this problem and greatly improve the efficiency of our network.Embeddings are just like a fully-connected layer. We will call this layer as — embedding layer and the weights as — embedding weights.

Now, instead of doing the matrix multiplication between the inputs and hidden layer we directly grab the values from embedding weight matrix. We can do this because the multiplication of one-hot vector with weight matrix returns the row of the matrix corresponding to the index of ‘1’ input unit .If you look at the rows of this weight matrix, these are actually what will be our word vectors.

If you multiply a 1 x 10,000 one-hot vector by a 10,000 x 300 matrix, it will effectively just select the matrix row corresponding to the “1”. Here’s a small example to give you a visual.

So, we use this Weight Matrix as lookup table. We encode the words as integers, for example ‘cool’ is encoded as 512, ‘hot’ is encoded as 764. Then to get hidden layer output value for ‘cool’ we just simply need to lookup the 512th row in the weight matrix. This process is called Embedding Lookup. The number of dimension in the hidden layer output is the embedding dimension.

Kindly note that at the very beginning of training, all weights in the Embedding matrix are initialized to random values.

Note: — Quality of word embedding increases with higher dimensionality. However, after reaching some threshold, the marginal gain will diminish. Typically, the dimensionality of the vectors is set to be between 100 and 1,000.

Output Layer

The 1 x 300 word vector for “ants” then gets fed to the output layer. In order to guarantee a probability based representation of the output word, a softmax activation function is used in the output layer and the following error function E is adopted during training:

At the same time, to reduce computational effort, a linear activation function is used for the hidden neurons and the same weights are used to embed all inputs (CBOW) or all outputs (Skip-gram).

Since the output layer is a softmax regression classifier so each output neuron (one per word in our vocabulary!) will produce an output between 0 and 1, and the sum of all these output values will add up to 1. Specifically, each output neuron has a weight vector which it multiplies against the word vector from the hidden layer, then it applies the function exp(x) to the result. Finally, in order to get the outputs to sum up to 1, we divide this result by the sum of the results from all 10,000 output nodes. Above is an illustration of calculating the output of the output neuron for the word “car”.

Kindly note that at the very beginning of training, all weights in the Output Matrix are set to 0.

Continuous skip-gram, or skip-gram, is similar to CBOW. Instead of predicting the target word (wt), it predicts the word surrounding it with context. The training objective is to learn representations, or embeddings, that are good at predicting nearby words.

It also takes an “n” number of words. For instance, if n=2 and the sentence is ‘the dog is playing in the park”, then the word fed into the model will be playing, and the target words will be (the, dog, is, in, the, park).

3.2.2 GloVE (Global Vectors for Word Representation)

GloVe from Standford combines the advantages of count-based and prediction-based methods by leveraging co-occurrence statistics to train word embeddings. By optimizing a global word-word co-occurrence matrix, GloVe generates embeddings that capture both local and global semantic relationships.

GloVe is an unsupervised learning algorithm that obtains vector word representations by analyzing the co-occurrence statistics of words in a text corpus. These word vectors capture the semantic meaning and relationships between words.

The key idea behind GloVe is to learn word embeddings by examining the probability of word co-occurrences across the entire corpus. It constructs a global word-word co-occurrence matrix and then factorizes it to derive word vectors representing words in a continuous vector space.

These word vectors have gained popularity in natural language processing (NLP) tasks due to their ability to capture semantic relationships between words. They are used in various applications such as machine translation, sentiment analysis, text classification, and more, where understanding the meaning and context of words is crucial.

Contextual understanding allows us to understand words from their surrounding words.

GloVe embeddings have been widely used alongside other embedding techniques, such as Word2Vec and FastText, significantly improving NLP models’ performance.

How are GloVe Word Embeddings Created?

The basic methodology of the GloVe model is to first create a huge word-context co-occurence matrix consisting of (word, context) pairs such that each element in this matrix represents how often a word occurs with the context (which can be a sequence of words). The idea then is to apply matrix factorization to approximate this matrix as depicted in the following figure.

Conceptual model for the GloVe model’s implementation (Credits: Dipanjan (DJ) Sarkar)

Considering the Word-Context (WC) matrix, Word-Feature (WF) matrix and Feature-Context (FC) matrix, we try to factorize WC = WF x FC, such that we we aim to reconstruct WC from WF and FC by multiplying them. For this, we typically initialize WF and FC with some random weights and attempt to multiply them to get WC’ (an approximation of WC) and measure how close it is to WC. We do this multiple times using Stochastic Gradient Descent (SGD) to minimize the error. Finally, the Word-Feature matrix (WF) gives us the word embeddings for each word where F can be preset to a specific number of dimensions. A very important point to remember is that both Word2Vec and GloVe models are very similar in how they work. Both of them aim to build a vector space where the position of each word is influenced by its neighboring words based on their context and semantics. Word2Vec starts with local individual examples of word co-occurrence pairs and GloVe starts with global aggregated co-occurrence statistics across all words in the corpus.

3.2.3 FastText

Limitations of Word2Vec

While Word2Vec was a game-changer for NLP, we will see how there was still some room for improvement:

Out of Vocabulary(OOV) Words: In Word2Vec, an embedding is created for each word. As such, it can’t handle any words it has not encountered during its training.

For example, words such as “tensor” and “flow” are present in the vocabulary of Word2Vec. But if you try to get embedding for the compound word “tensorflow”, you will get an out of vocabulary error.

Morphology: For words with same radicals such as “eat” and “eaten”, Word2Vec doesn’t do any parameter sharing. Each word is learned uniquely based on the context it appears in. Thus, there is scope for utilizing the internal structure of the word to make the process more efficient.

To solve the above challenges, Bojanowski et al. proposed a new embedding method called FastText. Their key insight was to use the internal structure of a word to improve vector representations obtained from the skip-gram method.

The modification to the skip-gram method is applied as follows:

1. Sub-word generation

For a word, we generate character n-grams of length 3 to 6 present in it.

We take a word and add angular brackets to denote the beginning and end of a word

Then, we generate character n-grams of length n. For example, for the word “eating”, character n-grams of length 3 can be generated by sliding a window of 3 characters from the start of the angular bracket till the ending angular bracket is reached. Here, we shift the window one step each time.

Thus, we get a list of character n-grams for a word.

Examples of different length character n-grams are given below:

Since there can be huge number of unique n-grams, we apply hashing to bound the memory requirements. Instead of learning an embedding for each unique n-gram, we learn total B embeddings where B denotes the bucket size. The paper used a bucket of a size of 2 million.

Each character n-gram is hashed to an integer between 1 to B. Though this could result in collisions, it helps control the vocabulary size. The paper uses the FNV-1a variant of the Fowler-Noll-Vo hashing function to hash character sequences to integer values.

2. Skip-gram with negative sampling

To understand the pre-training, let’s take a simple toy example. We have a sentence with a center word “eating” and need to predict the context words “am” and “food”.

First, the embedding for the center word is calculated by taking a sum of vectors for the character n-grams and the whole word itself.

2. For the actual context words, we directly take their word vector from the embedding table without adding the character n-grams.

3. Now, we collect negative samples randomly with probability proportion to the square root of the unigram frequency. For one actual context word, 5 random negative words are sampled.

4. We take dot product between the center word and the actual context words and apply sigmoid function to get a match score between 0 and 1.

5. Based on the loss, we update the embedding vectors with SGD optimizer to bring actual context words closer to the center word but increase distance to the negative samples.

3.3 Contextual Embeddings

Contextual embeddings genuinely showed some promising results in learning the relationships between words.

For instance,

These were capable of generating context-aware representations, thanks to their self-attention mechanism. This would allow embedding models to dynamically generate embeddings for a word based on the context they were used in. As a result, if a word would appear in a different context, the model would get a different representation.

This is precisely depicted in the image below for different uses of the word “Bank”.

For visualization purposes, the embeddings have been projected into 2d space using t-SNE.

Glove vs. BERT on understanding different senses of a word | Credits: Avi Chawla

The static embedding models — Glove and Word2Vec produce the same embedding for different usages of a word.

However, contextualized embedding models don’t.

In fact, contextualized embeddings understand the different meanings/senses of the word “Bank”:

A financial institution
Sloping land
A Long Ridge, and more.

Different senses were taken from Priceton’s Wordnet database here: WordNet.

As a result, they addressed the major limitations of static embedding models.

3.3.1 Self Attention

Converts Static Embediings into Dynamic Contextual Embeddings

The image demonstrates how the meaning of the word “Apple” changes depending on its context in a sentence. This is achieved through contextual embeddings, where the representation of a word adapts based on the surrounding words. Let’s break down the example.

Left Side: “I ate an Apple”

Sentence: “I ate an Apple.”
Context: Here, “Apple” clearly refers to the fruit.

This equation shows how the final representation of “Apple” is influenced by the surrounding words:

I: The pronoun indicating the speaker.
ate: The verb suggesting an eating action.
an: The article that supports the context of a single item, typically a countable noun.
Apple: The word itself, influenced by the context.

Outcome: The vector for “Apple” here leans heavily towards the meaning of the fruit, due to the words “ate” and “an.”

Right Side: “I bought an Apple”

Sentence: “I bought an Apple.”
Context: In this context, “Apple” is more likely to refer to the tech company, suggesting a purchase of an Apple product like an iPhone or MacBook.

This equation shows how the context changes the meaning:

I: The pronoun again indicating the speaker.
bought: The verb implying a purchase.
an: The article as before.
Apple: Here, the word shifts towards the context of a product or brand.

Outcome: The vector for “Apple” in this sentence shifts towards representing the company or its products, due to the context provided by “bought.”

Context Matters: The word “Apple” can mean different things based on the surrounding words. In “I ate an Apple,” it means the fruit. In “I bought an Apple,” it suggests a product from Apple Inc.

Contextual Embeddings: These embeddings adjust the vector for “Apple” according to the context, providing a more accurate understanding of the word’s meaning in each sentence. This approach helps in understanding language more naturally, capturing the nuances that static embeddings miss.

Now we need to create this for each unique words like we do above for Apple.

Contextual embeddings address the limitations of static embeddings by providing word representations that change based on their context. These embeddings are generated by deep learning models trained to understand the context in which words appear.

Context-Aware: The representation of a word is influenced by the words around it, allowing for different meanings based on context.
Dynamic Representations: Words have multiple representations depending on their usage in different sentences.
Enhanced Semantic Understanding: Contextual embeddings capture more complex relationships and nuanced meanings.

One of the key advancements in this area is the development of ELMo (Embeddings from Language Models). ELMo considers the entire sentence to establish word meaning. It has significantly improved performance on various NLP tasks.

Following ELMo, BERT (Bidirectional Encoder Representations from Transformers) took the concept further. BERT analyzes words in relation to all the other words in a sentence, rather than in isolation. This has led to even more nuanced language models.

3.3.2 BERT

Transformer models, like BERT, use attention mechanisms to weigh the relevance of all words in a sentence. This has been a game-changer for feature extraction in NLP. It has enabled more accurate predictions and better understanding of language nuances.

BERT is a transformer-based model that learns contextualized embeddings for words. It considers the entire context of a word by considering both left and right contexts, resulting in embeddings that capture rich contextual information.

from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained BERT model and tokenizer
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)
word_pairs = [('learn', 'learning'), ('india', 'indian'), ('fame', 'famous')]

# Compute similarity for each pair of words
for pair in word_pairs:
 tokens = tokenizer(pair, return_tensors='pt')
 with torch.no_grad():
  outputs = model(**tokens)

# Extract embeddings for the [CLS] token
 cls_embedding = outputs.last_hidden_state[:, 0, :]
 similarity = torch.nn.functional.cosine_similarity(cls_embedding[0], cls_embedding[1], dim=0)
 
 print(f"Similarity between '{pair[0]}' and '{pair[1]}' using BERT: {similarity:.3f}")

3.3.3 ELMo

ELMo (Embeddings from Language Models) was developed by researchers at the Allen Institute for AI. ELMo embeddings showed performance improvement in question-answering and sentiment analysis tasks. The official paper — https://arxiv.org/pdf/1802.05365.pdf

ELMo (Embeddings from Language Models) represents a significant advancement in feature extraction for Natural Language Processing (NLP). This technique leverages deep, contextualized word embeddings to capture syntax and semantics, as well as polysemy — words with multiple meanings.

Unlike traditional embeddings, ELMo analyzes words within the context of surrounding text, leading to a richer understanding. Here’s a simplified example in Python:

from allennlp.modules.elmo import Elmo, batch_to_ids

# Initialize ELMo
options_file = 'elmo_options.json'
weight_file = 'elmo_weights.hdf5'
elmo = Elmo(options_file, weight_file, num_output_representations=1)

# Example sentences
sentences = [['I', 'have', 'a', 'green', 'apple'], ['I', 'have', 'a', 'green', 'thumb']]

# Convert sentences to character ids
character_ids = batch_to_ids(sentences)

# Get ELMo embeddings
embeddings = elmo(character_ids)

ELMo’s dynamic word embeddings are a game-changer, providing nuanced word representations that reflect different meanings based on context. This has paved the way for more sophisticated NLP applications, enhancing the performance of various tasks in feature extraction and beyond.

Training a custom embedding model can prove to be beneficial in some use cases where the scope is limited. Training an embedding model that generalizes well can be a laborious exercise. Collection and pre-processing text data can be cumbersome. The training process can turn out to be computationally expensive too.

The good news for anyone building AI systems is that embeddings once created, can also generalize across tasks and domains. Some of the famous famous pre-trained embeddings available to use are:

Embeddings Models by OpenAI

OpenAI, the company behind ChatGPT and the GPT series of Large Language Models also provides three Embeddings Models : text-embedding-ada-002, text-embedding-3-small, text-embedding-3-large

OpenAI models can be accessed using the OpenAI API

2. Gemini Embeddings Model by Google

text-embedding-004 (last updated in April 2024) is the model offered by Google Gemini. It can be accessed via the Gemini API

3. Voyage AI

Voyage AI embedding models are recommended by Anthropic, the provider of Claude’s series of Large Language Models. Voyage offers several embeddings models like voyage-large-2-instruct, voyage-law-2, voyage-code-2

4. Mistral AI Embeddings

Mistral is the company behind LLMs like Mistral and Mixtral. They offer a 1024-dimension embedding model by the name of mistral-embed. This is an open-source embedding model.

5. Cohere Embeddings

Cohere, the developers of Command, Command R and Command R+ LLMs also offer a variety of embedding models like embed-english-v3.0, embed-english-light-v3.0, embed-multilingual-v3.0, etc. and can be accessed via cohere API.

4. Training Word Embeddings

Word embeddings were proposed by Bengio et. al. (2001, 2003) to tackle what’s known as the curse of dimensionality, a common problem in statistical language modelling.

It turns out that Bengio’s method could train a neural network such that each training sentence could inform the model about a number of semantically available neighboring words, which was known as distributed representation of words. The neural network not established relationships between different words, but it also preserved relationships in terms of both semantic and syntactic properties.

This introduced a neural network architecture approach that laid the foundation for many current approaches.

This neural network has three components:

An embedding layer generates word embedding, and the parameters are shared across words.
A hidden layer of one or more layers introduces non-linearity to the embeddings.
A softmax function that produces probability distribution over all the words in the vocabulary.

Let’s understand how a neural network language model works with the help of code.

(Here are links to the Notebook and original Paper)

Step 1: Indexing the words.

We start by indexing the words. For each word in the sentence, we’ll assign a number to it.

import torch
import torch.nn as nn
import torch.optim as optim

raw_sentence = ["i like dog", "i love coffee", "i hate milk"]
word_list = " ".join(raw_sentence).split()
word_list = list(set(word_list))
word2id = {w: i for i, w in enumerate(word_list)}
id2word = {i: w for i, w in enumerate(word_list)}
n_class = len(word2id)

Step 2: Building the model.

We will build the model exactly as described in the paper.

class NNLM(nn.Module):
   def __init__(self):
       super(NNLM, self).__init__()
       self.embeddings = nn.Embedding(n_class, m) #embedding layer or look up table
       self.hidden1 = nn.Linear(n_step * m, n_hidden, bias=False)
       self.ones = nn.Parameter(torch.ones(n_hidden))
       self.hidden2 = nn.Linear(n_hidden, n_class, bias=False)
       self.hidden3 = nn.Linear(n_step * m, n_class, bias=False) #final layer
       self.bias = nn.Parameter(torch.ones(n_class))

  def forward(self, X):
       X = self.embeddings(X) # embeddings
       X = X.view(-1, n_step * m) # first layer
       tanh = torch.tanh(self.d + self.hidden1(X)) # tanh layer
       output = self.b + self.hidden3(X) + self.hidden2(tanh) # summing up all the layers with bias
       return output

We’ll start by initializing an embedding layer. An embedding layer is a lookup table.

Once the input index of the word is embedded through an embedding layer, it’s then passed through the first hidden layer with bias added to it. The output of these two is then passed through a tanh function.

If you remember from the diagram in the original paper, the output from the embedded layer is also passed into the final hidden layer, where the output of the tanh is summed together.

output = self.b + self.hidden3(X) + self.hidden2(tanh)

Now, in the last step we will calculate the probability distribution over the entire vocabulary.

Step 3: Loss and optimization function.

Now that we have the output from the model, we need to make sure that we pass it through the softmax function to get the probability distribution.

We’re using cross entropy loss.

criterion = nn.CrossEntropyLoss()

The cross entropy loss is made up of two equations: log softmax function, and negative log likelihood loss or NLLLoss. The former calculates the softmax normalization, while the latter calculates the negative log likelihood loss.

For optimization, we use Adam optimizer.

Step 4: Training.

Finally, we train the model.

In a nutshell, word embeddings can be defined as a dense representation of words in the form of vectors in low-dimensional space. These embeddings are accompanied by learnable vectors, or parameterized functions. They update themselves during backpropagation using a loss function, and try to find a good relationship between words, preserving both semantic and synaptic properties.

*“As it turned out that neural network based models significantly outperformed statistical based models”* **Mikolov et. al. (2013)**.

Softmax Function

So far, you’ve seen how the softmax function plays a vital role in predicting the words around a given context. But it suffers from a complexity issue.

Recall the equation of softmax function:

Where wt is the target word, c is the context words, and y is the output for each target word.

If you look at the equation above, the complexity of the softmax function arises when the number of predictors is high. If i=3, then the softmax function will return a probability distribution over three categories.

But, in NLP we usually deal with thousands, sometimes millions of words. Getting a probability distribution over that many words will make the computation really expensive and slow.

Keep in mind that softmax functions return the exact probability distribution, so they tend to get slower with increasing parameters. For each word (wt), it sums over the entire vocabulary in the denominator.

Several techniques are commonly used to train word embeddings. These techniques vary in their approach to learning the semantic relationships between words, as well as their computational efficiency and effectiveness. Some of the most popular word embedding training techniques are:

CBOW (Continuous Bag-of-Words): CBOW is a technique that is used to predict a target word based on its surrounding context. In this technique, the model takes a window of surrounding words as input and tries to predict the target word in the center of the window. This technique is efficient and works well for smaller datasets.
Skip-gram: Skip-gram is a technique that is similar to CBOW, but instead of predicting the target word based on its context, it predicts the context words based on the target word. In this technique, the model takes a target word as input and tries to predict the surrounding context words. Skip-gram is more computationally intensive than CBOW but can work better for larger datasets.
Negative Sampling: Negative sampling is a technique that is used to address the problem of imbalanced training data. In traditional CBOW and Skip-gram techniques, the model is trained on a dataset where most of the word pairs are negative (i.e., they do not co-occur in the corpus). Negative sampling solves this problem by sampling a few negative examples for each positive example during training. This technique can speed up training and improve the quality of word embeddings.
Hierarchical Softmax: Hierarchical Softmax is a technique that is used to speed up the training process of word embeddings. In traditional training methods, the model must compute the probability of each word in the vocabulary for each training example. This can be computationally expensive, especially for larger vocabularies. Hierarchical Softmax solves this problem by using a binary tree to represent the probability distribution over the vocabulary. This technique can significantly speed up training times for larger vocabularies.
Subword Information: Subword information is a technique that is used to improve the representation of rare words and words with misspellings. In this technique, the model learns representations not only for individual words but also for their subword components (e.g., prefixes, suffixes, and stems). This can improve the model’s ability to handle out-of-vocabulary words and reduce the impact of misspellings on word representations.

These techniques can be used in combination with each other to improve the quality and efficiency of word embeddings. Choosing the appropriate training technique depends on the size and complexity of the dataset, the desired speed of training, and the specific NLP task at hand.

Let’s discuss them in detail:

4.1 Continuous Bag-of-Words (CBOW) model

Training the Continuous Bag-of-Words (CBOW) model is crucial in obtaining word embeddings that can effectively capture semantic relationships in a given corpus. In this section, we will explore the training process of the CBOW model, including data preprocessing, building the context window, creating input-output pairs, defining the neural network architecture, and optimizing the model’s parameters.

Data Preprocessing: Before training the CBOW model, we need to preprocess the text data. The preprocessing steps typically involve tokenization, removing punctuation, converting text to lowercase, and handling special characters. Additionally, we may remove stopwords (common words with little semantic value) and perform stemming or lemmatization to reduce word variations to their base forms.
Building the Context Window: The central idea behind CBOW is to predict a target word based on its surrounding context words. To achieve this, we define a context window size, which determines the number of words on either side of the target word that will be considered context words. A larger context window allows the model to capture more context but may lead to increased computational overhead.
Creating Input-Output Pairs: Once the context window is defined, we slide it over the preprocessed text data. We extract the context words within the window for each target word to create input-output pairs. For example, if the context window size is set to 2, and the sentence is “The quick brown fox jumps,” the input-output pairs would be:
– Input: [The, brown] Output: quick – Input: [quick, fox] Output: brown – Input: [brown, jumps] Output: fox
Defining the Neural Network Architecture: We can define the CBOW neural network architecture with ready input-output pairs. The architecture usually consists of an input layer, a hidden layer, and an output layer. Each word in the context window will be represented as a one-hot encoded vector in the input layer. The hidden layer contains the embedding layer, where the word representations are learned, and the output layer predicts the target word.
Training the CBOW Model: The training process involves feeding the input-output pairs into the CBOW model and adjusting the model’s parameters to minimize the prediction error. Common optimization algorithms, such as stochastic gradient descent (SGD) or Adam, update the model’s weights during training. The optimization process aims to find the word embeddings that best capture the semantic relationships between words in the corpus.
Softmax Activation Function: The output layer of the CBOW model typically employs the softmax activation function. Softmax converts the raw output scores into a probability distribution, allowing the model to predict the most likely word for a given context. The target word’s one-hot encoded vector is compared to the predicted probability distribution, and the error is backpropagated through the network to update the model’s parameters.
Training Epochs and Batch Size: During training, we iterate over the input-output pairs multiple times, known as epochs. The number of epochs determines how often the training dataset is processed. Additionally, input-output pairs are usually divided into batches to accelerate training and utilize parallelism. The batch size is a hyperparameter that controls the number of samples processed in each training step.
Evaluation during Training: To monitor the CBOW model’s training progress and prevent overfitting, it is essential to evaluate the model’s performance on a validation set during training. The validation set contains input-output pairs that are distinct from the training set. By assessing the model’s performance on this set, we can determine if it is generalizing well to unseen data and whether it is appropriate to stop training or make adjustments.
Hyperparameter Tuning: As mentioned previously, CBOW has several hyperparameters, including the context window size, embedding dimension, learning rate, and batch size. Hyperparameter tuning involves systematically experimenting with different combinations of hyperparameters to find the optimal configuration that yields the best performance on the validation set.

Training the CBOW model involves preprocessing the text data, creating input-output pairs, defining the neural network architecture, and optimizing the model’s parameters using an optimization algorithm. By training the CBOW model on a large corpus of text data, we obtain word embeddings that capture the contextual relationships between words, empowering us to leverage these embeddings for various downstream NLP tasks, such as word similarity, text classification, and sentiment analysis. The success of the CBOW model lies in its ability to efficiently produce meaningful word representations, facilitating better language understanding and enhancing the performance of NLP applications.

4.1.1 Continuous Bag-of-Words (CBOW) with Python and TensorFlow

Implementing Continuous Bag-of-Words (CBOW) with Python involves setting up the environment, preparing the data, creating the CBOW neural network architecture, training the model, and evaluating its performance. Below is a step-by-step guide to implementing CBOW using Python and TensorFlow, one of the popular deep learning frameworks for NLP.

**CBOW Architecture | Credits: Praveen Kumar Anwla**

1. Set Up the Environment: Ensure you install Python and TensorFlow. You can install TensorFlow using pip:

pip install tensorflow

2. Prepare the Data: Load your text corpus and preprocess it. Tokenize the sentences, remove punctuation, convert text to lowercase, and create a vocabulary with unique words. Assign an index to each word in the vernacular.

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Sample corpus
corpus = [
    "the quick brown fox jumps",
    "over the lazy dog",
    "hello world",
    # Add more sentences as needed
]
# Tokenize and create vocabulary
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
word_index = tokenizer.word_index
vocab_size = len(word_index) + 1

3. Create Input-Output Pairs: For CBOW, create input-output pairs by sliding a context window over the sentences. The context window size determines the number of words on either side of the target word to be considered context words.

import numpy as np

context_window = 2
def generate_data(corpus, context_window, tokenizer):
    sequences = tokenizer.texts_to_sequences(corpus)
    X, y = [], []
    for sequence in sequences:
        for i in range(context_window, len(sequence) - context_window):
            context = sequence[i - context_window : i] + sequence[i + 1 : i + context_window + 1]
            target = sequence[i]
            X.append(context)
            y.append(target)
    return np.array(X), np.array(y)
X_train, y_train = generate_data(corpus, context_window, tokenizer)

4. Create CBOW Model Architecture: Define the CBOW neural network architecture using TensorFlow. The model consists of an embedding layer, followed by an average pooling layer, and a dense output layer.

embedding_dim = 100

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=context_window*2),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(vocab_size, activation='softmax')
])
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')

5. Train the CBOW Model: Train the CBOW model using the prepared input-output pairs.

epochs = 50
batch_size = 16

model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size)

6. Evaluate the CBOW Model: After training, you can evaluate the CBOW model’s performance on word similarity tasks, analogy tasks, or any other specific NLP evaluation task.

# Perform evaluation on test data if available
test_loss, test_accuracy = model.evaluate(X_test, y_test, batch_size=batch_size)
print(f"Test Loss: {test_loss}, Test Accuracy: {test_accuracy}")

Implementing Continuous Bag-of-Words (CBOW) with Python and TensorFlow involves data preprocessing, defining the CBOW neural network architecture, training the model, and evaluating its performance. Following these steps, you can create word embeddings using CBOW and utilize them for various NLP tasks, such as word similarity, sentiment analysis, and text classification. Remember to adjust hyperparameters, context window size, and other settings based on your specific NLP task and dataset for optimal results.

Let’s understand how a CBOW model works with the help of code.

(here are links to the Notebook and original paper)

To begin with, we won’t change the word encoding method to numbers. That will stay the same.

Step 1: Define a function to create a context window with n words from the right and left of the target word.

def CBOW(raw_text, window_size=2):
   data = []
   for i in range(window_size, len(raw_text) - window_size):
       context = [raw_text[i - window_size], raw_text[i - (window_size - 1)], raw_text[i + (window_size - 1)], raw_text[i + window_size]]
       target = raw_text[i]
       data.append((context, target))
   return data

The function should take two arguments: data and window size. The window size will define how many words we are supposed to take from the right and from the left.

The for loop: for i in range(window_size, len(raw_text) — window_size): iterates through a range starting from the window size, i.e. 2 means it will ignore words in index 0 and 1 from the sentence, and end 2 words before the sentence ends.

Inside the for loop, we try separate context and target words and store them in a list.

For example, if the sentence is “The dog is eating and the cat is lying on the floor”, CBOW with window 2 will consider words ‘The’, ‘dog’, ‘eating’ and ‘and’. Essentially making the target word ‘is’.

Let i = window size = 2, then:

context = [raw_text[2 - 2], raw_text[2 - (2 - 1)], raw_text[i + (2 - 1)], raw_text[i + 2]]
target = raw_text[2]

Let’s call the function and see the output.

data = CBOW(raw_text)
print(data[0])

Output:
(['The', 'dog', 'eating', 'and'], 'is')

Step 2: Build the model.

Building a CBOW is similar to building the NNLM we did earlier, but actually much simpler.

In the CBOW model, we reduce the hidden layer to only one. So all together we have: an embedding layer, a hidden layer which passes through the ReLU layer, and an output layer.

class CBOW_Model(torch.nn.Module):
   def __init__(self, vocab_size, embedding_dim):
       super(CBOW_Model, self).__init__()
       self.linear1 = nn.Linear(embedding_dim, 128)
       self.activation_function1 = nn.ReLU()
       self.linear2 = nn.Linear(128, vocab_size)

   def forward(self, inputs):
       embeds = sum(self.embeddings(inputs)).view(1,-1)
       out = self.linear1(embeds)
       out = self.activation_function1(out)
       out = self.linear2(out)
       return out

This model is pretty straightforward. The context words index is fed into the embedding layers, which is then passed through the hidden layer followed by the nonlinear activation layer, i.e. ReLU, and finally we get the output.

Step 3: Loss and optimization function.

Similar to NNLM, we use the same technique for calculating probability distribution over all the words in the vocabulary, ie. nn.CrossEntropyLoss().

For optimization, we use Stochastic Gradient Descent. You can use Adam optimizer as well. In NLP, Adam is the go-to optimizer because it converges faster than SGD.

optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

Step 4: Training

Training is the same as the NNLM model.

for epoch in range(50):
   total_loss = 0

for context, target in data:
   context_vector = make_context_vector(context, word_to_ix)
   output = model(context_vector)
   target = torch.tensor([word_to_ix[target]])
   total_loss += loss_function(output, target)

#optimize at the end of each epoch
optimizer.zero_grad()
total_loss.backward()
optimizer.step()

make_context_vector turns words into numbers.

It’s worth noting that authors of this paper found that NNLM preserves linear relationships between words with similarity. For example, ‘king’ and ‘queen’ are the same as ‘men’ and ‘women’, i.e. NNLM preserves gender linearity.

Similarly, models such as CBOW and any neural network model that we’ll be discussing next will preserve linear relationships, even though we specifically define nonlinearity in the neural network.

4.2 Skip-gram Model

Here, we’ll provide a step-by-step guide on how to use a skip-gram model.

1. Data Preparation:

The first step is to prepare your training data, typically a large text corpus. The text is tokenized into individual words and optionally preprocessed by removing punctuation, converting to lowercase, etc.

2. Context-Target Pairs:

The skip-gram model aims to predict the surrounding context words for each word in the training data. The context is defined by a window size, which determines the number of words before and after the target word that are considered context words.
Consider an example sentence: “I love to eat pizza.”
If we set the window size to 2, the context-target pairs for the word “love” would be:
Context: [I, to, eat]
Target: love

Similarly, we create context-target pairs for all the words in the training data.

3. Neural Network Architecture:

The skip-gram model comprises a single hidden neural network with a projection layer.
The input layer represents the target word, and the projection layer represents the word embeddings or vector representations.
The projection layer has weights that correspond to each word in the vocabulary. Each weight vector represents the word embedding for that particular word.
The size of the projection layer (the dimensionality of the word embeddings) is a hyperparameter that needs to be specified before training.

4. Training:

The objective of training the skip-gram model is to maximize the probability of correctly predicting the context words given the target word.
This is typically done using stochastic gradient descent (SGD) or other optimization algorithms.
The training process involves updating the weights of the projection layer to minimize the loss between the predicted and actual context words.
The model learns to adjust the word embeddings such that similar words have similar vector representations in the embedding space.

5. Word Embeddings:

Once the skip-gram model is trained, the word embeddings are extracted from the projection layer.
These word embeddings capture the semantic relationships between words in the training data.
The dimensionality of the word embeddings, determined by the size of the projection layer, can be chosen based on the desired trade-off between computational efficiency and semantic expressiveness.
The word embeddings can be used as input features for various downstream NLP tasks or for measuring word similarity, clustering words, and other linguistic analyses.

4.2.1 Skip-Gram Model with Python and TensorFlow

Let’s understand how a skip-gram model works with the help of code.

(here are links to the Notebook and original paper)

A skipgram model is the same as the CBOW model with one difference. The difference lies in creating the context and the target word.

Step 1: Setting target and context variable.

Since skipgram takes a single context word and n number of target variables, we just need to flip the CBOW from the previous model.

def skipgram(sentences, window_size=1):
   skip_grams = []
   for i in range(window_size, len(word_sequence) - window_size):
       target = word_sequence[i]
       context = [word_sequence[i - window_size], word_sequence[i + window_size]]
       for w in context:
           skip_grams.append([target, w])
   return skip_grams

As you can see, the function is almost the same.

Here, you need to understand that when the window size is 1, we take one word before and after the target word.

When we call the function, the output looks something like this:

print(skipgram(word_sequence)[0:2])

Output:
[['my', 'During'], ['my', 'second']]

As you can see, the target word is ‘my’ and the two words are ‘During’ and ‘second’.

Essentially, we’re trying to create a pair of words such that each pair will contain a target word. Depending on the context window, it will contain the neighboring words.

Step 2: Building the model.

The model is pretty straightforward.

class skipgramModel(nn.Module):
   def __init__(self):
       super(skipgramModel, self).__init__()
       self.embedding = nn.Embedding(voc_size, embedding_size)
       self.W = nn.Linear(embedding_size, embedding_size, bias=False)
       self.WT = nn.Linear(embedding_size, voc_size, bias=False)

   def forward(self, X):
       embeddings = self.embedding(X)
       hidden_layer = nn.functional.relu(self.W(embeddings))
       output_layer = self.WT(hidden_layer)
       return output_layer

The loss function and optimisation remains the same.

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

Once we’ve defined everything, we can train the model.

for epoch in range(5000):
        input_batch, target_batch = random_batch()
        input_batch = torch.Tensor(input_batch)
        target_batch = torch.LongTensor(target_batch)
        
optimizer.zero_grad()
output = model(input_batch)
# output : [batch_size, voc_size], target_batch : [batch_size] (LongTensor, not one-hot)
loss = criterion(output, target_batch)
if (epoch + 1) % 1000 == 0:
        print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.6f}'.format(loss))
loss.backward()
optimizer.step()

The skip-gram model increases computational complexity because it has to predict nearby words based on the number of neighboring words. The more distant words tend to be slightly less related to the current word.

4.3 GloVE Model

Implementing GloVe embeddings in practical applications within Natural Language Processing involves several key steps, from accessing pre-trained embeddings to fine-tuning them for specific tasks:

1. Accessing Pre-trained GloVe Embeddings

Initially, it is essential to obtain pre-trained GloVe embeddings. These embeddings are available in various dimensions (e.g., 50, 100, 300) and trained on extensive text corpora. You can access them from repositories or the GloVe website.

2. Loading GloVe Embeddings into Models

Load the downloaded GloVe embeddings into your preferred platform or library, such as TensorFlow or PyTorch — map words to their corresponding vectors using dictionaries or embedding matrices.

3. Integrating GloVe Embeddings in NLP Models

Embed GloVe vectors as the initial weights in an embedding layer within NLP models. For instance, in TensorFlow, these embeddings are the weights of an Embedding layer, allowing the network to learn from these pre-trained representations.

4. Fine-tuning GloVe Embeddings (Optional)

Depending on the task, fine-tuning GloVe embeddings can optimize model performance. You can freeze the embeddings (trainable=False) to preserve their pre-trained features or update them during training (trainable=True) to adapt to specific domain nuances.

5. Customizing for Specific NLP Tasks

Tailor the GloVe embeddings for specialized NLP tasks. For instance, in sentiment analysis or text classification, feed these embeddings into models like recurrent neural networks (RNNs) or convolutional neural networks (CNNs) to classify sentiments or categorize texts.

6. Evaluating and Tuning Models

Assess model performance using validation sets and metrics pertinent to the task (accuracy, F1-score, etc.). Adjust hyperparameters to enhance model accuracy and generalization, including learning rate, architecture, and embedding dimensions.

7. Iterating and Refinement

Iterate through different approaches, experiment with various architectures, and consider ensembling techniques to refine model performance. To optimize results, fine-tune both the model and the GloVe embeddings.

Utilizing GloVe embeddings in NLP models empowers them with enriched semantic representations, enabling better comprehension of textual data. Effectively leveraging these embeddings contributes to superior performance across various NLP applications, enhancing language understanding, sentiment analysis, and information retrieval systems.

4.3.1 GloVe Word Embeddings In Python

Using GloVe embeddings in Python involves a few steps. You’ll either train your embeddings or use pre-trained ones. Here’s a basic overview using pre-trained embeddings in Python:

1. Downloading Pre-trained GloVe Embeddings

GloVe provides pre-trained word vectors trained on large corpora. You can download them from the GloVe website or other repositories.

2. Loading GloVe Embeddings into Python

Once downloaded, you’ll load these embeddings into your Python environment. You can use the embeddings directly or convert them into a Python dictionary for easy access.

# Load GloVe embeddings into a dictionary
def load_embeddings(file_path):
    embeddings = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], dtype='float32')
            embeddings[word] = vector
    return embeddings

glove_embeddings_path = 'path_to_glove_file/glove.6B.100d.txt'  # Adjust the path to your downloaded GloVe file
glove_embeddings = load_embeddings(glove_embeddings_path)

3. Using GloVe Embeddings

Once loaded, you can use these embeddings in various NLP tasks. For example, finding the embedding of a specific word or performing operations on word vectors:

import numpy as np

# Accessing word embeddings
word = 'example'
if word in glove_embeddings:
    embedding = glove_embeddings[word]
    print(f"Embedding for '{word}': {embedding}")
else:
    print(f"'{word}' not found in embeddings")

# Finding similarity between word embeddings
from scipy.spatial.distance import cosine
word1 = 'king'
word2 = 'queen'
similarity = 1 - cosine(glove_embeddings[word1], glove_embeddings[word2])
print(f"Similarity between '{word1}' and '{word2}': {similarity}")

4. Using GloVe Embeddings in Models

You can integrate these embeddings into your NLP models as input features for tasks like sentiment analysis, text classification, or any other application requiring word representations.

Remember to adjust the file paths and methods according to your specific use case and the dimensionality of the GloVe embeddings you’ve downloaded (e.g., glove.6B.100d.txt refers to 100-dimensional vectors trained on a 6-billion-token corpus). If not already in your environment, ensure you have the necessary dependencies installed, such as NumPy for array operations and SciPy for similarity computations.

How to Use GloVe Word Embeddings In Gensim

Gensim doesn’t directly support training GloVe embeddings, but it provides a convenient way to load pre-trained GloVe embeddings and work with them in Python. Here’s a simple guide on how to use gensim to load pre-trained GloVe embeddings:

First, ensure you have gensim installed. You can install it via pip:

pip install gensim

Once installed, you can load pre-trained GloVe embeddings using gensim:

from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

# Replace 'path_to_glove_file/glove.6B.100d.txt' with your GloVe file path
glove_file = 'glove.6B.100d.txt'

# Convert GloVe format to Word2Vec format
word2vec_temp_file = get_tmpfile("glove_word2vec.txt")
glove2word2vec(glove_file, word2vec_temp_file)

# Load GloVe embeddings using Gensim
glove_model = KeyedVectors.load_word2vec_format(word2vec_temp_file)

This code loads the GloVe embeddings from the file specified and stores them in glove_model.

Once loaded, you can perform various operations with the loaded model, such as finding the vector for a specific word or calculating the similarity between words:

# Example usage
word = 'example'
if word in glove_model:
    embedding = glove_model[word]
    print(f"Embedding for '{word}': {embedding}")
else:
    print(f"'{word}' not found in embeddings")

word1 = 'king'
word2 = 'queen'
similarity = glove_model.similarity(word1, word2)
print(f"Similarity between '{word1}' and '{word2}': {similarity}")

This code snippet demonstrates how to access the embedding of a specific word and find the similarity between two words using the loaded GloVe model.

Adjust the file path (glove_file) to point to your downloaded GloVe file, considering the specific dimensionality of the GloVe embeddings you are using (glove.6B.100d.txt refers to 100-dimensional vectors trained on a 6-billion-token corpus).

Summary so far:

Neural Network Language Model (NNLM) or Bengio’s model outperforms the earlier statistical model like the n-gram model.
NNLM also tackles the curse of dimensionality and preserves contextual, linguistic regularities and patterns through its distributed representation.
NNLM is computationally expensive.
Word2Vec models tackles computational complexity by removing the hidden layer and sharing the weights
The downside of Word2Vec is it does not have a neural network which makes it hard to represent the data but the upside is that if it can be trained on a large number of data then because it is much more efficient than neural networks it is possible to compute very accurate high dimensional word vectors.
Word2Vec has two models: CBOW and Skipgram. The former is faster than the latter.

6. Model Training Optimisation

So what are the different approaches that will make computation inexpensive and fast, while making sure that the approximation is not compromised?

In the next section, we’ll cover different approaches that can reduce the computational time. Instead of getting exact probabilities over the full vocabulary, we’ll try to approximate over the full vocabulary, or even a sample vocabulary. This reduces complexity and increases processing speed.

We will discuss two approaches: softmax-based approaches and sampling-based approaches.

6.1 Improving predictive functions

In this section, we explore three possible methods for improving prediction, by modifying the softmax function for approximating better results, and replacing the softmax with new methods.

6.1.1 Softmax-based approaches

Softmax-based approaches are more inclined towards modifying the softmax to get a better approximation of the predicted word, rather than eliminating it altogether. We will discuss two methods: hierarchical softmax approach and CNN approach.

Hierarchical softmax

Hierarchical softmax was introduced by Morin and Bengio in 2005, as an alternative to the full softmax function, where it replaces it with a hierarchical layer. It borrows the technique from the binary huffman tree, which reduces the complexity of calculating the probability from the whole vocabulary V to log2(V), i.e. binary.

Coding a huffman tree is very complicated. I’ll try to explain it without using code, but you can find the notebook here and try it out.

To understand the H-softmax, we need to understand the workings of the huffman tree.

The huffman tree is a binary tree that takes the words from the vocabulary; based on their frequency in the document, it creates a tree.

Take for example this text: “the cat is eating and the dog is barking”. In order to create a huffman tree, we need to calculate the frequency of words from the whole vocabulary.

word_to_id = {w:i for i, w in enumerate(set(raw_text))}
id_to_word = {i:w for w, i in word_to_id.items()}
word_frequency = {w:raw_text.count(w) for w,i in word_to_id.items()}

print(word_frequency)

Output:
{'and': 1, 'barking': 1, 'cat': 1, 'dog': 1, 'eating': 1, 'is': 2, 'the': 2}

The next step is to create a huffman tree. The way we do it is by taking the least frequent word. In our example, we have a lot of words that are occurring only once, so we’re free to take any two. Let’s take ‘dog’ and ‘and’. We will then join the two leaf nodes by a parent node, and add the frequency.

In the next step, we’ll take another word that is least frequent (again, the word that occurs only once) and we’ll put it beside the node that has the sum of two. Remember that less frequent words go to the left side, and more frequent words go to the right side.

Similarly, we’ll keep on building the words until we’ve used all the words from the vocabulary.

Remember, all the words with the least frequency are at the bottom.

print(Tree.wordid_code)

Output:
{0: [0, 1, 1],
 1: [0, 1, 0],
 2: [1, 1, 1, 1],
 3: [1, 1, 1, 0],
 4: [0, 0],
 5: [1, 1, 0],
 6: [1, 0]}

Once the tree is created, we can then start the training.

In the huffman tree, we no longer calculate the output embeddings w`. Instead, we try to calculate the probability of turning right or left at each leaf node, using a sigmoid function.

p(right | n,c)=σ(h⊤w′n), where n is the node and c is the context.

As you will find in the code below, a sigmoid function is used to decide whether to go right or to left. It’s also important to know that the probabilities of all the words should sum up to 1. This ensures that the H-softmax has a normalized probability distribution over all the words in the vocabulary.

class SkipGramModel(nn.Module):
   def __init__(self, emb_size, emb_dimension):
       super(SkipGramModel, self).__init__()
       self.emb_size = emb_size
       self.emb_dimension = emb_dimension
       self.w_embeddings = nn.Embedding(2*emb_size-1, emb_dimension, sparse=True)
       self.v_embeddings = nn.Embedding(2*emb_size-1, emb_dimension, sparse=True)
       self._init_emb()

   def _init_emb(self):
       initrange = 0.5 / self.emb_dimension
       self.w_embeddings.weight.data.uniform_(-initrange, initrange)
       self.v_embeddings.weight.data.uniform_(-0, 0)

   def forward(self, pos_w, pos_v,neg_w, neg_v):

       emb_w = self.w_embeddings(torch.LongTensor(pos_w))
       neg_emb_w = self.w_embeddings(torch.LongTensor(neg_w))
       emb_v = self.v_embeddings(torch.LongTensor(pos_v))
       neg_emb_v = self.v_embeddings(torch.LongTensor(neg_v))
       score = torch.mul(emb_w, emb_v).squeeze()
       score = torch.sum(score, dim=1)
       score = F.logsigmoid(-1 * score)
       neg_score = torch.mul(neg_emb_w, neg_emb_v).squeeze()
       neg_score = torch.sum(neg_score, dim=1)
       neg_score = F.logsigmoid(neg_score)
       # L = log sigmoid (Xw.T * θv) + [log sigmoid (-Xw.T * θv)]
       loss = -1 * (torch.sum(score) + torch.sum(neg_score))
       return loss

6.1.2 Sampling-based approaches

Sampling-based approaches completely eliminate the softmax layer.

We’ll discuss two approaches: noise contrastive estimation, and negative sampling.

Noise contrastive estimation

Noise contrastive estimation (NCE) is an approximation method that replaces the softmax layer and reduces the computational cost. It does so by converting the prediction problem into a classification problem.

This section will contain a lot of mathematical explanations.

NCE takes an unnormalised multinomial function (i.e. the function that has multiple labels and its output has not been passed through a softmax layer), and converts it to a binary logistic regression.

In order to learn the distribution to predict the target word (wt) from some specific context ©, we need to create two classes: positive and negative. The positive class contains samples from training data distribution, while the negative class contains samples from a noise distribution Q, and we label them 1 and 0 respectively. Noise distribution is a unigram distribution of the training set.

For every target word given context, we generate sample noise from the distribution Q as Q(w), such that it’s k times more frequent than the samples from the data distribution P(w | c).

These two probability distributions can be represented as the sum of each other because we are effectively sampling words from the two distributions. Hence,

As mentioned earlier, NCE is a binary classifier which consists of a true label as ‘1’ and false label as ‘0’. Intuitively,

When y=1,

When y=0,

Our aim is to develop a model with parameters θ, such that given a context c, its predicted probability P(w,c) approximates the original data distribution Pd(w,c).

Generally, the noise distribution is approximated by sampling. We do that by generating k noise samples {wij}:

Where Zθ© is a normalizing term from the softmax, and you recall this is what we are trying to eliminate. The way we can eliminate Zθ© is by making it a learnable parameter. Essentially, we transform the softmax function from absolute value, i.e. the value which sums over all the words in vocabulary again and again, to a dynamic value which changes to find a better for itself — it’s learnable.

But, as it turns out, Mnih et al. (2013) stated that Zθ© can be fixed at 1. Even though it’s static again but it normalizes quite well, Zoph et al. (2016) found that Zθ©=1 produces a model with low variance.

We can replace Pθ(w | c) with exp(sθ(w | c)) such that the loss function can be written as:

One thing to keep in mind is that as we increase the number of noise samples k, the NCE derivative approaches the likelihood gradient, or the softmax function of the normalised model.

In conclusion, NCE is a way of learning a data distribution by comparing it against a noise distribution, and modifying the learning parameters such that the model Pθ is almost equal to Pd.

Negative sampling

It’s important to understand NCE, because negative sampling is the modified version of the same. It’s a more simplified version as well.

To begin with, we learned that as we increase the number of noise samples k, the NCE derivative approaches the likelihood gradient, or the softmax function of the normalised model.

The way negative sampling works is, it gets rid of the noise by replacing it with 1. Intuitively,

When y=1,

In negative sampling we use a sigmoid function, so we’ll transform the above equation to:

We know that Pθ(w | c) is replaced with exp(sθ(w | c)).

Therefore,

This makes the equation shorter. It has to compute 1 instead of noise, so the equation becomes computationally efficient. But why do we care to simplify NCE?

One reason is that we’re concerned with the high representation of the word vector, so it can simplify the model as long as the word embeddings produced by the model retain their quality.

If we replace the final NCE equation with the equation above, we get:

Since log(1)=0,

therefore,

Since we’re dealing with sigmoid function i.e.

we can modify the equation above to:

(Here are the links to the Notebook and original paper)

7. Considerations for Deploying Word Embedding Models

You need to use the same pipeline during deploying your model as was used to create the training data for the word embedding. If you use a different tokenizer or different method of handling white space, punctuation etc. you might end up with incompatible inputs.
Words in your input that don’t have a pre-trained vector. Such words are known as Out of Vocabulary(OOV) Words. What you can do is replace those words with “UNK” which means unknown and then handle them separately.
Dimension mismatch: Vectors can be of many lengths. If you train a model with vectors of length say 400 and then try to apply vectors of length 1000 at inference time, you will run into errors. So make sure to use the same dimensions throughout.

8. How to choose an embedding model?

Ever since the release of ChatGPT and the advent of the aptly described LLM Wars, there has also been a mad rush in developing embedding models. There are many evolving standards of evaluating LLMs and embeddings alike. There is no right answer to “Which embeddings model to use?”. However, you may notice particular embeddings working better for specific use cases (like summarization, text generation, classification etc.)

OpenAI is used to recommend different embedding models for different use cases. However, now they recommend text-embeddings-3 for all tasks.

MTEB Leaderboard at Hugging Face evaluates almost all available embedding models across seven use cases — Classification, Clustering, Pair Classification, Reranking, Retrieval, Semantic Textual Similarity (STS) and Summarisation.

Another important consideration is cost. With OpenAI models you can incur significant costs if you are working with a lot of documents. The cost of open-source models will depend on the implementation.

9. Conclusion

The journey of word embeddings has evolved from simple one-hot encoding to advanced transformer-based models. Starting with methods like Word2Vec and GloVe, which provided static embeddings, the field has moved to contextual embeddings with ELMo, BERT, and GPT, enabling more nuanced and sophisticated understanding and generation of human language. This evolution reflects significant advancements in capturing the complexities of human language, culminating in the powerful capabilities of contemporary LLMs and transformers.

10. Test Your Knowledge!

How would you interpret the cosine similarity between two word vectors, and what does it signify about the relationship between the words? — Expected Answer: Cosine similarity measures the cosine of the angle between two vectors in a vector space. A cosine similarity close to 1 indicates that the vectors are close to each other and likely represent words with similar meanings. A cosine similarity of 0 indicates that the vectors are orthogonal, meaning the words are unrelated. Negative cosine similarity suggests opposite meanings.

2. Discuss the limitations of vector space representations in capturing polysemy and homonymy. How do modern embedding techniques address these limitations? — Expected Answer: Traditional vector space models like Word2Vec assign a single vector to each word, which fails to capture polysemy (multiple meanings of a word) and homonymy (words that sound alike but have different meanings). Modern embedding techniques like contextual embeddings (e.g., BERT) address this by generating different vectors for the same word based on its context, effectively capturing the different meanings.

3. What are the trade-offs between using a large vocabulary size and reducing the vocabulary size when training word embeddings? — Expected Answer: A large vocabulary size allows the model to capture a wider range of words and nuances but increases computational complexity and memory requirements. Reducing the vocabulary size can lead to faster training and reduced resource usage but may miss important words or nuances, leading to poorer embeddings for less frequent words.

4. In what scenarios would you prefer to use a sampling-based approach (like negative sampling) over a full softmax in training word embeddings? — Expected Answer: Negative sampling is preferred when working with large datasets and vocabularies because it significantly reduces the computational cost by only updating a small subset of weights. It is particularly useful when the focus is on capturing similarity between words rather than modeling the full distribution over the vocabulary, as in models like Word2Vec.

5. How do BERT embeddings differ from Word2Vec embeddings in terms of capturing word meaning and context? — Expected Answer: BERT embeddings are contextual, meaning they generate different embeddings for the same word depending on the surrounding context. This allows BERT to capture the dynamic meaning of words based on their usage in a sentence. In contrast, Word2Vec produces static embeddings where each word has a single vector representation, irrespective of context.

6. Explain how FastText improves on Word2Vec by incorporating subword information. How does this affect the quality of embeddings for rare words? — Expected Answer: FastText improves on Word2Vec by representing words as bags of character n-grams, which allows it to generate embeddings for words by composing the embeddings of their subwords. This approach helps in generating better embeddings for rare or out-of-vocabulary words, as it can leverage subword information to create meaningful representations even when full word data is sparse.

7. Discuss the use of hierarchical softmax and its impact on training efficiency for large datasets. — Expected Answer: Hierarchical softmax is a technique used to approximate the softmax function in large vocabulary settings. It structures the vocabulary into a binary tree, allowing the model to compute probabilities in logarithmic time rather than linear time. This greatly improves training efficiency, particularly when dealing with very large datasets and vocabularies, by reducing the number of computations required.

8. Explain the challenges of optimizing embeddings for domain-specific applications, and how you would address these challenges. — Expected Answer: Domain-specific applications may require embeddings that capture nuances unique to the domain, which general-purpose embeddings might miss. Challenges include limited domain-specific data, vocabulary mismatch, and the need for fine-tuning. To address these, one could use transfer learning, where a general-purpose model is fine-tuned on domain-specific data, or train embeddings from scratch using domain-specific corpora.

9. How would you evaluate the quality of word embeddings for a specific task like sentiment analysis? What metrics would you use? — Expected Answer: The quality of word embeddings for sentiment analysis can be evaluated using both intrinsic and extrinsic metrics. Intrinsic evaluations include tasks like word similarity or analogy tasks. Extrinsic evaluation involves using the embeddings in a downstream task (like sentiment analysis) and measuring performance using metrics such as accuracy, F1 score, or AUC. The embeddings’ ability to differentiate between positive and negative sentiment words in the task-specific context is crucial.

10. Discuss the potential pitfalls of relying solely on intrinsic evaluation metrics like analogy tasks when assessing the quality of word embeddings. — Expected Answer: Intrinsic metrics like analogy tasks often test the geometric properties of embeddings but may not correlate well with downstream task performance. They can give a false sense of quality, as they do not account for the specific requirements of a particular application. For instance, embeddings that perform well on analogy tasks may still fail to capture the nuances required for sentiment analysis or named entity recognition. Extrinsic evaluation, therefore, provides a more practical measure of embedding quality for specific tasks.

11. What are some potential challenges in deploying contextual embeddings like BERT in a low-latency environment, and how would you mitigate them? — Expected Answer: Deploying BERT in a low-latency environment is challenging due to its large size and computational complexity, which can lead to slow inference times. Mitigation strategies include model distillation (reducing the size of the model while retaining performance), quantization (reducing the precision of the model’s weights), or using more efficient variants like DistilBERT or ALBERT. Additionally, techniques like caching embeddings for common phrases or using a two-stage approach where a simpler model filters inputs before passing them to BERT can help reduce latency.

12. How would you handle the issue of out-of-vocabulary (OOV) words when deploying a word embedding model in production? — Expected Answer: Handling OOV words can be approached by using models that incorporate subword information, such as FastText, which can generate embeddings for unseen words by breaking them into known subwords. Another approach is to use contextual models like BERT, which can infer the meaning of OOV words based on context. In some cases, you might also maintain a fallback mechanism, where OOV words are mapped to a generic vector representing unknown words or handled through techniques like character-level embeddings or hashing.

10.1 DIY

Which Embedding Model to choose and why. How will you select the size of the embedding model ?
What is the “curse of dimensionality,” and how does it relate to NLP?

Thank you for reading!

If this guide has enhanced your understanding of Python and Machine Learning:

Please show your support with a clap 👏 or several claps!
Your claps help me create more valuable content for our vibrant Python or ML community.
Feel free to share this guide with fellow Python or AI / ML enthusiasts.
Your feedback is invaluable — it inspires and guides my future posts.

Connect with me!

Vipra

LLM Architectures Explained: Word Embeddings (Part 2)

Posts in this Series

Table Of Contents

1. Introduction

1.1 Word Embeddings

2. Fundamentals of Word Embeddings

2.1 Understanding Vectors and Vector Space

2.1.1 What is a Vector?

2.1.2 What is a Vector Space?

2.1.3 How Vectors Represent Words

2.1.4 Operations in Vector Space

2.1.5 Visualization of Vector Space

2.2 How Word Embeddings Represent Meaning

2.2.1 The Concept of Meaning in Word Embeddings

2.2.2 Capturing Meaning Through Context

2.2.3 Geometric Relationships in Embeddings

2.2.4 Dense Representation

2.2.5 Applications of Meaning in Word Embeddings

2.3 The Concept of Context in Word Embeddings

2.3.1 Why Context Matters

3. Word Embedding Techniques

3.1 Frequency-Based Methods (Shallow Embeddings)

3.1.1 Count Vectorizer

3.1.2 Bag-of-Words (BoW)

3.1.3 Term Frequency-Inverse Document Frequency (TF-IDF)

3.1.4 N-Grams

3.1.5 Co-occurrence Matrices

3.1.6 One-Hot Encoding

3.2 Static Embeddings

3.2.1 Word2Vec

3.2.1.1 Continuous Bag of Words (CBOW)

3.2.1.2 Skip-Gram

3.2.2 GloVE (Global Vectors for Word Representation)

3.2.3 FastText

3.3 Contextual Embeddings

3.3.1 Self Attention

3.3.2 BERT

3.3.3 ELMo

4. Training Word Embeddings

4.1 Continuous Bag-of-Words (CBOW) model

4.1.1 Continuous Bag-of-Words (CBOW) with Python and TensorFlow

4.2 Skip-gram Model

4.2.1 Skip-Gram Model with Python and TensorFlow

4.3 GloVE Model

4.3.1 GloVe Word Embeddings In Python

6. Model Training Optimisation

6.1 Improving predictive functions

6.1.1 Softmax-based approaches

6.1.2 Sampling-based approaches

7. Considerations for Deploying Word Embedding Models

8. How to choose an embedding model?

9. Conclusion

10. Test Your Knowledge!

10.1 DIY

Thank you for reading!

Connect with me!