The article provides a comprehensive guide to creating word embeddings from scratch using the Word2Vec algorithm in Python, with a focus on deep learning techniques.
Abstract
The article delves into the process of generating word embeddings, which are vector representations of words, by utilizing deep learning methods. The author, who previously relied on pre-trained embeddings, now aims to elucidate the creation process for readers. The pipeline involves reading and preprocessing text, creating data points, converting these into one-hot encoded matrices, and training a neural network to extract the weights as word embeddings. The neural network architecture includes an input layer corresponding to the vocabulary size, a hidden layer representing the embedding dimension, and an output layer with a softmax activation function. The article emphasizes the importance of context window size in defining the relationships between words and uses one-hot encoding to transform words into numerical vectors for neural network processing. The final step involves training the network with Keras and TensorFlow and visualizing the resulting word embeddings to demonstrate semantic similarity.
Opinions
The author believes that understanding the process of creating word embeddings is crucial for a deeper comprehension of language modeling and feature learning in NLP.
There is an opinion that pre-trained word embeddings, while useful, do not provide the same level of insight as creating embeddings from scratch.
The article suggests that the choice of context window size is critical for capturing the semantic relationships between words.
The author values the practical application of word embeddings and provides a GitHub repository with the full code for readers to explore and learn from.
The use of one-hot encoding is presented as a necessary step to make text data computable for neural networks.
The visualization of word embeddings is considered an effective method to validate the model's ability to capture semantic similarities.
The author recommends using pre-trained embeddings from sources like Stanford's GloVe for practical applications, especially when dealing with larger datasets and vocabularies.
Creating Word Embeddings: Coding the Word2Vec Algorithm in Python using Deep Learning
Understanding the intuition behind word embedding creation with deep learning
When I was writing another article that showcased how to use word embeddings in a text classification objective I realized that I always used pre-trained word embeddings downloaded from an external source (for example https://nlp.stanford.edu/projects/glove/). I started thinking about how to create word embeddings from scratch and thus this is how this article was born. My main goal is for people to read this article with my code snippets and to get an in-depth understanding of the logic behind the creation of vector representations of words.
The short version of the creation of the word embeddings can be summarized in the following pipeline:
Read the text -> Preprocess text -> Create (x, y) data points -> Create one hot encoded (X, Y) matrices -> train a neural network -> extract the weights from the input layer
In this article, I will briefly explain every step of the way.
From wiki: Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. The term word2vec literally translates to word to vector. For example,
“dad” = [0.1548, 0.4848, …, 1.864]
“mom” = [0.8785, 0.8974, …, 2.794]
The most important feature of word embeddings is that similar words in a semantic sense have a smaller distance (either Euclidean, cosine or other) between them than words that have no semantic relationship. For example, words like “mom” and “dad” should be closer together than the words “mom” and “ketchup” or “dad” and “butter”.
Word embeddings are created using a neural network with one input layer, one hidden layer and one output layer.
To create word embeddings the first thing that is needed is text. Let us create a simple example stating some well-known facts about a fictional royal family containing 12 sentences:
The future king is the prince
Daughter is the princess
Son is the prince
Only a man can be a king
Only a woman can be a queen
The princess will be a queen
Queenand king rule the realm
The prince is astrong man
The princess is a beautiful woman
The royal family is the king and queen and their children
Prince is only a boy now
A boy will be a man
The computer does not understand that the words king, prince and man are closer together in a semantic sense than the words queen, princess, and daughter. All it sees are encoded characters to binary. So how do we make the computer understand the relationship between certain words? By creating X and Y matrices and using a neural network.
When creating the training matrices for word embeddings one of the hyperparameters is the window size of the context (w). The minimum value for this is 1 because without context the algorithm cannot work. Lets us take the first sentence and lets us assume that w = 2.
The future king is the prince
The bolded word the is called the focus word and 2 words to the left and 2 words to the right (because w = 2) are the so-called context words. So we can start building our data points:
From 6 words we are able to create 18 data points. In practice, we do some preprocessing of the text and remove stop words like is, the, a, etc. By scanning our whole text document and appending the data we create the initial input which we can then transform into a matrix form.
The full pipeline to create the (X, Y) word pairs given a list of strings texts:
After the initial creation of the data points, we need to assign a unique integer (often called index) to each unique word of our vocabulary. This will be used further on when creating one-hot encoded vectors.
After using the above function on the text we get the dictionary:
What we created up to this point is still not neural network friendly because what we have as data is the pairs of (focus word, context word). In order for the computer to start doing computations, we need a clever way to transform these data points into data points made up of numbers. One such clever way is the one-hot encoding technique.
One-hot encoding transforms a word into a vector that is made up of 0 with one coordinate, representing the string, equal to 1. The vector size is equal to the number of unique words in a document. For example, lets us define a simple list of strings:
a = ['blue', 'sky', 'blue', 'car']
There are 3 unique words: blue, sky and car. One hot representation for each word:
We will be creating two matrices, X and Y, with the exact same technique. The X matrix will be created using the focus words and the Y matrix will be created using the context words.
Recall the first three data points which we created given the texts about royalties:
The final sizes of these matrices will be n x m, where
n - number of created data points (pairs of focus words and context words)
m - number of unique words
We now have X and Y matrices built from the focus word and context word pairs. The next step is to choose the embedding dimension. I will choose the dimension to be equal to 2 in order to later plot the words and see whether similar words form clusters.
Neural network architecture
The hidden layer dimension is the size of our word embedding. The output layers activation function is softmax. The activation function of the hidden layer is linear. The input dimension is equal to the total number of unique words (remember, our X matrix is of the dimension n x 21). Each input node will have two weights connecting it to the hidden layer. These weights are the word embeddings! After the training of the network, we extract these weights and remove all the rest. We do not necessarily care about the output.
For the training of the network, we will use keras and tensorflow:
After the training of the network, we can obtain the weights and plot the results:
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 10))
for word inlist(unique_word_dict.keys()):
coord = embedding_dict.get(word)
plt.scatter(coord[0], coord[1])
plt.annotate(word, (coord[0], coord[1]))
Visualization of the embeddings
As we can see, there are the words ‘man’, ‘future’, ‘prince’, ‘boy’ and ‘daughter’, ‘woman’, ‘princess’ in separate corners of the plot and form clusters. All this was achieved from just 21 unique words and 12 sentences.
Often in practice, pre-trained word embeddings are used with typical word embedding dimensions being either 100, 200 or 300. I personally use the embeddings stored here: \https://nlp.stanford.edu/projects/glove/.