#30DaysOfNLP
NLP-Day 10: Why You Should Care About Word Vectors
Introducing Word2Vec by uncovering word embeddings

Yesterday, we finished our hands-on implementation of Latent Semantic Analysis. We created our matrix of topic vectors and even tried to make sense of it.
However, till now we ignored the context. The neighbors. The entourage of a single word. We ignored how the relationship between a word and its surroundings affects the overall meaning of a text.
Now, it’s time to change all of that. It’s time to discover word vectors.
In the following sections, we’re going to learn what a word vector is and why it’s so useful and exciting. We will learn how a word vector is created and how we can utilize pre-trained models and external libraries.
We have a lot of ground to cover, so take a seat, don’t go anywhere, and make sure to follow #30DaysOfNLP: Why You Should Care About Word Vectors.
What is a word vector?
So far we created Bag-Of-Words and TF-IDF vectors, ignoring the nearby, more relevant surroundings of a word. We basically threw every word, we could get our hands on, into a huge statistical bag.
However, it is reasonable to assume that words appearing close together in a sentence are defined by a stronger, more meaningful relationship than words that appear widely apart. The closer the words the more context they convey.
So how do we capture a word’s surroundings?
With word vectors. A word vector tries to represent those nearby relationships by creating smaller “bags” from a neighborhood of only a few words. Usually less than 10 tokens.
By focusing only on the close company a word keeps, it is able to create a numerical vector representation of the semantics, the literal and implied meaning of a word.
In order to compute such a representation, Tomas Mikolov and this team of researchers at Google introduced Word2Vec in 2013.
Word2Vec utilizes a neural network to learn the meaning of a text by processing a large corpus of unlabeled documents. The main idea here is to learn word embeddings, a set of weights by either predicting the surrounding words or the target word based on its company. The set of weights can then be used to produce a word vector, encapsulating the word’s semantics.

Due to the unsupervised nature of the training process, word vectors are especially powerful.
Why is it useful, applications?
Word vectors are useful since they enable us to perform meaningful vector algebra. Thus, making it possible to execute semantic queries, analogies, and vector-oriented reasoning in general.
Questions in the form of "a is to b as c is to? are now possible to answer.
Consider, the following question: "A 'man' is to 'woman' as 'uncle' is to?". The answer is, of course, "aunt". Visualizing such word pairs in a vector space could look like the following.

The Word2Vec model “knows” that the direction and distance between "man" and "woman" are roughly the same as between "uncle" and "aunt". This measure allows for vector-oriented reasoning to work.
We will, however, almost never hit the exact word vector in our vocabulary, but the closest will most likely suffice. This is especially useful for semantic queries in search engines or to simply overcome the rigidity of pattern matching.
How is a word vector created?
Now, we have a basic understanding of what a word vector is. But how do we calculate these vector representations?
There are two different possible ways to train a Word2Vec model.
- The skip-gram approach tries to predict a word’s surrounding context.
- The continuous bag-of-words (CBOW) approach predicts a target word based on its nearby company.
Although we have two different techniques to train our model and obtain the word embeddings. We will most likely rely on pre-trained models. The computation of word vectors can be very resource-intensive, requiring a lot of data as well as computation time.
So how does skip-gram actually work?
With the skip-gram approach, we try to predict the surrounding words based on an input word. We can think about this as a fixed size “sliding window” that runs over our sentence.

We create an n-gram of words based on the sliding window. The word in the center, the input token, can be represented by a One-Hot-Vector. Next, we try to predict the surrounding words. Depending on “how big our window” is, we need multiple iterations per training round.

In the example above, we try to predict the previous and next words based on the input token. We utilize the softmax function, in order to output probabilities ranging between 0 and 1.
Once completed the training process, we obtain a set of weights that represent the semantic meaning. Words with similar meanings will have similar vectors since our model was trained to predict similar surroundings.
We can simply take the dot product of the embedding and a one-hot encoded input word to create our word vector.
And what about CBOW?
The training process with the CBOW approach is basically the same as before. However, now we try to predict a target word based on its nearby company. The surroundings are also captured by a fixed-size sliding window.

The CBOW approach is much faster to train since we only need to predict a single target per training epoch. It is also considered to show higher accuracy for more frequently appearing words.
Implementations, libraries to use in the real world
As mentioned earlier, training a Word2Vec model can be pretty resource-intensive. We most likely don’t have access to a huge training corpus that is required to create decent vector representations. And we probably don’t have the computational resources to train the model.
Fortunately, we can rely on pre-trained models.
A popular library we can use is gensim.
So if you haven’t already make sure to pip install gensim and let’s get started.
First of all, we need to download a pre-trained Word2Vec model. A popular choice is a model trained on Google News Documents which can be downloaded here.






