Similarity Metrics in NLP
Euclidean distance, dot product, and cosine similarity

When we convert language into a machine-readable format, the standard approach is to use dense vectors.
A neural network typically generates dense vectors. They allow us to convert words and sentences into high-dimensional vectors — organized so that each vector's geometric position can attribute meaning.

There is a particularly well-known example of this, where we take the vector of King, subtract the vector Man, and add the vector Woman. The closest matching vector to the resultant vector is Queen.
We can apply the same logic to longer sequences, too, like sentences or paragraphs — and we will find that similar meaning corresponds with proximity/orientation between those vectors.
So, similarity is important — and what we will cover here are the three most popular metrics for calculating that similarity.
Euclidean Distance
Euclidean distance (often called L2 norm) is the most intuitive of the metrics. Let’s define three vectors:

Just by looking at these vectors, we can confidently say that a and b are nearer to each other — and we see this even clearer when visualizing each on a chart:

Clearly, a and b are closer together — and we calculate that using Euclidean distance:

To apply this formula to our two vectors, a and b, we do:

And we get a distance of 0.014, performing the same calculation for d(a, c) returns 1.145, and d(b, c) returns 1.136. Clearly, a and b are nearer in Euclidean space.
Dot Product
One drawback of Euclidean distance is the lack of orientation considered in the calculation — it is based solely on magnitude. And this is where we can use our other two metrics. The first of those is the dot product.
The dot product considers direction (orientation) and also scales with vector magnitude.
We care about orientation because similar meaning (as we will often find) can be represented by the direction of the vector — not necessarily the magnitude of it.
For example, we may find that our vector's magnitude correlates with the frequency of a word that it represents in our dataset. Now, the word hi means the same as hello, and this may not be represented if our training data contained the word hi 1000 times and hello just twice.
So, vectors' orientation is often seen as being just as important (if not more so) as distance.
The dot product is calculated using:

The dot product considers the angle between vectors, where the angle is ~0, the cosθ component of the formula equals ~1. If the angle is nearer to 90 (orthogonal/perpendicular), the cosθ component equals ~0, and at 180 the cosθ component equals ~-1.
Therefore, the cosθ component increases the result where there is less of an angle between the two vectors. So, a higher dot-product correlates with higher orientation.
Again, let’s apply this formula to our two vectors, a and b:

Clearly, the dot product calculation is straightforward (the simplest of the three) — and this gives us benefits in terms of computation time.
However, there is one drawback. It is not normalized — meaning larger vectors will tend to score higher dot products, despite being less similar.
For example, if we calculate a·a — we would expect a higher score than a·c (a is an exact match to a). But that’s not how it works, unfortunately.

So, in reality, the dot-product is used to identify the general orientation of two vectors — because:
- Two vectors that point in a similar direction return a positive dot-product.
- Two perpendicular vectors return a dot-product of zero.
- Vectors that point in opposing directions return a negative dot-product.
Cosine Similarity
Cosine similarity considers vector orientation, independent of vector magnitude.

The first thing we should be aware of in this formula is that the numerator is, in fact, the dot product — which considers both magnitude and direction.
In the denominator, we have the strange double vertical bars — these mean ‘the length of’. So, we have the length of u multiplied by the length of v. The length, of course, considers magnitude.
When we take a function that considers both magnitude and direction and divide that by a function that considers just magnitude — those two magnitudes cancel out, leaving us with a function that considers direction independent of magnitude.
We can think of cosine similarity as a normalized dot product! And it clearly works. The cosine similarity of a and b is near 1 (perfect):

And using the sklearn implementation of cosine similarity to compare a and c again gives us much better results:

That’s all for this article covering the three distance/similarity metrics — Euclidean distance, dot product, and cosine similarity.
It’s worth being aware of how each works and their pros and cons — as they’re all used heavily in machine learning, and particularly NLP.
You can find Python implementations of each metric in this notebook.
I hope you’ve enjoyed the article. Let me know if you have any questions or suggestions via Twitter or in the comments below. If you’re interested in more content like this, I post on YouTube too.
Thanks for reading!
Sources
🤖 NLP With Transformers Course
*All images are by the author except where stated otherwise






