avatarDagang Wei

Summary

The article discusses the importance of understanding similarity measures—dot product, cosine similarity, and Euclidean distance—in machine learning, explaining their mathematical formulas, interpretations, and applications.

Abstract

The article "Essential Math for Machine Learning: The Cosine Law and Similarity Measures" delves into the mathematical foundations of similarity measures critical for various machine learning tasks. It introduces the dot product, which assesses both the similarity and magnitude difference between vectors, and highlights its utility in scenarios where both direction and strength of the relationship are important. Cosine similarity is presented as a measure that focuses solely on the orientation of vectors, ignoring their magnitudes, making it particularly useful in text analysis where document length can vary significantly. The Euclidean distance, representing the straight-line distance between points in a multi-dimensional space, is shown to be sensitive to feature magnitude and is suitable for applications where the relative scale of features is significant. The Law of Cosines is explained to illustrate the interconnection between these similarity measures, providing a geometric understanding of their relationships. The article concludes by emphasizing the importance of selecting the appropriate similarity measure to enhance the effectiveness of machine learning models.

Opinions

  • The author suggests that the choice of similarity measure can significantly impact the performance of machine learning models.
  • It is implied that cosine similarity is often preferred in text analysis due to its disregard for document length.
  • The article endorses the Law of Cosines as a fundamental concept that links the three similarity measures.
  • The author provides Python code snippets to demonstrate practical applications of the mathematical concepts discussed.
  • The recommendation of an AI service, ZAI.chat, indicates the author's belief in the service's cost-effectiveness and performance comparable to ChatGPT Plus (GPT-4).

Essential Math for Machine Learning: The Cosine Law and Similarity Measures

Image generated with Gemini

This article is part of the series Essential Math for Machine Learning.

Introduction

Machine learning algorithms often rely on understanding how similar or dissimilar different data points are. Whether it’s recommending movies, clustering customer profiles, or detecting similar-looking images, the concept of similarity lies at the heart of many ML tasks. In this blog post, we’ll explore three fundamental similarity measures and examine when they’re most appropriate to use:

  • Dot Product
  • Cosine Similarity
  • Euclidean Distance

Understanding Similarity

Intuitively, when we say two objects are similar, we imply that they share certain characteristics or features. Mathematically, we can represent objects as vectors in a multi-dimensional space. Each dimension represents a feature. Similarity measures are tools that take these vector representations and calculate a numerical score quantifying how close or alike those vectors are.

Dot Product

source

The dot product of two vectors measures both the similarity and the difference in magnitude between the vectors. Here’s how it works:

Formula: If a = [a1, a2, …, an] and b = [b1, b2, …, bn] are two vectors, then their dot product is calculated as: a · b = a1b1 + a2b2 + … + an*bn

Geometric Meaning: The dot product is proportional to the product of the magnitudes of the two vectors and the cosine of the angle between them.

When to Use: The dot product is useful when you’re interested in both the similarity of direction and the strength of the relationship between your vectors. Higher dot products mean two vectors point in more similar directions.

Python Code

import numpy as np

a = np.array([1, 2, 3])
b = np.array([4, -1, 0])
dot_product = np.dot(a, b)
print(dot_product)  # Output: 2

Cosine Similarity

source

Cosine similarity focuses purely on the angle between two vectors, giving a score of how closely aligned the vectors are, regardless of their magnitudes.

Formula: If a and b are two vectors, the cosine similarity is: cos(θ) = (a · b) / (||a|| ||b||) where: * ||a|| is the magnitude (length) of vector a * ||b|| is the magnitude (length) of vector b

Interpretation: Cosine similarity values range from -1 to 1. A value of 1 indicates perfect similarity (vectors point in the same direction). 0 means the vectors are orthogonal (no similarity). -1 signifies opposite directions.

When To Use: Use cosine similarity when you care about the orientation of the vectors, while discounting differences in their magnitudes. This is common in text analysis, where document length discrepancies shouldn’t overly diminish similarity scores.

Python Code

import numpy as np

a = np.array([1, 2, 3])
b = np.array([2, 4, 6])  # Notice b has twice the magnitude of a 
cosine_sim = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print(cosine_sim)  # Outputs: 1.0

Euclidean Distance

source

Euclidean distance is the familiar “straight-line” distance between two points in multi-dimensional space.

Formula: The Euclidean distance between vectors a and b is: ||ab|| = sqrt((a1 — b1)² + (a2 — b2)² + … + (an — bn)²)

Interpretation: Larger Euclidean distances indicate greater dissimilarity. Euclidean distance is sensitive to the overall magnitude of the vectors.

When to Use: Euclidean distance is suitable when the relative scale of the features matters and when finding the most similar instances (in terms of pure geometric distance) is desirable.

Python Code

import numpy as np

a = np.array([1, 2, 3])
b = np.array([5, 0, 1])
euclidean_dist = np.linalg.norm(a - b)
print(euclidean_dist)  # Output: 4.4721...

The Cosine Law Connection

The Law of Cosines provides a fundamental link between these three similarity measures. Recall the Law of Cosines, which relates the sides and an angle of any triangle:

c² = a² + b² — 2ab cos(θ)

where:

  • c is the length of the side opposite angle θ
  • a and b are the lengths of the other two sides
  • cos(θ) is the cosine of angle θ

Let’s see how this connects to our similarity measures:

Euclidean Distance and the Cosine Law: If we represent vectors a and b as two sides of a triangle, the vector a — b represents the third side. Substituting into the Law of Cosines we get:

||a — b||² = ||a||² + ||b||² — 2 ||a|| ||b|| cos(θ)

Notice the appearance of the magnitudes of a and b and the cosine of the angle θ between them.

Cosine Similarity: Isolating cos(θ) in the above equation leads us directly to the formula for cosine similarity:

cos(θ) = (a · b) / (||a|| ||b||)

The Law of Cosines underpins the relationships between Euclidean distance, dot product, and cosine similarity. Essentially, these similarity measures offer different ways to express the relationships between triangle sides and their included angle.

Geometric Proof of the Law of Cosines

Triangle Setup: Let’s consider an arbitrary triangle ABC, where angle θ is at vertex C. Draw an altitude from vertex B, reaching side AC at point D. This divides side AC into two segments, one of length x and the other of length (a — x), where ‘a’ is the length of AC.

Pythagorean Theorem (twice):

In right triangle BCD: b² = h² + x²

In right triangle ABD: c² = h² + (a — x)²

Isolate h² from the first equation: h² = b² — x²

Substitute this expression for h² into the second equation:

c² = (b² — x²) + (a — x)²

Expand the squared term:

c² = b² — x² + a² — 2ax + x²

Notice that -x² and x² cancel out.

c² = a² + b² — 2ax

Trigonometry with Right Triangle BCD:

cos(θ) = x / b (cosine definition in a right triangle)

x = b cos(θ)

Final Substitution: Substitute this value of x into the simplified equation:

c² = a² + b² — 2ab cos(θ)

Algebraic Proof of the Law of Cosines

c = a - b

so the dot product of c and itself is:

c · c = (a - b) · (a - b)

dot product is distributive

(a - b) · (a - b) = a · (a - b) - b · (a - b) = a · a + b · b - 2 (a · b)

so

c · c = a · a + b · b - 2 (a · b)

Conclusion

Choosing the right similarity measure is essential for building effective machine learning models. Dot product, cosine similarity, and Euclidean distance each offer strengths depending on whether you care about overall magnitudes, directions, or a combination of both. Remember the relationships between these measures as they can help inform your decision in various ML applications.

Machine Learning
Recommended from ReadMedium