avatarJun Xie

Summary

The web content discusses three primary metrics for measuring vector similarity in semantic search: squared L2, inner product, and cosine similarity, noting their relationships and implications for search results when vectors are normalized or not.

Abstract

The article titled "Semantic Search Similarity Metrics" outlines the three prevalent metrics used to quantify the similarity or distance between vectors in semantic search applications. These metrics are squared L2 distance, also known as Euclidean distance, inner product, and cosine similarity. The text explains that while squared L2 distance is typically used to measure distance, inner product and cosine similarity are employed to assess similarity. The article further clarifies that when vectors are normalized, the three metrics become equivalent, with cosine similarity and inner product yielding identical results, and the squared L2 distance being proportional to the inner product by a factor of two. However, when vectors are not normalized, the choice of metric can significantly affect search performance, necessitating empirical evaluation to determine the most effective metric for optimizing recall in search results.

Opinions

  • The author suggests that the choice of similarity metric can be flexible when dealing with normalized vectors, as all three metrics will produce equivalent results.
  • In cases where vectors are not normalized, the author advises conducting experiments to ascertain which metric yields better recall, implying that the performance of these metrics is context-dependent.
  • The article implies that normalization of vectors is a key factor in the interchangeability of the metrics, emphasizing its importance in semantic search applications.
  • The author provides a formulaic relationship between squared L2 distance and inner product for non-normalized vectors, indicating that the difference between these metrics is a matter of scale rather than kind.

Semantic Search Similarity Metrics

There are three common similarity metrics used in the semantic search to measure the similarity/distance between vectors, including squared L2 (l2), inner product (ip) and cosine similarity (cosine).

For L2, it is commonly named as Euclidean distance. Assume there are two vectors with n dimension:

Squared L2 distance:

In semantic search, we normally use distance. Inner product is used for similarity. This way, we get its distance by doing a subtraction.

We apply the same subtraction idea to the cosine similarity.

In fact, if x and y are normalized, then all three distances are equivalent. The context is that normalized, then inner product = = 1

  1. Cosine similarity and inner product are the same.
  2. For l2 and ip relationship is l2=2*ip. This way, the two distances are just proportional, which doesn’t change the search result.

Based on the above logic, given normalized, feel free to choose any distance metric. If not normalized, then it is better to run experiments to figure out which one performs better in terms of recall.

Machine Learning
Recommended from ReadMedium