Making Recommendations Using OpenAI Embeddings and Nearest Neighbor Search
Online recommendations are omnipresent. Platforms such as e-commerce, streaming services, and online news outlets utilize recommendation systems to suggest similar items or content. These recommendations are often based on a user’s past behavior or similar user behavior.
This article delves into using embeddings and nearest neighbor search to make recommendations. Specifically, we’ll use a dataset of news articles from AG’s corpus and build a model that can suggest similar articles given one particular article.
Prerequisites
Before diving into the core algorithm, we need to import the necessary Python packages and functions. Make sure you have pandas, pickle, and the OpenAI embeddings utility module installed.
# imports
import pandas as pd
import pickle
from openai.embeddings_utils import (
get_embedding,
distances_from_embeddings,
tsne_components_from_embeddings,
chart_from_components,
indices_of_nearest_neighbors_from_distances,
)
# constants
EMBEDDING_MODEL = "text-embedding-ada-002"Loading Data
The AG news dataset, which contains a collection of news articles, is used in this case. You can preview a few rows of the dataset to understand its structure and contents. The dataset contains the title, description, and labels for each news article.
Cache Embeddings
As generating embeddings can be a computationally intensive task, it is a good practice to cache the embeddings. This way, the same embeddings can be reused later, avoiding redundant computations.
The cache is implemented as a Python dictionary that maps a tuple of text and model to an embedding (a list of floats) and saved as a pickle file.
def embedding_from_string(string: str, model: str = EMBEDDING_MODEL, embedding_cache=embedding_cache) -> list:
if (string, model) not in embedding_cache.keys():
embedding_cache[(string, model)] = get_embedding(string, model)
with open(embedding_cache_path, "wb") as embedding_cache_file:
pickle.dump(embedding_cache, embedding_cache_file)
return embedding_cache[(string, model)]Make Recommendations
To recommend similar articles, we use a three-step process:
- First, generate the embeddings for all article descriptions.
- Next, calculate the distance between the source article’s embedding and the embeddings of all other articles.
- Finally, return the articles closest to the source article in the embedding space.
The nearest neighbors (i.e., the articles with the smallest distances to the source article) are the recommended articles.
def print_recommendations_from_strings(strings: list[str], index_of_source_string: int, k_nearest_neighbors: int = 1, model=EMBEDDING_MODEL) -> list[int]:
embeddings = [embedding_from_string(string, model=model) for string in strings]
query_embedding = embeddings[index_of_source_string]
distances = distances_from_embeddings(query_embedding, embeddings, distance_metric="cosine")
indices_of_nearest_neighbors = indices_of_nearest_neighbors_from_distances(distances)
return indices_of_nearest_neighborsMaking Recommendations
With everything in place, we can now start making recommendations. Let’s consider an example of finding articles similar to a specific article about Tony Blair.
tony_blair_articles = print_recommendations_from_strings(
strings=article_descriptions,
index_of_source_string=0,
k_nearest_neighbors=5,
)The model identifies other articles mentioning Tony Blair and other related political themes as similar. This confirms that our approach is able to successfully detect semantic similarity between articles, making it a valuable tool for recommendation systems.
Visualization of embeddings with t-SNE
Although we can get a sense of the effectiveness of our recommender system from the examples above, it’s also helpful to visualize the space of article descriptions to see how they cluster together. We can do this by reducing the dimensionality of the embeddings using t-SNE (t-Distributed Stochastic Neighbor Embedding), a common technique for visualizing high-dimensional data.
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
# Get embeddings for all descriptions
description_embeddings = [embedding_from_string(description,
model=EMBEDDING_MODEL) for description in article_descriptions]
#compute 2D components with t-SNE
tsne = TSNE(n_components=2, random_state=0)
components = tsne.fit_transform(description_embeddings)
#plot components
plt.figure(figsize=(10, 10))
scatter = plt.scatter(components[:, 0], components[:, 1], alpha=0.3)
plt.title('t-SNE of article descriptions')
plt.show()The resulting plot will likely show some degree of clustering, where similar articles group together in the embedding space.
Evaluating the recommender system
While our examples have demonstrated some success of the recommender system, it’s important to assess its performance more systematically. Depending on the nature of your data and your use case, different metrics might be appropriate. In our case, we might be interested in whether similar articles, as determined by their category labels, tend to be recommended.
def evaluate_recommendation_quality(
strings: list[str],
labels: list[str],
index_of_source_string: int,
k_nearest_neighbors: int = 1,
model=EMBEDDING_MODEL,
) -> float:
"""Evaluate the quality of a recommender by checking whether recommended articles belong to the same category."""
# get indices of nearest neighbors
indices_of_nearest_neighbors = print_recommendations_from_strings(
strings=strings,
index_of_source_string=index_of_source_string,
k_nearest_neighbors=k_nearest_neighbors,
model=model
)
# check what fraction of recommended articles belong to the same category
source_label = labels[index_of_source_string]
recommended_labels = [labels[i] for i in indices_of_nearest_neighbors]
return recommended_labels.count(source_label) / len(recommended_labels)
# This evaluation function computes the fraction of recommended articles that belong to the same category as the source article. A higher score indicates better recommendation quality.
# evaluate the quality of recommendations for the first few articles
for i in range(5):
print(f"\nArticle #{i+1}")
print(f"Recommendation quality: {evaluate_recommendation_quality(article_descriptions, df['label'].tolist(), i, k_nearest_neighbors=5)}")
The output of this evaluation will give a numerical indication of how well our recommender system is working.
Conclusion
This article walked you through how to use embeddings and nearest neighbor search to build a recommendation system. We demonstrated the use of OpenAI’s text-embedding-ada-002 model to generate embeddings for news articles, which were then used to recommend similar articles based on cosine distance in the embedding space. We also discussed methods for caching embeddings, visualizing the embeddings using t-SNE, and evaluating the quality of the recommendations. These steps provide a foundation upon which more sophisticated recommendation systems can be built.
