Free AI web copilot to create summaries, insights and extended knowledge, download it at here

3941

Abstract

es with the smallest distances to the source article) are the recommended articles.</p><div id="1cc1"><pre><span class="hljs-keyword">def</span> <span class="hljs-title function_">print_recommendations_from_strings</span>(<span class="hljs-params">strings: <span class="hljs-built_in">list</span>[<span class="hljs-built_in">str</span>], index_of_source_string: <span class="hljs-built_in">int</span>, k_nearest_neighbors: <span class="hljs-built_in">int</span> = <span class="hljs-number">1</span>, model=EMBEDDING_MODEL</span>) -> <span class="hljs-built_in">list</span>[<span class="hljs-built_in">int</span>]: embeddings = [embedding_from_string(string, model=model) <span class="hljs-keyword">for</span> string <span class="hljs-keyword">in</span> strings] query_embedding = embeddings[index_of_source_string] distances = distances_from_embeddings(query_embedding, embeddings, distance_metric=<span class="hljs-string">"cosine"</span>) indices_of_nearest_neighbors = indices_of_nearest_neighbors_from_distances(distances) <span class="hljs-keyword">return</span> indices_of_nearest_neighbors</pre></div><h1 id="4463">Making Recommendations</h1><p id="b7b9">With everything in place, we can now start making recommendations. Let’s consider an example of finding articles similar to a specific article about Tony Blair.</p><div id="99db"><pre>tony_blair_articles = print_recommendations_from_strings( strings=article_descriptions, index_of_source_string=<span class="hljs-number">0</span>, k_nearest_neighbors=<span class="hljs-number">5</span>, )</pre></div><p id="4fa9">The model identifies other articles mentioning Tony Blair and other related political themes as similar. This confirms that our approach is able to successfully detect semantic similarity between articles, making it a valuable tool for recommendation systems.</p><h1 id="d787">Visualization of embeddings with t-SNE</h1><p id="0941">Although we can get a sense of the effectiveness of our recommender system from the examples above, it’s also helpful to visualize the space of article descriptions to see how they cluster together. We can do this by reducing the dimensionality of the embeddings using t-SNE (t-Distributed Stochastic Neighbor Embedding), a common technique for visualizing high-dimensional data.</p><div id="c2f5"><pre><span class="hljs-keyword">from</span> sklearn.manifold <span class="hljs-keyword">import</span> TSNE <span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt

<span class="hljs-comment"># Get embeddings for all descriptions</span> description_embeddings = [embedding_from_string(description, model=EMBEDDING_MODEL) <span class="hljs-keyword">for</span> description <span class="hljs-keyword">in</span> article_descriptions]

<span class="hljs-comment">#compute 2D components with t-SNE</span>

tsne = TSNE(n_components=<span class="hljs-number">2</span>, random_state=<span class="hljs-number">0</span>) components = tsne.fit_transform(description_embeddings)

<span class="hljs-comment">#plot components</span>

plt.figure(figsize=(<span class="hljs-number">10</span>, <span class="hljs-number">10</span>)) scatter = plt.scatter(components[:, <span class="hljs-number">0</span>], components[:, <span class="hljs-number">1</span>], alpha=<span class="hljs-number">0.3</span>) plt.title(<span class="hljs-string">'t-SNE of article descriptions'</span>) plt.show()</pre></div><p id="92b6">The resulting plot will likely show some degree of clustering, where similar articles group together in the embedding space.</p><h1 id="583e">Evaluating the recommender system</h1><p id="af05">While our examples have demonstrated some success of the recommender system, it’s important to assess its performance more systematically. Depending on the nature of your data and your use case, different metrics might be appropriate. In our case, we might be interested

Options

in whether similar articles, as determined by their category labels, tend to be recommended.</p><div id="11f4"><pre><span class="hljs-keyword">def</span> <span class="hljs-title function_">evaluate_recommendation_quality</span>(<span class="hljs-params"> strings: <span class="hljs-built_in">list</span>[<span class="hljs-built_in">str</span>], labels: <span class="hljs-built_in">list</span>[<span class="hljs-built_in">str</span>], index_of_source_string: <span class="hljs-built_in">int</span>, k_nearest_neighbors: <span class="hljs-built_in">int</span> = <span class="hljs-number">1</span>, model=EMBEDDING_MODEL, </span>) -> <span class="hljs-built_in">float</span>: <span class="hljs-string">"""Evaluate the quality of a recommender by checking whether recommended articles belong to the same category."""</span> <span class="hljs-comment"># get indices of nearest neighbors</span> indices_of_nearest_neighbors = print_recommendations_from_strings( strings=strings, index_of_source_string=index_of_source_string, k_nearest_neighbors=k_nearest_neighbors, model=model ) <span class="hljs-comment"># check what fraction of recommended articles belong to the same category</span> source_label = labels[index_of_source_string] recommended_labels = [labels[i] <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> indices_of_nearest_neighbors] <span class="hljs-keyword">return</span> recommended_labels.count(source_label) / <span class="hljs-built_in">len</span>(recommended_labels)

<span class="hljs-comment"># This evaluation function computes the fraction of recommended articles that belong to the same category as the source article. A higher score indicates better recommendation quality.</span>

<span class="hljs-comment"># evaluate the quality of recommendations for the first few articles</span>

<span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(<span class="hljs-number">5</span>): <span class="hljs-built_in">print</span>(<span class="hljs-string">f"\nArticle #<span class="hljs-subst">{i+<span class="hljs-number">1</span>}</span>"</span>) <span class="hljs-built_in">print</span>(<span class="hljs-string">f"Recommendation quality: <span class="hljs-subst">{evaluate_recommendation_quality(article_descriptions, df[<span class="hljs-string">'label'</span>].tolist(), i, k_nearest_neighbors=<span class="hljs-number">5</span>)}</span>"</span>)

</pre></div><p id="af52">The output of this evaluation will give a numerical indication of how well our recommender system is working.</p><h1 id="2597">Conclusion</h1><p id="d057">This article walked you through how to use embeddings and nearest neighbor search to build a recommendation system. We demonstrated the use of OpenAI’s text-embedding-ada-002 model to generate embeddings for news articles, which were then used to recommend similar articles based on cosine distance in the embedding space. We also discussed methods for caching embeddings, visualizing the embeddings using t-SNE, and evaluating the quality of the recommendations. These steps provide a foundation upon which more sophisticated recommendation systems can be built.</p><h2 id="ad99">BECOME a WRITER at MLearning.ai The Future of 3D AI // Your AI</h2><div id="058e" class="link-block"> <a href="https://readmedium.com/mlearning-ai-submission-suggestions-b51e2b130bfb"> <div> <div> <h2>Mlearning.ai Submission Suggestions</h2> <div><h3>How to become a writer on Mlearning.ai</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*6xCb1sNpjadaSBuVLPTFQQ.png)"></div> </div> </div> </a> </div></article></body>

Making Recommendations Using OpenAI Embeddings and Nearest Neighbor Search

Online recommendations are omnipresent. Platforms such as e-commerce, streaming services, and online news outlets utilize recommendation systems to suggest similar items or content. These recommendations are often based on a user’s past behavior or similar user behavior.

This article delves into using embeddings and nearest neighbor search to make recommendations. Specifically, we’ll use a dataset of news articles from AG’s corpus and build a model that can suggest similar articles given one particular article.

Prerequisites

Before diving into the core algorithm, we need to import the necessary Python packages and functions. Make sure you have pandas, pickle, and the OpenAI embeddings utility module installed.

# imports
import pandas as pd
import pickle
from openai.embeddings_utils import (
    get_embedding,
    distances_from_embeddings,
    tsne_components_from_embeddings,
    chart_from_components,
    indices_of_nearest_neighbors_from_distances,
)

# constants
EMBEDDING_MODEL = "text-embedding-ada-002"

Loading Data

The AG news dataset, which contains a collection of news articles, is used in this case. You can preview a few rows of the dataset to understand its structure and contents. The dataset contains the title, description, and labels for each news article.

Cache Embeddings

As generating embeddings can be a computationally intensive task, it is a good practice to cache the embeddings. This way, the same embeddings can be reused later, avoiding redundant computations.

The cache is implemented as a Python dictionary that maps a tuple of text and model to an embedding (a list of floats) and saved as a pickle file.

def embedding_from_string(string: str, model: str = EMBEDDING_MODEL, embedding_cache=embedding_cache) -> list:
    if (string, model) not in embedding_cache.keys():
        embedding_cache[(string, model)] = get_embedding(string, model)
        with open(embedding_cache_path, "wb") as embedding_cache_file:
            pickle.dump(embedding_cache, embedding_cache_file)
    return embedding_cache[(string, model)]

Make Recommendations

To recommend similar articles, we use a three-step process:

First, generate the embeddings for all article descriptions.
Next, calculate the distance between the source article’s embedding and the embeddings of all other articles.
Finally, return the articles closest to the source article in the embedding space.

The nearest neighbors (i.e., the articles with the smallest distances to the source article) are the recommended articles.

def print_recommendations_from_strings(strings: list[str], index_of_source_string: int, k_nearest_neighbors: int = 1, model=EMBEDDING_MODEL) -> list[int]:
    embeddings = [embedding_from_string(string, model=model) for string in strings]
    query_embedding = embeddings[index_of_source_string]
    distances = distances_from_embeddings(query_embedding, embeddings, distance_metric="cosine")
    indices_of_nearest_neighbors = indices_of_nearest_neighbors_from_distances(distances)
    return indices_of_nearest_neighbors

Making Recommendations

With everything in place, we can now start making recommendations. Let’s consider an example of finding articles similar to a specific article about Tony Blair.

tony_blair_articles = print_recommendations_from_strings(
    strings=article_descriptions,
    index_of_source_string=0,
    k_nearest_neighbors=5,
)

The model identifies other articles mentioning Tony Blair and other related political themes as similar. This confirms that our approach is able to successfully detect semantic similarity between articles, making it a valuable tool for recommendation systems.

Visualization of embeddings with t-SNE

Although we can get a sense of the effectiveness of our recommender system from the examples above, it’s also helpful to visualize the space of article descriptions to see how they cluster together. We can do this by reducing the dimensionality of the embeddings using t-SNE (t-Distributed Stochastic Neighbor Embedding), a common technique for visualizing high-dimensional data.

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Get embeddings for all descriptions
description_embeddings = [embedding_from_string(description, 
model=EMBEDDING_MODEL) for description in article_descriptions]

#compute 2D components with t-SNE

tsne = TSNE(n_components=2, random_state=0)
components = tsne.fit_transform(description_embeddings)

#plot components

plt.figure(figsize=(10, 10))
scatter = plt.scatter(components[:, 0], components[:, 1], alpha=0.3)
plt.title('t-SNE of article descriptions')
plt.show()

The resulting plot will likely show some degree of clustering, where similar articles group together in the embedding space.

Evaluating the recommender system

While our examples have demonstrated some success of the recommender system, it’s important to assess its performance more systematically. Depending on the nature of your data and your use case, different metrics might be appropriate. In our case, we might be interested in whether similar articles, as determined by their category labels, tend to be recommended.

def evaluate_recommendation_quality(
strings: list[str],
labels: list[str],
index_of_source_string: int,
k_nearest_neighbors: int = 1,
model=EMBEDDING_MODEL,
) -> float:
  """Evaluate the quality of a recommender by checking whether recommended articles belong to the same category."""
  # get indices of nearest neighbors
  indices_of_nearest_neighbors = print_recommendations_from_strings(
  strings=strings,
  index_of_source_string=index_of_source_string,
  k_nearest_neighbors=k_nearest_neighbors,
  model=model
  )
  # check what fraction of recommended articles belong to the same category
  source_label = labels[index_of_source_string]
  recommended_labels = [labels[i] for i in indices_of_nearest_neighbors]
  return recommended_labels.count(source_label) / len(recommended_labels)
  
  # This evaluation function computes the fraction of recommended articles that belong to the same category as the source article. A higher score indicates better recommendation quality.
  
  # evaluate the quality of recommendations for the first few articles
  
  for i in range(5):
    print(f"\nArticle #{i+1}")
    print(f"Recommendation quality: {evaluate_recommendation_quality(article_descriptions, df['label'].tolist(), i, k_nearest_neighbors=5)}")

The output of this evaluation will give a numerical indication of how well our recommender system is working.

Conclusion

This article walked you through how to use embeddings and nearest neighbor search to build a recommendation system. We demonstrated the use of OpenAI’s text-embedding-ada-002 model to generate embeddings for news articles, which were then used to recommend similar articles based on cosine distance in the embedding space. We also discussed methods for caching embeddings, visualizing the embeddings using t-SNE, and evaluating the quality of the recommendations. These steps provide a foundation upon which more sophisticated recommendation systems can be built.

BECOME a WRITER at MLearning.ai The Future of 3D AI // Your AI

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com