So you want to build an AI application powered by LLM: Let’s talk about Embedding and Semantic Search
Embedding and Semantic Search for an AI application powered by LLM
So, you’ve decided to join the Large Language Model trend and build an AI application that incorporates LLM. It’s highly probable that your application will perform semantic searches within a vector database and then use the LLM to derive solutions from the retrieved search results. I can’t fault you for that, as I’m doing the same thing.
This article continues on the same topic, “Building an AI application that utilizes LLM,” as the last article (“So You Want to Build an AI Application powered by LLM: Let’s Talk About Data Pre-Processing”), where we delve into both embeddings and semantic search.
Let’s clear one thing up: you do not need a vector database to perform a semantic search, or any other search for that matter. I predict that the entire range of vector database functionalities will be incorporated into existing databases in the near future. If you want to build on a platform that will not disappear in the near future, your best bet is Elasticsearch. Not because they possess superior technology, but because they have a broad existing user base. If you are using Postgres as your database then you can also utilize pgvector: Open-source vector similarity search for Postgres (https://github.com/pgvector/pgvector). The vendors of vector databases will claim that you may lose powerful functionalities such as metadata filtering (I think most people know how to do a SQL query ‘SELECT * FROM TABLE WHERE column_name = value’) and sparse-dense hybrid indexing, which we will discuss in the next article.
For the purpose of this article, we will be using Apache Parquet to store all our data, including embeddings, and will utilize Faiss for performing searches. Although ScaNN performs better than Faiss, it is quite finicky to use. I hope Google ports ScaNN to Jax soon.
Embedding
I will leave it to experts in the field of NLP to explain how embeddings work, but I will attempt to explain it in practical terms.
- If you use OpenAI’s embedding model (text-embedding-ada-002) to generate embeddings, you will receive a list of 1,536 numbers. For instance, if you create an embedding for “hi” (1 token), the embedding model will return a list of 1,536 numbers. Similarly, if you create an embedding for a text consisting of 8,192 tokens, the embedding model will still return a list of 1,536 numbers.
- You also need to consider the cost of creating embeddings; however, it may seem quite affordable at $0.0004 per 1,000 tokens, right? In reality, it can be very expensive. Let me illustrate with an example: Suppose there are 10,000,000 patent records with an average text data of 40,000 tokens, where you used an overlapping windowed passage chunking strategy. You would have just spent (10,000,000 x 40,000 x 2.6) / 1,000 x 0.0004 = $416,000 on embeddings alone. It is crucial to start creating embeddings only when you have cleansed the text and have the appropriate chunking strategy. You certainly don’t want to go through this process multiple times, as I have.
- OpenAI’s infrastructure is not entirely stable, even when https://status.openai.com displays all green bars, which are all lies. Expect your embedding processes to fail on several occasions. Your code should include a strategy to resume the process from the point where it failed.
- You will likely use the code to obtain embeddings for your text concurrently. Be aware of your rate limits, as detailed here (https://platform.openai.com/account/rate-limits), to ensure your process does not come to a halt due to exceeding your rate limits.
Massive Text Embedding Model Benchmark
OpenAI’s embedding model is not the only option that is available. Hugging Face hosts Massive Text Embedding Benchmark (MTEB) Leaderboard (MTEB Leaderboard — a Hugging Face Space by mteb) for measuring the performance of text embedding models on diverse embedding tasks. This article (MTEB: Massive Text Embedding Benchmark (huggingface.co)) provides a good explanation of the leaderboard.
I should explore the possibility of using other text embedding models, as it is not necessary to utilize OpenAI’s text embedding in order to use their GPT models. Furthermore, I understand that all other models are available for FREE.
Patent Records Embedding Example
This example expands on the previous article (“So You Want to Build an AI Application powered by LLM: Let’s Talk About Data Pre-Processing”)), in which we processed over 1 million patent records and created 2.6 million chunked records using the Chunking by Overlapping Windowed Passage strategy. We chose to store this data in Apache Parquet format with the following columns:
- ID
- CHUNK_ID
- TITLE
- TEXT
Creating embeddings for these texts is a fairly straightforward process in which you call OpenAI’s API to generate embeddings using the “text-embedding-ada-002” model, while accounting for the expected timeout errors and staying within the constraints of your rate limit.
Please note that this code is just a sample and you will need to revise the code to fit your purpose.
import openai
import os
import math
import time
import pandas as pd
from tqdm import tqdm
import concurrent.futures
from concurrent.futures import as_completed
from dotenv import load_dotenv
# Define your API key and model for OpenAI
load_dotenv()
MODEL = "text-embedding-ada-002"
openai.api_key = os.getenv("OPENAI_API_KEY")
def get_text_embedding(text, model="text-embedding-ada-002"):
text = text.replace("\n", " ")
return openai.Embedding.create(input=[text], model=model)['data'][0]['embedding']
# Process a batch of rows
def process_batch_text(batch, model):
try:
return batch['text'].apply(lambda x: get_text_embedding(x, model=model))
except Exception as e:
print(f"Error while processing batch: {e}")
raise e
def process_and_save_chunks(df, start_row, batch_size, model, num_workers, output_file_template):
num_batches = math.ceil((len(df) - start_row) / batch_size)
def process_batch_with_retry(batch, model, batch_index, num_batches):
success = False
while not success:
try:
embeddings = process_batch_text(batch, model)
success = True
except Exception as e:
print(f"Failed to process batch {batch_index+1}/{num_batches}. Retrying in 60 seconds...")
time.sleep(60)
return embeddings
with concurrent.futures.ThreadPoolExecutor(max_workers=num_workers) as executor:
futures = []
for i in range(num_batches):
start = start_row + i * batch_size
end = start_row + (i + 1) * batch_size
batch = df.iloc[start:end]
future = executor.submit(process_batch_with_retry, batch, model, i, num_batches)
futures.append((i, future))
for i, future in tqdm(futures, desc="Batches"):
start = start_row + i * batch_size
end = start_row + (i + 1) * batch_size
embeddings = future.result()
df.loc[start:end-1, 'embedding'] = embeddings
output_file = output_file_template.format(start, end)
print(f"Saving batch {i+1}/{num_batches} to {output_file}")
df.iloc[start:end].to_parquet(output_file, index=False)
then call these functions
input_df = pd.read_parquet('input/patent_records.parquet')
output_file_template = 'output/patent_records_embedding_{}_{}.parquet'
process_and_save_chunks(df=input_df, start_row=0, batch_size=1000, model='text-embedding-ada-002', num_workers=4, output_file_template=output_file_template)
This process will generate multiple output files based on your BATCH_SIZE setting, with embeddings stored in the “embedding” column as a list of 1,536 numbers. For example, if the BATCH_SIZE is set to 1,000, then each Parquet file will hold 1,000 records. You will need to merge all these parquet files into one parquet file afterward.
Semantic Search
Semantic search can be explained in the context of text embeddings, which are compact numerical representations that encapsulate the meaning and context of words, phrases, or entire documents. By transforming text into embeddings, we can quantify and compare the semantic relationships between different pieces of text.
Semantic search leverages these embeddings to enhance search capabilities. By converting both the search query and the target documents into embeddings, we can compare their semantic relationships and retrieve the most relevant results. This approach goes beyond traditional keyword-based searches, as it can identify relevant documents even if they don’t share the exact same words as the search query.
In essence, semantic search powered by text embeddings allows for more accurate and meaningful search results by understanding and comparing the underlying meaning of the text, rather than just focusing on keyword matches. This approach is particularly useful in fields with large amounts of text data, such as patent records or academic literature, where finding relevant information quickly and accurately is of utmost importance.
Is Semantic Search Good?
In my personal opinion, semantic search is simply average. It doesn’t excel nor disappoint, just falls into the “okay” category. It doesn’t live up to the hype as the groundbreaking AI application some people claim it to be.
ANN Benchmarks
Anyway... Before we dive into the details, it’s important to understand the various options available to you. ANN Benchmarks (http://ann-benchmarks.com/index.html) provides a list of different options, along with their benchmarking results, as shown below.
According to the ANN Benchmark website, Google Research’s ScaNN: Efficient Vector Similarity Search outperforms all other methods. However, for some reason, I have not been able to use this tool outside of Google Colab due to its complicated dependency on Google TensorFlow. As a result, we will resort to Meta’s Faiss: A library for efficient similarity search, which may not be as performant as ScaNN but is still sufficient for most use cases.
Patent Records Semantic Search Example
In the previous section, we generated embeddings for all 2.6 million patent records and stored them in an Apache Parquet file with the following columns:
- ID
- CHUNK_ID
- TITLE
- TEXT
- EMBEDDING
Now, we will be using Meta’s Faiss to perform a semantic search on Patent Records database. The code is structured as follows:
Please note that this code is just a sample and you will need to revise the code to fit your purpose. If you want to experiment with the semantic search without creating an embedding dataset then you can download existing embedding datasets from here (https://blog.devgenius.io/list-of-embedding-archives-a5f5db33510b).
import os
import time
import faiss
import numpy as np
import pandas as pd
import openai
import pyarrow.parquet as pq
from dotenv import load_dotenv
# Define your API key and model for OpenAI
load_dotenv()
MODEL = "text-embedding-ada-002"
openai.api_key = os.getenv("OPENAI_API_KEY")
CHUNK_SIZE = 500
# Define a function to search the Faiss index
def search_faiss_index(dataframe, query_embedding, num_neighbors=5):
# Create an embedding matrix from the dataframe
embedding_matrix = np.vstack(dataframe['embedding'].values).astype('float32')
# Normalize the embeddings
faiss.normalize_L2(embedding_matrix)
faiss.normalize_L2(query_embedding)
# Create a Faiss index
index = faiss.IndexFlatIP(embedding_matrix.shape[1])
index.add(embedding_matrix)
# Perform a search using the Faiss index
distances, neighbors = index.search(query_embedding, num_neighbors)
# Create a dataframe with search results
search_results = dataframe.iloc[neighbors[0]].copy()
search_results['distances'] = distances[0]
return search_results[['distances', 'id', 'chunk_id', 'title', 'text']]
# Define a function to search the parquet files
def get_search_results(file_name, query_embedding, num_neighbors=5):
# Initialize an empty dataframe to store the search results
all_search_results = pd.DataFrame()
table = pq.read_table(file_name)
file_num_rows = table.num_rows
for i in range(0, file_num_rows, CHUNK_SIZE):
chunk = table.slice(i, CHUNK_SIZE).to_pandas()
search_results = search_faiss_index(chunk, query_embedding, file_name, num_neighbors)
yield search_results
# Append the search results to the all_search_results dataframe
all_search_results = pd.concat([all_search_results, search_results], ignore_index=True)
return all_search_results
# Define a function to get the query embedding
def get_query_embedding(search, model=MODEL, retry_timeout=15, max_retries=5):
for i in range(max_retries):
try:
query_response = openai.Embedding.create(input=search, engine=model)
query_embedding = np.array(query_response["data"][0]["embedding"], dtype='float32')
query_embedding = query_embedding.reshape(1, -1)
return query_embedding
except Exception as e:
print(f"DEBUG: OpenAI Embedding API attempt {i + 1} failed. Retrying in {retry_timeout} seconds...")
time.sleep(retry_timeout)
retry_timeout *= 2
else:
error_message = "An error occurred while transforming search query to embedding."
return error_message
then call these functions
filename = 'output/patent_records_embedding.parquet'
search = 'Your query'
query_embedding = get_query_embedding(search)
search_results = get_search_results(filename, search, query_embedding)
Note: This code will return hundreds or even thousands of records in the search results. You will need to filter them to remove duplicates and rank them accordingly. This is due to the search being conducted in chunks of 500 records to avoid exhausting resources during the search process.
Evaluating Semantic Search Results
Evaluating semantic search results can be a challenging task, as it necessitates assessing the relevance and quality of the search results in relation to the user’s query intent. Let’s consider patent records as an example. If you are searching through 1 million or even 10 million records, do you have a team of patent lawyers capable of definitively determining that, given a search query, these records should be returned in a specific ranking order? If you do have them then you can use the following metrics and methods to evaluate semantic search results, such as:
- Precision and Recall: Precision measures the proportion of relevant documents retrieved compared to the total number of documents retrieved, while recall measures the proportion of relevant documents retrieved compared to the total number of relevant documents in the collection. A balanced measure that combines both precision and recall is the F1 score, which is the harmonic mean of precision and recall.
- Mean Average Precision (MAP): MAP is a widely-used metric that evaluates the average precision of search results at different recall levels. It considers the order of search results and is well-suited for tasks where the position of relevant documents matters.
I hope you’ve enjoyed this article. In my next piece, I plan to delve into hybrid search and re-ranking. If you have any questions or comments, please feel free to share them here.
Resources
- So You Want to Build an AI Application powered by LLM: Let’s Talk About Data Pre-Processing
- pgvector: Open-source vector similarity search for Postgres (https://github.com/pgvector/pgvector)
- Massive Text Embedding Benchmark (MTEB) Leaderboard (MTEB Leaderboard — a Hugging Face Space by mteb)
- MTEB: Massive Text Embedding Benchmark (MTEB: Massive Text Embedding Benchmark (huggingface.co))
- ANN Benchmarks (http://ann-benchmarks.com/index.html)
- Google Research’s ScaNN: Efficient Vector Similarity Search
- Meta’s Faiss: A library for efficient similarity search