Visualizing FAISS Vector Space to Understand its Influence on RAG Performance
Visualizing embeddings using renumics-spotlight reveals useful insights into RAG generation behavior.
With ever-improving performance of open-source large language models, they are finding ways into various applications including writing and analyzing code, recommendations, text summarization and question-answering (QA). When it comes to QA, LLMs typically fall short on questions related to documents not used during their training. And there are many such internal documents that should remain behind corporate walls to ensure compliance, trade secret or privacy. When querying about such documents, LLMs are known to hallucinate, where it produces content that is irrelevant, fabricated or inconsistent.
One available technique to deal with this challenge is Retrieval-Augmented Generation (RAG). It involves the process of enhancing LLM response by referencing an authoritative knowledge base outside its training data sources prior to response generation. A RAG application consists of a retriever system to fetch relevant document snippets from a corpus, and an LLM to generate responses using the retrieved snippets as context. Naturally, the quality of corpus and its representation in the vector space, termed embeddings, will play a significant role in RAG accuracy.
In this article, let’s look at how to visualize the multi-dimensional embeddings of the FAISS vector space in 2-D using visualization library renumics-spotlight
. We will look for opportunities to improve RAG response accuracy by varying certain key vectorization parameters. And for our LLM of choice, we will be adopting TinyLlama 1.1B Chat, a compact model with the same architecture and tokenizer as Llama 2 [1]. It has the advantages of having a significantly smaller resource footprint and fast run time, but without a proportional drop in its accuracy. This makes it ideal for rapid experimentation.
Table of Contents
1.0 Environment Setup
2.0 Design and Implementation
2.1 Module LoadFVector
ize
2.2 The m
ain Module
3.0 Test Run
3.1 Testing Chunk Size and Overlap Parameters
4.0 Final Thoughts
1.0 Environment Setup
This experimentation will be conducted on a MacBook Air M1 with 8GB RAM. The version of Python used here is 3.10.5. Initially, let’s create a virtual environment to manage this project. To create and activate the environment, let’s run the followings:
python3.10 -m venv mychat
source mychat/bin/activate
Library renumics-spotlight
uses UMAP-like visualizations that reduces the high-dimensional embeddings to a more manageable 2D visualization while preserving key properties [2]. Let’s proceed to install all the required libraries:
pip install langchain faiss-cpu sentence-transformers flask-sqlalchemy psutil unstructured pdf2image unstructured_inference pillow_heif opencv-python pikepdf pypdf
pip install renumics-spotlight
CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir
The last line above involves installing the llama-cpp-python
library with Metal support, which will be used to load TinyLlama with hardware acceleration on M1 processor. Using Metal, the computation runs on the GPU.
Since we have the environment ready, let’s take a look at the system design followed by its implementation.
2.0 Design and Implementation
There are two modules for this QA system as illustrated in Fig. 1.
Module LoadFVectorize
involves loading pdf or web documents. For the initial test and visualization, the document of interest is a 440-page vendor deployment guide that was released recently (Dec 2023), and is quite unlikely seen by the LLM during its training. This module handles the splitting and vectorization of the documents.
The second module involves loading the LLM and instantiating a FAISS retriever, followed by the creation of a retrieval chain encompassing the LLM, the retriever and a custom prompt for questioning. Finally, it launches the vector space visualization.
The details of both modules are described further.
2.1 Module LoadFVectorize
This module comprises 3 functions:
- Function
load_doc
handles the loading of an online pdf document, splits at 512 characters per chunk with an overlap of 100 characters, and returns the document list. - Function
vectorize
calls the above functionload_doc
to get a chunked list of documents, creates embeddings and commits to local directory opdf_index as well as returns the FAISS instance. - Function
load_db
checks if a FAISS vectorstore is on disk within directory opdf_index, and attempts to load. Otherwise, it invokes functionvectorize
to load and vectorize the document. It finally returns a FAISS object.
The full listing of this module’s code is shown below.
# LoadFVectorize.py
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.document_loaders import OnlinePDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
# access an online pdf
def load_doc() -> 'List[Document]':
loader = OnlinePDFLoader("https://support.riverbed.com/bin/support/download?did=7q6behe7hotvnpqd9a03h1dji&version=9.15.0")
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=100)
docs = text_splitter.split_documents(documents)
return docs
# vectorize and commit to disk
def vectorize(embeddings_model) -> 'FAISS':
docs = load_doc()
db = FAISS.from_documents(docs, embeddings_model)
db.save_local("./opdf_index")
return db
# attempts to load vectorstore from disk
def load_db() -> 'FAISS':
embeddings_model = HuggingFaceEmbeddings()
try:
db = FAISS.load_local("./opdf_index", embeddings_model)
except Exception as e:
print(f'Exception: {e}\nNo index on disk, creating new...')
db = vectorize(embeddings_model)
return db
2.2 The main Module
The main module initially defines the prompt template for TinyLlama of the following template:
<|system|>{context}</s><|user|>{question}</s><|assistant|>
To further reduce the LLM memory footprint, we will adopt a quantized version of TinyLlama from TheBloke’s HuggingFace repo [3], which basically uses lesser bits for the model’s parameters. If someone is interested in any additional background on this LLM as well as looking for further details on the enabling technologies, feel free to check out our earlier article. To load the quantized LLM in the GGUF format, LlamaCpp
is used. Using the FAISS object returned by the previous module, a FAISS retriever is created. With the above objects, the RetrievalQA
chain is then instantiated, and used for questioning.
The following code extract captures these steps.
# main.py
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain_community.llms import LlamaCpp
from langchain_community.embeddings import HuggingFaceEmbeddings
import LoadFVectorize
from renumics import spotlight
import pandas as pd
import numpy as np
# Prompt template
qa_template = """<|system|>
You are a friendly chatbot who always responds in a precise manner. If answer is
unknown to you, you will politely say so.
Use the following context to answer the question below:
{context}</s>
<|user|>
{question}</s>
<|assistant|>
"""
# Create a prompt instance
QA_PROMPT = PromptTemplate.from_template(qa_template)
# load LLM
llm = LlamaCpp(
model_path="./models/tinyllama_gguf/tinyllama-1.1b-chat-v1.0.Q5_K_M.gguf",
temperature=0.01,
max_tokens=2000,
top_p=1,
verbose=False,
n_ctx=2048
)
# vectorize and create a retriever
db = LoadFVectorize.load_db()
faiss_retriever = db.as_retriever(search_type="mmr", search_kwargs={'fetch_k': 3}, max_tokens_limit=1000)
# Define a QA chain
qa_chain = RetrievalQA.from_chain_type(
llm,
retriever=faiss_retriever,
chain_type_kwargs={"prompt": QA_PROMPT}
)
query = 'What versions of TLS supported by Client Accelerator 6.3.0?'
result = qa_chain({"query": query})
print(f'--------------\nQ: {query}\nA: {result["result"]}')
visualize_distance(db,query,result["result"])
The vector space visualization itself is handled by the last line in the above code listing, visualize_distance
, which is also defined in this module.
In function visualize_distance
, we firstly need to access the FAISS object’s attribute __dict__
, which is a dictionary for instance variables. This allows us to access the docstore
. Instance variable index_to_docstore_id
is itself a dictionary of key indices to value docstore-ids
. The total document count used for vectorization is represented by the index
object’s attribute ntotal
.
vs = db.__dict__.get("docstore")
index_list = db.__dict__.get("index_to_docstore_id").values()
doc_cnt = db.index.ntotal
To enable an approximate reconstruction of the vector space, we simply invoke object index
’s method reconstruct_n
with default parameters:
embeddings_vec = db.index.reconstruct_n()
Since we have the list of docstore-ids
as index_list
, let’s find the relevant document object and use it to create a list of lists to include the docstore-id
, document metadata, document content as well as its embedding within the vector space for all ids, as per the following listing:
doc_list = list()
for i,doc-id in enumerate(index_list):
a_doc = vs.search(doc-id)
doc_list.append([doc-id,a_doc.metadata.get("source"),a_doc.page_content,embeddings_vec[i]])
The list of lists is then used to create a dataframe with the column headers, which will subsequently be used in a spotlight
call to create the visualization.
df = pd.DataFrame(doc_list,columns=['id','metadata','document','embedding'])
Before we proceed with the visualization, we need to find a way to incorporate the question and answer. A separate dataframe for the question as well as answer are created and then merged with the main dataframe to be able to show where both of them appear along with the rest of the document chunks in the vector space:
# add rows for question and answer
embeddings_model = HuggingFaceEmbeddings()
question_df = pd.DataFrame(
{
"id": "question",
"question": question,
"embedding": [embeddings_model.embed_query(question)],
})
answer_df = pd.DataFrame(
{
"id": "answer",
"answer": answer,
"embedding": [embeddings_model.embed_query(answer)],
})
df = pd.concat([question_df, answer_df, df])
To find the Euclidean distance between the question and the documents in this space, create the embedding for the question and then use numpy’s linalg.norm
on the document and question embedding difference:
question_embedding = embeddings_model.embed_query(question)
# add column for vector distance
df["dist"] = df.apply(
lambda row: np.linalg.norm(
np.array(row["embedding"]) - question_embedding
),axis=1,)
Once we have the dataframe, we simply call spotlight.show
to generate the visualization.
spotlight.show(df)
Voila! This launches spotlight
in a browser window with the pandas dataframe ready for exploration.
Test Run
To facilitate our first experiment, below is the sample question that was chosen:
What versions of TLS supported by Client Accelerator 6.3.0?
As per the adopted document, the correct answer is:
Client Accelerator 6.3.0 supports TLS 1.1 or 1.2.
Additionally, the following extra information may be included in the response.
You must enable this capability using the following CLI command on the Client Accelerator: (config) # policy id
ssl backend client-tls-1.2 Even though TLS 1.1 is not mentioned in the syntax of this command, using this command automatically enables support for both TLS 1.1 and 1.2.
And below was the response from TinyLlama for the above question:
Client Accelerator 6.3.0 supports TLS 1.1 or 1.2 as the default supported TLS versions with Client Accelerator. You must enable this capability using the following CLI command on the Client Accelerator: ``` (config) # policy id
ssl backend client-tlss1.2 ``` Note that this command only applies to TLS 1.1 or TLS 1.2. If you need to support older TLS versions, you can use the `ssl backend` command with the `client-tlss1.0` or `client-tlss1.1` option instead.
This response looks rather similar to the actual answer, but it is not entirely correct as these TLS versions are not its default.
Fig. 2 depicts the screenshot of spotlight
. The top-left table section displays all columns of the dataframe and the visualization is shown within the Similarity Map tab view.
You can use the visible column button to control the displayed columns. Sorting the table by “dist” shows the question, answer, and the most relevant document snippets at the top. Looking at the embeddings visualization, it depicts nearly all documents here as a single cluster. This is probably reasonable as the original pdf is a deployment guide for a specific product. If we click on the filter icon within the Similarity Map tab, it highlights only the selected document list, which is rather tightly clustered, with the rest greyed out as shown in Fig. 3.
Testing Chunk Size and Overlap Parameters
Since retriever is a key influencer to RAG performance, let’s look at a couple of parameters that influences the embeddings space. Table 1 captures and tabulates the responses of TinyLlama to the same question while TextSplitter’
s chunk size (1000, 2000) and/or overlap (100, 200) parameters are varied during document splitting.
At first glance, LLM responses for all combinations appear similar. However, if we compare the correct answer and each response carefully, the accurate answer is for combination (1000,200). The incorrect details in the other responses were highlighted in red. To explain this behavior, Fig. 3 depicts the embeddings map for each combination side-by-side.
Going from left to right with increasing chunk size, we see the vector space becoming sparser with lesser chunks. Going from bottom to top where the overlap was doubled, the vector space characteristics didn’t change dramatically. In all these maps, the entire collection still appears like a single cluster more or less, with only a few outliers. This is clearly reflected in the generated responses, which are rather similar. If the query was located, say, in the center of the cluster, the responses rather likely change significantly with these parameters’ changes as the nearest neighbors are likely different.
If your RAG application is routinely not providing an expected answer for certain questions, generating a visualization such as above with those questions may reveal additional insights on how best to split the corpus to improve the overall performance.
And for further illustration, let’s visualize a vector space occupied by two Wikipedia documents from unrelated domains. To achieve this, we simply modify the first line of function load_doc
within module LoadFVectorize
to instantiate a WebBasedLoader
of two URLs, where one is for Grammy Awards and another for the JWST telescope, as shown below.
def load_doc():
loader = WebBaseLoader(['https://en.wikipedia.org/wiki/66th_Annual_Grammy_Awards','https://en.wikipedia.org/wiki/James_Webb_Space_Telescope'])
documents = loader.load()
...
The rest of the code stays intact. Running this modified code, we get the vector space visualization as depicted in Fig. 5.
As expected, there are two distinct non-overlapping clusters here. If we were to pose a question lying outside of either cluster, the resulting context we get from the retriever is at least not going to be helpful for the LLM, but will most likely be detrimental. And just for fun, I decided to ask the same question posed previously. And sure enough the LLM started hallucinating..
Client Accelerator 6.3.0 supports the following versions of Transport Layer Security (TLS): 1. TLS 1.2 2. TLS 1.3 3. TLS 1.2 with Extended Validation (EV) certificates 4. TLS 1.3 with EV certificates 5. TLS 1.3 with SHA-256 and SHA-384 hash algorithms …
In our system design here, we used FAISS for the vector store. If you are using ChromaDB and wondering how to perform a similar visualization, you are in luck. Markus Stoll, one of the developer of library renumics-spotlight
, wrote an interesting article about it here. Check it out.
Final Thoughts
Retrieval-Augmented Generation (RAG) allows us to harness large language model capabilities even when an LLM is not trained on your internal documents. RAG involves the retrieval of a number of relevant document chunks from a vectorstore, which is then used by the LLM as context for its generation. Accordingly, the quality of your embeddings will play an important role in RAG performance.
In this article, we demonstrated and visualized the impact of a couple of key vectorization parameters on the overall LLM performance. Our LLM of choice was TinyLlama 1.1B Chat due to its significantly smaller resource footprint but still boasts a good accuracy. Using library renumics-spotlight
, we showed how to represent the entire FAISS vector space with a dataframe, which was then used to visualize the embeddings with a single line of code. Spotlight’s intuitive UI helps one to explore the vector space in view of the question, which allows for better understanding of LLM’s responses. By tweaking certain vectorization parameters, we are able to influence its generation behavior for improved accuracy.
Thank you for reading!
References
1. https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0 2. https://github.com/Renumics/spotlight 3. https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF