avatarKennedy Selvadurai, PhD

Summary

The article discusses the visualization of FAISS vector space to understand its influence on the performance of Retrieval-Augmented Generation (RAG) systems, utilizing TinyLlama 1.1B Chat for experimentation and the renumics-spotlight library for 2D embeddings visualization.

Abstract

The article explores the relationship between vector space representation using FAISS and the accuracy of responses in RAG applications. It demonstrates how to visualize high-dimensional embeddings in 2D using the renumics-spotlight library, allowing for the analysis of how different vectorization parameters, such as chunk size and overlap, can affect RAG performance. The experiments are conducted using TinyLlama 1.1B Chat, a compact model with efficient resource usage. The article also provides detailed instructions for setting up the environment, designing and implementing the system, and executing a test run, including the visualization and interpretation of the results in the context of a real-world question-answering scenario. By examining the vector space and the clustering of documents, the authors show how to improve RAG response accuracy by fine-tuning vectorization parameters.

Opinions

  • The author believes that the performance of RAG can be significantly improved by carefully selecting vectorization parameters when dealing with internal documents.
  • The author sees TinyLlama 1.1B Chat as an advantageous choice for rapid experimentation due to its small resource footprint and high accuracy.
  • The visualization of embeddings with renumics-spotlight is highly recommended by the author as it provides valuable insights into the behavior of RAG systems.
  • There is an opinion that traditional LLMs may "hallucinate" irrelevant, fabricated, or inconsistent content when queried about documents not seen during training.
  • The author suggests that the specific choice of chunk size and overlap in text splitting has a noticeable impact on the quality of RAG responses.
  • It is implied that visualization tools like renumics-spotlight can be instrumental in identifying the reasons behind suboptimal RAG performance when the correct answer is not provided.

Visualizing FAISS Vector Space to Understand its Influence on RAG Performance

Visualizing embeddings using renumics-spotlight reveals useful insights into RAG generation behavior.

Generated on Canva, as prompted by author

With ever-improving performance of open-source large language models, they are finding ways into various applications including writing and analyzing code, recommendations, text summarization and question-answering (QA). When it comes to QA, LLMs typically fall short on questions related to documents not used during their training. And there are many such internal documents that should remain behind corporate walls to ensure compliance, trade secret or privacy. When querying about such documents, LLMs are known to hallucinate, where it produces content that is irrelevant, fabricated or inconsistent.

One available technique to deal with this challenge is Retrieval-Augmented Generation (RAG). It involves the process of enhancing LLM response by referencing an authoritative knowledge base outside its training data sources prior to response generation. A RAG application consists of a retriever system to fetch relevant document snippets from a corpus, and an LLM to generate responses using the retrieved snippets as context. Naturally, the quality of corpus and its representation in the vector space, termed embeddings, will play a significant role in RAG accuracy.

In this article, let’s look at how to visualize the multi-dimensional embeddings of the FAISS vector space in 2-D using visualization library renumics-spotlight. We will look for opportunities to improve RAG response accuracy by varying certain key vectorization parameters. And for our LLM of choice, we will be adopting TinyLlama 1.1B Chat, a compact model with the same architecture and tokenizer as Llama 2 [1]. It has the advantages of having a significantly smaller resource footprint and fast run time, but without a proportional drop in its accuracy. This makes it ideal for rapid experimentation.

Table of Contents 1.0 Environment Setup 2.0 Design and Implementation 2.1 Module LoadFVectorize 2.2 The main Module 3.0 Test Run 3.1 Testing Chunk Size and Overlap Parameters 4.0 Final Thoughts

1.0 Environment Setup

This experimentation will be conducted on a MacBook Air M1 with 8GB RAM. The version of Python used here is 3.10.5. Initially, let’s create a virtual environment to manage this project. To create and activate the environment, let’s run the followings:

python3.10 -m venv mychat
source mychat/bin/activate

Library renumics-spotlight uses UMAP-like visualizations that reduces the high-dimensional embeddings to a more manageable 2D visualization while preserving key properties [2]. Let’s proceed to install all the required libraries:

pip install langchain faiss-cpu sentence-transformers flask-sqlalchemy psutil unstructured pdf2image unstructured_inference pillow_heif opencv-python pikepdf pypdf
pip install renumics-spotlight
CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir

The last line above involves installing the llama-cpp-python library with Metal support, which will be used to load TinyLlama with hardware acceleration on M1 processor. Using Metal, the computation runs on the GPU.

Since we have the environment ready, let’s take a look at the system design followed by its implementation.

2.0 Design and Implementation

There are two modules for this QA system as illustrated in Fig. 1.

Fig. 1. System Architecture. Image by author

Module LoadFVectorize involves loading pdf or web documents. For the initial test and visualization, the document of interest is a 440-page vendor deployment guide that was released recently (Dec 2023), and is quite unlikely seen by the LLM during its training. This module handles the splitting and vectorization of the documents.

The second module involves loading the LLM and instantiating a FAISS retriever, followed by the creation of a retrieval chain encompassing the LLM, the retriever and a custom prompt for questioning. Finally, it launches the vector space visualization.

The details of both modules are described further.

2.1 Module LoadFVectorize

This module comprises 3 functions:

  1. Function load_doc handles the loading of an online pdf document, splits at 512 characters per chunk with an overlap of 100 characters, and returns the document list.
  2. Function vectorize calls the above function load_doc to get a chunked list of documents, creates embeddings and commits to local directory opdf_index as well as returns the FAISS instance.
  3. Function load_db checks if a FAISS vectorstore is on disk within directory opdf_index, and attempts to load. Otherwise, it invokes function vectorize to load and vectorize the document. It finally returns a FAISS object.

The full listing of this module’s code is shown below.

# LoadFVectorize.py

from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.document_loaders import OnlinePDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS

# access an online pdf
def load_doc() -> 'List[Document]':
    loader = OnlinePDFLoader("https://support.riverbed.com/bin/support/download?did=7q6behe7hotvnpqd9a03h1dji&version=9.15.0")
    documents = loader.load()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=100)
    docs = text_splitter.split_documents(documents)
    return docs

# vectorize and commit to disk
def vectorize(embeddings_model) -> 'FAISS':
    docs = load_doc()
    db = FAISS.from_documents(docs, embeddings_model)
    db.save_local("./opdf_index")
   return db

# attempts to load vectorstore from disk
def load_db() -> 'FAISS':
    embeddings_model = HuggingFaceEmbeddings()
    try:
        db = FAISS.load_local("./opdf_index", embeddings_model)
   except Exception as e:
        print(f'Exception: {e}\nNo index on disk, creating new...')
        db = vectorize(embeddings_model)
    return db

2.2 The main Module

The main module initially defines the prompt template for TinyLlama of the following template: <|system|>{context}</s><|user|>{question}</s><|assistant|>

To further reduce the LLM memory footprint, we will adopt a quantized version of TinyLlama from TheBloke’s HuggingFace repo [3], which basically uses lesser bits for the model’s parameters. If someone is interested in any additional background on this LLM as well as looking for further details on the enabling technologies, feel free to check out our earlier article. To load the quantized LLM in the GGUF format, LlamaCpp is used. Using the FAISS object returned by the previous module, a FAISS retriever is created. With the above objects, the RetrievalQA chain is then instantiated, and used for questioning.

The following code extract captures these steps.

# main.py
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain_community.llms import LlamaCpp
from langchain_community.embeddings import HuggingFaceEmbeddings
import LoadFVectorize
from renumics import spotlight
import pandas as pd
import numpy as np

# Prompt template 
qa_template = """<|system|>
You are a friendly chatbot who always responds in a precise manner. If answer is 
unknown to you, you will politely say so.
Use the following context to answer the question below:
{context}</s>
<|user|>
{question}</s>
<|assistant|>
"""

# Create a prompt instance 
QA_PROMPT = PromptTemplate.from_template(qa_template)
# load LLM
llm = LlamaCpp(
    model_path="./models/tinyllama_gguf/tinyllama-1.1b-chat-v1.0.Q5_K_M.gguf",
    temperature=0.01,
    max_tokens=2000,
    top_p=1,
    verbose=False,
    n_ctx=2048
)
# vectorize and create a retriever
db = LoadFVectorize.load_db()
faiss_retriever = db.as_retriever(search_type="mmr", search_kwargs={'fetch_k': 3}, max_tokens_limit=1000)
# Define a QA chain 
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=faiss_retriever,
    chain_type_kwargs={"prompt": QA_PROMPT}
)

query = 'What versions of TLS supported by Client Accelerator 6.3.0?'

result = qa_chain({"query": query})
print(f'--------------\nQ: {query}\nA: {result["result"]}')

visualize_distance(db,query,result["result"])

The vector space visualization itself is handled by the last line in the above code listing, visualize_distance, which is also defined in this module.

In function visualize_distance, we firstly need to access the FAISS object’s attribute __dict__, which is a dictionary for instance variables. This allows us to access the docstore. Instance variable index_to_docstore_id is itself a dictionary of key indices to value docstore-ids. The total document count used for vectorization is represented by the index object’s attribute ntotal.

    vs = db.__dict__.get("docstore")
    index_list = db.__dict__.get("index_to_docstore_id").values()
    doc_cnt = db.index.ntotal

To enable an approximate reconstruction of the vector space, we simply invoke object index’s method reconstruct_n with default parameters:

   embeddings_vec = db.index.reconstruct_n()

Since we have the list of docstore-ids as index_list, let’s find the relevant document object and use it to create a list of lists to include the docstore-id, document metadata, document content as well as its embedding within the vector space for all ids, as per the following listing:

   doc_list = list() 
   for i,doc-id in enumerate(index_list):
       a_doc = vs.search(doc-id)
       doc_list.append([doc-id,a_doc.metadata.get("source"),a_doc.page_content,embeddings_vec[i]])

The list of lists is then used to create a dataframe with the column headers, which will subsequently be used in a spotlight call to create the visualization.

    df = pd.DataFrame(doc_list,columns=['id','metadata','document','embedding'])

Before we proceed with the visualization, we need to find a way to incorporate the question and answer. A separate dataframe for the question as well as answer are created and then merged with the main dataframe to be able to show where both of them appear along with the rest of the document chunks in the vector space:

    # add rows for question and answer
    embeddings_model = HuggingFaceEmbeddings()
    question_df = pd.DataFrame(
        {
            "id": "question",
            "question": question,
            "embedding": [embeddings_model.embed_query(question)],
        })
    answer_df = pd.DataFrame(
        {
            "id": "answer",
            "answer": answer,
            "embedding": [embeddings_model.embed_query(answer)],
        })
    df = pd.concat([question_df, answer_df, df])

To find the Euclidean distance between the question and the documents in this space, create the embedding for the question and then use numpy’s linalg.norm on the document and question embedding difference:

    question_embedding = embeddings_model.embed_query(question)
    # add column for vector distance
    df["dist"] = df.apply(                                                                                                                                                                         
        lambda row: np.linalg.norm(
            np.array(row["embedding"]) - question_embedding
        ),axis=1,)

Once we have the dataframe, we simply call spotlight.show to generate the visualization.

spotlight.show(df)

Voila! This launches spotlight in a browser window with the pandas dataframe ready for exploration.

Test Run

To facilitate our first experiment, below is the sample question that was chosen:

What versions of TLS supported by Client Accelerator 6.3.0?

As per the adopted document, the correct answer is:

Client Accelerator 6.3.0 supports TLS 1.1 or 1.2.

Additionally, the following extra information may be included in the response.

You must enable this capability using the following CLI command on the Client Accelerator: (config) # policy id ssl backend client-tls-1.2 Even though TLS 1.1 is not mentioned in the syntax of this command, using this command automatically enables support for both TLS 1.1 and 1.2.

And below was the response from TinyLlama for the above question:

Client Accelerator 6.3.0 supports TLS 1.1 or 1.2 as the default supported TLS versions with Client Accelerator. You must enable this capability using the following CLI command on the Client Accelerator: ``` (config) # policy id ssl backend client-tlss1.2 ``` Note that this command only applies to TLS 1.1 or TLS 1.2. If you need to support older TLS versions, you can use the `ssl backend` command with the `client-tlss1.0` or `client-tlss1.1` option instead.

This response looks rather similar to the actual answer, but it is not entirely correct as these TLS versions are not its default.

Fig. 2 depicts the screenshot of spotlight. The top-left table section displays all columns of the dataframe and the visualization is shown within the Similarity Map tab view.

Fig.2. Screenshot of Spotlight UI. Image by author

You can use the visible column button to control the displayed columns. Sorting the table by “dist” shows the question, answer, and the most relevant document snippets at the top. Looking at the embeddings visualization, it depicts nearly all documents here as a single cluster. This is probably reasonable as the original pdf is a deployment guide for a specific product. If we click on the filter icon within the Similarity Map tab, it highlights only the selected document list, which is rather tightly clustered, with the rest greyed out as shown in Fig. 3.

Fig. 3. Filtered visualization of the question, answer and top-10 documents. Image by author

Testing Chunk Size and Overlap Parameters

Since retriever is a key influencer to RAG performance, let’s look at a couple of parameters that influences the embeddings space. Table 1 captures and tabulates the responses of TinyLlama to the same question while TextSplitter’s chunk size (1000, 2000) and/or overlap (100, 200) parameters are varied during document splitting.

Table 1. LLM response vs splitter chunk size and overlap

At first glance, LLM responses for all combinations appear similar. However, if we compare the correct answer and each response carefully, the accurate answer is for combination (1000,200). The incorrect details in the other responses were highlighted in red. To explain this behavior, Fig. 3 depicts the embeddings map for each combination side-by-side.

Fig. 4. Embedding space visualization vs splitter chunk size and overlap. Image by author

Going from left to right with increasing chunk size, we see the vector space becoming sparser with lesser chunks. Going from bottom to top where the overlap was doubled, the vector space characteristics didn’t change dramatically. In all these maps, the entire collection still appears like a single cluster more or less, with only a few outliers. This is clearly reflected in the generated responses, which are rather similar. If the query was located, say, in the center of the cluster, the responses rather likely change significantly with these parameters’ changes as the nearest neighbors are likely different.

If your RAG application is routinely not providing an expected answer for certain questions, generating a visualization such as above with those questions may reveal additional insights on how best to split the corpus to improve the overall performance.

And for further illustration, let’s visualize a vector space occupied by two Wikipedia documents from unrelated domains. To achieve this, we simply modify the first line of function load_doc within module LoadFVectorize to instantiate a WebBasedLoader of two URLs, where one is for Grammy Awards and another for the JWST telescope, as shown below.

def load_doc():
    loader = WebBaseLoader(['https://en.wikipedia.org/wiki/66th_Annual_Grammy_Awards','https://en.wikipedia.org/wiki/James_Webb_Space_Telescope'])
    documents = loader.load()
    ...

The rest of the code stays intact. Running this modified code, we get the vector space visualization as depicted in Fig. 5.

Fig. 5. Embeddings visualization for documents from unrelated domains. Image by author

As expected, there are two distinct non-overlapping clusters here. If we were to pose a question lying outside of either cluster, the resulting context we get from the retriever is at least not going to be helpful for the LLM, but will most likely be detrimental. And just for fun, I decided to ask the same question posed previously. And sure enough the LLM started hallucinating..

Client Accelerator 6.3.0 supports the following versions of Transport Layer Security (TLS): 1. TLS 1.2 2. TLS 1.3 3. TLS 1.2 with Extended Validation (EV) certificates 4. TLS 1.3 with EV certificates 5. TLS 1.3 with SHA-256 and SHA-384 hash algorithms …

In our system design here, we used FAISS for the vector store. If you are using ChromaDB and wondering how to perform a similar visualization, you are in luck. Markus Stoll, one of the developer of library renumics-spotlight, wrote an interesting article about it here. Check it out.

Final Thoughts

Retrieval-Augmented Generation (RAG) allows us to harness large language model capabilities even when an LLM is not trained on your internal documents. RAG involves the retrieval of a number of relevant document chunks from a vectorstore, which is then used by the LLM as context for its generation. Accordingly, the quality of your embeddings will play an important role in RAG performance.

In this article, we demonstrated and visualized the impact of a couple of key vectorization parameters on the overall LLM performance. Our LLM of choice was TinyLlama 1.1B Chat due to its significantly smaller resource footprint but still boasts a good accuracy. Using library renumics-spotlight, we showed how to represent the entire FAISS vector space with a dataframe, which was then used to visualize the embeddings with a single line of code. Spotlight’s intuitive UI helps one to explore the vector space in view of the question, which allows for better understanding of LLM’s responses. By tweaking certain vectorization parameters, we are able to influence its generation behavior for improved accuracy.

Thank you for reading!

References

1. https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0 2. https://github.com/Renumics/spotlight 3. https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF

Llm
Retrieval Augmented
Python
Visualization
Faiss
Recommended from ReadMedium