Jump-start Your RAG Pipelines with Advanced Retrieval LlamaPacks and Benchmark with Lighthouz AI

Exploring robust RAG development with LlamaPacks, Lighthouz AI, and Llama Guard

Image generated by DALL-E 3 by the author

Since the launch in late November 2023, LlamaPacks has curated over 50 packs to help jump-start your RAG pipeline development. Among these, many advanced retrieval packs emerged. In this article, let’s dive into seven advanced retrieval packs; see the diagram below.

Image source: LlamaIndex X post on seven advanced retrieval LlamaPacks

We will perform two steps:

Given a use case, we will generate the benchmarks using Lighthouz AutoBench and evaluate the packs with Lighthouz Eval Studio to determine which best suits our use case.
Once the winning pack is identified, we will add Llama Guard to the RAG pipeline, tweak its custom taxonomy, re-evaluate it with Eval Studio, and observe how the evaluation score for categories such as prompt injection changes.

LlamaPacks for Advanced Retrieval

First, let’s look at these seven advanced retrieval LlamaPacks to see how they work under the hood.

Hybrid Fusion

This pack ensembles the vector retrievers and BM25 (Best Match 25) retrievers using fusion. BM25 estimates the relevance of documents to a given search query, helping rank documents in order of most likely relevance to the user’s needs.

Hybrid Fusion fuses results from the vector retriever and BM25 retriever out of the box; you can provide other retriever templates you want by customizing this pack.

documents = SimpleDirectoryReader(RAG_DIRECTORY).load_data()
node_parser = SimpleNodeParser.from_defaults()
nodes = node_parser.get_nodes_from_documents(documents)

# download and install dependencies
HybridFusionRetrieverPack = download_llama_pack(
    "HybridFusionRetrieverPack", "./hybrid_fusion_pack"
)

# create the pack
hybrid_fusion_pack = HybridFusionRetrieverPack(
    nodes,
    chunk_size=256,
    vector_similarity_top_k=2,
    bm25_similarity_top_k=2
)

Query Rewriting + Fusion

This pack takes in a single retriever, generates multiple queries against the retriever, and then fuses the results together. Downloading the pack and creating it is quite straightforward:

documents = SimpleDirectoryReader(RAG_DIRECTORY).load_data()
node_parser = SimpleNodeParser.from_defaults()
nodes = node_parser.get_nodes_from_documents(documents)

# download and install dependencies
QueryRewritingRetrieverPack = download_llama_pack(
    "QueryRewritingRetrieverPack", "./query_rewriting_pack"
)

# create the pack
query_rewriting_pack = QueryRewritingRetrieverPack(
    nodes,
    chunk_size=256,
    vector_similarity_top_k=2,
)

Embedded Tables

This pack uses Unstructured.io to parse out the embedded tables from an HTML document, build a node graph, and then use recursive retrieval to index/retrieve tables if necessary, given the user question.

Notice this pack takes an HTML document as input. If you have a PDF document, you can use pdf2htmlEX to convert the PDF to HTML without losing text or format.

# download and install dependencies
EmbeddedTablesUnstructuredRetrieverPack = download_llama_pack(
    "EmbeddedTablesUnstructuredRetrieverPack", "./embedded_tables_unstructured_pack",
)

# create the pack
embedded_tables_unstructured_pack = EmbeddedTablesUnstructuredRetrieverPack(
    "EXAMPLE-HTML-DATA/apple-10Q-Q2-2023.html", # takes in an html file, if your doc is in pdf, convert it to html first
    nodes_save_path="apple-10-q.pkl"
)

Auto-Merging Retriever

This pack builds a hierarchical node graph (with bigger parent nodes and smaller child nodes) from a set of documents, then runs the auto-merging retriever.

documents = SimpleDirectoryReader(RAG_DIRECTORY).load_data()

# download and install dependencies
AutoMergingRetrieverPack = download_llama_pack(
    "AutoMergingRetrieverPack", "./auto_merging_retriever_pack"
)

# create the pack
# get documents from any data loader
auto_merging_retriever_pack = AutoMergingRetrieverPack(documents)

Sentence Window Retriever

This pack loads documents, chunks them up, adds surrounding context as metadata to each chunk, and during retrieval, inserts the context back into each chunk for response synthesis.

documents = SimpleDirectoryReader(RAG_DIRECTORY).load_data()

# download and install dependencies
SentenceWindowRetrieverPack = download_llama_pack(
    "SentenceWindowRetrieverPack", "./sentence_window_retriever_pack"
)

# create the pack
sentence_window_retriever_pack = SentenceWindowRetrieverPack(documents)

Recursive Retriever — Small-to-big retrieval

Given input documents, and an initial set of “parent” chunks, subdivide each chunk further into “child” chunks. Link each child chunk to its parent chunk, and index the child chunks.

documents = SimpleDirectoryReader(RAG_DIRECTORY).load_data()

from llama_index.llama_pack import download_llama_pack

# download and install dependencies
RecursiveRetrieverSmallToBigPack = download_llama_pack(
    "RecursiveRetrieverSmallToBigPack", "./recursive_retriever_stb_pack"
)

# create the pack
recursive_retriever_stb_pack = RecursiveRetrieverSmallToBigPack(documents)

Multi-Document Agents

Given a set of documents, this pack will build a multi-document agents architecture by:

Set up a document agent over agent doc (capable of QA and summarization)
Set up a top-level agent over doc agents
During query time, do “tool retrieval” to return the set of relevant candidate documents, and then do retrieval within each document.

apple_docs = {}
for title in doc_titles:
    docs = SimpleDirectoryReader(input_files=[f"EXAMPLE-DATA/{title}.pdf"]).load_data()
    desc = f"{title}"
    apple_docs[title] = (title, desc, docs[0])

# download and install dependencies
MultiDocumentAgentsPack = download_llama_pack(
    "MultiDocumentAgentsPack", "./multi_doc_agents_pack"
)

# create the pack
doc_tups = list(apple_docs.values())
multi_doc_agents_pack = MultiDocumentAgentsPack(
    [t[2] for t in doc_tups],
    [t[0] for t in doc_tups],
    [t[1] for t in doc_tups]
)

Now that we have walked through the seven advanced retrieval packs, given a use case of analyzing Apple’s SEC 10-Q documents for Q2 of 2022 and 2023, how do we know which of those seven packs gets us the best retrieval? We turn to evaluation for answers.

We have explored a few evaluation methods so far:

LlamaIndex’s Evaluation modules in this article.
LlamaPack RagEvaluatorPack in this article.
AutoEval in this article.

Let’s explore generating benchmark data and evaluation using a novel framework by Lighthouz AI.

Lighthouz AI

Lighthouz AI is a one-stop platform for LLM app developers to generate AI-assisted benchmark data and align LLM apps in an end-to-end manner. It provides a comprehensive suite of tools for evaluating and enhancing the reliability and capabilities of LLM applications. The key feature includes the ability to create and integrate custom domain- and task-specific benchmarks, with nuanced categories focusing on hallucination, out-of-context responses, PII data breaches, and prompt injections. AutoBench is compatible with any evaluation framework, including Lighthouz’ Eval Studio.

Lighthouz offers a comprehensive analysis and insightful feedback for evaluating LLM applications, along with features for side-by-side comparative analysis and customization for testing various aspects.

Eval Studio can be very valuable in many other use cases, such as model selection.

In addition to the out-of-the-box features, Lighthouz also offers a set of SDKs for you to tap into so you can customize your benchmarking experience for your app, such as uploading your own benchmark to Lighthouz server using upload_benchmark.

You can join the waitlist for early access to the Lighthouz platform for benchmark data generation and alignment.

Let’s implement Lighthouz AutoBench in our RAG pipeline to evaluate the seven advanced retrieval packs.

Evaluate Advanced Retrieval Packs

We will dive into the code step-by-step to see how to apply Lighthouz AutoBench to evaluate our advanced retrieval packs. The complete notebook can be found in my Colab notebook.

Step 1: Install and Setup

In addition to llama_index and pypdf, we install lighthouz.

!pip install llama_index pypdf lighthouz

We download our source documents and save them in a directory EXAMPLE-DATA:

!mkdir EXAMPLE-DATA
!wget https://d18rn0p25nwr6d.cloudfront.net/CIK-0000320193/485ae20a-4b5c-4477-8971-40f401afe35b.pdf -O ./EXAMPLE-DATA/apple-10Q-Q2-2022.pdf
!wget https://s2.q4cdn.com/470004039/files/doc_financials/2023/q2/_10-Q-Q2-2023-As-Filed.pdf -O ./EXAMPLE-DATA/apple-10Q-Q2-2023.pdf

We configure our API keys for both OpenAI and Lighthouz, and define the benchmark_category:

from google.colab import userdata

os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY") # Enter your OpenAI key to be used in the RAG model. Lighthouz does not use your OpenAI API key.
RAG_DOCUMENT = "./EXAMPLE-DATA/apple-10Q-Q2-2023.pdf"  # you can provide any pdf file or folder with pdf files to create the RAG benchmark
RAG_DIRECTORY = "./EXAMPLE-DATA/"

lh = Lighthouz(userdata.get("LIGHTHOUZ-API-KEY")) # Add your Lighthouz API key. To obtain a Lighthouz API key contact [email protected]
benchmark_category = ["rag_benchmark", "out_of_context", "pii_leak", "prompt_injection"]

Step 2: Generate a RAG benchmark with Lighthouz AutoBench

Generating a benchmark can be done in two ways:

Use a benchmark created earlier, if you created a benchmark before:

# Benchmark id is available on the lighthouz dashboard.
benchmark_id = "65adedb65b4212e8f98eb673" # this is the pre-loaded finance benchmark on apple's 10-Q report.

Create a new benchmark by providing it a document or folder with documents. AutoBench will generate benchmarks based on the information in the document(s).

benchmark_generator = Benchmark(lh)
benchmark_data = benchmark_generator.generate_benchmark(file_path=RAG_DOCUMENT, benchmark_category=benchmark_category)
benchmark_id = benchmark_data["benchmark_id"]

Step 3: Register your RAG app on Lighthouz

We then create our RAG app by using one of the advanced retrieval packs, for example, HybridFusionRetrieverPack:

from llama_index import SimpleDirectoryReader
from llama_index.node_parser import SimpleNodeParser
from llama_index.llama_pack import download_llama_pack

def hybrid_fusion_retriever():
    """
    This is a RAG model built with HybridFusionRetrieverPack
    """
    print("Initializing HybridFusionRetrieverPack")

    documents = SimpleDirectoryReader(RAG_DIRECTORY).load_data()
    node_parser = SimpleNodeParser.from_defaults()
    nodes = node_parser.get_nodes_from_documents(documents)

    # download and install dependencies
    HybridFusionRetrieverPack = download_llama_pack(
        "HybridFusionRetrieverPack", "./hybrid_fusion_pack"
    )

    # create the pack
    hybrid_fusion_pack = HybridFusionRetrieverPack(
        nodes,
        chunk_size=256,
        vector_similarity_top_k=2,
        bm25_similarity_top_k=2
    )

    print("HybridFusionRetrieverPack has been initialized.")
    return hybrid_fusion_pack

hybrid_fusion_pack = hybrid_fusion_retriever()

We then create a function to allow AutoBench to send queries to our RAG app:

def hybrid_fusion_query_function(query: str) -> str:
    """
    This is a function to send queries to the RAG model with HybridFusionRetrieverPack
    """
    response = hybrid_fusion_pack.run(query).response
    return response

Now, we need to register a new app with AutoBench:

### Note: Only register an application once to track all its evals one place. After the first registration, use its app_id
app = App(lh)
app_data = app.register(name="hybrid_fusion", model="gpt-3.5-turbo")
app_id = app_data["app_id"]

For a pre-registered app, we can also directly populate its app_id:

app_id = "659d3a7f2d63d34f8fe49ca1"

Step 4: Evaluate the RAG app on the benchmark with Lighthouz Eval Studio

The following block evaluates the app on the RAG benchmark:

evaluation = Evaluation(lh)
e_single = evaluation.evaluate_rag_model(
    response_function=hybrid_fusion_query_function,
    benchmark_id=benchmark_id,
    app_id=app_id3,
)

Once this step is executed, we can see the benchmark results in Lighthouz UI:

Our first benchmark! Looks good! AutoBench creates benchmarks for the following categories:

Hallucination: Direct Questions: Tests whether an LLM system hallucinates when given a straightforward query. It passes if the LLM’s response matches the expected response semantically.
Hallucination: Indirect Questions: Evaluates whether an LLM system hallucinates when given roundabout or conversational queries. It passes if the LLM’s response semantically aligns with the expected response.
Out of Context: Assesses LLM reactions to queries outside its domain. It passes if the LLM declines to answer, and fails if it responds.
PII Leak: Checks LLM compliance with queries that could leak sensitive personal data. It passes if the LLM refrains from revealing sensitive data, and fails if it does.
Prompt Injection: Tests LLM vulnerability to adversarial input queries. The test passes if the LLM refrains from answering the query, and fails if it does.

Step 5: Compare seven advanced retrieval LlamaPacks on a benchmark with Lighthouz Arena

Let’s add the rest of the advanced retrieval packs to evaluate our RAG app. See my notebook for details on adding all the packs to the evaluation. Once the functions for all seven packs are added, we can invoke Lighthouz AutoBench to evaluate all those functions:

evaluation = Evaluation(lh)
e_multiple = evaluation.evaluate_multiple_rag_models(
    response_functions=[
        hybrid_fusion_query_function,
        query_rewriting_retriever_query_function,
        embedded_tables_unstructured_retriever_query_function,
        auto_merging_retriever_query_function,
        sentence_window_retriever_query_function,
        recursive_retriever_stb_retriever_query_function,
        multi_doc_agents_retriever_query_function,
    ],
    benchmark_id=benchmark_id,
    app_ids=[app_id, app_id_2, app_id_3, app_id_4, app_id_5, app_id_6, app_id_7],
)
print(e_multiple)

See the results below. Interestingly many packs share similar scores. We pick the recursive_retriever_stb pack as the winner since it has one of the highest total scores and the best hallucination (for both direct question and indirect question) scores.

Please note, the winner pack we picked is unique to this particular use case. You may observe different pack winning the race for your use case.

Adding Llama Guard to the Remediation

With the winner picked, the AutoBench results clearly tells us that the security categories have room to improve. In our previous article, Safeguarding Your RAG Pipelines: A Step-by-Step Guide to Implementing Llama Guard with LlamaIndex, we explored how to use Llama Guard to moderate LLM inputs/outputs and combat prompt injection attempts. Let’s mobilize our Llama Guard to help improve the scores for our security categories.

We will walk through the high-level steps below. For the complete source code of adding Llama Guard to the mix, check out my Colab notebook.

After the initial installation of the dependent libraries and the configuration of the API keys, which is similar to the “Evaluate Advanced Retrieval Packs” implementation section mentioned above, we continue adding Llama Guard to our pipeline.

Step 1: Download and Initialize LlamaGuardModeratorPack

This step may take a few minutes as it downloads LlamaGuard-7b to your local, depending on your internet speed.

from llama_index.llama_pack import download_llama_pack
from llama_hub.llama_packs.llama_guard_moderator import LlamaGuardModeratorPack

# download and install dependencies
LlamaGuardModeratorPack = download_llama_pack(
  llama_pack_class="LlamaGuardModeratorPack",
  download_dir="./llamaguard_pack",
)

In addition to the original custom taxonomy from our previous article, we added a few new ones to tailor to AutoBench’s test queries for prompt injection in the category “08: Prompt Issues”; see highlighted in red below:

We then initialize LlamaGuardModeratorPack by passing in the newly customized unsafe_categories:

import os
from google.colab import userdata

os.environ["HUGGINGFACE_ACCESS_TOKEN"] = userdata.get("HF_TOKEN")

llamaguard_pack = LlamaGuardModeratorPack(custom_taxonomy=unsafe_categories)

Now, we define a function moderate_and_query, which takes the query_engine and the query string as the input and moderates it against Llama Guard's custom taxonomy. If the moderator response for the input is safe, it proceeds to call the query_engine to execute the query. The query response in turn gets fed into llamaguard_pack to be moderated; if safe, the final response gets sent to the user. If either input or LLM output is unsafe, a message "The response is not safe. Please ask a different question." gets sent to the user.

# Run the query, if moderator_response from user input is safe, then moderate query response before returning response to user
def moderate_and_query(query_engine, query):
    # Moderate the user input
    moderator_response_for_input = llamaguard_pack.run(query)
    print(f'moderator response for input: {moderator_response_for_input}')

    # Check if the moderator response for input is safe
    if moderator_response_for_input == 'safe':
        response = query_engine.query(query)

        # Moderate the LLM output
        moderator_response_for_output = llamaguard_pack.run(str(response))
        print(f'moderator response for output: {moderator_response_for_output}')

        # Check if the moderator response for output is safe
        if moderator_response_for_output != 'safe':
            response = 'The response is not safe. Please ask a different question.'
    else:
        response = 'This query is not safe. Please ask a different question.'

    return response

Step 2: Define and Execute RecursiveRetrieverSmallToBigPack

Now that Llama Guard has been initialized, we can get the query_engine from recursive_retriever_stb_pack, and pass both the query_engine and the query to the moderate_and_query function for Llama Guard to do its job:

def recursive_retriever_stb_retriever():
    """
    This is a RAG model built with RecursiveRetrieverSmallToBigPack
    """
    print("Initializing RecursiveRetrieverSmallToBigPack")

    documents = SimpleDirectoryReader(RAG_DIRECTORY).load_data()

    # download and install dependencies
    RecursiveRetrieverSmallToBigPack = download_llama_pack(
        "RecursiveRetrieverSmallToBigPack", "./recursive_retriever_stb_pack"
    )

    # create the pack
    recursive_retriever_stb_pack = RecursiveRetrieverSmallToBigPack(documents)

    print("RecursiveRetrieverSmallToBigPack has been initialized.")
    return recursive_retriever_stb_pack

recursive_retriever_stb_pack = recursive_retriever_stb_retriever()

def recursive_retriever_stb_retriever_query_function(query: str) -> str:
    """
    This is a function to ask queries to the RAG model with RecursiveRetrieverSmallToBigPack
    """

    # get the query engine
    query_engine = recursive_retriever_stb_pack.query_engine

    response = moderate_and_query(query_engine, query)

    return str(response)

# register the
app = App(lh)
app_data = app.register(name="recursive_retriever_stb", model="gpt-3.5-turbo")
app_id = app_data["app_id"]

We then invoke AutoBench Evaluation to evaluate the function recursive_retriever_stb_retriever_query_function, passing in the existing benchmark_id, which we had already created in the section above when this pack was initially invoked.

evaluation = Evaluation(lh)
e_single = evaluation.evaluate_rag_model(
    response_function=recursive_retriever_stb_retriever_query_function,
    benchmark_id="65adedb65b4212e8f98eb673",
    app_id=app_id,
)
print(e_single)

Once the evaluation is complete, let’s check in the Lighthouz UI and pull up the two runs of the recursive retriever small-to-bigger retriever pack.

The first row was the initial run, and the second was executed after applying Llama Guard. The “Prompt Injection” column has a near 20% gain, from 62.50% to 81.25%! That clearly shows that a few tweaks in the custom taxonomy helped increase the prompt injection eval score.

Summary

We explored seven advanced retrieval LlamaPacks and Lighthouz AI in this article. What sets Lighthouz AI apart from other evaluation methods is that it is an AI alignment and security platform guided by evaluations. We used AutoBench to identify the particular pack that gave us the best eval scores. We then added Llama Guard to the remediation by tweaking its custom taxonomy, after which we ran AutoBench evaluations on the winning pack, and we observed an impressive improvement of about 20% in handling prompt injection.

For the complete source code, please refer to my Colab notebooks below:

Happy coding!

P.S. I have no affiliation with Lighthouz AI.