avatarAivin Solatorio

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

8429

Abstract

an>: { <span class="hljs-string">"query"</span>: <span class="hljs-string">"Represent the query for retrieving passages:"</span>, <span class="hljs-string">"passage"</span>: <span class="hljs-string">"Represent the passage for retrieval:"</span>}, }</pre></div><h2 id="5297">Downloading the document</h2><p id="4ec9">We download the test document from a URL and save it as <code>document.pdf</code> in the current working directory.</p><div id="17e2"><pre><span class="hljs-comment"># Download a document</span> file_path = <span class="hljs-string">"document.pdf"</span>

<span class="hljs-comment"># The effects of school-based management in the Philippines : an initial assessment using administrative data (English)</span> !wget https://documents1.worldbank.org/curated/en/692901468296405564/pdf/WPS5248.pdf -O {file_path}</pre></div><h2 id="9568">Implementing the document loader and splitter</h2><p id="b9fe">We define a function that implements the loading of the document using LangChain’s document loader leveraging PyMuPDF. The function also splits the document into passages.</p><p id="347e">The way we split the document is based on the tokenizer of the model. We set the maximum chunk size (passage length) to 500 tokens. While we choose embedding models with 512 context size, we limit the passage length to 500 to account for some models' instructions.</p><p id="205f">We also set the chunks to overlap by 32 tokens. There is a more precise reason why we use overlapping sections in the context of LLM applications — “fringe effects,” but it is not important for our current experiment. Still, it’s nice to have this in place.</p><div id="fe97"><pre><span class="hljs-keyword">from</span> typing <span class="hljs-keyword">import</span> <span class="hljs-type">Optional</span>, <span class="hljs-type">Union</span> <span class="hljs-keyword">from</span> pathlib <span class="hljs-keyword">import</span> Path <span class="hljs-keyword">from</span> langchain.docstore.document <span class="hljs-keyword">import</span> Document <span class="hljs-keyword">from</span> langchain.document_loaders <span class="hljs-keyword">import</span> PyMuPDFLoader <span class="hljs-keyword">from</span> langchain.text_splitter <span class="hljs-keyword">import</span> ( NLTKTextSplitter, CharacterTextSplitter, RecursiveCharacterTextSplitter, TextSplitter, )

MAX_TOKENS = <span class="hljs-number">500</span> <span class="hljs-comment"># We use 500 to account for the passage tokens.</span> CHUNK_OVERLAP = <span class="hljs-number">32</span>

<span class="hljs-keyword">def</span> <span class="hljs-title function_">get_text_splitter</span>(<span class="hljs-params">tokenizer, max_tokens=<span class="hljs-number">512</span>, chunk_overlap=<span class="hljs-number">32</span></span>): chunk_size = max_tokens

<span class="hljs-comment"># Create a text splitter</span>
<span class="hljs-comment"># cls = CharacterTextSplitter</span>
cls = RecursiveCharacterTextSplitter

text_splitter = cls.from_huggingface_tokenizer(
    tokenizer,
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    add_start_index=<span class="hljs-literal">True</span>,
)

<span class="hljs-keyword">return</span> text_splitter

<span class="hljs-keyword">def</span> <span class="hljs-title function_">get_doc</span>(<span class="hljs-params">file_path, tokenizer, max_tokens=<span class="hljs-number">512</span>, chunk_overlap=<span class="hljs-number">32</span></span>): text_splitter = get_text_splitter(tokenizer, max_tokens=max_tokens, chunk_overlap=chunk_overlap)

<span class="hljs-comment"># Load the document</span>
doc = PyMuPDFLoader(file_path)
doc_splits = doc.load_and_split(text_splitter)

<span class="hljs-keyword">return</span> doc_splits</pre></div><h2 id="403e">Generation of embeddings</h2><p id="0729">After implementing the document processing methods, we can generate embeddings for the passages using the different models.</p><p id="ad56">The following code block iterates over the list of models in our selection. We then load the model from the HuggingFace Hub.</p><p id="672c">We segment the documents using the respective tokenizer of the model since not all models use the same tokenizer. This ensures that we do not accidentally pass a passage that would exceed the context length of the embedding model.</p><div id="6527"><pre><span class="hljs-keyword">from</span> tqdm.auto <span class="hljs-keyword">import</span> tqdm

<span class="hljs-keyword">from</span> sentence_transformers <span class="hljs-keyword">import</span> SentenceTransformer <span class="hljs-keyword">from</span> InstructorEmbedding <span class="hljs-keyword">import</span> INSTRUCTOR

query = <span class="hljs-string">"What data was used in this paper?"</span> model_similarity = {} model_doc_splits = {}

<span class="hljs-keyword">for</span> mname, minfo <span class="hljs-keyword">in</span> tqdm(model_instructions.items()):

<span class="hljs-keyword">if</span> mname == <span class="hljs-string">"hkunlp/instructor-base"</span>:
    model = INSTRUCTOR(mname)

    doc_splits = get_doc(file_path, tokenizer=model.tokenizer, max_tokens=MAX_TOKENS, chunk_overlap=CHUNK_OVERLAP)
    model_doc_splits[mname] = doc_splits

    input_passage = [[minfo[<span class="hljs-string">"passage"</span>], split.page_content] <span class="hljs-keyword">for</span> split <span class="hljs-keyword">in</span> doc_splits]
    input_query = [[minfo[<span class="hljs-string">"query"</span>], query]]
<span class="hljs-keyword">else</span>:
    model = SentenceTransformer(mname)

    doc_splits = get_doc(file_path, tokenizer=model.tokenizer, max_tokens=MAX_TOKENS, chunk_overlap=CHUNK_OVERLAP)
    model_doc_splits[mname] = doc_splits

    input_passage = [<span class="hljs-string">" "</span>.join([minfo[<span class="hljs-string">"passage"</span>], split.page_content]).strip() <span class="hljs-keyword">for</span> split <span class="hljs-keyword">in</span> doc_splits]
    input_query = [<span class="hljs-string">" "</span>.join([minfo[<span class="hljs-string">"query"</span>], query]).strip()]

passage_embeddings = model.encode(input_passage, normalize_embeddings=<span class="hljs-literal">True</span>)
query_embedding = model.encode(input_query, normalize_embeddings=<span class="hljs-literal">True</span>)

similarity = query_embedding @ passage_embeddings.T
model_similarity[mname] = similarity

<span class="hljs-comment"># print(similarity)</span></pre></div><p id="39ef">Next, we apply the appropriate formatting of the passage and query inputs, considering the instruction requirement, if necessary. All models share the same interface since they implement the SentenceTransformer class, except for the Instructor model. For the Instructor model, we need to format the input as a tuple of <code>(instruction, content)</code> pair.</p><p id="ec51">After this, we encode the passages and the query into their respective embeddings. We also normalize the embeddings. Normalizing the embeddings implies we can perform a dot product, and the result will be equivalent to computing the cosine similarity of the vectors.</p><p id="67fc">We store the similarity scores and the passages for evaluation.</p><h2 id="b361">Evaluation</h2><p id="c23c">Now that we have the similarity scores of the query for all passages, we can simulate the retrieval of the relevant context. We want to measure the rank of the relevant passage after sorting all passages by cosine similarity with respect to the query embedding.</p><p id="3e5b">We can validate the accuracy of the retrieval by finding the exact phrase we expect the relevant passage to contain. In this case, we reference two relevant phrases in the document:</p><ul><li><i>“We use school administrative data”</i> (from the body)</li><li><i>“using the administrative dataset of all public schools” </i>(from the abstract)</li></ul><p id="579e">These phrases are the content of a passage we want an effective embedding model to retrieve. And that passages containing these phrases must be at the top of the ranking.</p><p id="bb53">In the code block below, we sorted the scores from the highest to the lowest for each model. Next, we identify which passages correspond to the top-K scores. For each passage, we check if it contains 

Options

the reference text. We report the rank of the passage if the reference text is found.</p><div id="c81b"><pre>top_k = <span class="hljs-number">50</span> ref_texts = [ <span class="hljs-string">"We use school administrative data"</span>, <span class="hljs-comment"># In the body</span> <span class="hljs-string">"using the administrative dataset of all public schools"</span> <span class="hljs-comment"># In the abstract</span> ]

<span class="hljs-keyword">for</span> ref_text <span class="hljs-keyword">in</span> ref_texts: <span class="hljs-built_in">print</span>(<span class="hljs-string">f"Reference: <span class="hljs-subst">{ref_text}</span>"</span>)

<span class="hljs-keyword">for</span> mname <span class="hljs-keyword">in</span> tqdm(model_similarity):
    found = <span class="hljs-literal">False</span>

    <span class="hljs-keyword">for</span> ix, i <span class="hljs-keyword">in</span> <span class="hljs-built_in">enumerate</span>(model_similarity[mname][<span class="hljs-number">0</span>].argsort()[::-<span class="hljs-number">1</span>][:top_k], <span class="hljs-number">1</span>):
        <span class="hljs-keyword">if</span> ref_text <span class="hljs-keyword">in</span> model_doc_splits[mname][i].page_content:
            found = <span class="hljs-literal">True</span>
            <span class="hljs-keyword">break</span>
    <span class="hljs-keyword">if</span> found:              
        <span class="hljs-built_in">print</span>(<span class="hljs-string">f"<span class="hljs-subst">{mname}</span>: Found in top <span class="hljs-subst">{ix}</span>/<span class="hljs-subst">{<span class="hljs-built_in">len</span>(model_doc_splits[mname])}</span>..."</span>)
        <span class="hljs-built_in">print</span>(<span class="hljs-string">"="</span> * <span class="hljs-number">50</span>)
        <span class="hljs-built_in">print</span>()    
    <span class="hljs-keyword">else</span>:
        <span class="hljs-built_in">print</span>(<span class="hljs-string">f"<span class="hljs-subst">{mname}</span>: Not found in top <span class="hljs-subst">{ix}</span>/<span class="hljs-subst">{<span class="hljs-built_in">len</span>(model_doc_splits[mname])}</span>..."</span>)
        <span class="hljs-built_in">print</span>(<span class="hljs-string">"="</span> * <span class="hljs-number">50</span>)
        <span class="hljs-built_in">print</span>()
<span class="hljs-built_in">print</span>()</pre></div><p id="38ed">The evaluation script above returns the following output.</p><p id="e586">For the <i>“We use school administrative data”</i> reference, all models except for the Instructor model performed significantly worse in retrieving the relevant passage. If we were using an LLM with a 4,096 context size, only the Instructor model would have included this relevant passage in the context.</p><div id="7fa1"><pre>Reference: We use school administrative data

<span class="hljs-section">BAAI/bge-base-en: Found in top 25/65... ==================================================</span>

<span class="hljs-section">thenlper/gte-base: Found in top 45/65... ==================================================</span>

<span class="hljs-section">intfloat/e5-base-v2: Found in top 27/65... ==================================================</span>

<span class="hljs-section">intfloat/e5-base: Found in top 38/65... ==================================================</span>

<span class="hljs-section">hkunlp/instructor-base: Found in top 3/57... ==================================================</span></pre></div><p id="4e54">On the other hand, the reference <i>“using the administrative dataset of all public schools” </i>found in the document's abstract appears to have been easily retrievable by the models. Still, the former passage is more relevant as it contains a more detailed description of the dataset used.</p><div id="a63e"><pre>Reference: using the administrative dataset of all public schools

<span class="hljs-section">BAAI/bge-base-en: Found in top 3/65... ==================================================</span>

<span class="hljs-section">thenlper/gte-base: Not found in top 50/65... ==================================================</span>

<span class="hljs-section">intfloat/e5-base-v2: Found in top 1/65... ==================================================</span>

<span class="hljs-section">intfloat/e5-base: Found in top 4/65... ==================================================</span>

<span class="hljs-section">hkunlp/instructor-base: Found in top 4/57... ==================================================</span></pre></div><p id="747d">The drastic outcomes of the models’ performance for finding the relevant passages for these two reference phrases using the same query demonstrate the need to understand the models we choose and not simply copy what worked for others.</p><p id="6cf7">While one may argue that this is just for a single document, it’s all the more relevant since this supports the case that AI is generally not a one-size-fits-all solution, at least in its current state. Additionally, this highlights the importance of transparency in how we present the output of the models to users since there is no certainty with AI solutions. We should always err on the side of caution whenever accuracy or topics about model hallucination are discussed.</p><h1 id="deb9">Suggested solutions</h1><p id="3d61">Here, I outline some suggestions on what we can do to mitigate these pitfalls.</p><ul><li>Build a trusted “gold dataset” to validate the relevance and accuracy of your retrieval system. Note that we can use LLMs to generate an initial set of data that we can manually curate to build our “gold dataset.”</li><li>Implement a re-ranking model based on this “gold dataset” to increase the chances that the relevant passage can be included in our context.</li><li>Fine-tune a text embedding model based on our data and use case.</li><li>Implement a hybrid search combining semantic-based retrieval systems with a keyword-based search for increased recall.</li></ul><p id="842c">Note that I am using the word “mitigate” since perfection is a utopia for the current state of AI. This means that errors will occur. So, only when you and your organization become cognizant of this and embrace this dark side of AI will you be truly worthy of AI’s promises!</p><blockquote id="5ab7"><p>Perfection is a utopia for the current state of AI</p></blockquote><h1 id="0fae">Some thoughts</h1><p id="e333">At this point, I will circle back to one of the motivations I listed above — that this experiment is expected to serve as the premise for a brief discussion on why there exist risks when treating semantic search and context retrieval for LLMs merely as an application development problem.</p><p id="0f61">Building chatbots and LLM-powered applications is exciting. Open-sourced frameworks help accelerate the pace at which the development of various applications occurs. However, many of the critical parts are abstracted by these frameworks. I want to demonstrate why we should be circumspect in developing applications that leverage AI. Having a nice-looking chatbot is only useful when it can deliver the most relevant content it is supposed to provide.</p><p id="66bb">The risk comes when you assume that using some embedding model to generate a context for an LLM to consume solves the critical hallucination problem. Also, basing your decision on some public leaderboard results may not be optimal. If you built an LLM-powered application that leverages embedding-based RAG, and you did not validate the relevance performance of the embedding model for your use case, then I suggest you spend some time reviewing it. No LLMs can provide the answer or generate an accurate response when the relevant information is not passed to it!</p><p id="d348">Building AI solutions is not just an IT or software engineering problem; this requires research and subject matter expertise to identify relevant contexts. Nonetheless, we have tools and resources to perform the needed “sleuthing” to improve the applications and solutions we build, no matter which background you come from!</p><p id="3678">Ultimately, we need to understand that AI applications, at least at this time, are not just about writing API requests. There’s a more exciting side to it — researching what works!</p></article></body>

Rethinking Embedding-based Retrieval-Augmented Generation (RAG) for Semantic Search and Large Language Models (LLMs)

Dall-E 2: a surrealist dream-like oil painting by Salvador Dalí of a cat playing checkers (https://openai.com/dall-e-2)

This post discusses a simple experiment I put together to explore potential pitfalls in the common embedding-based Retrieval-Augmented Generation (RAG) paradigm. This common approach involves using a pre-trained embedding model to create text embeddings. Then, a query is also transformed into an embedding using the same embedding model. A metric, commonly the cosine similarity, ranks relevant passages to the provided query. Based on this ranking, the top-K passages are considered the most relevant and passed to LLMs as context.

Information retrieval, the discipline behind the “R” in RAG, is useful for many use cases. This includes semantic search, Question Answering, recommendation systems, and LLM-based chat applications. Information retrieval for capturing relevant texts for LLM-based systems is more crucial due to the limited contexts afforded by current LLMs. Additionally, RAG mitigates the hallucinations of LLMs by grounding the output. Hallucination is a phenomenon where LLMs sound confident yet generate non-factual responses. But again, RAG does NOT prevent LLM hallucination at the current state of LLMs. RAG only mitigates it.

The experiment compares the top embedding models based on the Massive Text Embedding Benchmark (MTEB) leaderboard on HuggingFace in retrieving the relevant passage in a document given a query. The link to the Google Colab notebook is available below so you can review the implementation and run the experiment yourself!

Overview

In this section, I describe the motivation for this experiment. I also share a high-level view of the experiment. Finally, I provide the details for the model selection criteria I used.

Motivation

This simple experiment aims to provide insights into the need for experimenting and research in developing AI-based solutions. For example, needing to implement a re-ranking or fine-tuning a retrieval model tailored to your specific use case.

The result of the experiment is expected to serve as the premise for a brief discussion on why there exist risks when treating semantic search and context retrieval for LLMs merely as an application development problem. I also hope that this will help justify the need for investing in research in organizations instead of simply relying on off-the-shelf solutions or basing decisions on reported findings by third parties.

As in any machine learning and AI problem, there is no “free lunch.” This means that we can take inspiration from available resources, but it is dangerous to expect that the successes we see will immediately translate to our use case by simply following what others have done. Concretely, if your use case likely belongs to the “long tail” of the model's training data, all the more effort and research are needed. When integrating AI solutions into your organization or applications, these nuances must be considered.

The flow

These are the main sections of the experiment I did. First, I implemented a pipeline that loads a PDF document using LangChain’s PyMuPDF document loader. The document is then split into passages. Each passage is transformed into embeddings using a specified model. Afterward, using cosine similarity, I retrieve contexts for a query, particularly related to my use case of finding which data was used in the document. Finally, I tested whether the empirically relevant passage had been successfully retrieved in the top-K most relevant passages.

This pipeline covers the basic flow of a simple semantic search system.

Model selection

For comparability, I used embedding models with the same dimensionality. I chose models with 768 embedding dimensions and a context length of 512 in this experiment. I also constrained the selection of models to test based on the model size. That is, only models that are below 1GB were considered. The metrics for these filters were based on the MTEB leaderboard hosted in the HuggingFace spaces.

I also sorted the list based on the total average score (as of 2023–08–19) and chose the top 5 models. The shortlisted models are:

The experiment

The full code is available in this Google Colab Notebook. Try testing with your document, and see if these models work for your query(ies)!

I outline the different parts of the notebook in the following sections with some explanations.

Installing the needed packages

We begin by installing the relevant packages needed. We need the HuggingFace transformers, the Sentence Transformer, LangChain, PyMuPDF, and InstructorEmbedding libraries. We import tqdm, but we don’t install it since it’s already available in Colab.

!pip install transformers[torch] &> /dev/null
!pip install sentence-transformers &> /dev/null
!pip install InstructorEmbedding &> /dev/null
!pip install langchain &> /dev/null
!pip install pymupdf &> /dev/null

Defining the models and instructions

Next, we define the models that we have shortlisted for the experiment.

Some recent embedding models have used instructions to encode asymmetry in asymmetric tasks such as retrieval or question-answering. We also provide the query and passage instructions for models that require them. We look at the model cards for notes on which instructions should be used for each model.

# Define the models and their corresponding instructions for both the query and the passage, if required.

model_instructions = {
    "BAAI/bge-base-en": {
        "query": "Represent this sentence for searching relevant passages:",
        "passage": ""},
    "thenlper/gte-base": {
        "query": "",
        "passage": "" },
    "intfloat/e5-base-v2": {
        "query": "query:",
        "passage": "passage:" },
    "intfloat/e5-base": {
        "query": "query:",
        "passage": "passage:" },
    "hkunlp/instructor-base": {
        "query": "Represent the query for retrieving passages:",
        "passage": "Represent the passage for retrieval:"},
}

Downloading the document

We download the test document from a URL and save it as document.pdf in the current working directory.

# Download a document
file_path = "document.pdf"

# The effects of school-based management in the Philippines : an initial assessment using administrative data (English)
!wget https://documents1.worldbank.org/curated/en/692901468296405564/pdf/WPS5248.pdf -O {file_path}

Implementing the document loader and splitter

We define a function that implements the loading of the document using LangChain’s document loader leveraging PyMuPDF. The function also splits the document into passages.

The way we split the document is based on the tokenizer of the model. We set the maximum chunk size (passage length) to 500 tokens. While we choose embedding models with 512 context size, we limit the passage length to 500 to account for some models' instructions.

We also set the chunks to overlap by 32 tokens. There is a more precise reason why we use overlapping sections in the context of LLM applications — “fringe effects,” but it is not important for our current experiment. Still, it’s nice to have this in place.

from typing import Optional, Union
from pathlib import Path
from langchain.docstore.document import Document
from langchain.document_loaders import PyMuPDFLoader
from langchain.text_splitter import (
    NLTKTextSplitter,
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
    TextSplitter,
)


MAX_TOKENS = 500  # We use 500 to account for the passage tokens.
CHUNK_OVERLAP = 32


def get_text_splitter(tokenizer, max_tokens=512, chunk_overlap=32):
    chunk_size = max_tokens

    # Create a text splitter
    # cls = CharacterTextSplitter
    cls = RecursiveCharacterTextSplitter

    text_splitter = cls.from_huggingface_tokenizer(
        tokenizer,
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        add_start_index=True,
    )

    return text_splitter


def get_doc(file_path, tokenizer, max_tokens=512, chunk_overlap=32):
    text_splitter = get_text_splitter(tokenizer, max_tokens=max_tokens, chunk_overlap=chunk_overlap)

    # Load the document
    doc = PyMuPDFLoader(file_path)
    doc_splits = doc.load_and_split(text_splitter)

    return doc_splits

Generation of embeddings

After implementing the document processing methods, we can generate embeddings for the passages using the different models.

The following code block iterates over the list of models in our selection. We then load the model from the HuggingFace Hub.

We segment the documents using the respective tokenizer of the model since not all models use the same tokenizer. This ensures that we do not accidentally pass a passage that would exceed the context length of the embedding model.

from tqdm.auto import tqdm

from sentence_transformers import SentenceTransformer
from InstructorEmbedding import INSTRUCTOR


query = "What data was used in this paper?"
model_similarity = {}
model_doc_splits = {}

for mname, minfo in tqdm(model_instructions.items()):

    if mname == "hkunlp/instructor-base":
        model = INSTRUCTOR(mname)

        doc_splits = get_doc(file_path, tokenizer=model.tokenizer, max_tokens=MAX_TOKENS, chunk_overlap=CHUNK_OVERLAP)
        model_doc_splits[mname] = doc_splits

        input_passage = [[minfo["passage"], split.page_content] for split in doc_splits]
        input_query = [[minfo["query"], query]]
    else:
        model = SentenceTransformer(mname)

        doc_splits = get_doc(file_path, tokenizer=model.tokenizer, max_tokens=MAX_TOKENS, chunk_overlap=CHUNK_OVERLAP)
        model_doc_splits[mname] = doc_splits

        input_passage = [" ".join([minfo["passage"], split.page_content]).strip() for split in doc_splits]
        input_query = [" ".join([minfo["query"], query]).strip()]

    passage_embeddings = model.encode(input_passage, normalize_embeddings=True)
    query_embedding = model.encode(input_query, normalize_embeddings=True)

    similarity = query_embedding @ passage_embeddings.T
    model_similarity[mname] = similarity

    # print(similarity)

Next, we apply the appropriate formatting of the passage and query inputs, considering the instruction requirement, if necessary. All models share the same interface since they implement the SentenceTransformer class, except for the Instructor model. For the Instructor model, we need to format the input as a tuple of (instruction, content) pair.

After this, we encode the passages and the query into their respective embeddings. We also normalize the embeddings. Normalizing the embeddings implies we can perform a dot product, and the result will be equivalent to computing the cosine similarity of the vectors.

We store the similarity scores and the passages for evaluation.

Evaluation

Now that we have the similarity scores of the query for all passages, we can simulate the retrieval of the relevant context. We want to measure the rank of the relevant passage after sorting all passages by cosine similarity with respect to the query embedding.

We can validate the accuracy of the retrieval by finding the exact phrase we expect the relevant passage to contain. In this case, we reference two relevant phrases in the document:

  • “We use school administrative data” (from the body)
  • “using the administrative dataset of all public schools” (from the abstract)

These phrases are the content of a passage we want an effective embedding model to retrieve. And that passages containing these phrases must be at the top of the ranking.

In the code block below, we sorted the scores from the highest to the lowest for each model. Next, we identify which passages correspond to the top-K scores. For each passage, we check if it contains the reference text. We report the rank of the passage if the reference text is found.

top_k = 50
ref_texts = [
    "We use school administrative data",  # In the body
    "using the administrative dataset of all public schools"  # In the abstract
]

for ref_text in ref_texts:
    print(f"Reference: {ref_text}")

    for mname in tqdm(model_similarity):
        found = False

        for ix, i in enumerate(model_similarity[mname][0].argsort()[::-1][:top_k], 1):
            if ref_text in model_doc_splits[mname][i].page_content:
                found = True
                break
        if found:              
            print(f"{mname}: Found in top {ix}/{len(model_doc_splits[mname])}...")
            print("=" * 50)
            print()    
        else:
            print(f"{mname}: Not found in top {ix}/{len(model_doc_splits[mname])}...")
            print("=" * 50)
            print()
    print()

The evaluation script above returns the following output.

For the “We use school administrative data” reference, all models except for the Instructor model performed significantly worse in retrieving the relevant passage. If we were using an LLM with a 4,096 context size, only the Instructor model would have included this relevant passage in the context.

Reference: We use school administrative data

BAAI/bge-base-en: Found in top 25/65...
==================================================

thenlper/gte-base: Found in top 45/65...
==================================================

intfloat/e5-base-v2: Found in top 27/65...
==================================================

intfloat/e5-base: Found in top 38/65...
==================================================

hkunlp/instructor-base: Found in top 3/57...
==================================================

On the other hand, the reference “using the administrative dataset of all public schools” found in the document's abstract appears to have been easily retrievable by the models. Still, the former passage is more relevant as it contains a more detailed description of the dataset used.

Reference: using the administrative dataset of all public schools

BAAI/bge-base-en: Found in top 3/65...
==================================================

thenlper/gte-base: Not found in top 50/65...
==================================================

intfloat/e5-base-v2: Found in top 1/65...
==================================================

intfloat/e5-base: Found in top 4/65...
==================================================

hkunlp/instructor-base: Found in top 4/57...
==================================================

The drastic outcomes of the models’ performance for finding the relevant passages for these two reference phrases using the same query demonstrate the need to understand the models we choose and not simply copy what worked for others.

While one may argue that this is just for a single document, it’s all the more relevant since this supports the case that AI is generally not a one-size-fits-all solution, at least in its current state. Additionally, this highlights the importance of transparency in how we present the output of the models to users since there is no certainty with AI solutions. We should always err on the side of caution whenever accuracy or topics about model hallucination are discussed.

Suggested solutions

Here, I outline some suggestions on what we can do to mitigate these pitfalls.

  • Build a trusted “gold dataset” to validate the relevance and accuracy of your retrieval system. Note that we can use LLMs to generate an initial set of data that we can manually curate to build our “gold dataset.”
  • Implement a re-ranking model based on this “gold dataset” to increase the chances that the relevant passage can be included in our context.
  • Fine-tune a text embedding model based on our data and use case.
  • Implement a hybrid search combining semantic-based retrieval systems with a keyword-based search for increased recall.

Note that I am using the word “mitigate” since perfection is a utopia for the current state of AI. This means that errors will occur. So, only when you and your organization become cognizant of this and embrace this dark side of AI will you be truly worthy of AI’s promises!

Perfection is a utopia for the current state of AI

Some thoughts

At this point, I will circle back to one of the motivations I listed above — that this experiment is expected to serve as the premise for a brief discussion on why there exist risks when treating semantic search and context retrieval for LLMs merely as an application development problem.

Building chatbots and LLM-powered applications is exciting. Open-sourced frameworks help accelerate the pace at which the development of various applications occurs. However, many of the critical parts are abstracted by these frameworks. I want to demonstrate why we should be circumspect in developing applications that leverage AI. Having a nice-looking chatbot is only useful when it can deliver the most relevant content it is supposed to provide.

The risk comes when you assume that using some embedding model to generate a context for an LLM to consume solves the critical hallucination problem. Also, basing your decision on some public leaderboard results may not be optimal. If you built an LLM-powered application that leverages embedding-based RAG, and you did not validate the relevance performance of the embedding model for your use case, then I suggest you spend some time reviewing it. No LLMs can provide the answer or generate an accurate response when the relevant information is not passed to it!

Building AI solutions is not just an IT or software engineering problem; this requires research and subject matter expertise to identify relevant contexts. Nonetheless, we have tools and resources to perform the needed “sleuthing” to improve the applications and solutions we build, no matter which background you come from!

Ultimately, we need to understand that AI applications, at least at this time, are not just about writing API requests. There’s a more exciting side to it — researching what works!

Large Language Models
Llm
Semantic Search
AI
Hugging Face
Recommended from ReadMedium