avatarFabio Matricardi

Summary

This article discusses a new approach to improve the accuracy of Retrieval Augmented Generation (RAG) applications, focusing on re-ranking techniques with document pre-processing and the use of a third-party judge.

Abstract

The article addresses the issue of decreased accuracy in RAG applications when dealing with long documents or multiple sources. It introduces the concept of re-ranking, which involves organizing retrieved document chunks according to relevance to avoid losing context and improve overall performance. The author proposes a pipeline that leverages a summarization model, such as T5, to extract salient points from lengthy documents and reduce noise for the generator. This approach aims to overcome limitations in current RAG methods, such as the challenge of processing large amounts of text and accessing important information buried in lengthy contexts. The author also suggests using a third-party language model, such as OpenChat, to act as a judge and evaluate the performance of different AI assistants.

Opinions

  • Current RAG methods struggle with processing large amounts of text and accessing important information buried in lengthy contexts.
  • Summarization models, such as T5, can be used to extract relevant information and reduce noise for the generator.
  • Re-ranking retrieved document chunks according to relevance can improve the performance of RAG applications.
  • A third-party language model can be used as a judge to evaluate the performance of different AI assistants.
  • The proposed approach aims to overcome limitations in current RAG methods and improve the accuracy of generated responses.
  • The author suggests using OpenChat as a judge to score the performance of different models.
  • The proposed pipeline can be adapted to various use cases and tailored to specific requirements.

Re-Ranking is All You Need?

Increasing RAG accuracy is not and easy feat: meet LangChain Re-Ranking with Documents pre-processing techniques and a 3rd party Judge!

Photo by Maarten van den Heuvel on Unsplash

When it comes to create a complex Retrieval Augmented Generation application we meet few hurdles: as soon as we have a long document or multiple documents we see a huge decrease of accuracy in the answers.

Do we have to work on the chunk length? Is it a problem of metadata? Are we simply asking the wrong questions? Is the document clear enough?

How can we overcome these issues? Can we leverage the increased context lengths to bind together relevance and meaning?

In this article we are going to give an answer to few of these questions: we all face them working with generative AI.

The quest for Better RAG strategies

When it comes to specific knowledge base Large Language Models tends to hallucinate because they cannot find the answer. For this use cases RAG is the best option. Retrieval Augmented Generation (RAG) is a technique that brings together information retrieval and generative models.

When we use Retrieval Augmented Generation (RAG), we add a piece of information to the prompt during the process of generating responses. This information is usually a paragraph or a snippet of text that we find by searching through a database using special search techniques. When it’s time for the LLM to generate a response, this retrieved text is given to it as additional input.

NOTE: this strategy is really effective for factual information, but at the actual state of the art is still lacking on multi-reasoning questions. In fact the retrieval part is affected by the type of embeddings and the chunks size. So how to get answers on complex reasoning where multiple documents sources must be retrieved? How we can ensure the context is not lost?

Photo by Alvaro Reyes on Unsplash

There are actually few limitations with the Retrieval Augmented Generation approach. The first is that we usually put the text corpus inside a vector store database, splitting it into chunks. The second one is that we go for a similarity search between the question and the chunks that match it the most.

The chunks are the granularity we apply to the entire text: if it is too little we loose the general context; if it is too large the chunk may show as a whole only a poor similarity to the question.

The similarity search is the basic approach to match a question with a text with the answer keeping the semantic meaning: however semantic similarity and relevance are not the same thing.

A new Approach: Summarization + Re-Ranking

Few weeks ago Anthony Alcaraz published an amazing article titled Crafting Knowledgeable AI with Retrieval Augmentation: A Guide to Best Practices. It is not a code showcase or a python project, but guidelines for better results in RAG. He suggested also to summarize Lenghty passages:

Apply summarization models like BART or T5 fine-tuned on your data to extract salient points from lengthy retrieved texts, reducing noise for the generator.

I took the idea and decided to use a slim T5 model to create the Context for all the Question&Answers pipeline.

My idea of Pre-Processing pipeline

The main points I want to highlight are:

  • we do not need to get stuck on one Model alone: we can stuff together more than one model and assign specific tasks to them
  • use a slim T5 model to pre-process the documents creating a summary and extracting relevant questions in the process
  • Suggested question are presented to the user up-front giving hints on the content of the text

The second main point, as you can see in the picture above, is that we apply a Re-Ranking strategy on the retrieved chunks.

ReRanking is a technique used to avoid that the LLM forget to consider the text lost in the middle of long context. While Language Models (LLMs) are incredibly powerful, they do have some limitations. They can struggle when it comes to processing large amounts of text at once and referencing specific information. Recent research has even shown that LLM performance tends to be highest when relevant information is located at the beginning or end of the input context. When models have to access important information buried in the middle of lengthy contexts, their performance can significantly degrade.

LaangChain reranking is a ready to use pipeline that reorder the content according to the relevance: it will take the most important content at the beginning and at the end of the text to be injected in the LLM Generation pipeline. You can read mode in my previous article:

Let’s have a look at it: the pipeline is really simply because LangChain does all the heavy lifting.

  retriever = db.as_retriever(search_kwargs={"k": k})
  from langchain.document_transformers import LongContextReorder
  # Get relevant documents ordered by relevance score
  context_set = retriever.get_relevant_documents(query)
  # Reorder the documents:
  # Less relevant document will be at the middle of the list and more
  # relevant elements at beginning / end.
  reordering = LongContextReorder()
  reordered_docs = reordering.transform_documents(context_set)

The code is taken from langchain official page. The basic elements are:

  • a retriever with k number of results for the similarity search on the Vector Database (retriever)
  • your retrieved set of document chunks from the Similarity search (context_set)
  • a LongContextReorder instance (reordering)

And that is it. The output is a list of langChain Documents objects, containing the text and the meta_data. You can peek at this article to see the differences of a normal retriever and a re-ranked one

The point now is to use this set of Documents reordered and inject in a strategic position the Summarization produced by the T5 model.

I did it like this: don’t be intimidated by the code, I will go through it.

from langchain.chains import StuffDocumentsChain, LLMChain
  from langchain.prompts import PromptTemplate
  reordered_docs.insert(-2,summarization)
  #print(str(reordered_docs))
  # We prepare and run a custom Stuff chain with reordered docs as context.
  # and Summary in 3rd last position
  document_prompt = PromptTemplate(
      input_variables=["page_content"], template="{page_content}"
  )
  document_variable_name = "context"
  stuff_prompt_override = """<|system|>\n</s>\n<|user|>\nGiven this text extracts:
  -----
  {context}
  -----
  Please answer the following question:
  {query}</s>\n<|assistant|>"""
  prompt = PromptTemplate(
      template=stuff_prompt_override, input_variables=["context", "query"]
  )

  # Instantiate the chain
  llm_chain = LLMChain(llm=llm, prompt=prompt)
  chain = StuffDocumentsChain(
      llm_chain=llm_chain,
      document_prompt=document_prompt,
      document_variable_name=document_variable_name,
  )
  result = chain.run(input_documents=reordered_docs, query=query)
  • with the insert method for lists, we put the summary Document in position -2. reordered_docs.insert(-2,summarization)
  • NOTE that summarization must be a LangChain Document object, not plain text. Here an example to how to create a Document from plain text
from langchain.schema.document import Document
# From SUMfinal, plain text for the summary, we create a 
# SUMMARIZATION DOCUMENT: we use metadata to identify the document
# title and type = summary 
docsum = Document(page_content = SUMfinal, metadata = {
    'source': '/content/EN_Vector Search Is Not All You Need by Anthony Alc.txt',
    'type': 'summary'})

Everything else is the creation of a PrompTemplate that will run a LangChain chain. It is a stuff chain, means that LangChain will go through all the Documents in the re-ordered set and stuff together only the text into the template in the section called context.

Photo by Sasun Bughdaryan on Unsplash

Is Re-Ranking is All You Need?

Evaluation of RAG is a live topic with an open debate on it. Very smart people created a dedicated library for it, called RAGAS. Honestly I tried to use it with an open-source LLM (I don’t want to rely on ChatGPT…) and the process failed all the time.

So I decided to use another brand new approach: to use a third party Language Model to act as a Judge (called also JudgeLM: Fine-tuned Large Language Models are Scalable Judges). With my limited computational resources I used a similar approach inspired by the official paper:

  • use GPTQ version of OpenChat3.5 (really promising model) in Google Colab
  • extract the generations by Zephyr7b on the same questions using simple re-ranking and re-ranking with Summary injection
  • ask OpenChat to be the judge and score with reasoning the 2 models.

In the last pages of the paper there are all the examples of the pormpts used to train the evaluation model: I used this one with OpenChat

  JLM_tempalate = f'''GPT4 User: You are a helpful and precise assistant for checking the quality of the answer.
[Question]
{question}
[The Start of Assistant 1's Answer]
{answer2}
[The End of Assistant 1's Answer]
[The Start of Assistant 2's Answer]
{answer1}
[The End of Assistant 2's Answer]
[System]
We would like to request your feedback on the performance of two AI assistants in response to the
user question displayed above.
Please rate the helpfulness, relevance, accuracy, level of details of their responses. Each assistant
receives an overall score on a scale of 1 to 10, where a higher score indicates better overall
performance.
Please first output a single line containing only two values indicating the scores for Assistant 1 and
2, respectively, following the format below:
Assistant 1's Score: here the score
Assistant 2's Score: here the score 

In the subsequent line, please provide a comprehensive explanation of your evaluation, avoiding any potential bias and ensuring that the
order in which the responses were presented does not affect your judgment.<|end_of_turn|>GPT4 Assistant:
'''

Here the results:

OpenChat3.5 as Judge

Review and Conclusions

This article wanted to explore a new approach to the RAG. First of all the use of a tiny T5 model for summarization and topic extraction. In this way we have a super fast summary and a recommended questions coming from the main topic.

The second goal was to see if providing an wider context would have ensured a highest accuracy replying general or open questions to the documents.

I believe that there is a really promising room for improvements here. Try it out yourself and give me some feedbacks. There is no better way to learn than doing yourself (with guidance of course).

If this story provided value and you like the topics consider subscribing to Medium to unlock more resources. Medium is a big community with high quality content: you can certainly find here what you need.

  1. Sign up for a Medium membership using my link — ($5/month to read unlimited Medium stories)
  2. Follow me on Medium
  3. Highlight what you want to remember and if you have doubts or suggestions simply drop a comment to the article: I will promptly reply to you
  4. Read my latest articles https://medium.com/@fabio.matricardi

Don you want to read more? Here some topics

Artificial Intelligence
Python
Open Source
Local Gpt
Hugging Face
Recommended from ReadMedium