avatarFabio Matricardi

Summary

The article discusses the limitations of Retrieval Augmented Generation (RAG) in Large Language Models (LLMs) and proposes an orchestration strategy using open source models to improve results, focusing on metadata and keywords extraction.

Abstract

The article begins by addressing the issue of LLMs often hallucinating and the need for a production-ready chatbot to be grounded in factual truth. It then introduces Retrieval Augmented Generation (RAG) as a solution, but acknowledges its limitations, such as handling text corpus, chunking, and similarity search mechanisms. The author proposes an orchestration strategy using open source models to enhance RAG results, emphasizing the importance of metadata and keywords extraction. The author suggests that including keywords in the document summary and chunks as part of the metadata can exponentially improve RAG results. The article also discusses the use of KeyBERT for keyword extraction and its advantages, such as speed and small memory footprint.

Opinions

  1. The author believes that relying solely on OpenAI or Cohere for balancing costs and revenues may not be the best approach and suggests using open source models for better results.
  2. The author argues that it is naive to think that any unstructured document can be used in the RAG pipeline without proper organization and structure.
  3. The author emphasizes the importance of data ingestion as the first step in any data-science pipeline and cautions against underrating its significance.
  4. The author criticizes the out-of-the-box approach of Langchain Document Loader, stating that it is bound to fail due to the lack of control over metadata and content quality.
  5. The author suggests that high-quality data content and metadata are essential for achieving high-quality results in the RAG pipeline.
  6. The author proposes a tailor-made pipeline, including more steps in the data ingestion phase and document construction, for better results in the RAG pipeline.
  7. The author concludes that metadata, once considered a mere byproduct of data management, is now emerging as a driving force in the field of artificial intelligence.

Metadata Metamorphosis: from plain Data to Enhanced insights with Retrieval Augmented Generation

Discover how metadata, the hidden gem of your knowledge base, can be transformed into a powerful tool for enriching your RAG pipeline and unlocking new possibilities.

image created by the author

The main stream focus on Large Language Models is recently shifted to solid Retrieval Augmented Generation, also known as RAG. LLMs are too often hallucinating and a production ready chatbot must be grounded to factual truth. But how to achieve this feat? Can we only rely on OpenAI or Cohere balancing costs and revenues? Cannot we improve it with open source models?

In this article I will explore an orchestration strategy to achieve better results using only hugging face open source models.

If you are new to the Artificial Intelligence world or you want to Learn how to start to Build Your Own AI, download This Free eBook

Image by Eli Digital Creative from Pixabay

Where the mechanism is weak?

When it comes to specific knowledge base Large Language Models tends to hallucinate because they cannot find the answer. For this use cases RAG is the best option. Retrieval Augmented Generation (RAG) is a technique that brings together information retrieval and generative models.

However recent studies and I would say, the complaints of many of us, greedy users of LLMs… 😂, shown that there are several limits to the efficiency of the Retrieval Augmented Generation.

When we use Retrieval Augmented Generation (RAG), we add a piece of information to the prompt during the process of generating responses. This information is usually a paragraph or a snippet of text that we find by searching through a database using special search techniques. When it’s time for the LLM to generate a response, this retrieved text is given to it as additional input.

NOTE: this strategy is really effective for factual information, but at the actual state of the art is still lacking on multi-reasoning questions. In fact the retrieval part is affected by the type of embeddings and the chunks size. So how to get answers on complex reasoning where multiple documents sources must be retrieved? How we can ensure the context is not lost?

While Retrieval Augmented Generation (RAG) has emerged as a promising approach for enhancing text generation, it is essential to acknowledge its inherent limitations. These limitations stem from the interplay between the text corpus, chunking, and similarity search mechanisms employed in RAG.

Text Corpus and Chunking

One key limitation lies in the handling of the text corpus, the vast collection of text data from which relevant information is retrieved. The standard practice involves splitting the corpus into chunks, which are essentially smaller segments of text. However, this chunking process introduces a trade-off between granularity and context preservation.

If the chunks are too small, the overall context of the text may be lost, making it difficult for the model to grasp the broader meaning and relationships between different pieces of information. This can lead to generated text that lacks coherence and fails to capture the nuances of the original text.

Conversely, if the chunks are too large, the similarity search process may not be able to identify the most relevant portions of the text with sufficient precision. This is because large chunks may exhibit a high degree of similarity to the query, even if they do not contain the specific answer or information sought. Consequently, the generated text may not address the user’s intent accurately.

Similarity Search and Semantic Meaning

Another limitation arises from the reliance on similarity search as the primary mechanism for matching a query to relevant text. Similarity search algorithms typically measure the degree of textual similarity between two pieces of text based on their word overlap or shared patterns. While this approach can be effective in identifying textually similar passages, it does not necessarily guarantee that the retrieved text is semantically relevant to the query.

Semantic relevance goes beyond mere textual similarity and encompasses the underlying meaning and intent of the text. Two pieces of text may share many words or even phrases, but if they convey different meanings or address different concepts, they are not semantically relevant to each other.

RAG’s reliance on similarity search may lead to the retrieval of text that is textually similar but semantically irrelevant, resulting in generated text that fails to address the user’s true intent. This limitation highlights the importance of incorporating semantic understanding and reasoning into RAG models to ensure that the retrieved text is not only similar but also meaningful.

you can find all the code in this article in my GitHub repo: useful if you like to code along 😏

Overcoming the Limitations: “metadata is all you need”

At the heart of RAG lies data: but what if this data is a poor quality one? What if there is no organization and structure to it?

It is naive to think that we can LangChain plug&play any unstructured document and start using the RAG pipeline and pray that everything will work out fine by a kind of magic.

It may be ok for one document, but what will happen if you start having a huge vector database with hundreds or more of documents there?

My proof of concept here is that we can exponentially improve the RAG results if we include keywords to the document summary and to the chunks as part of the metadata: the query will be matched on both similarity search but also on the keywords extracted during the pre-processing stage and stored as Documents metadata.

Infographics by the author

I decided to build an orchestra of models ready to perform a high level pre-processing of the Documents: because data ingestion is the first step of any data-science pipeline, and cannot be discarded or underrated. The refrain is still valid: Garbage In, Garbage Out.

I already talked about the importance of Summarization: a fast summarization is the key to overcome the first issue of RAG related to the text corpus and chunking granularity. In fact a good summary can be:

  • extracted by any document and stored in a different database (it can be a pickle file with no embeddings at all!!)
  • can be always used as part of the context together with the similarity search of the query to the vector store database

Why do we need metadata?

The second huge limitation of the easy RAG is the gap between similarity search and semantic meaning. Having more context window is an advantage we cannot waste with wrong chunks: we will only risk to loose the meaning among a huge amount of useless information.

I recently stumbled upon KeyBERT. KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and keyphrases that are most similar to a document.

First, document embeddings are extracted with BERT to get a document-level representation. Then, word embeddings are extracted for N-gram words/phrases. Finally, we use cosine similarity to find the words/phrases that are the most similar to the document. The most similar words could then be identified as the words that best describe the entire document.

It is really an easy one liner powerful library, that yet can be customized to become a powerful ally in the Exploratory Document Analysis process.

At its core KeyBERT uses an embedding of your choice (and here I suggest you to pick up the right one for you, according to your preferences in terms of language capabilities, max tokens, use case) to extract keywords (nGrams) from a text. The good thing here is that it is super fast and with a small memory footprint.

Running KeyBERT on my Mac with GRADIO

Imagine this: I am running the Orchestra on a 16 GB RAM Mac Intel, with no GPU usage, and I am able to use LaMini-Flan-T5–77M for the summarization (normal pytorch transformers), KeyBERT for the keywords (normal pytorch transformers and KeyBERT library) and Mistral-7B GGUF quantized 4bit (CTransformers library).

One of the amazing features of KeyBERT is that there is an option to Highlight the text with keywords extracted. The picture below is the terminal of the Gradio application from the GIF above

parameter highlight=True to display the results from the original text

You can learn more from the official project page on GitHub:

official GitHub page

How of all this can be done and Why?

The out of the box approach of Langchain Document Loader is bounded to be a failure. The main reason is that you don’t have control of the metadata, and you don’t have control on the quality of the content itself.

If you want high quality retriever you need high quality data content and metadata. My approach consists in more steps in the data ingestion phase and Document construction, and only few more in the RAG pipeline itself.

  1. use Langchain to Load only the document (pdf, txt, docx…).
  2. use a text editor to clean up all the noise in the data (header, footer, line breaks and so on).
  3. fill in the main metadata keys: document title, original filename, author, url if any.
  4. Chunk the documents by tokens (to gain control of the final context window)
  5. Run KeyBERT on the chunks and pair the text with the metadata in point 3 and keywords.

Let’s see the code…

Create a new virtual environment and activate it. We have to install few packages for a local run. If you prefer to use a Google Colab notebook you can find it in the GitHub Repo.

pip install pytorch
pip install transformers
pip install langchain
pip install rich
pip install gradio
pip install keybert
pip install tiktoken

Tiktoken is required because we want to split the documents into chunks counting the tokens, instead of the characters.

For the purpose of the test I will use an amazing article by Giles Crouch | Digital Anthropologist — it is a really good read.

Now it is time to import the libraries and create few functions to be used in our custom pipeline. Remember, a tailor made pipeline is the key to achieve better results.

from tqdm.rich import trange, tqdm
from rich import console
from rich.panel import Panel
from rich.markdown import Markdown
from rich.text import Text
import warnings
warnings.filterwarnings(action='ignore')
import datetime
from rich.console import Console
console = Console(width=110)
from transformers import pipeline
import gradio as gr
import os
from keybert import KeyBERT

The first function is for KeyBERT. We are going to use a multilanguage embedding, the same we will use for the vector store. This is not the faster embeddings: if you don’t have a RAM constraints you can use KeyBERT(model=’multi-qa-MiniLM-L6-cos-v1')

from keybert import KeyBERT
kw_model = KeyBERT(model='intfloat/multilingual-e5-base')

Now that we have an instance of KeyBERT (kw_model) we use the function extract_keys(text, ngram,dvsity) where we pass the text to be processed, the number of ngrams (words) to be extracted and the diversity ratio (see documentation for clarifications).

#########################################################################################
#########    EXTRACT THE KEYWORDS FROM A TEXT, given nGram and Diversity     ############
#########    Return a LIST of tags                                           ############
#########################################################################################
def extract_keys(text, ngram,dvsity):
    a = kw_model.extract_keywords(text, keyphrase_ngram_range=(1, ngram), 
                          stop_words='english',
                          use_mmr=True, diversity=dvsity, 
                          highlight=True)     #highlight=True
    tags = []
    for kw in a:
        tags.append(str(kw[0]))
    return tags

Let’s run it with an extract of the mentioned above text:

screenshot from the provided Google Colab Notebook

As we can see it is also visually appealing (with the Highlight function), but above all we return a list of the extracted keywords.

Cool right? Try yourself how to iterate over chunks of text to extract and save the keywords. It is important that you store in a json-like format all the informations.

filename = '/content/2023-12-03 18.41.12 Governing societies with Artificial Intelligence .txt'
with open(filename, encoding="utf8") as f:
  fulltext = f.read()
f.close()
console.print("Text has been saved into variable [bold]fulltext")
# For now we pass the main metadata field: in future they must be part of 
# user input in a GUI
title = 'Governing societies with Artificial Intelligence'
filename = '2023-12-03 18.41.12 Governing societies with Artificial Intelligence .txt'
author = 'Giles Crouch'
url = 'https://gilescrouch.medium.com/governing-society-and-artificial-intelligence-23882b9ce473'

Now that we have the entire text (I already fixed it, part of the EDA process) let’s create a basic LangChain Document to be ready for a Vector Store DB.

Let’s run the TextSplitter with Token counts:

from langchain.document_loaders import TextLoader
from langchain.text_splitter import TokenTextSplitter
TOKENtext_splitter = TokenTextSplitter(chunk_size=350, chunk_overlap=10)
splitted_text = TOKENtext_splitter.split_text(fulltext) #create a list

Now splitted_text contain a list of chunks (in string format only).

screenshot from Google Colab Notebook

We are going to iterate over the chunks and store both the text, the main metadata and the keywords into a json-like format (a python dictionary):

keys = []
for i in trange(0,len(splitted_text)):
  text = splitted_text[i]
  keys.append({'document' : filename,
              'title' : title,
              'author' : author,
              'url' : url,
              'doc': text,
              'keywords' : extract_keys(text, 1, 0.34)
  })

As you can see the main metadata (author, title…) are included in every chunk: this is the starting point for a bigger flexibility during the similarity search, where we can also filter the results by filename, or author, or even by keywords.

Let’s print the second chunk we created into the key dictionary:

The official langchain text loader is quite straight forward: it requires only 2 lines of code, but the only metadata you get is the source (means the filename of the loaded document)

from langchain.document_loaders import TextLoader
loader = TextLoader("./index.md")
loader.load()

#This is the result
[
    Document(page_content='---\nsidebar_position: 0\n---\n# Document 
loaders\n\nUse document loaders to load data from a source as `Document`\'s. 
A `Document` is a piece of text\nand associated metadata. For example, 
there are document loaders for loading a simple `.txt` file, for 
loading the text\ncontents of any web page, or even for loading a 
transcript of a YouTube video.\n\nEvery document loader exposes t
wo methods:\n1. "Load": load documents from the configured source\n
2. "Load and split": load documents from the configured source and 
split them using the passed in text splitter\n\nThey optionally 
implement:\n\n3. "Lazy load": load documents into memory lazily\n', 
metadata={'source': '../docs/docs/modules/data_connection/
document_loaders/index.md'})
]

The Document object have a page_content and a metadata section. We want to make sure that the metadata section is always tailored to the content.

Let’s create a tailor made Document set

LangChain allows you to create Document objects from scratch. And this is exactly what we do here:

############### CREATE cHUnKS DOC DATABASE ##################
from langchain.schema.document import Document
goodDocs = []
for i in range(0,len(keys)):
  goodDocs.append(Document(page_content = keys[i]['doc'],
                          metadata = {'source': keys[i]['document'],
                              'type': 'chunk',
                              'title': keys[i]['title'],
                              'author': keys[i]['author'],
                              'url' : keys[i]['url'],
                              'keywords' : keys[i]['keywords']
                              }))

The result is a list of LangChain Document that we can pass to our VectorStore of choice. If we print one element we can see the results of all our efforts.

a LangChain Document with metadata enrichment

Conclusions

At the hearth of the RAG there is data and metadata, the often overlooked yet invaluable information that accompanies data. Metadata provides context, structure, and meaning to data, making it easier for machines to understand and utilize. In the context of RAG, metadata plays a crucial role in enriching the retrieval process, enabling the model to identify the most relevant and informative passages from a vast knowledge base.

Metadata, once considered a mere byproduct of data management, is now emerging as a driving force in the field of artificial intelligence. By harnessing the power of metadata in RAG, we can not only enrich our knowledge bases but also elevate the capabilities of machines to understand, learn, and generate human-quality text. The future of AI is intertwined with metadata, and we are only beginning to discover the transformative potential of this hidden gem.

Next to do? Learn how to run a similarity search powered by many filters that the user can set in the GUI: by keywords, and then by filename or title, by author and so on

Hope you enjoyed the article. If this story provided value and you wish to show a little support, you could:

  1. Clap a lot of times for this story
  2. Highlight the parts more relevant to be remembered (it will be easier for you to find it later, and for me to write better articles)
  3. Learn how to start to Build Your Own AI, download This Free eBook
  4. Sign up for a Medium membership using my link — ($5/month to read unlimited Medium stories)
  5. Follow me on Medium
  6. Read my latest articles https://medium.com/@fabio.matricardi

If you want to read more here some ideas:

WRITER at MLearning.ai / 48K+ GPTs / Free TURBO AI Art Tools

Artificial Intelligence
Python
Retrieval Augmented
Hugging Face
Ml So Good
Recommended from ReadMedium