The Simple Principle Behind Retrieval Augmented Generation in Large Language Models

Understand RAG intuitively and implement a chat pipeline with your documents using LangChain and Llamma v2

In a timeframe that can only be best described as a blink of an eye, large language models have exploded in the general public consciousness.

Even if you have nothing to do with tech, you (and your grandma) have heard of ChatGPT!

It’s easy, accessible, and mighty — even on the free tier.

ChatGPT is pretty good at tackling general questions like:

What is the speed of a rock in free fall from a height of 10 meters?

But what if you want it to calculate something proprietary, like:

What is the exact trajectory of landing a SpaceX rocket?

In the first case, it gets the answer correct (14 m/s in case you were wondering), but in the second case…

IT FAILS.

Were you a SpaceX employee, you wouldn’t want to risk inputting your proprietary data into an external LLM — that could be a big security risk for the organization!

What do you do in such a case where proprietary documents are involved?
Wouldn’t it be wonderful if you could pose questions to your documents too?

Well, one of the methods is to train an LLM with your data.

If you wish to do that, I hope you have a billion dollars in the bank, or you own OpenAI or Microsoft.

Not only are the costs of training/fine-tuning large language models astronomical, but they are also time-intensive!

But if you neither have a billion dollars nor own OpenAI, don't worry.

There are still ways you can do it.

One of the most popular methods is called Retrieval Augmented Generation (RAG).

Sounds very complex?

Despite how it sounds, the concept is mind-blowingly simple.

And by the end of the post, you will be able to implement a script to chat with your very own local documents!

Retrieval Augmented Generation through an Analogy

Let’s start with a scenario that has absolutely nothing to do with tech:

Gordon is a British Master Chef — he has all the English recipes at the back of his head, and the art of making them at his fingertips. He could whip you up the perfect Cornish Pasty in the blink of an eye.

He’s the man to go to for all things (delicacies) British!

But one fine day, he has a visit from his Indian friend and Chef — Sanjeev, who has a challenge for Gordon — to cook an Indian specialty lamb curry called Rogan Josh.

Despite being a great chef, Gordon goes blank. Not surprising, since his specialty is British food.

Thankfully, there’s a recipe book!

And Gordon is up for the challenge. He knows how to cook after all — he just needs to extend his knowledge!

He glances at the index to find the recipe, quickly flips through the book, and finds the page with all the ingredient lists and the steps!

He follows the steps diligently and is done in no time!

Time for the tasting!

Sanjeev is already impressed with the aroma and takes a bite.

Although not exactly as expected, he is happy with the result! It’s not bad at all for someone who was not trained to make this recipe.

Image copyright of the author (also contains a CC image of the curry from Wikimedia)

And they continue enjoying the meal…

But, you are not here for the curry…You are here to understand RAG.

So I’ll take you a bit closer to our topic.

Gordon is an expert at cooking — which means he has a broad knowledge of cooking methods and the ingredients —e.g. he knows that chili is sharp, lemon is sour, and how meat fries in oil. He does not need to learn it from scratch. His knowledge gap? — Indian food. → In a sense, you can consider Gordon as any LLM like GPTx, Llama, and so on — trained on vast amounts of data but still having a few domain-specific knowledge gaps.
He could practice the Indian recipes too, and gain expertise in them. But that would take months of reading and cooking. → This would be equivalent to fine-tuning the LLM.
Since he does not have the time, and the request is immediate — the only option is: 1. refer to a recipe book, 2. search for the recipe, 3. follow the instructions, leverage his existing knowledge of cooking, and prepare the dish. → This my dear reader, in the LLM world is called Retrieval Augmented Generation!

Et Voila.

In other words, at the risk of sounding too technical:

Retrieval-augmented generation (RAG) is a technique for enhancing the accuracy and reliability of generative AI models with facts fetched from external sources.¹

RAG fills the knowledge gap in LLM’s by using external sources.

It was first introduced in this 2020 paper titled – “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”, thus coining the term too.

In our analogy, Gordon is a human who can easily look at the recipe book’s index, find the exact page, and follow the steps.

Unfortunately, it’s not easy for an LLM— it needs a retrieval system.

Thankfully we have an open-source framework ready to be used.

RAG with LangChain

Let’s now boil down the image above into a technical vanilla diagram of how RAG works.

Looks simple now, doesn’t it?

All we need is an external knowledge base, an LLM, and a retrieval system!

External knowledge base:

It would be best if you also had some of your documents or files from which you want the answers after prompting — that’d be the external knowledge base, like the recipe book in our analogy.

For this example, I chose a recent article from Wired and converted it to a PDF — Spying on Beavers From Space Could Help Save California | WIRED

This article was published on 28th December 2023 and hence its information is absent in the Llama-2 7B chat model (we will make sure to cross-check it at the end).

Language Model:

Since we are implementing the whole pipeline locally, you need to download your preferred model. This is our equivalent to the Master Chef — Gordon.

You’ll use the GPT4ALL package, which supports a lot of language models. In this post, I use the Llama v2 7B Chat model (Link — the exact model used is llama-2-7b-chat.Q8_0.gguf, but you can try the other ones too. Make sure the file type is .gguf)

Also, make sure you have the following dependencies/libraries installed:

# Install langchain
pip install langchain

# Install vectorStore
pip install faiss-cpu

# Install gpt4all
pip install gpt4all

# Install huggingfaceHub
pip install huggingface-hub

# Install PyPdf for working with PDFs
pip install pypdf

#Import LLM packages
from langchain.llms import GPT4All

# large language model you want to use
# LLM path
llama_7b_path = r'/model_path/llama-2-7b-chat.Q8_0.gguf'

# Create the LLM instance
llm = GPT4All(model=llama_7b_path,
              max_tokens=1000, 
              verbose=True, 
              repeat_last_n=0)

With our LLM ready, let's focus on the retrieval system.

You know now that the retrieval system should have the following key components:

Indexing
Retrieving
Augmenting

All this can be done using an open-source Python framework called LangChain². It was designed to simplify the creation of applications using LLMs.

Now let’s look at what it does while coding along the way and implementing your very own RAG pipeline!

1. Indexing

This is the system that loads the knowledge base (loading), breaks it down into smaller pieces (chunking), converts it to vector representations for retrieval (embedding), and stores these representations (vector database).

#1. INDEXING

# import all packages required for Indexing
from langchain.document_loaders import PyPDFLoader #load
from langchain.text_splitter import RecursiveCharacterTextSplitter #chunk
from langchain.embeddings import HuggingFaceEmbeddings #embedding
from langchain.vectorstores.faiss import FAISS #vector stores

# Loading
loader = PyPDFLoader("WIRED_article.pdf")
documents = loader.load_and_split()
print(len(documents))

The output should be the number of pages in the document, in this case, it is 7. Now let's break them down into smaller pieces or chunks.

# Chunking
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1024,
    chunk_overlap=64
)
texts = text_splitter.split_documents(documents)

print(len(texts))
print (texts)

This should give 16 chunks (1024 by 64), and if you print the texts, you will see each chunk's content along with the metadata.

Now we have to convert them into vector representations, called embeddings, and store them.

For this, we will use the sentence transformers³ model from Hugging Face.

Another package we will use is called FAISS⁴ — or Facebook AI Similarity Search — which easily retrieves documents based on the similarity of their vector representations.

# Embedding
embeddings = HuggingFaceEmbeddings(
                                   model_name=
                                   "sentence-transformers/all-MiniLM-L6-v2")


# Storing in vector database
vectorstore = FAISS.from_documents(texts, embeddings)

# Define the retriever for the stored indices
retriever = vectorstore.as_retriever()

Now you have converted the pdf document into retrievable vectors stored in the vector database.

2. Retrieving

This is a system that fetches the user query (user input), converts the query to vectors just as we did for the document above, and then returns the relevant information from the vector store based on similarity to the query (retrieving).

#2. RETRIEVING

# import all packages required for Retrieving
from langchain.chains import RetrievalQA
from langchain.callbacks import StdOutCallbackHandler

#create a retrieval chain with the document source in the vector store
qa_chain_with_sources = RetrievalQA.from_chain_type(llm=llm,
                                                    retriever=retriever,
                                                    callbacks=[StdOutCallbackHandler()],
                                                    return_source_documents=True)

3. Augmenting

This is the system that enhances the user input prompt with the retrieved context (augment), ensuring that the LLM has all the required information to create an apt response (LLM output).

#3. AUGMENTING

# Send the augmented user query to the LLM to get the response
response = qa_chain_with_sources({"query":"what is EEAGER and why is it used?"})

That’s it!

If all goes well, you should get a similar output as below:

“EEAGER stands for Earth Engine Automated Geospatial Elements Recognition. It is a machine learning algorithm developed by Google that can identify beaver dams in satellite imagery with a high degree of accuracy. EEAGER is used to help scientists and conservationists monitor beaver populations, track their movements, and understand their role in ecosystems. By using EEAGER to analyze satellite imagery, researchers can identify beaver dams and ponds more quickly and accurately than by manually reviewing images, which can save time and resources.”

Not bad! It matches what’s there in the document. You can even check the source documents from where you get the matches.

response['source_documents']

Also, a quick look at the document shows that this is a pretty decent summary, and the acronym EEAGER is correct:

But hang on, what if this information was already there in the model?

Well, you can do a quick check with the model, without the external sources:

# Check model output without external vector sources

vanilla_model = GPT4All(model=llama_7b_path,
                        max_tokens=1000, 
                        verbose=True, 
                        repeat_last_n=0)

output = vanilla_model.generate(["what is EEAGER and why is it used?"])

# Print output
print(output.generations[0][0].text)

Which gives the output:

The output is something about eager evaluation in terms of databases. Nothing to do with our document!

But just for a double check, let's give it a context, that the question is related to a Wired article about beavers:

output = vanilla_model.generate(["what is EEAGER and why is it used? 
                            The context is a Wired article about beavers"])

# Print output
print(output.generations[0][0].text)

The output is:

Although it does have some information related to beavers, it is not related to our document, and it does not get the acronym correctly — which is only available in the external source we used.

What does that mean?

That means, your RAG pipeline worked perfectly— you successfully chatted with your local data!

Congratulations!

Putting it all together

To summarize, in this post:

You learned Retrieval Augmented Generation with the help of a fun and intuitive explanation
You got an overview of the frameworks available to implement it (LangChain)
You implemented a local RAG pipeline from scratch in Python to chat with your local document!

Below I have put the entire script together, for you to use it easily!

You can refer to the repository with the article pdf used here: GitHub
Direct download link to the model used: Download

Hope you enjoyed reading!

Until next time!

Fin.

Sources:

Nvidia Blog- What is RAG: https://blogs.nvidia.com/blog/what-is-retrieval-augmented-generation/
LangChain documentation: https://python.langchain.com/docs/get_started/introduction
Sentence transformers: https://huggingface.co/sentence-transformers
FAISS: https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/