Generative AI - Document Retrieval and Question Answering with LLMs

Apply LLMs to your domain-specific data

With Large Language Models (LLMs), we can integrate domain-specific data to answer questions. This is especially useful for data unavailable to the model during its initial training, like a company's internal documentation or knowledge base.

The architecture is called Retrieval Augmentation Generation or less commonly used Generative Question Answering.

This article helps you understand how to implement this architecture using LLMs and a Vector Database. We can significantly decrease the hallucinations that are commonly associated with LLMs.

It can be used for a wide range of use cases. It reduces the time we need to interact with documents. There is no need for us anymore to search for answers in search results. The LLM takes care of precisely finding the most relevant documents and using them to generate the answer right from your documents.

Jump to the Notebook and Code

All the code for this article is ready to use in a Google Colab notebook. If you have questions, please reach out to me via LinkedIn or Twitter.

Document Retrieval and Question Answering with LLMs

Generative AI

colab.research.google.com

YouTube

LangChain

I extensively use LangChain, an open-source software development framework designed to simplify the creation of applications that use large language models (LLMs).

In this article, it is used for tasks like Data Loading, Document Chunking, Vector Stores, and Text and Embedding model interaction.

The usage is not a requirement, but it helps us to reduce our implementation time and helps to ensure an easily maintainable solution.

Google implemented a Vertex AI LangChain integration that was initially merged in April this year.

Documents

As documents for this example, we use the full Google Cloud Vertex AI Documentation https://cloud.google.com/vertex-ai/sitemap.xml. This way, our LLM is able to answer domain-specific questions about Vertex AI. Any type of document is possible.

Any type of document will work as long as it is in some textual form.

Combining LLMs and Vector Databases for Enhanced Question Answering Systems

Let’s examine how we can combine the strengths of LLMs and Vector Databases to create a powerful document retrieval and question-answering system.

I will use the terms vector and embedding interchangeably throughout the rest of the article.

Steps involved (for full code, see shared notebook above):

Get the Documents and Preprocess

LangChain supports a wide list of options to load data. We load the data based on the sitemap using Langchains sitemap document loader.

Documents can be quite large and contain a lot of text. Therefore we need to split the document into smaller chunks. It reduces the size of the text that is sent to the LLM.

This has two main advantages:

The embedding that is created for that document chunk more accurately represented the information to our question.
During retrieval, we receive smaller documents and can keep the LLM context small. This leads to faster latency and lower costs.

After this process, we have a folder containing a bunch of chunk documents.

Those chunk documents are used in the next steps to

Create the document embeddings
Answer our questions

Document Embedding

The embedding step transforms our documents into a vector representation called embedding. We can use an LLM to encode each document chunk into a high-dimensional vector. This process captures the semantic information of the document in the form of a vector.

Again I use LangChain to use the Vertex AI PaLM 2 Embedding model. As already said LangChain is absolutely not a must. For this task, you could also rely on Google's Vertex AI SDK.

from langchain.embeddings import VertexAIEmbeddings

embeddings = VertexAIEmbeddings()

text = "DoiT is a great company"
doc_result = embeddings.embed_documents([text])

After this process, we have for each document chunk a vector representation (embedding).

Storing in Vector Database

Once we have our document vectors, we store them in a vector database. This database enables efficient similarity search among the vectors, helping us retrieve the most relevant documents for a given query.

Query Processing

When a question is asked, we use the LLM, in our case, Googles Vertex AI Embedding PaLM 2 Model, to transform the question into a vector, much like we did with the documents in the previous step.

Document Retrieval

Based on the query from the previous step, we search the vector database for the most semantically similar document vectors to the question vector. The documents associated with these vectors are the most relevant to our query and will be as context for our LLM.

The vector database only contains the embeddings and an identifier without the actual text.

To match the vector results to the actual documents, I again use LangChain, which uses the identifier and matches them with the document chunks.

Answer Generation

Finally, the retrieved documents are fed back as context into the LLM. The LLM generates an answer based on the information provided. The important part is the prompt structure.

prompt=f"""
Follow exactly those 3 steps:
1. Read the context below and aggregrate this data
Context : {matching_engine_response}
2. Answer the question using only this context
3. Show the source for your answers
User Question: {question}


If you don't have any context and are unsure of the answer, reply that you don't know about this topic and are always learning.
"""

Tuning vs. Indexing

To get LLM to answer questions based on our domain-specific knowledge, we can either fine-tune the model or let our LLM use an external index that can be queried at runtime.

Indexing has some advantages over tuning:

New documents are available in real-time, compared to tuning, which might require a couple of hours.
We circumvent the context size limitations. Most LLMs allow around 4000 tokens per request. This makes it impossible to provide a large amount of data. With the Indexing approach, our LLM can rely on unlimited data because of the retrieval of similar documents we only sent the relevant ones.
Restricted documents that aren’t supposed to be available for everyone can be filtered on runtime. Compared to tuning that does not know anything about access restrictions to documents.
Cheaper because no LLM fine-tuning is required.
Explainable due to the underlying data, this helps to verify if an answer is correct if needed. We know the truth.
Combined with prompt engineering, we can avoid hallucinations.

Cloud Architecture

It is possible to implement this as a batch or stream process, depending on the requirements.

This solution solely relies on Google Cloud products. We use Vertex AI PaLM API to create the document and question embeddings. To answer questions based on the potential document candidates, we also use the PaLM API with the PaLM text model.

The embeddings are stored in Google Matching Engine, a Vector Database that provides vector similarity-matching. The provided similarity matching is used to find matching documents for our question.

Handling a large number of embeddings can be challenging. I recommend using a managed service like Googles Matching Engine or Pinecone. If you never use Google Matching Engine before, check out my dedicated article and YouTube video.

Everything can be nicely bound together using Cloud Run to provide an API that can be used in your application. The Cloud Run services can take care of adding or updating documents to the index (Streaming) and returning the right answer based on your question. You could also implement a Cloud Function trigger that processes new documents as soon as they are uploaded to Google Cloud Storage.

Keep in mind the documents are processed into chunks. To properly handle an update, we need to be able to identify all chunks related to a document. This needs to be done for the documents stored on Cloud Storage as well as the embeddings stored in Matching Engine. This can be solved by combining a document ID + chunk ID. If you want to learn more about streaming with Vertex AI Matching Engine, check out my dedicated article.

As already mentioned, it heavily depends on your requirements.

For the batch approach, we can additionally introduce a Cloud Scheduler or a Vertex AI Pipelines that is adding new documents to the index, for example, every night.

Follow me for an upcoming production-ready stream implementation.

Multilingual Capabilities

Many Large Language Models are trained on multiple languages and can handle them. This is particularly useful if your documents are, for example, in English but your question is in Spanish. The LLM can translate the right answer found in an English document to Spanish 🤯. The PaLM 2 model is, at the time of writing this article (June 2023), available only in English. A multilingual release is on the roadmap.

Quotas

We process approx 10000 documents using the Vertex AI PaLM API. During the implementation, I ran into quota issues which ultimately forced me to implement exponential backoff.

I've checked the Vertex AI Quotas but could not find any quota that is reaching the quota limits. If you know which quotas are affected, let me know. I assume this is due to its preview state. This quota is not yet available in the UI. I am in contact with Google and will update here accordingly.

Conclusion

This article delved into the fascinating realm of Generative AI and Large Language Models (LLMs), showcasing their potential to revolutionize document retrieval and question-answering systems by harnessing the power of LLMs when combined with Vector Databases.

Through a step-by-step process involving data preprocessing, document embeddings, vector database storage, and LLM-based query transformation, users can obtain precise and contextually relevant answers from their documents.

Moreover, the article sheds light on the benefits of utilizing indexing techniques over fine-tuning LLMs, including real-time document availability, reduction of context limitations, data access control, cost-effectiveness, and verifiability.

Generative AI Series

I’ve written a series of articles, and there’s more to come. Stay tuned by following me.

Thanks for reading

Your feedback and questions are highly appreciated. You can find me on LinkedIn or connect with me via Twitter @HeyerSascha. Even better, subscribe to my YouTube channel ❤️.