How To Build an LLM-Powered App To Chat with PapersWithCode
Keep up with the latest ML research
Do you find it difficult to keep up with the latest ML research? Are you overwhelmed with the massive amount of papers about LLMs, vector databases, or RAGs?
In this post, I will show how to build an AI assistant that mines this large amount of information easily. You’ll ask it your questions in natural language and it’ll answer according to relevant papers it finds on Papers With Code.
On the backend side, this assistant will be powered with a Retrieval Augmented Generation (RAG) framework that relies on a scalable serverless vector database, an embedding model from VertexAI, and an LLM from OpenAI.
On the front-end side, this assistant will be integrated into an interactive and easily deployable web application built with Streamlit.
Every step of this process will be detailed below with an accompanying source code that you can reuse and adapt👇.
Ready? Let’s dive in 🔍.
If you’re interested in ML content, detailed tutorials, and practical tips from the industry, follow my newsletter. It’s called The Tech Buffet.
1 — Collect data from Papers With Code
Papers With Code (a.k.a PWC) is a free website for researchers and practitioners to find and follow the latest state-of-the-art ML papers, source code, and datasets.
Luckily, it’s also possible to interact with PWC through an API to programmatically retrieve research papers. If you look at this Swagger UI, you can find all the available endpoints and try them out.
Let’s, for example, search papers on a specific keyword.
Here’s how to do it from the interface: you locate the papers/
endpoint, fill in the query ( q
) argument
and hit the execute button.
Equivalently, you can perform this same search by hitting this URL.
The output response shows the first page of results only. The following pages are available by accessing the next
key.
By exploiting this structure, we can retrieve 7200 papers matching “Large Language Models”. This can be simply done with a function that requests the URL and loops over all the pages.
import requests
import urllib.parse
from tqdm import tqdm
def extract_papers(query: str):
query = urllib.parse.quote(query)
url = f"https://paperswithcode.com/api/v1/papers/?q={query}"
response = requests.get(url)
response = response.json()
count = response["count"]
results = []
results += response["results"]
num_pages = count // 50
for page in tqdm(range(2, num_pages)):
url = f"https://paperswithcode.com/api/v1/papers/?page={page}&q={query}"
response = requests.get(url)
response = response.json()
results += response["results"]
return results
query = "Large Language Models"
results = extract_papers(query)
print(len(results))
# 7200
Once the results are extracted, we convert them from their raw JSON format into LangChain Documents to simplify chunking and indexing.
Document objects have two parameters:
- page_content (str): to store the text of the paper’s abstract
- metadata (dict): to store additional information. In our use case we’ll keep: id, arxiv_id, url_pdf, title, authors, published
from langchain.docstore.document import Document
documents = [
Document(
page_content=result["abstract"],
metadata={
"id": result["id"] if result["id"] else "",
"arxiv_id": result["arxiv_id"] if result["arxiv_id"] else "",
"url_pdf": result["url_pdf"] if result["url_pdf"] else "",
"title": result["title"] if result["title"] else "",
"authors": result["authors"] if result["authors"] else "",
"published": result["published"] if result["published"] else "",
},
)
for result in results
]
Prior to embedding the documents, we need to chunk them into smaller pieces. This helps overcome LLMs’ limitations in terms of input tokens and provides fine-grained information per chunk.
After chunking the documents with a chunk_size of 1200 characters and a chunk_overlap of 200, we end up with over 11K splits.
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1200,
chunk_overlap=200,
separators=["."],
)
splits = text_splitter.split_documents(documents)
len(splits)
# 11308
2 — Create an index on Upstash
To be able to store document embeddings (and metadata) somewhere, we first have to create an index.
In this tutorial, we’ll use Upstash, a serverless database.
To create an index in it, you need to log in here and follow the instructions to fill in some parameters:
- Region: pick the closest one to your location.
- Dimensions = Set it to 768 (VertexAI’s embedding dimension)
- Distance metric = set it to cosine.
Once the index is created, you need to install upstash-vector
pip install upstash-vector
This allows you to establish a connection to the index.
from upstash_vector import Index
index = Index(
url="<UPSTASH_URL>",
token="<UPSTASH_TOKEN>"
)
URL and token are the credentials you’ll need to connect to your index. Keep them safe and don’t version them with the code. They’re available in your account’s settings.
3 — Embed the chunks and index them into Upstash
To embed the chunks and index them into the vector db, we’ll create a simple class that imitates LangChain Vectorstore implementation.
This class will be named UpstashVectorStore
and will have the following methods:
- An
__init__
constructor that expects an Upstash Index and an Embeddings object add_documents
to embed documents and index them in batchessimilarity_search_with_score
to query the index and retrieve the top_k most relevant documents along with their corresponding scores
Here’s the full implementation:
from typing import List, Optional, Tuple, Union
from uuid import uuid4
from langchain.docstore.document import Document
from langchain.embeddings.base import Embeddings
from tqdm import tqdm
from upstash_vector import Index
class UpstashVectorStore:
def __init__(self, index: Index, embeddings: Embeddings):
self.index = index
self.embeddings = embeddings
def delete_vectors(
self,
ids: Union[str, List[str]] = None,
delete_all: bool = None,
):
if delete_all:
self.index.reset()
else:
self.index.delete(ids)
def add_documents(
self,
documents: List[Document],
ids: Optional[List[str]] = None,
batch_size: int = 32,
):
texts = []
metadatas = []
all_ids = []
for document in tqdm(documents):
text = document.page_content
metadata = document.metadata
metadata = {"context": text, **metadata}
texts.append(text)
metadatas.append(metadata)
if len(texts) >= batch_size:
ids = [str(uuid4()) for _ in range(len(texts))]
all_ids += ids
embeddings = self.embeddings.embed_documents(texts, batch_size=250)
self.index.upsert(
vectors=zip(ids, embeddings, metadatas),
)
texts = []
metadatas = []
if len(texts) > 0:
ids = [str(uuid4()) for _ in range(len(texts))]
all_ids += ids
embeddings = self.embeddings.embed_documents(texts)
self.index.upsert(
vectors=zip(ids, embeddings, metadatas),
)
n = len(all_ids)
print(f"Successfully indexed {n} dense vectors to Upstash.")
print(self.index.stats())
return all_ids
def similarity_search_with_score(
self,
query: str,
k: int = 4,
) -> List[Tuple[Document, float]]:
query_embedding = self.embeddings.embed_query(query)
query_results = self.index.query(
query_embedding,
top_k=k,
include_metadata=True,
)
output = []
for query_result in query_results:
score = query_result.score
metadata = query_result.metadata
context = metadata.pop("context")
doc = Document(
page_content=context,
metadata=metadata,
)
output.append((doc, score))
return output
Let’s use this class to index the chunks:
from langchain.embeddings import VertexAIEmbeddings
from upstash_vector import Index
index = Index(
url="<UPSTASH_URL>",
token="<UPSTASH_TOKEN>",
)
embeddings = VertexAIEmbeddings(model_name="textembedding-gecko@003")
upstash_vector_store = UpstashVectorStore(index, embeddings)
ids = upstash_vector_store.add_documents(splits, batch_size=25)
This process may take a while depending on the number of splits, your connection speed, and the chosen batch size.
When the indexing process is done, you can check the vectors and the corresponding metadata from the UI: this helps for quick sanity checks and easy management (e.g. delete) of records.
4 — Ask questions about the indexed papers
With the abstracts being correctly indexed into Upstash, we can now interact with them in natural language and ask specific questions about ML topics.
This is much easier than it looks.
To do this, let’s first define a function that, given a question, retrieves related documents from the vector store and uses them to build a prompt.
def get_context(query, vector_store):
results = vector_store.similarity_search_with_score(query)
context = ""
for doc, score in results:
context += doc.page_content + "\n===\n"
return context
def get_prompt(question, context):
template = """
Your task is to answer questions by using a given context.
Don't invent anything that is outside of the context.
Answer in at least 350 characters.
%CONTEXT%
{context}
%Question%
{question}
Hint: Do not copy the context. Use your own words
Answer:
"""
prompt = template.format(question=question, context=context)
return prompt
Feeling uninspired? here’s a question to get started:
“What are the problems behind the Retrieval Augmented Generation (RAG) framework?”
query = (
"What are the problems behind the Retrieval Augmented Generation (RAG) framework?"
)
context = get_context(query, upstash_vector_store)
prompt = get_prompt(query, context)
Here’s what the prompt looks like after receiving the context:
Your task is to answer questions by using a given context.
Don't invent anything that is outside of the context.
Answer in at least 350 characters.
%CONTEXT%
Retrieval-Augmented Generation (RAG) is a promising approach for
mitigating the hallucination of large language models (LLMs).
However, existing research lacks rigorous evaluation of the impact
of retrieval-augmented generation on different large language models,
which make it challenging to identify the potential bottlenecks in the
capabilities of RAG for different LLMs. In this paper, we systematically
investigate the impact of Retrieval-Augmented Generation on large language
models. We analyze the performance of different large language models in
4 fundamental abilities required for RAG, including noise robustness,
negative rejection, information integration, and counterfactual robustness.
To this end, we establish Retrieval-Augmented Generation Benchmark (RGB), a
new corpus for RAG evaluation in both English and Chinese. RGB divides the
instances within the benchmark into 4 separate testbeds based on the
aforementioned fundamental abilities required to resolve the case.
Then we evaluate 6 representative LLMs on RGB to diagnose the challenges
of current LLMs when applying RAG
===
Despite their remarkable capabilities, large language models (LLMs) often
produce responses containing factual inaccuracies due to their sole reliance
on the parametric knowledge they encapsulate. Retrieval-Augmented Generation
(RAG), an ad hoc approach that augments LMs with retrieval of relevant
knowledge, decreases such issues. However, indiscriminately retrieving and
incorporating a fixed number of retrieved passages, regardless of whether
retrieval is necessary, or passages are relevant, diminishes LM versatility
or can lead to unhelpful response generation. We introduce a new framework
called Self-Reflective Retrieval-Augmented Generation (Self-RAG)
that enhances an LM's quality and factuality through retrieval and
self-reflection. Our framework trains a single arbitrary LM that adaptively
retrieves passages on-demand, and generates and reflects on retrieved
passages and its own generations using special tokens, called reflection
tokens. Generating reflection tokens makes the LM controllable during the
inference phase, enabling it to tailor its behavior to diverse task
requirements
...
Let’s now pass it to an LLM to generate an answer.
from langchain.chat_models import AzureChatOpenAI
llm = AzureChatOpenAI(
azure_deployment="<AZURE_DEPLOYMENT>",
model="<MODEL_NAME>",
)
answer = llm.predict(prompt)
Ta-da! 🥁
The Retrieval Augmented Generation (RAG) framework can suffer from
problems such as indiscriminate retrieval and incorporation of passages
that may not be necessary or relevant, leading to unhelpful response
generation. Additionally, existing research lacks rigorous evaluation
of the impact of RAG on different Large Language Models (LLMs),
making it difficult to identify potential bottlenecks in the capabilities
of RAG for different LLMs. To address these issues, researchers have
proposed the Self-Reflective Retrieval-Augmented Generation (Self-RAG)
framework, which enhances an LM's quality and factuality through
retrieval and self-reflection. Another challenge with LLMs is their
forgetfulness, as they do not improve over time or acquire new knowledge
like humans do. To address this, researchers have explored the use of RAG
to improve problem-solving performance and proposed the ARM-RAG system,
which learns from its successes without requiring high training costs.
Pretty decent, right?
Let’s wrap this up with this diagram to summarize the full workflow.
5— Integrate into a Streamlit application
To interact with the RAG from a UI, we can integrate it into a Streamlit application.
Here’s what it looks like:
Want to try the app locally and play with the code? Be my guest.
Some takeaways
Many of you have already built RAGs to chat with their data.
I'm giving you honest feedback on the real usefulness of such projects. My goal is not to discourage you from building RAGs or experimenting with them but to provide a nuanced opinion to mitigate the hype over this solution.
Let’s start with the benefits:
- RAGs allow you to access external data. For example, the app we built provides correct answers on recent open-source LLMs such as Mistral or LLama2. If you ask ChatGPT questions about these models, here’s what you get.
- With RAGs, you’re able to cite the source documents that ground the generated response. This increases user trust and helps with debugging and interpretability.
- RAGs limit LLMs' propensity to hallucinate because they leverage external data only to ground their answer.
- RAGs are relatively easy to build. They don’t require expensive computing resources since no model is being trained or finetuned.
RAGs, however, are not a magical solution that works out of the box. If you want to industrialize them in the “corporate” world, there are some key aspects that you must consider.
- Their impact depends on the data you feed them. In our example app, we only used the paper abstracts. While that provides a good start to answer generic questions, it doesn’t help when queries are too complex and need access to the full text.
- Off-the-shelve RAG implementations rarely work well. They’re good for demo purposes, but once you start digging deep into the answers, you quickly realize that the quality is disappointing. That’s why you need extensive tuning, evaluation metrics, and humans in the loop.
- RAGs are not the solution to everything. Some applications like style copying are best performed with model fine-tuning.
- RAGs are limited by the context size of the LLM and even if an LLM has a context size of 1M tokens, that doesn’t mean it’s a good idea to prompt it with such amount of data.
Conclusion
If you’ve made it this far, I’d like to first thank you for your time.
Now if you’re interested in boosting this assistant and taking it to the next level, here are some ideas to explore or implement.
- Use the full text instead of the abstracts
- Complement vector retrieval with metadata filtering
- Try a hybrid search instead of a dense search: keyword search surprisingly boosts semantic search.
- Re-rank the documents after retrieval: this allows us to retrieve a wider range of documents and zoom in on the most important ones
- Expand the user query
- Fine-tune the embeddings
Here’s an article I previously wrote on improving retrieval techniques.
Check it out here👇.
Thanks for reading 📖.