avatarAhmed Besbes

Summary

This context describes a tutorial on building an AI assistant that mines information from research papers on Papers With Code using a Retrieval Augmented Generation (RAG) framework, a scalable serverless vector database, an embedding model from VertexAI, and an LLM from OpenAI.

Abstract

The tutorial aims to help users build an AI assistant that can answer questions about the latest machine learning research by mining information from research papers on Papers With Code. The assistant is powered by a Retrieval Augmented Generation (RAG) framework that uses a scalable serverless vector database, an embedding model from VertexAI, and an LLM from OpenAI. The tutorial provides step-by-step instructions and accompanying source code for each step, including collecting data from Papers With Code, creating an index on Upstash, embedding and indexing the chunks, and asking questions about the indexed papers. The tutorial also includes tips for improving the assistant's performance, such as using the full text instead of abstracts, complementing vector retrieval with metadata filtering, and fine-tuning the embeddings.

Bullet points

  • The tutorial describes how to build an AI assistant that can answer questions about the latest machine learning research by mining information from research papers on Papers With Code.
  • The assistant is powered by a Retrieval Augmented Generation (RAG) framework that uses a scalable serverless vector database, an embedding model from VertexAI, and an LLM from OpenAI.
  • The tutorial provides step-by-step instructions and accompanying source code for each step, including collecting data from Papers With Code, creating an index on Upstash, embedding and indexing the chunks, and asking questions about the indexed papers.
  • The tutorial includes tips for improving the assistant's performance, such as using the full text instead of abstracts, complementing vector retrieval with metadata filtering, and fine-tuning the embeddings.
  • The tutorial also includes a diagram that summarizes the full workflow and a Streamlit application that allows users to interact with the RAG from a UI.
  • The tutorial concludes with some takeaways, including the benefits and limitations of RAGs and ideas for improving the assistant's performance.

How To Build an LLM-Powered App To Chat with PapersWithCode

Keep up with the latest ML research

Photo by Patrick Tomasso on Unsplash

Do you find it difficult to keep up with the latest ML research? Are you overwhelmed with the massive amount of papers about LLMs, vector databases, or RAGs?

In this post, I will show how to build an AI assistant that mines this large amount of information easily. You’ll ask it your questions in natural language and it’ll answer according to relevant papers it finds on Papers With Code.

On the backend side, this assistant will be powered with a Retrieval Augmented Generation (RAG) framework that relies on a scalable serverless vector database, an embedding model from VertexAI, and an LLM from OpenAI.

On the front-end side, this assistant will be integrated into an interactive and easily deployable web application built with Streamlit.

Every step of this process will be detailed below with an accompanying source code that you can reuse and adapt👇.

Ready? Let’s dive in 🔍.

If you’re interested in ML content, detailed tutorials, and practical tips from the industry, follow my newsletter. It’s called The Tech Buffet.

1 — Collect data from Papers With Code

Papers With Code (a.k.a PWC) is a free website for researchers and practitioners to find and follow the latest state-of-the-art ML papers, source code, and datasets.

Image modified by the author

Luckily, it’s also possible to interact with PWC through an API to programmatically retrieve research papers. If you look at this Swagger UI, you can find all the available endpoints and try them out.

Let’s, for example, search papers on a specific keyword.

Here’s how to do it from the interface: you locate the papers/ endpoint, fill in the query ( q ) argument

Screenshot by the author

and hit the execute button.

Screenshot by the author

Equivalently, you can perform this same search by hitting this URL.

The output response shows the first page of results only. The following pages are available by accessing the next key.

By exploiting this structure, we can retrieve 7200 papers matching “Large Language Models”. This can be simply done with a function that requests the URL and loops over all the pages.

import requests
import urllib.parse
from tqdm import tqdm

def extract_papers(query: str):
    query = urllib.parse.quote(query)
    url = f"https://paperswithcode.com/api/v1/papers/?q={query}"
    response = requests.get(url)
    response = response.json()
    count = response["count"]
    results = []
    results += response["results"]

    num_pages = count // 50
    for page in tqdm(range(2, num_pages)):
        url = f"https://paperswithcode.com/api/v1/papers/?page={page}&q={query}"
        response = requests.get(url)
        response = response.json()
        results += response["results"]
    return results

query = "Large Language Models"

results = extract_papers(query)

print(len(results))
# 7200

Once the results are extracted, we convert them from their raw JSON format into LangChain Documents to simplify chunking and indexing.

Document objects have two parameters:

  • page_content (str): to store the text of the paper’s abstract
  • metadata (dict): to store additional information. In our use case we’ll keep: id, arxiv_id, url_pdf, title, authors, published
from langchain.docstore.document import Document

documents = [
    Document(
        page_content=result["abstract"],
        metadata={
            "id": result["id"] if result["id"] else "",
            "arxiv_id": result["arxiv_id"] if result["arxiv_id"] else "",
            "url_pdf": result["url_pdf"] if result["url_pdf"] else "",
            "title": result["title"] if result["title"] else "",
            "authors": result["authors"] if result["authors"] else "",
            "published": result["published"] if result["published"] else "",
        },
    )
    for result in results
]

Prior to embedding the documents, we need to chunk them into smaller pieces. This helps overcome LLMs’ limitations in terms of input tokens and provides fine-grained information per chunk.

After chunking the documents with a chunk_size of 1200 characters and a chunk_overlap of 200, we end up with over 11K splits.

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1200,
    chunk_overlap=200,
    separators=["."],
)
splits = text_splitter.split_documents(documents)

len(splits)
# 11308
Image by the author

2 — Create an index on Upstash

To be able to store document embeddings (and metadata) somewhere, we first have to create an index.

In this tutorial, we’ll use Upstash, a serverless database.

To create an index in it, you need to log in here and follow the instructions to fill in some parameters:

  • Region: pick the closest one to your location.
  • Dimensions = Set it to 768 (VertexAI’s embedding dimension)
  • Distance metric = set it to cosine.
Screenshot by the author

Once the index is created, you need to install upstash-vector

pip install upstash-vector

This allows you to establish a connection to the index.

from upstash_vector import Index

index = Index(
    url="<UPSTASH_URL>", 
    token="<UPSTASH_TOKEN>"
)

URL and token are the credentials you’ll need to connect to your index. Keep them safe and don’t version them with the code. They’re available in your account’s settings.

3 — Embed the chunks and index them into Upstash

To embed the chunks and index them into the vector db, we’ll create a simple class that imitates LangChain Vectorstore implementation.

This class will be named UpstashVectorStore and will have the following methods:

  • An __init__ constructor that expects an Upstash Index and an Embeddings object
  • add_documents to embed documents and index them in batches
  • similarity_search_with_score to query the index and retrieve the top_k most relevant documents along with their corresponding scores

Here’s the full implementation:

from typing import List, Optional, Tuple, Union
from uuid import uuid4
from langchain.docstore.document import Document
from langchain.embeddings.base import Embeddings
from tqdm import tqdm
from upstash_vector import Index


class UpstashVectorStore:
    def __init__(self, index: Index, embeddings: Embeddings):
        self.index = index
        self.embeddings = embeddings

    def delete_vectors(
        self,
        ids: Union[str, List[str]] = None,
        delete_all: bool = None,
    ):
        if delete_all:
            self.index.reset()
        else:
            self.index.delete(ids)

    def add_documents(
        self,
        documents: List[Document],
        ids: Optional[List[str]] = None,
        batch_size: int = 32,
    ):
        texts = []
        metadatas = []
        all_ids = []

        for document in tqdm(documents):
            text = document.page_content
            metadata = document.metadata
            metadata = {"context": text, **metadata}
            texts.append(text)
            metadatas.append(metadata)

            if len(texts) >= batch_size:
                ids = [str(uuid4()) for _ in range(len(texts))]
                all_ids += ids
                embeddings = self.embeddings.embed_documents(texts, batch_size=250)
                self.index.upsert(
                    vectors=zip(ids, embeddings, metadatas),
                )
                texts = []
                metadatas = []

        if len(texts) > 0:
            ids = [str(uuid4()) for _ in range(len(texts))]
            all_ids += ids
            embeddings = self.embeddings.embed_documents(texts)
            self.index.upsert(
                vectors=zip(ids, embeddings, metadatas),
            )

        n = len(all_ids)
        print(f"Successfully indexed {n} dense vectors to Upstash.")
        print(self.index.stats())
        return all_ids

    def similarity_search_with_score(
        self,
        query: str,
        k: int = 4,
    ) -> List[Tuple[Document, float]]:
        query_embedding = self.embeddings.embed_query(query)
        query_results = self.index.query(
            query_embedding,
            top_k=k,
            include_metadata=True,
        )
        output = []
        for query_result in query_results:
            score = query_result.score
            metadata = query_result.metadata
            context = metadata.pop("context")
            doc = Document(
                page_content=context,
                metadata=metadata,
            )
            output.append((doc, score))
        return output

Let’s use this class to index the chunks:

from langchain.embeddings import VertexAIEmbeddings
from upstash_vector import Index

index = Index(
    url="<UPSTASH_URL>",
    token="<UPSTASH_TOKEN>",
)
embeddings = VertexAIEmbeddings(model_name="textembedding-gecko@003")

upstash_vector_store = UpstashVectorStore(index, embeddings)
ids = upstash_vector_store.add_documents(splits, batch_size=25)

This process may take a while depending on the number of splits, your connection speed, and the chosen batch size.

Image by the author

When the indexing process is done, you can check the vectors and the corresponding metadata from the UI: this helps for quick sanity checks and easy management (e.g. delete) of records.

Screenshot by the author

4 — Ask questions about the indexed papers

With the abstracts being correctly indexed into Upstash, we can now interact with them in natural language and ask specific questions about ML topics.

This is much easier than it looks.

To do this, let’s first define a function that, given a question, retrieves related documents from the vector store and uses them to build a prompt.

def get_context(query, vector_store):
    results = vector_store.similarity_search_with_score(query)
    context = ""

    for doc, score in results:
        context += doc.page_content + "\n===\n"
    return context

def get_prompt(question, context):
    template = """
    Your task is to answer questions by using a given context.

    Don't invent anything that is outside of the context.
    Answer in at least 350 characters.

    %CONTEXT%
    {context}

    %Question%
    {question}

    Hint: Do not copy the context. Use your own words
    
    Answer:
    """
    prompt = template.format(question=question, context=context)
    return prompt

Feeling uninspired? here’s a question to get started:

“What are the problems behind the Retrieval Augmented Generation (RAG) framework?”

query = (
    "What are the problems behind the Retrieval Augmented Generation (RAG) framework?"
)

context = get_context(query, upstash_vector_store)
prompt = get_prompt(query, context)

Here’s what the prompt looks like after receiving the context:

Your task is to answer questions by using a given context.

Don't invent anything that is outside of the context.
Answer in at least 350 characters.

%CONTEXT%

Retrieval-Augmented Generation (RAG) is a promising approach for
mitigating the hallucination of large language models (LLMs). 
However, existing research lacks rigorous evaluation of the impact 
of retrieval-augmented generation on different large language models,
which make it challenging to identify the potential bottlenecks in the 
capabilities of RAG for different LLMs. In this paper, we systematically 
investigate the impact of Retrieval-Augmented Generation on large language 
models. We analyze the performance of different large language models in 
4 fundamental abilities required for RAG, including noise robustness, 
negative rejection, information integration, and counterfactual robustness. 
To this end, we establish Retrieval-Augmented Generation Benchmark (RGB), a 
new corpus for RAG evaluation in both English and Chinese. RGB divides the 
instances within the benchmark into 4 separate testbeds based on the 
aforementioned fundamental abilities required to resolve the case. 
Then we evaluate 6 representative LLMs on RGB to diagnose the challenges 
of current LLMs when applying RAG
===
Despite their remarkable capabilities, large language models (LLMs) often 
produce responses containing factual inaccuracies due to their sole reliance
 on the parametric knowledge they encapsulate. Retrieval-Augmented Generation 
(RAG), an ad hoc approach that augments LMs with retrieval of relevant 
knowledge, decreases such issues. However, indiscriminately retrieving and 
incorporating a fixed number of retrieved passages, regardless of whether 
retrieval is necessary, or passages are relevant, diminishes LM versatility 
or can lead to unhelpful response generation. We introduce a new framework 
called Self-Reflective Retrieval-Augmented Generation (Self-RAG) 
that enhances an LM's quality and factuality through retrieval and 
self-reflection. Our framework trains a single arbitrary LM that adaptively 
retrieves passages on-demand, and generates and reflects on retrieved 
passages and its own generations using special tokens, called reflection 
tokens. Generating reflection tokens makes the LM controllable during the 
inference phase, enabling it to tailor its behavior to diverse task 
requirements
...

Let’s now pass it to an LLM to generate an answer.

from langchain.chat_models import AzureChatOpenAI

llm = AzureChatOpenAI(
    azure_deployment="<AZURE_DEPLOYMENT>",
    model="<MODEL_NAME>",
)

answer = llm.predict(prompt)

Ta-da! 🥁

The Retrieval Augmented Generation (RAG) framework can suffer from 
problems such as indiscriminate retrieval and incorporation of passages 
that may not be necessary or relevant, leading to unhelpful response 
generation. Additionally, existing research lacks rigorous evaluation 
of the impact of RAG on different Large Language Models (LLMs), 
making it difficult to identify potential bottlenecks in the capabilities 
of RAG for different LLMs. To address these issues, researchers have 
proposed the Self-Reflective Retrieval-Augmented Generation (Self-RAG) 
framework, which enhances an LM's quality and factuality through 
retrieval and self-reflection. Another challenge with LLMs is their 
forgetfulness, as they do not improve over time or acquire new knowledge 
like humans do. To address this, researchers have explored the use of RAG 
to improve problem-solving performance and proposed the ARM-RAG system, 
which learns from its successes without requiring high training costs.

Pretty decent, right?

Let’s wrap this up with this diagram to summarize the full workflow.

Image by the author

5— Integrate into a Streamlit application

To interact with the RAG from a UI, we can integrate it into a Streamlit application.

Here’s what it looks like:

GIF by the author

Want to try the app locally and play with the code? Be my guest.

Some takeaways

Many of you have already built RAGs to chat with their data.

I'm giving you honest feedback on the real usefulness of such projects. My goal is not to discourage you from building RAGs or experimenting with them but to provide a nuanced opinion to mitigate the hype over this solution.

Let’s start with the benefits:

  • RAGs allow you to access external data. For example, the app we built provides correct answers on recent open-source LLMs such as Mistral or LLama2. If you ask ChatGPT questions about these models, here’s what you get.
Pretty disappointing, right? Screenshot by the author
  • With RAGs, you’re able to cite the source documents that ground the generated response. This increases user trust and helps with debugging and interpretability.
  • RAGs limit LLMs' propensity to hallucinate because they leverage external data only to ground their answer.
  • RAGs are relatively easy to build. They don’t require expensive computing resources since no model is being trained or finetuned.

RAGs, however, are not a magical solution that works out of the box. If you want to industrialize them in the “corporate” world, there are some key aspects that you must consider.

  • Their impact depends on the data you feed them. In our example app, we only used the paper abstracts. While that provides a good start to answer generic questions, it doesn’t help when queries are too complex and need access to the full text.
  • Off-the-shelve RAG implementations rarely work well. They’re good for demo purposes, but once you start digging deep into the answers, you quickly realize that the quality is disappointing. That’s why you need extensive tuning, evaluation metrics, and humans in the loop.
  • RAGs are not the solution to everything. Some applications like style copying are best performed with model fine-tuning.
  • RAGs are limited by the context size of the LLM and even if an LLM has a context size of 1M tokens, that doesn’t mean it’s a good idea to prompt it with such amount of data.

Conclusion

If you’ve made it this far, I’d like to first thank you for your time.

Now if you’re interested in boosting this assistant and taking it to the next level, here are some ideas to explore or implement.

  • Use the full text instead of the abstracts
  • Complement vector retrieval with metadata filtering
  • Try a hybrid search instead of a dense search: keyword search surprisingly boosts semantic search.
  • Re-rank the documents after retrieval: this allows us to retrieve a wider range of documents and zoom in on the most important ones
  • Expand the user query
  • Fine-tune the embeddings

Here’s an article I previously wrote on improving retrieval techniques.

Check it out here👇.

Thanks for reading 📖.

Programming
Python
Artificial Intelligence
Data Science
Machine Learning
Recommended from ReadMedium