avatarCris Velasquez

Summary

This article provides an end-to-end guide for developing a research chatbot using OpenAI functions and the arXiv API to automate scientific knowledge retrieval.

Abstract

The article discusses the challenges of staying updated with the vast amount of scientific publications and proposes a solution using AI and semantic search, specifically OpenAI functions, to transform natural language inputs into structured outputs or function calls. The guide focuses on Python code to search and process scientific literature using OpenAI functions and the arXiv API, streamlining the retrieval, summarization, and presentation of academic research findings. The solution architecture involves a multi-layered approach to processing and delivering scientific knowledge to users, including user interface, conversation management, query processing, OpenAI API integration, interaction with external APIs, article retrieval and summarization, and response generation. The article also covers installation of necessary libraries, project setup, core functionalities, and interaction with the research chatbot.

Bullet points

  • The article discusses the challenge of staying updated with the vast amount of scientific publications.
  • It proposes a solution using AI and semantic search, specifically OpenAI functions.
  • The guide focuses on Python code to search and process scientific literature using OpenAI functions and the arXiv API.
  • The solution architecture involves a multi-layered approach to processing and delivering scientific knowledge to users.
  • The article covers installation of necessary libraries, project setup, core functionalities, and interaction with the research chatbot.
  • The core functionalities include embedding generation, retrieving academic papers, ranking and summarization, and integration and use of OpenAI functions.
  • The article also provides examples of user-system interaction flow.
  • The article concludes with practical applications of the research chatbot in academic research, industry R&D, education, and data science and AI.
  • Future work will focus on leveraging the latest advancements in AI and machine learning to address challenges and ensure the system remains at the forefront of technology.

Automating Scientific Knowledge Retrieval with AI in Python

End-to-End Guide for Developing a Research Chatbot with OpenAI functions Capable of Semantic Search Across Arxiv

The sheer volume of scientific publications, datasets, and scholarly articles available today poses a challenge for researchers, academics, and professionals striving to stay abreast of the latest developments in their fields.

This challenge underscores the necessity for innovative approaches to streamline the process of scientific knowledge retrieval, making it both efficient and effective.

AI and semantic search has shown remarkable promise in transforming the way we access and interact with information. Among the forefront of these innovations is the application of OpenAI functions, transforming natural language inputs into structured outputs or function calls.

For instance, when tasked with a query about the latest advancements in renewable energy technologies, OpenAI’s models can sift through recent publications, identify key papers and findings, and summarize research trends without being limited to specific keywords.

This capability not only accelerates the research process but also uncovers connections and insights that might not be immediately evident through conventional search methods.

The purpose of this article is to provide end-to-end Python code to search and process scientific literature, utilizing OpenAI functions and the arXiv API to streamline the retrieval, summarization, and presentation of academic research findings.

This guide is structured as follows:

Solution Architecture

Core Python Functions

Interacting with the Research Chatbot

Challenges and Improvements

1. Solution Architecture

The solution architecture for the research chatbot delineates a multi-layered approach to processing and delivering scientific knowledge to users.

The workflow is designed to handle complex user queries, interact with external APIs, and provide informative responses.

The architecture incorporates various components that facilitate the flow of information from initial user input to the final response delivery.

Figure 1. Solution Architecture for Automatic Scientific Knowledge Retrieval with OpenAI Functions and the arXiv API.

1. User Interface (UI): The user submits queries through this interface. In this case from a jupyter notebook

2. Conversation Management: This module handles the dialogue, ensuring context is maintained throughout the user interaction.

3. Query Processing: The user’s query is interpreted here, which involves understanding the intent and preparing it for subsequent actions.

4. OpenAI API Integration (Embedding & Completion):

  • The Completion part directly processes the query to generate an immediate response for some queries.
  • The Embedding Request is used for queries that need academic paper retrieval, generating a vector to find relevant documents.

5. External APIs (arXiv): This is where the chatbot interacts with external databases like arXiv to fetch scientific papers based on the query.

6. Get Articles & Summarize: This function retrieves articles and then uses the embeddings to prioritize which articles to summarize based on the query’s context.

7. PDF Processing, Text Extraction & Chunking: If detailed information is needed, the system processes the PDFs, extracts text, and chunks it into smaller pieces, preparing for summarization.

8. Response Generation:

  • It integrates responses from the OpenAI API Completion service.
  • It includes summaries of articles retrieved and processed from the arXiv API, which are based on the embeddings generated earlier.

9. Presentation to User: The final step where a cohesive response, combining AI-generated answers and summaries of articles, is presented to the user.

2. Getting Started in Python

2.1 Installation of Necessary Libraries

We utilize a variety of Python libraries, each serving a specific function to facilitate the retrieval and processing of scientific knowledge. Here is an overview of each library and its role:

  • scipy: Essential for scientific computing, scipy offers modules for optimization, linear algebra, integration, and more
  • tenacity: Facilitates retrying of failed operations, particularly useful for reliable requests to external APIs or databases.
  • tiktoken: is a fast BPE tokenizer designed for use with OpenAI’s models, facilitating the efficient tokenization of text for processing with AI models like GPT-4.
  • termcolor: Enables colored terminal output, useful for differentiating log messages or outputs for easier debugging.
  • openai: Official library for interacting with OpenAI's APIs like GPT-3, crucial for querying and receiving AI model responses.
  • requests: For making HTTP requests to web services or APIs, likely used for data retrieval or interaction with scientific resources.
  • arxiv: Simplifies searching, fetching, and managing scientific papers from arXiv.org.
  • pandas: Key for data manipulation and analysis, offering structures and functions for handling large datasets.
  • PyPDF2: Enables text extraction from PDF files, vital for processing scientific papers in PDF format.
  • tqdm: Generates progress bars for loops or long-running processes, improving the user experience.

2.2 Setting Up the Enviroment

First, you’ll need to create an account on OpenAI’s platform and obtain an API key from the API section of your account settings.

openai.api_key = "API_KEY"

GPT_MODEL = "gpt-3.5-turbo-0613"
EMBEDDING_MODEL = "text-embedding-ada-002"

2.3 Project Setup

Creating a structured directory for managing downloaded papers or data is crucial for organization and easy access. Here’s how you can set up the necessary directories:

  1. Create Directory Structure: Decide on a structure that suits your project’s needs. For managing downloaded papers, a ./data/papers directory is suggested.
  2. Implementation: Use Python’s os library to check for the existence of these directories and create them if they don't exist:
import os

directory = './data/papers'
if not os.path.exists(directory):
    os.makedirs(directory)

This snippet ensures that your script can run on any system without manual directory setup, making your project more portable and user-friendly.

3. Core Functionalities

The research chatbot, designed to facilitate scientific knowledge retrieval, integrates several core functionalities.

These are centered around processing natural language queries, retrieving and summarizing academic content, and enhancing user interactions with advanced NLP techniques.

Below, we detail these functionalities, underscored by specific code snippets that illustrate their implementation.

3.1 Embedding Generation

To understand and process user queries effectively, the chatbot leverages embeddings — a numerical representation of text that captures semantic meanings. This is crucial for tasks like determining the relevance of scientific papers to a query.

@retry(wait=wait_random_exponential(min=1, max=40), stop=stop_after_attempt(3))
def embedding_request(text):
    response = openai.Embedding.create(input=text, model=EMBEDDING_MODEL)
    return response['data']['embeddings']

This function, equipped with a retry mechanism, requests embeddings from OpenAI’s API, ensuring robustness in face of potential API errors or rate limits.

3.2 Retrieving Academic Papers

Upon understanding a query, the chatbot fetches relevant academic papers, demonstrating its ability to interface directly with external databases like arXiv.

# Function to get articles from arXiv
def get_articles(query, library=paper_dir_filepath, top_k=5):
    """
    Searches for and retrieves the top 'k' academic papers related to a user's query from the arXiv database. 
    The function uses the arXiv API to search for papers, with the search criteria being the user's query and the number of results limited to 'top_k'. 
    For each article found, it stores relevant information such as the title, summary, and URLs in a list. 
    It also downloads the PDF of each paper and stores references, including the title, download path, and embedding of the paper title, in a CSV file specified by 'library'.
    This is useful for keeping a record of the papers and their embeddings for later retrieval and analysis. 
    This function will be used by the read_article_and_summarize
    """
    search = arxiv.Search(
        query=query, max_results=top_k, sort_by=arxiv.SortCriterion.Relevance
    )
    result_list = []
    for result in search.results():
        result_dict = {}
        result_dict.update({"title": result.title})
        result_dict.update({"summary": result.summary})

        # Taking the first url provided
        result_dict.update({"article_url": [x.href for x in result.links][0]})
        result_dict.update({"pdf_url": [x.href for x in result.links][1]})
        result_list.append(result_dict)

        # Store references in library file
        response = embedding_request(text=result.title)
        file_reference = [
            result.title,
            result.download_pdf(data_dir),
            response["data"][0]["embedding"],
        ]

        # Write to file
        with open(library, "a") as f_object:
            writer_object = writer(f_object)
            writer_object.writerow(file_reference)
            f_object.close()
    return result_list

3.3 Ranking and Summarization

With relevant papers at hand, the system ranks them based on their relatedness to the query and summarizes the content to provide concise, insightful information back to the user.

# Function to rank strings by relatedness to a query string
def strings_ranked_by_relatedness(
    query: str,
    df: pd.DataFrame,
    relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),
    top_n: int = 100,
    ) -> list[str]:

    """
    Ranks and returns a list of strings from a DataFrame based on their relatedness to a given query string.
    The function first obtains an embedding for the query string. Then, it calculates the relatedness of each string in the DataFrame to the query, 
    using the provided 'relatedness_fn', which defaults to computing the cosine similarity between their embeddings.
    It sorts these strings in descending order of relatedness and returns the top 'n' strings.
    """
    query_embedding_response = embedding_request(query)
    query_embedding = query_embedding_response["data"][0]["embedding"]

    strings_and_relatednesses = [
        (row["filepath"], relatedness_fn(query_embedding, row["embedding"]))
        for i, row in df.iterrows()
    ]
    
    strings_and_relatednesses.sort(key=lambda x: x[1], reverse=True)
    strings, relatednesses = zip(*strings_and_relatednesses)
    return strings[:top_n]

3.4 Summarizing Academic Papers

Following the identification of relevant papers, the chatbot employs a summarization process to distill the essence of scientific documents.

# Function to summarize chunks and return an overall summary
def summarize_text(query):
    """
    Automates summarizing academic papers relevant to a user's query. The process includes:
    1. Reading Data: Reads 'arxiv_library.csv' containing information about papers and their embeddings.
    2. Identifying Relevant Paper: Compares query's embedding to embeddings in the CSV to find closest match.
    3. Extracting Text: Reads the PDF of the identified paper and converts its content into a string.
    4. Chunking Text: Divides the extracted text into manageable chunks for efficient processing.
    5. Summarizing Chunks: Each text chunk is summarized using the 'extract_chunk' function in parallel.
    6. Compiling Summaries: Combines individual summaries into a final comprehensive summary.
    7. Returning Summary: Provides a condensed overview of the paper, focusing on key insights relevant to the user's query.
    """

    # A prompt to dictate how the recursive summarizations should approach the input paper
    summary_prompt = """Summarize this text from an academic paper. Extract any key points with reasoning.\n\nContent:"""

    # If the library is empty (no searches have been performed yet), we perform one and download the results
    library_df = pd.read_csv(paper_dir_filepath).reset_index()
    if len(library_df) == 0:
        print("No papers searched yet, downloading first.")
        get_articles(query)
        print("Papers downloaded, continuing")
        library_df = pd.read_csv(paper_dir_filepath).reset_index()
    library_df.columns = ["title", "filepath", "embedding"]
    library_df["embedding"] = library_df["embedding"].apply(ast.literal_eval)
    strings = strings_ranked_by_relatedness(query, library_df, top_n=1)
    print("Chunking text from paper")
    pdf_text = read_pdf(strings[0])

    # Initialise tokenizer
    tokenizer = tiktoken.get_encoding("cl100k_base")
    results = ""

    # Chunk up the document into 1500 token chunks
    chunks = create_chunks(pdf_text, 1500, tokenizer)
    text_chunks = [tokenizer.decode(chunk) for chunk in chunks]
    print("Summarizing each chunk of text")

    # Parallel process the summaries
    with concurrent.futures.ThreadPoolExecutor(
        max_workers=len(text_chunks)
    ) as executor:
        futures = [
            executor.submit(extract_chunk, chunk, summary_prompt)
            for chunk in text_chunks
        ]
        with tqdm(total=len(text_chunks)) as pbar:
            for _ in concurrent.futures.as_completed(futures):
                pbar.update(1)
        for future in futures:
            data = future.result()
            results += data

    # Final summary
    print("Summarizing into overall summary")
    response = openai.ChatCompletion.create(
        model=GPT_MODEL,
        messages=[
            {
                "role": "user",
                "content": f"""Write a summary collated from this collection of key points extracted from an academic paper.
                        The summary should highlight the core argument, conclusions and evidence, and answer the user's query.
                        User query: {query}
                        The summary should be structured in bulleted lists following the headings Core Argument, Evidence, and Conclusions.
                        Key points:\n{results}\nSummary:\n""",
            }
        ],
        temperature=0,
    )
    return response

3.5 integration and Use of OpenAI Functions

The research chatbot leverages OpenAI functions, a powerful feature of the OpenAI API, to enhance its ability to process and respond to complex queries.

These functions allow for a more seamless interaction between the chatbot and various external data sources and tools, significantly enriching the user’s experience by providing detailed, accurate, and contextually relevant information.

OpenAI functions are designed to extend the capabilities of OpenAI models by integrating external computation or data retrieval directly into the model’s processing flow.

3.5.1 Custom OpenAI Functions

  1. get_articles Function: To retrieve academic papers relevant to the user’s query from the arXiv database, showcasing the chatbot’s ability to access real-time data from external sources.
  2. read_article_and_summarizeFunction: To retrieve academic papers relevant to the user’s query from the arXiv database, showcasing the chatbot’s ability to access real-time data from external sources.

Implementation:

# Function to initiate our get_articles and read_article_and_summarize functions
arxiv_functions = [
    {
        "name": "get_articles",
        "description": """Use this function to get academic papers from arXiv to answer user questions.""",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": f"""
                            User query in JSON. Responses should be summarized and should include the article URL reference
                            """,
                }
            },
            "required": ["query"],
        },
    },
    {
        "name": "read_article_and_summarize",
        "description": """Use this function to read whole papers and provide a summary for users.
        You should NEVER call this function before get_articles has been called in the conversation.""",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": f"""
                            Description of the article in plain text based on the user's query
                            """,
                }
            },
            "required": ["query"],
        },
    }
]

The incorporation of these functions into the chatbot’s workflow demonstrates an advanced use case of OpenAI’s API, where custom functions tailored to specific tasks — like academic research — are executed based on conversational context.

3.6 Complete Code

See Complete Code with the required functions and chatbot interaction for end-to-end implementation.

For this type of project, as well as a range of other innovative, data-driven initiatives in AI, data science, and tech, We encourage readers to explore the wealth of resources available at www.entreprenerdly.com.

https://entreprenerdly.com

4. Interacting with the Research Chatbot

This section delves into the implementation and functionality of a research chatbot, alongside examples that illustrate the user-system interaction flow.

4.1 Implementation Overview

The chatbot is built on the OpenAI API, utilizing models like GPT-3 or GPT-4, which are capable of understanding complex queries and generating human-like responses.

The implementation involves setting up an interface (either a command-line interface or a web-based UI) through which users can input their queries. The system then processes these queries, interacts with the OpenAI API, and presents the responses back to the user.

4.2 Functionality

The core functionality of the research chatbot includes:

  1. Query Understanding: The chatbot first interprets the user’s query, leveraging the OpenAI model’s comprehension capabilities to grasp the context and intent behind the question.
  2. Information Retrieval: Depending on the query, the chatbot may directly generate an answer using its trained knowledge base or fetch relevant scientific papers and documents to construct a response.
  3. Response Generation: The chatbot synthesizes the information it has retrieved or generated into a coherent, concise answer that it then presents to the user.

4.3 User-System Interaction Flow

  1. User Query Example: A user asks, “What are the latest advancements in quantum computing?”. Processing the Query:
response = openai.Completion.create(
  engine="davinci",
  prompt="What are the latest advancements in quantum computing?",
  max_tokens=100
)
  • Generating a Response: The system formulates an answer, possibly summarizing recent breakthroughs in quantum computing.
  • Presenting the Response: The chatbot outputs the synthesized information, structured for user comprehension.

4.3.1 Retrieve Relevant Papers

This stage involves a user querying the chatbot to identify and retrieve papers on a specified topic:

# Start with a system message
paper_system_message = """You are arXivGPT, a helpful assistant pulls academic papers to answer user questions.
You summarize the papers clearly so the customer can decide which to read to answer their question.
You always provide the article_url and title so the user can understand the name of the paper and click through to access it.
Begin!"""
paper_conversation = Conversation()
paper_conversation.add_message("system", paper_system_message)


# Add a user message
paper_conversation.add_message("user", "What is the latest on Market Efficiency?") # How does PPO reinforcement learning work?
chat_response = chat_completion_with_function_execution(
    paper_conversation.conversation_history, functions=arxiv_functions
)

assistant_message = chat_response["choices"][0]["message"]["content"]
paper_conversation.add_message("assistant", assistant_message)
display(Markdown(assistant_message))
Figure 2. A summary of recent academic papers discussing various aspects of market efficiency, highlighting their key contributions and findings.

4.3.2 Summarizing Articles

Following the retrieval of relevant papers, the chatbot further processes the user’s request by summarizing the contents of the specified articles, enhancing the interaction by providing concise, insightful summaries.

# Add another user message to induce our system to use the second tool
paper_conversation.add_message(
    "user",
    "Can you read the Market-Aware Models for Efficient Cross-Market Recommendation paper for me and give me a summary", # "Can you read the PPO sequence generation paper for me and give me a summary"
)
updated_response = chat_completion_with_function_execution(
    paper_conversation.conversation_history, functions=arxiv_functions
)
display(Markdown(updated_response["choices"][0]["message"]["content"]))
Figure 3. A detailed breakdown of the process for generating a comprehensive summary of an academic paper on market-aware models for cross-market recommendation.

5. Challenges and Solutions

5.1 Integrating Diverse Data Sources

  • Challenge: Scientific knowledge is dispersed across numerous platforms and formats, from academic journals to preprint servers and institutional repositories.
  • Solution: One should develop a modular data ingestion framework capable of interfacing with various APIs and web scraping techniques to fetch and normalize data from multiple sources.

5.2 User-System Interaction Flow

  • Challenge: Maintaining a natural and engaging interaction flow in the chatbot, especially for complex queries that require multiple steps of information retrieval and processing is challenging.
  • Solution: To enhance user experience, we could implement a multi-threaded request handling system, allowing the chatbot to process information retrieval in the background while maintaining an interactive session with the user.

5.3 Ensuring Continuous Learning and Improvement

  • Challenge: Ensuring the chatbot continuously learns and improves from user interactions to enhance its accuracy and effectiveness over time.
  • Solution: Implement a feedback loop mechanism where users can rate the relevance and accuracy of the chatbot’s responses. This feedback is used to fine-tune the models and improve response quality.

5.4 Real-Time Data Synchronization

  • Challenge: Keeping the chatbot’s database synchronized with the latest scientific publications in real time. As new research is constantly being published, ensuring the chatbot provides the most current information is a significant challenge.
  • Solution: One could implement a real-time data synchronization mechanism using webhooks and RSS feeds from major scientific publication databases. This would allow the system to automatically update its repository with new publications as soon as they became available.

6. Practical Applications

6.1 Academic Research

Researchers across various disciplines can significantly benefit from this system to streamline their literature review processes and uncover relevant studies efficiently. By entering specific queries related to their research topics, the system can quickly search through vast of scientific papers, identifying and summarizing key findings, methodologies, and results.

6.2 Industry R&D

In the fast-paced environments of pharmaceuticals, engineering, and technology R&D departments, staying updated with the latest scientific discoveries is crucial for innovation and maintaining competitive advantages. The system offers these industries a powerful tool to quickly access cutting-edge research, experimental results, and technological advancements.

6.3 Education

Educators and students alike can utilize the system to enrich the learning experience and support academic research. Teachers can find up-to-date information to prepare their lectures, ensuring that the content they deliver is current and relevant. Similarly, students can use the system to find sources, references, and case studies for their essays, projects, or theses.

6.4 Data Science and AI

For data scientists and AI researchers, the system serves as a critical resource for sourcing datasets, understanding complex algorithms, and benchmarking against existing research. Users can query the system for the most recent and relevant datasets available for their specific projects, including details on dataset size, diversity, and application.

Conclusion and Future Work

The development and implementation of this research and scientific knowledge retrieval system underscore the transformative potential of AI in enhancing the accessibility and efficiency of scientific inquiry.

Future work will focus on leveraging the latest advancements in AI and machine learning to address the challenges identified, ensuring that the system remains at the forefront of technology and continues to serve the needs of its diverse user base.

Thank you for taking the time to read. If you found the article insightful, please consider clapping to support future content.👏

Entreprenerdly.com hosts the full suite of tutorials, code, and strategies designed to empower you with actionable knowledge.

https://entreprenerdly.com
Data Science
Machine Learning
Artificial Intelligence
AI
Python
Recommended from ReadMedium