Automating Scientific Knowledge Retrieval with AI in Python
End-to-End Guide for Developing a Research Chatbot with OpenAI functions Capable of Semantic Search Across Arxiv

The sheer volume of scientific publications, datasets, and scholarly articles available today poses a challenge for researchers, academics, and professionals striving to stay abreast of the latest developments in their fields.
This challenge underscores the necessity for innovative approaches to streamline the process of scientific knowledge retrieval, making it both efficient and effective.
AI and semantic search has shown remarkable promise in transforming the way we access and interact with information. Among the forefront of these innovations is the application of OpenAI functions, transforming natural language inputs into structured outputs or function calls.
For instance, when tasked with a query about the latest advancements in renewable energy technologies, OpenAI’s models can sift through recent publications, identify key papers and findings, and summarize research trends without being limited to specific keywords.
This capability not only accelerates the research process but also uncovers connections and insights that might not be immediately evident through conventional search methods.
The purpose of this article is to provide end-to-end Python code to search and process scientific literature, utilizing OpenAI functions and the arXiv API to streamline the retrieval, summarization, and presentation of academic research findings.
This guide is structured as follows:
Solution Architecture
Core Python Functions
Interacting with the Research Chatbot
Challenges and Improvements
1. Solution Architecture
The solution architecture for the research chatbot delineates a multi-layered approach to processing and delivering scientific knowledge to users.
The workflow is designed to handle complex user queries, interact with external APIs, and provide informative responses.
The architecture incorporates various components that facilitate the flow of information from initial user input to the final response delivery.

1. User Interface (UI): The user submits queries through this interface. In this case from a jupyter notebook
2. Conversation Management: This module handles the dialogue, ensuring context is maintained throughout the user interaction.
3. Query Processing: The user’s query is interpreted here, which involves understanding the intent and preparing it for subsequent actions.
4. OpenAI API Integration (Embedding & Completion):
- The Completion part directly processes the query to generate an immediate response for some queries.
- The Embedding Request is used for queries that need academic paper retrieval, generating a vector to find relevant documents.
5. External APIs (arXiv): This is where the chatbot interacts with external databases like arXiv to fetch scientific papers based on the query.
6. Get Articles & Summarize: This function retrieves articles and then uses the embeddings to prioritize which articles to summarize based on the query’s context.
7. PDF Processing, Text Extraction & Chunking: If detailed information is needed, the system processes the PDFs, extracts text, and chunks it into smaller pieces, preparing for summarization.
8. Response Generation:
- It integrates responses from the OpenAI API Completion service.
- It includes summaries of articles retrieved and processed from the arXiv API, which are based on the embeddings generated earlier.
9. Presentation to User: The final step where a cohesive response, combining AI-generated answers and summaries of articles, is presented to the user.
2. Getting Started in Python
2.1 Installation of Necessary Libraries
We utilize a variety of Python libraries, each serving a specific function to facilitate the retrieval and processing of scientific knowledge. Here is an overview of each library and its role:
scipy
: Essential for scientific computing,scipy
offers modules for optimization, linear algebra, integration, and moretenacity
: Facilitates retrying of failed operations, particularly useful for reliable requests to external APIs or databases.tiktoken
: is a fast BPE tokenizer designed for use with OpenAI’s models, facilitating the efficient tokenization of text for processing with AI models like GPT-4.termcolor
: Enables colored terminal output, useful for differentiating log messages or outputs for easier debugging.openai
: Official library for interacting with OpenAI's APIs like GPT-3, crucial for querying and receiving AI model responses.requests
: For making HTTP requests to web services or APIs, likely used for data retrieval or interaction with scientific resources.arxiv
: Simplifies searching, fetching, and managing scientific papers from arXiv.org.pandas
: Key for data manipulation and analysis, offering structures and functions for handling large datasets.PyPDF2
: Enables text extraction from PDF files, vital for processing scientific papers in PDF format.tqdm
: Generates progress bars for loops or long-running processes, improving the user experience.
2.2 Setting Up the Enviroment
First, you’ll need to create an account on OpenAI’s platform and obtain an API key from the API section of your account settings.
openai.api_key = "API_KEY"
GPT_MODEL = "gpt-3.5-turbo-0613"
EMBEDDING_MODEL = "text-embedding-ada-002"
2.3 Project Setup
Creating a structured directory for managing downloaded papers or data is crucial for organization and easy access. Here’s how you can set up the necessary directories:
- Create Directory Structure: Decide on a structure that suits your project’s needs. For managing downloaded papers, a
./data/papers
directory is suggested. - Implementation: Use Python’s
os
library to check for the existence of these directories and create them if they don't exist:
import os
directory = './data/papers'
if not os.path.exists(directory):
os.makedirs(directory)
This snippet ensures that your script can run on any system without manual directory setup, making your project more portable and user-friendly.
3. Core Functionalities
The research chatbot, designed to facilitate scientific knowledge retrieval, integrates several core functionalities.
These are centered around processing natural language queries, retrieving and summarizing academic content, and enhancing user interactions with advanced NLP techniques.
Below, we detail these functionalities, underscored by specific code snippets that illustrate their implementation.
3.1 Embedding Generation
To understand and process user queries effectively, the chatbot leverages embeddings — a numerical representation of text that captures semantic meanings. This is crucial for tasks like determining the relevance of scientific papers to a query.
@retry(wait=wait_random_exponential(min=1, max=40), stop=stop_after_attempt(3))
def embedding_request(text):
response = openai.Embedding.create(input=text, model=EMBEDDING_MODEL)
return response['data']['embeddings']
This function, equipped with a retry mechanism, requests embeddings from OpenAI’s API, ensuring robustness in face of potential API errors or rate limits.
3.2 Retrieving Academic Papers
Upon understanding a query, the chatbot fetches relevant academic papers, demonstrating its ability to interface directly with external databases like arXiv.
# Function to get articles from arXiv
def get_articles(query, library=paper_dir_filepath, top_k=5):
"""
Searches for and retrieves the top 'k' academic papers related to a user's query from the arXiv database.
The function uses the arXiv API to search for papers, with the search criteria being the user's query and the number of results limited to 'top_k'.
For each article found, it stores relevant information such as the title, summary, and URLs in a list.
It also downloads the PDF of each paper and stores references, including the title, download path, and embedding of the paper title, in a CSV file specified by 'library'.
This is useful for keeping a record of the papers and their embeddings for later retrieval and analysis.
This function will be used by the read_article_and_summarize
"""
search = arxiv.Search(
query=query, max_results=top_k, sort_by=arxiv.SortCriterion.Relevance
)
result_list = []
for result in search.results():
result_dict = {}
result_dict.update({"title": result.title})
result_dict.update({"summary": result.summary})
# Taking the first url provided
result_dict.update({"article_url": [x.href for x in result.links][0]})
result_dict.update({"pdf_url": [x.href for x in result.links][1]})
result_list.append(result_dict)
# Store references in library file
response = embedding_request(text=result.title)
file_reference = [
result.title,
result.download_pdf(data_dir),
response["data"][0]["embedding"],
]
# Write to file
with open(library, "a") as f_object:
writer_object = writer(f_object)
writer_object.writerow(file_reference)
f_object.close()
return result_list
3.3 Ranking and Summarization
With relevant papers at hand, the system ranks them based on their relatedness to the query and summarizes the content to provide concise, insightful information back to the user.
# Function to rank strings by relatedness to a query string
def strings_ranked_by_relatedness(
query: str,
df: pd.DataFrame,
relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),
top_n: int = 100,
) -> list[str]:
"""
Ranks and returns a list of strings from a DataFrame based on their relatedness to a given query string.
The function first obtains an embedding for the query string. Then, it calculates the relatedness of each string in the DataFrame to the query,
using the provided 'relatedness_fn', which defaults to computing the cosine similarity between their embeddings.
It sorts these strings in descending order of relatedness and returns the top 'n' strings.
"""
query_embedding_response = embedding_request(query)
query_embedding = query_embedding_response["data"][0]["embedding"]
strings_and_relatednesses = [
(row["filepath"], relatedness_fn(query_embedding, row["embedding"]))
for i, row in df.iterrows()
]
strings_and_relatednesses.sort(key=lambda x: x[1], reverse=True)
strings, relatednesses = zip(*strings_and_relatednesses)
return strings[:top_n]
3.4 Summarizing Academic Papers
Following the identification of relevant papers, the chatbot employs a summarization process to distill the essence of scientific documents.
# Function to summarize chunks and return an overall summary
def summarize_text(query):
"""
Automates summarizing academic papers relevant to a user's query. The process includes:
1. Reading Data: Reads 'arxiv_library.csv' containing information about papers and their embeddings.
2. Identifying Relevant Paper: Compares query's embedding to embeddings in the CSV to find closest match.
3. Extracting Text: Reads the PDF of the identified paper and converts its content into a string.
4. Chunking Text: Divides the extracted text into manageable chunks for efficient processing.
5. Summarizing Chunks: Each text chunk is summarized using the 'extract_chunk' function in parallel.
6. Compiling Summaries: Combines individual summaries into a final comprehensive summary.
7. Returning Summary: Provides a condensed overview of the paper, focusing on key insights relevant to the user's query.
"""
# A prompt to dictate how the recursive summarizations should approach the input paper
summary_prompt = """Summarize this text from an academic paper. Extract any key points with reasoning.\n\nContent:"""
# If the library is empty (no searches have been performed yet), we perform one and download the results
library_df = pd.read_csv(paper_dir_filepath).reset_index()
if len(library_df) == 0:
print("No papers searched yet, downloading first.")
get_articles(query)
print("Papers downloaded, continuing")
library_df = pd.read_csv(paper_dir_filepath).reset_index()
library_df.columns = ["title", "filepath", "embedding"]
library_df["embedding"] = library_df["embedding"].apply(ast.literal_eval)
strings = strings_ranked_by_relatedness(query, library_df, top_n=1)
print("Chunking text from paper")
pdf_text = read_pdf(strings[0])
# Initialise tokenizer
tokenizer = tiktoken.get_encoding("cl100k_base")
results = ""
# Chunk up the document into 1500 token chunks
chunks = create_chunks(pdf_text, 1500, tokenizer)
text_chunks = [tokenizer.decode(chunk) for chunk in chunks]
print("Summarizing each chunk of text")
# Parallel process the summaries
with concurrent.futures.ThreadPoolExecutor(
max_workers=len(text_chunks)
) as executor:
futures = [
executor.submit(extract_chunk, chunk, summary_prompt)
for chunk in text_chunks
]
with tqdm(total=len(text_chunks)) as pbar:
for _ in concurrent.futures.as_completed(futures):
pbar.update(1)
for future in futures:
data = future.result()
results += data
# Final summary
print("Summarizing into overall summary")
response = openai.ChatCompletion.create(
model=GPT_MODEL,
messages=[
{
"role": "user",
"content": f"""Write a summary collated from this collection of key points extracted from an academic paper.
The summary should highlight the core argument, conclusions and evidence, and answer the user's query.
User query: {query}
The summary should be structured in bulleted lists following the headings Core Argument, Evidence, and Conclusions.
Key points:\n{results}\nSummary:\n""",
}
],
temperature=0,
)
return response
3.5 integration and Use of OpenAI Functions
The research chatbot leverages OpenAI functions, a powerful feature of the OpenAI API, to enhance its ability to process and respond to complex queries.
These functions allow for a more seamless interaction between the chatbot and various external data sources and tools, significantly enriching the user’s experience by providing detailed, accurate, and contextually relevant information.
OpenAI functions are designed to extend the capabilities of OpenAI models by integrating external computation or data retrieval directly into the model’s processing flow.
3.5.1 Custom OpenAI Functions
get_articles
Function: To retrieve academic papers relevant to the user’s query from the arXiv database, showcasing the chatbot’s ability to access real-time data from external sources.read_article_and_summarize
Function: To retrieve academic papers relevant to the user’s query from the arXiv database, showcasing the chatbot’s ability to access real-time data from external sources.
Implementation:
# Function to initiate our get_articles and read_article_and_summarize functions
arxiv_functions = [
{
"name": "get_articles",
"description": """Use this function to get academic papers from arXiv to answer user questions.""",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": f"""
User query in JSON. Responses should be summarized and should include the article URL reference
""",
}
},
"required": ["query"],
},
},
{
"name": "read_article_and_summarize",
"description": """Use this function to read whole papers and provide a summary for users.
You should NEVER call this function before get_articles has been called in the conversation.""",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": f"""
Description of the article in plain text based on the user's query
""",
}
},
"required": ["query"],
},
}
]
The incorporation of these functions into the chatbot’s workflow demonstrates an advanced use case of OpenAI’s API, where custom functions tailored to specific tasks — like academic research — are executed based on conversational context.
3.6 Complete Code
See Complete Code with the required functions and chatbot interaction for end-to-end implementation.