How to Build a Document-Based Q&A System Using OpenAI and Python

Leveraging the power of Large Language Models and the LangChain framework for an innovative approach to document querying

Sarah watches the airplanes while waiting for her connection. Image generated by Midjourney, prompt by author.

Amid the ceaseless hum of a bustling airport, Sarah was absorbed in the choreographed ballet of airplanes, their soaring ascents and elegant descents marking time as she awaited her connecting flight.

With unoccupied hours stretching ahead, she decided to seize the moment. Clasping a warming cup of coffee, she claimed a nearby desk, ready to dive into work amidst the ambient hustle of the airport.

As CTO of NeonShield, a pioneering cyber security firm, Sarah was immersed in crafting a tender for a prospective client. This critical document brimmed with technical explanations detailing how NeonShield would deftly integrate its cyber defenses into the client’s IT infrastructure.

She had answered similar questions before, but each response demanded meticulous customization to suit the unique format of the tender. The essential information lay somewhere within NeonShield’s vast data repository, but locating them was a Herculean task. A sigh of exasperation subtly punctuated the airport’s constant murmur.

Hiroshi, sitting next to her, noticed her frustration and casually offered help. Grateful for the unexpected break in routine, Sarah found herself detailing her predicament.

Hiroshi listened as Sarah described her arduous quest to pinpoint precise data in her company’s burgeoning information vault.

Hiroshi explained that he worked for a startup that uses AI, especially Large Language Models, to solve business problems. He asked Sarah, “Have you ever heard of OpenAI, ChatGPT, and LLMs?

As Hiroshi explained how OpenAI could parse and comprehend documents for efficient information retrieval, Sarah was captivated by this intriguing prospect.

The conversation blossomed into a lively discussion on the potential of this technology. As Hiroshi elaborated, Sarah could envision myriad ways it could streamline data retrieval at NeonShield and hurry its tender response process.

This exchange left her mind buzzing with possibilities. AI appeared to be the panacea for many data-related issues plaguing NeonShield. Embarking on her return journey, she anticipated sharing this discovery with her team.

Back at NeonShield, Sarah primed herself to explore this new avenue. She contacted Mustafa, a savvy software engineer in her team known for his adeptness with new technologies and a master’s thesis on practical AI applications. Intrigued by Sarah’s account of AI, Mustafa agreed to help develop a prototype.

Join us as we embark on a journey alongside Sarah and Mustafa. Harnessing Sarah’s deep understanding of NeonShield’s data needs and Mustafa’s technical prowess, they set out to integrate OpenAI into the company’s systems.

Their mission? To develop a prototype that would query their extensive document database, with OpenAI promising to be the key to unlocking this challenge.

In this story, we will witness them overcoming various hurdles, innovating, and problem-solving in their quest for efficient data retrieval.

The DocuVortex prototype can be found in this GitHub repository. Instructions on how to install and run the application can be found in the README.md.

After delving into the nuances of creating a Document-Based Q&A System in this story, be sure to continue your learning journey with our follow-up article, Harnessing Local Language Models — A Guide to Transitioning From OpenAI to On-Premise Power where we discuss in detail the process of transitioning from cloud-based AI solutions to local language models.

The first meeting on AI integration

Back at NeonShield, Sarah found Mustafa engrossed in his work. He was renowned within the company for his ability to quickly understand and implement new technologies. Mustafa had written his Master’s thesis on practical applications of AI, and Sarah was eager to tap into his expertise.

She walked into his office, the air filled with the hum of computer fans and the muted clacking of Mustafa’s keyboard. Catching his attention, she began to relay the conversation at the airport and the potential of OpenAI as a solution to their data search challenges.

Mustafa listened intently, his eyes narrowing in concentration. Once Sarah finished, he leaned back in his chair, silent momentarily. Then, he sketched a rough architecture of how OpenAI could be integrated into NeonShield’s existing systems.

Alright, let’s get into the technical aspects then. Our process can be divided into two phases: Ingestion and Searching.

Ingestion

In the Ingestion phase, we’ll prepare our data for later use. It’s about processing all the documents you want to be searchable. We’ll convert each document into smaller pieces called ‘chunks.’ Each chunk might be a paragraph or a sentence, depending on what works best.

We then pass these chunks of text to the OpenAI API. The API uses a sophisticated language model to convert each chunk of text into an ‘embedding.’ These embeddings are high-dimensional vectors representing each text chunk's semantic content. Essentially, they’re a numerical summary of what each chunk is about.

“We store these embeddings in a special vector database designed to search and retrieve high-dimensional vectors efficiently. Each embedding is linked to the original chunk of text it came from, so we can retrieve the text later.

Searching

Converting a question to an embedding and querying the Vector Db, sending the question and result to the OpenAI API—image by the author.

“Moving on to the Searching phase. This is where we respond to a user query. When a user asks a question, we first convert that question into an embedding, again using the OpenAI API.

“We then search our vector database for the embeddings most similar to the question embedding. The idea is that similar questions will have similar embeddings. By finding the closest embeddings in our database, we see the chunks of text most relevant to the user’s query.

“We retrieve the original text chunks associated with these closest embeddings and send them, along with the user’s question, to the OpenAI API again. The API generates a coherent and contextually appropriate response based on this information.

“And that’s essentially how we can integrate OpenAI into NeonShield’s systems. The Ingestion phase processes and stores information about our documents, and the Searching phase retrieves and uses this information to answer user questions.” Mustafa concluded.

Wow, Mustafa, that was a lot to take in! I have a ton of questions swirling around in my head now.” Sarah admitted.

“That’s fantastic, Sarah! Let’s dive into them. There’s no better way to understand all of this than by asking questions,” Mustafa responded with enthusiasm.

Answering Sarah questions

Sarah: “Mustafa, we’ll send some pretty sensitive stuff to this OpenAI thing. How sure are we that our secrets are safe?”

Mustafa: “Good point, Sarah. I mean, security is super important, right? OpenAI is pretty tight about this stuff. They don’t hang onto our data for more than a month, and they’re not peeking at it to improve their models.

But, if we’ve got some super top-secret stuff, we should not include it in the ingestion process or even look at doing the whole thing in-house. We’ll have to see what suits our policy.”

Sarah: “So, why are we chopping up our documents? Couldn’t we use whole PDFs?”

Mustafa: “Well, Sarah, think about it like this: we’re making a super detailed table of contents or index for a book. We could summarize the book in a sentence or two, but we’d lose many specifics. By breaking it down, we can ensure we don’t miss any juicy details when answering queries.”

Sarah: “Mustafa, the OpenAI service isn’t free. Any idea what kind of costs we’re looking at here? I know it’s early days, but the suits upstairs will want some figures soon.”

Mustafa: “Yeah, that’s a biggie, Sarah. It will depend on how much text we must process and how many questions we expect. OpenAI usually charges by the amount of text — or ‘tokens’ — we process.

Sarah: Okay, we’re also billed for embedding creation, right? And we’re storing them to save some cash?”

Mustafa: “Exactly, Sarah. We get billed for each embedding, so storing them in our Vector Database makes sense. This way, we pay once, but we can use them as many times as we need.”

Sarah: “So this Vector Database, that’s on us, right? What do we need to get it up and running?”

Mustafa: “Yeah, we’ll need to host it ourselves. We’ve got a few options to pick from, like FAISS, ChromaDb, Annoy, or ElasticSearch’s vector fields. We’ll pick one based on how much data we’ve got, how quick we need it to be, and what our infrastructure can handle. We’ve got to make sure our servers can take the heat.”

Sarah: “Okay, last one. In the Searching phase, what will this OpenAI give us back?”

Mustafa: “Alright, so in the Searching phase, we’re sending the relevant text chunks and the user’s question over to the OpenAI API. Then, OpenAI goes to work and gives us a well-crafted response. It will be a chunk of text that uses the info from our documents to answer the question best. It’s like having a super-smart assistant who always knows where to find the right information!”

“Alright, Mustafa, that’s plenty to chew on for now,” Sarah said. “What’s our timeline for a prototype? Just take a few of those PDFs we’ve got up on our website and see if you can show us how this all works in action.”

Mustafa grinned at her question. “Ah, the inevitable ‘how long’ query! Let’s see… I think I can whip up something to show you by the end of this week.”

Demo Day

Under the gentle warmth of a sun-soaked Friday afternoon, Mustafa was primed and ready to unveil the project that had been his labor of love for the entire week. Brimming with curiosity, Sarah was just as eager to witness the fruits of his work.

Sarah, her eyes sparkling with anticipation, started off the conversation. “So, Mustafa, how did it pan out?”

Mustafa responded with a grin that was the telltale sign of triumph. “It shaped up well; I dare say. I’ve succeeded in crafting both the ingest and query components we envisioned. I’m itching to give you a tour!”

“But before we dive in,” Mustafa interjected, “I thought I’d christen our project ‘DocuVortex.’ What do you think?”

Sarah responded, clearly impressed, “Sounds cool! But why ‘DocuVortex’?”

Mustafa explained, “I always believe that naming an application, even if it’s just a working title, gives it a persona.

As for ‘DocuVortex,’ I drew inspiration from the image that our process conjures up in my mind: a mighty vortex, voraciously consuming our documents and stowing them away into a unique repository that’s at our beck and call whenever we have questions.”

The image that inspired Mustafa to come up with the name DocuVortex. Image generated by Midjourney, prompt by author.

“Shall we kick off with the ingest component?” With a nod from Sarah, Mustafa embarked on the walkthrough.

Ingesting

“The structure of the prototype is quite simple. I’ve got a folder in the root called ‘docs’ where we can stash all the PDF files for ingestion. These are then converted into chunks and embeddings and stored in Chroma DB. Here, take a look.”

The folder structure of the repository, image by the author.

Mustafa gestured towards a single PDF in the folder titled ‘Cyber Security.’ He ran the ingest application by typing python ingest.py into the terminal, explaining, "I've added a bunch of log statements so you can follow along. Check it out."

The logging messages when ingesting a PDF document. Image by the author.

Sarah seemed to catch on quickly. “So, this process connects to the OpenAI API to create the embeddings and then stores them in the Chroma database, right?”

“Sarah. Spot on,” Mustafa confirmed with an approving nod.

Switching gears, Mustafa then moved on to the second part of the demonstration. “Now, let’s try querying the document.”

Querying

He explained that he had created a small terminal application that could take a question and provide an answer. Typing python query.py into the terminal, he asked: "What is the conclusion of the 'CyberSecurity Risk' document we ingested?"

Sarah looked on as the application churned out a coherent answer. Not only that, but it also displayed the sources from which it had drawn its information.

“See, the sources are the results from the Chroma DB query,” Mustafa explained, pointing out the chunks of text that were sent along with the question to the OpenAI API.

Mustafa beamed with satisfaction, happy with the outcome of a week’s hard work, while Sarah watched on, visibly impressed with the promising prototype.

Mustafa’s enthusiasm was contagious as he began to share yet another feature he had developed — a Streamlit application. He believed this feature would simplify the demonstration of their work to the broader organization.

Querying using the Streamlit app

“Sarah,” he began excitedly, “I’ve also developed a different, perhaps more interactive way to query the documents. It uses Streamlit, and I think it could be particularly effective when showing our work to the rest of the team.”

With a few keystrokes, he started the application by typing streamlit run streamlit_app.py into the terminal. An interactive webpage came to life where search queries could be directly entered.

“It operates quite like a chat box,” Mustafa explained, “remembering your previous messages to maintain the flow of conversation.”

Sarah was visibly taken aback. She gazed at the interactive tool on the screen, clearly amazed by its potential. “Wow, Mustafa,” she marveled, “this is brilliant. You’re right. This tool will make demonstrating our project to the rest of the organization so much easier.”

The Streamlit application is used for querying documents—image by the author.

Now, let’s transition to the part I’m particularly interested in. Could you walk me through how you’ve implemented all of this?

DocuVortex implementation

Let’s dive into the DocuVortex implementation details. We’ll start with the process of ingesting PDF documents.

Ingestion

The heart of this process is a class I created named VortexPdfParser. It's assigned the crucial task of parsing the PDF documents in the 'docs' folder. Here's a look at its key function:

def clean_text_to_docs(self) -> List[docstore.Document]:
    raw_pages, metadata = self.parse_pdf()

    cleaning_functions: List = [
        self.merge_hyphenated_words,
        self.fix_newlines,
        self.remove_multiple_newlines,
    ]

    cleaned_text_pdf = self.clean_text(raw_pages, cleaning_functions)
    return self.text_to_docs(cleaned_text_pdf, metadata)

This function performs three main actions:

It extracts metadata from the document, such as the title, author, and creation date.
It gathers all the text from the PDF’s pages and tidies it up. The cleaning process involves merging hyphen-split words, eradicating multiple newline characters, and replacing newline characters with spaces.
It then chunks the text into segments of 1000 characters, allowing an overlap of 200 characters for context preservation.

Next, I crafted another class named VortexIngester. This class is responsible for transforming the text chunks into embeddings and storing them in our vector database, Chroma DB. Here's a glimpse at the source code for this class:

class VortexIngester:
    def __init__(self, content_folder: str):
        self.content_folder = content_folder

    def ingest(self) -> None:
        vortex_content_iterator = VortexContentIterator(self.content_folder)
        vortex_pdf_parser = VortexPdfParser()

        chunks: List[docstore.Document] = []
        for document in vortex_content_iterator:
            vortex_pdf_parser.set_pdf_file_path(document)
            document_chunks = vortex_pdf_parser.clean_text_to_docs()
            chunks.extend(document_chunks)
            logger.info(f"Extracted {len(chunks)} chunks from {document}")

        embeddings = OpenAIEmbeddings(client=None)
        logger.info("Loaded embeddings")
        vector_store = Chroma.from_documents(
            chunks,
            embeddings,
            collection_name=COLLECTION_NAME,
            persist_directory=PERSIST_DIRECTORY,
        )

        logger.info("Created Chroma vector store")
        vector_store.persist()
        logger.info("Persisted Chroma vector store")

Throughout this process, the LangChain library has been a cornerstone for me. It’s a robust framework designed for building applications that capitalize on the power of language models. As a result, the OpenAIEmbeddings class you see in use here comes directly from this forward-thinking framework.

The method Chroma.from_documents is another gem from the LangChain toolbox. Working hand-in-hand with the OpenAIEmbeddings class, it transforms each text chunk into embeddings and stores them in the Chroma vector database.

Now, what is the persist_directory argument for? Well, it's simply there to tell Chroma DB where to store the database when it's time for data persistence. So, all in all, these tools from LangChain have significantly streamlined the process and made the implementation more efficient.

Querying Process

In contrast to the ingest function, the querying part of the system is relatively more straightforward, as you’re about to see. The critical player in querying the Chroma Db and forwarding the query to the OpenAI API is the VortexQuery class.

Our first step in this process is to create an instance of a Language Learning Model (LLM). We’ve chosen OpenAI’s gpt-3.5-turbo, one of their latest models, as our LLM.

A feature we leverage here is the temperature setting, which we’ve set to 0. This value between 0 and 1 dictates how much creative freedom the model can exercise in generating its responses.

Next, we return our old friend, the OpenAIEmbeddings class, and instantiate the Chroma Db database. The ConversationalRetrievalChain from Langchain then comes into play. This allows us to create an object to send our questions.

An important detail is that we also record previous questions and answers. This is crucial as it provides context for the model to understand and respond to the incoming questions.

Here’s the relevant part of the code for the VortexQuery class:

class VortexQuery:
    def __init__(self):
        load_dotenv()
        self.chain = self.make_chain()
        self.chat_history = []

    def make_chain(self):
        model = ChatOpenAI(
            client=None,
            model="gpt-3.5-turbo",
            temperature=0,
        )
        embedding = OpenAIEmbeddings(client=None)

        vector_store = Chroma(
            collection_name=COLLECTION_NAME,
            embedding_function=embedding,
            persist_directory=PERSIST_DIRECTORY,
        )

        return ConversationalRetrievalChain.from_llm(
            model,
            retriever=vector_store.as_retriever(),
            return_source_documents=True,
        )

    def ask_question(self, question: str):
        response = self.chain({"question": question, "chat_history": self.chat_history})

        answer = response["answer"]
        source = response["source_documents"]
        self.chat_history.append(HumanMessage(content=question))
        self.chat_history.append(AIMessage(content=answer))

        return answer, source

So, that’s the crux of how the querying process is set up, Mustafa concluded.

Unit tests

With the spirit of rigor and precision, Sarah queried, “I know it’s our standard practice to write unit tests for our applications. They help assure that our software performs its intended function and continues to do so even as we make modifications.

Given that this was a prototype, I wonder if you also found it necessary to adopt unit tests here?”

Mustafa nodded in understanding before he explained,

“Indeed, the creation of a prototype often presents unique challenges. However, I still found it helpful to incorporate unit tests, especially for the more complex components like the PDF parser.

To add another layer of reliability, I set up a GitHub action that triggers these unit tests every time a change is pushed to the repository.”

Sarah couldn’t help but admire, “That’s commendable, Mustafa!”

Charting the Path Forward: Roadmap to Beta

Before we roll out this groundbreaking innovation to the rest of our team, let’s pause for a moment. We’ve made incredible strides already, but let’s map out the areas we need to refine and enhance.

This way, we’ll have a clear game plan to transform this exciting prototype into a fully functioning beta version of our NeonShield DocuVortex application.

“Mustafa, you’ve done an amazing job turning our concept into a working prototype. Your ingenuity has demonstrated how NeonShield can harness the power of OpenAI to address our business challenges when drafting tenders for potential clients,” Sarah began.

“You’ve managed to process, clean, and chunk our own PDFs, creating and storing embeddings of these chunks in a vector database. The fact that we can directly query our documents will reduce the time and cost of tender creation. This could even boost our capacity to bid, which could mean more revenue.”

Sarah paused for a moment, then added, “But to make this vision a reality and to evolve from this brilliant prototype to a powerful beta version, there are a few key areas we need to focus on:

First, we must look into deploying ChromaDb within our existing Kubernetes cluster rather than running it in memory.
Second, we should test the ingestion process with various PDF files across our organization to ensure its effectiveness and robustness.
Third, let’s see how we can integrate other types of documents into the system, such as our internal wiki. We’ll also need to find the best way to parse these documents.
Fourth, exploring other large language models that could enhance our application would be worth exploring.
Fifth, we should Dockerize the ingestion component and schedule it to run at specific intervals.
Sixth, we should estimate the operational costs when running at total capacity.
And finally, let’s work on enhancing and Dockerizing the Streamlit application for a more seamless user experience.”

As they discussed these focal points, Sarah and Mustafa could see the next phase of the NeonShield DocuVortex journey unfolding before them, bringing them one step closer to their ambitious vision.

The DocuVortex prototype can be found in this GitHub repository. Instructions on how to install and run the application can be found in the README.md.

Happy coding!