Free AI web copilot to create summaries, insights and extended knowledge, download it at here

10129

Abstract

span>)</pre></div>Finally, store the generated embeddings in the weaviate DB. For this, you need to configure the client to perform batch processing that can handle large amounts of data.<div id="cbf5"><pre>client.batch.configure( batch_size=10, dynamic=True, timeout_retries=3, )

for i in range(0, len(embeddings)): item = embeddings.iloc[i]

pdf_obj = {
    <span class="hljs-string">"embedded_values"</span>: item[<span class="hljs-string">"embedded_values"</span>],
}

<span class="hljs-keyword">try</span>:
    client.batch.add_data_object(pdf_obj, <span class="hljs-string">"PDF"</span>)
<span class="hljs-keyword">except</span> BaseException <span class="hljs-keyword">as</span> error:
    <span class="hljs-built_in">print</span>(<span class="hljs-string">"Import Failed at: "</span>, i)
    <span class="hljs-built_in">print</span>(<span class="hljs-string">"An exception occurred: {}"</span>.<span class="hljs-built_in">format</span>(error))
    <span class="hljs-comment"># Stop the import on error</span>
    <span class="hljs-keyword">break</span>

<span class="hljs-built_in">print</span>(<span class="hljs-string">"Status: "</span>, <span class="hljs-built_in">str</span>(i) + <span class="hljs-string">"/"</span> + <span class="hljs-built_in">str</span>(<span class="hljs-built_in">len</span>(embeddings) - <span class="hljs-number">1</span>))

client.batch.flush() print("Job done...") return True</pre></div>Code Link:<a href="https://github.com/LLM-Projects/docs-qa-bot/blob/main/app/store.py">https://github.com/LLM-Projects/docs-qa-bot/blob/main/app/store.py</a><h1 id="f8b0">Decoding the Schema📉</h1><figure id="c3c2"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*FRlWSqEkQ_yRkeN9"><figcaption>Weviate Schema | Skanda Vivek</figcaption></figure>Weaviate supports Graph query language (<a href="https://graphql.org">GraphQL</a>) that supports that is built over Graph data structures hence faster data access rates and updation. Like any other query language, you can perform Get, Aggregate, and Explore on the Weaviate schema. These can be easily performed on the Weaviate console that provides prebuilt UI to test the GraphQL queries:<h2 id="69e9">Get⬇️</h2>As per the schema, you can Get the embedding values which display the values stored inside the schema which is in the form of JSON. The command for the same is shown below:<div id="0cac"><pre>{ Get { PDF { embedded_values } } }</pre></div>Sample response:<figure id="40ed"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*IHvgR0HLJXQNuBZL"><figcaption>Weviate Sample Query | Skanda Vivek</figcaption></figure>Learn more about the Get function <a href="https://weaviate.io/developers/weaviate/api/graphql/get">here</a>.<h2 id="4f91">Aggregate➕</h2>It’s easy to perform aggregate functions like mean, median, mode, count, etc., on numeric columns. Other datatypes like String, Boolean, and Text are also supported. A sample query to perform an aggregate function is shown below:<div id="ec50"><pre>{Aggregate { PDF { embedded_values { count maximum mean median minimum mode sum type } } } }</pre></div>Sample response:<figure id="3f4a"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*YpK6VYaXL3y7WtI4"><figcaption>Weviate Sample Query | Skanda Vivek</figcaption></figure>Other available functions for different data types can be found in the <a href="https://weaviate.io/developers/weaviate/api/graphql/aggregate">Weaviate GrpahQL Aggregate docs</a>.<h1 id="cdfd">Retrieve the closest match</h1>As I talked about earlier — the key is finding the closest match(es) to the user input. You can run weaviate hybrid search on the schema, using Langchain, and OpenAI to return the closest match.First, run a <a href="https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter">recursive text splitter</a> on the extracted text. Then, the ChatPromptTemplate is used for creating chat prompts, and the SystemMessagePromptTemplate and HumanMessagePromptTemplate modules are used for generating prompts for system and human messages, respectively. These modules aid in creating interactive chat components using Langchain.<div id="a1e5"><pre>from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.vectorstores import Weaviate from langchain.chains import RetrievalQAWithSourcesChain from langchain import OpenAI

from langchain.prompts.chat import ( ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate, )

import streamlit as st

@st.cache_data def chat_with_pdf(_text, _embeddings, chat_prompt): text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0) texts = text_splitter.split_text(_text)

system_template=<span class="hljs-string">"""Use the following pieces of context to answer the users question. 


Begin!
----------------
{summaries}
"""</span>

messages = [
    SystemMessagePromptTemplate.from_template(system_template),
    HumanMessagePromptTemplate.from_template(<span class="hljs-string">"{question}"</span>)
]
prompt = ChatPromptTemplate.from_messages(messages)</pre></div><p id="8333">The <a href="https://python.langchain.com/docs/use_cases/question_answering/local_retrieval_qa">RetrievalQAWithSourcesChain</a> is used for Question-Answering(QA) using sources of information. Finally, pass the prompt to the `get_chain` function that searches using `docsearch`. The response is the answer to the prompt given by the user based on the most relevant document.</p><div id="a497"><pre>docsearch = Weaviate.from_texts(
texts,
_embeddings,
weaviate_url=st.secrets[<span class="hljs-string">"WEAVIATE_CLUSTER_URL"</span>],
by_text=<span class="hljs-literal">False</span>,
metadatas=[{<span class="hljs-string">"source"</span>: <span class="hljs-string">f"<span class="hljs-subst">{i}</span>-pl"</span>} <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(<span class="hljs-built_in">len</span>(texts))],

)

def get_chain(store): chain_type_kwargs = {"prompt": prompt} chain = RetrievalQAWithSourcesChain.from_chain_type( OpenAI(temperature=0), chain_type="stuff", retriever=store.as_retriever(), chain_type_kwargs=chain_type_kwargs, reduce_k_below_max_tokens=True ) return chain

chain = get_chain(docsearch)

# chain = RetrievalQAWithSourcesChain.from_chain_type( # OpenAI(temperature=0), chain_type="stuff", retriever=docsearch.as_retriever() # )

response = chain( {"question": chat_prompt}, return_only_outputs=True, )

print(response["answer"]) return response["answer"]</pre></div>Code Link:<a href="https://github.com/LLM-Projects/docs-qa-bot/blob/main/app/chat.py">https://github.com/LLM-Projects/docs-qa-bot/blob/main/app/chat.py</a><h1 id="a107">Developing the UI (Frontend):</h1>Now that you are done creating the app, its time to bring it to life!<figure id="cac7"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*kGK-iOAcUX1G8drRr22pCg.png"><figcaption><a href="https://tenor.com/view/its-alive-young-frankenstein-gif-18700938">https://tenor.com/view/its-alive-young-frankenstein-gif-18700938</a></figcaption></figure><a href="https://streamlit.io">Streamlit</a> helps build and share web apps in just a few minutes. The advantages of using streamlit are:<ul><li>Customizable pre-built components for all our needs</li><li>Instant deployment</li><li>Responsive design</li><li>High compatibility with different libraries</li><li>Pre-built components</li></ul>For building UI close to the one shown in the image above, you can easily use components like:<ul><li><a href="https://docs.streamlit.io/library/api-reference/widgets/st.file_uploader">File uploader</a></li><li><a href="https://docs.streamlit.io/library/api-reference/text/st.hea

Options

der">Header</a></li><li><a href="https://docs.streamlit.io/library/api-reference/status/st.warning">Warnings</a></li><li><a href="https://docs.streamlit.io/library/api-reference/chat/st.chat_input">Chat input (textbox)</a></li><li><a href="https://docs.streamlit.io/library/api-reference/text/st.text">Text display</a></li><li><a href="https://github.com/AI-Yash/st-chat">Chat interface</a></li><li><a href="https://docs.streamlit.io/streamlit-community-cloud/get-started/deploy-an-app/connect-to-data-sources/secrets-management">Secrets</a></li></ul>Structure your streamlit application by setting the title and use the <a href="https://docs.streamlit.io/library/api-reference/widgets/st.file_uploader">file uploader component</a> that accepts a PDF as input and scrapes and returns the text.<div id="b5ef"><pre># External libraries import streamlit as st from streamlit_chat import message import time from PyPDF2 import PdfReader

# Internal file imports from extract import extract_text from embeddings import create_embeddings from store import store_embeddings from qa import search_qa from chat import chat_with_pdf

# Start of streamlit application st.title("PDF QA Bot using Langchain")

# Intitialization st.header("File upload") file = st.file_uploader("Choose a file (PDF)", type="pdf", help="file to be parsed")

if file is not None: # @st.cache_data data = extract_text(file) # st.text(data, help="Extracted text from uploaded pdf")</pre></div>Then run the create_embeddings function in the embeddings.py file.<div id="68b2"><pre># Create, display, search and query the embeddings st.header("Create Embeddings") texts, embeddings, embeds_df = create_embeddings( data, st.secrets["OPENAI_API_KEY"] ) st.text("Created successfully...")

if store_embeddings(embeds_df): st.success("Data saved successfully...", icon="✅") else: st.error("Operation not successful. Please reach out to support...", icon="❌")

else: st.error("Upload the file to proceed further", icon="🚨")</pre></div>Last but not least is the chat component. The user input is passed as arguments along with scrapped text and embeddings to return the response from the LLM we created.<div id="23c1"><pre># Chat component to chat with your uploaded PDF st.header("Chat with your PDF...") if 'generated' not in st.session_state: st.session_state['generated'] = []

if 'past' not in st.session_state: st.session_state['past'] = []

def get_text(): input_text = st.chat_input("Enter your query to retrieve answer from my knowledge...") return input_text

user_input = get_text()

if user_input: output = chat_with_pdf(data, embeddings, user_input)

st.session_state.past.append(user_input)
st.session_state.generated.append(output)

if st.session_state['generated']:

<span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(<span class="hljs-built_in">len</span>(st.session_state[<span class="hljs-string">'generated'</span>])-<span class="hljs-number">1</span>, -<span class="hljs-number">1</span>, -<span class="hljs-number">1</span>):
    message(st.session_state[<span class="hljs-string">"generated"</span>][i], key=<span class="hljs-built_in">str</span>(i))
    message(st.session_state[<span class="hljs-string">'past'</span>][i], is_user=<span class="hljs-literal">True</span>, key=<span class="hljs-built_in">str</span>(i) + <span class="hljs-string">'_user'</span>)</pre></div><p id="7fc3">Finally, we are all done. Let’s run the app locally:</p><div id="f6ba"><pre>streamlit run app/main.py</pre></div><p id="5632">This opens a new environment in <a href="http://localhost:8501">http://localhost:8501</a> that renders the app. If this works as required, it’s time to take the application live and make it available for users all over the Internet!</p><p id="7576">Commit the changes to reflect in the GitHub repository. You can now head to the online Streamlit app, link to the GitHub repo — and reboot the application to see the changes.</p><figure id="36f8"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*L5Yd6-zQcCr9y8ltC3dmPg.png"><figcaption><a href="https://github.com/LLM-Projects/docs-qa-bot">Docs QA Bot</a> | Skanda Vivek</figcaption></figure><h1 id="b8da">Conclusion</h1><p id="c74c">This article showcases how to harness the power of LLMs on your PDF documents using tools and techniques that are now becoming a staple of building customized LLM-based solutions. The key aspects covered are:</p><ul><li>LLMs (Using ChatGPT/GPT-4)</li><li>Vectorization and databases (Using Weaviate)</li><li>Prompt engineering/chaining (Using Langchain)</li><li>Frontend and UI (Using Streamlit)</li></ul><p id="49ab">While this space is growing so fast with new LLMs, database solutions, and prompt engineering techniques coming out every day — these can be used in a modular fashion and swapped out accordingly. With slight modifications, you can also build a similar chatbot for other file formats like CSV, Docs, PPTX, etc.</p><p id="22e3">I hope this helps you build a real-life application or prototype with minimal overhead. I can’t wait to see what you all build! The entire GitHub repo is below:</p><div id="4b8f" class="link-block">
      <a href="https://github.com/LLM-Projects/docs-qa-bot">
        <div>
          <div>
            <h2>GitHub - LLM-Projects/docs-qa-bot</h2>
            <div><h3>Contribute to LLM-Projects/docs-qa-bot development by creating an account on GitHub.</h3></div>
            <div><p>github.com</p></div>
          </div>
          <div>
            <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*QhX1TWWuwkGj21Cy)"></div>
          </div>
        </div>
      </a>
    </div><p id="a0e1"><i>If you like this post, <a href="https://skanda-vivek.medium.com/">follow me </a>— I write on Generative AI in real-world applications and, more generally, on the intersections between data and society.</i></p><p id="9ee6"><i>Feel free to connect with me on <a href="https://www.linkedin.com/in/skanda-vivek-01619311b/">LinkedIn</a>!</i></p><p id="80bf"><b>Here are some related articles:</b></p><p id="7cd6"><a href="https://readmedium.com/when-should-you-fine-tune-llms-2dddc09a404a">When Should You Fine-Tune LLMs?</a></p><p id="a506"><a href="https://readmedium.com/llm-economics-chatgpt-vs-open-source-dfc29f69fec1">LLM Economics: ChatGPT vs Open-Source</a></p><p id="4c08"><a href="https://readmedium.com/4-crucial-factors-for-evaluating-large-language-models-in-industry-applications-f0ec8f6d4e9e">4 Crucial Factors for Evaluating Large Language Models in Industry Applications</a></p><p id="5260"><a href="https://skanda-vivek.medium.com/deploying-open-source-llms-as-apis-ec026e2187bc">Deploying Open-Source LLMs As APIs</a></p><p id="831a"><a href="https://readmedium.com/how-do-you-build-a-chatgpt-powered-app-89c83f3e2143">How Do You Build A ChatGPT-Powered App?</a></p><p id="ee64"><a href="https://readmedium.com/extractive-vs-generative-q-a-which-is-better-for-your-business-5a8a1faab59a">Extractive vs Generative Q&amp;A — Which is better for your business?</a></p><p id="d52d"><a href="https://readmedium.com/fine-tune-transformer-models-for-question-answering-on-custom-data-513eaac37a80">Fine-Tune Transformer Models For Question Answering On Custom Data</a></p><p id="e804"><a href="https://readmedium.com/unleashing-the-power-of-generative-ai-for-your-customers-70297f1c9698">Unleashing the Power of Generative AI For Your Customers</a></p><p id="9b2f"><a href="https://readmedium.com/deploy-llms-using-azure-ml-804c40f8635e">Deploy LLMs Using Azure ML</a></p></article></body>

Build A Custom AI Based ChatBot Using Langchain, Weviate, and Streamlit

A comprehensive guide to building a customized chatbot using Generative AI, a popular vector database, prompt chaining, and UI tools

As multiple organizations are racing to build customized LLMs, a common question I have been asked is — what are the tools out there to streamline this process?

In this article, I show you how to build a fully functional application for engaging in conversations through a chatbot built on top of your documents. This application employs the power of ChatGPT/GPT-4 (or any other large language model) to extract information from document data stored as embeddings in a vector database, and Langchain for prompt chaining. Here’s a preview:

So let’s dive in!

Building the app🏗️

First, create a new folder named `app` where the source code for the application resides. This acts as the entry point for the streamlit application. Then create folders that perform different tasks like extracting text from PDF, creating text embeddings, storing embeddings, and finally — chatting. The `app` directory looks like this:

PDF Upload

Upload a PDF and extract text for further processing.

from PyPDF2 import PdfReader
import streamlit as st

@st.cache_data()
def extract_text(_file):
    """
        :param file: the PDF file to extract
    """
    content = ""
    reader = PdfReader(_file)
    number_of_pages = len(reader.pages)

    # Scrape text from multiple pages
    for i in range(number_of_pages):
        page = reader.pages[i]
        text = page.extract_text()
        content = content + text

    return content

Code Link:

https://github.com/LLM-Projects/docs-qa-bot/blob/main/app/extract.py

Generate Text Embeddings

Vectorizing is converting data into dimensions. Vector embeddings assign values to PDF text. Vector embeddings speed up the process of comparing similar texts, and assign similarity scores based on vector similarity metrics like cosine similarity, nearest neighbor search, etc. Here, we use embeddings to find the documents that are most similar to the user input, so that the answers generated are the most valid. This is called Retrieval Augmented Generation (RAG) — I’ve written a blog on RAG if you want to learn more:

Build Industry-Specific LLMs Using Retrieval Augmented Generation

Organizations are in a race to adopt Large Language Models. Let’s dive into how you can build industry-specific LLMs…

towardsdatascience.com

We use OpenAI embeddings provided by Langchain. The PDF text is often too large for our OpenAI API key to handle. One common solution is breaking the text into smaller chunks in the form of documents that are easier to store and work with.

We then store the embeddings in a vector database. The advantage of a vector database over a traditional database is that it speeds up standard vector tasks like comparing/filtering using cosine similarity or other vector metrics.

import pandas as pd
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
import streamlit as st
@st.cache_data
def create_embeddings(text, OPENAI_API_KEY):
    text_splitter = CharacterTextSplitter(        
        separator = "\n",
        chunk_size = 1000,
        chunk_overlap  = 200,
        length_function = len,
    )

    texts = text_splitter.create_documents([text])

    embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
    embeddings_query_result = []

    for i in range(len(texts)):
        query_result = embeddings.embed_query(texts[i].page_content)
        embeddings_query_result = embeddings_query_result + query_result
embeddings_df = pd.DataFrame(embeddings_query_result)
embeddings_df.columns = ["embedded_values"]
return texts, embeddings, embeddings_df

Store the embeddings in a vector store

Now that we are done with generating the embeddings, we now need to store them in the vector database. In our case, we are using Weaviate. The reason for choosing weaviate are:

Open Source
Multiple search options like vector, hybrid, and generative searches
A vibrant community of developers and support

First, create a weaviate client to initiate the connection with our weaviate cluster — note that the cluster type can either be authenticated or not authenticated.

client = weaviate.Client(
    url=st.secrets["WEAVIATE_CLUSTER_URL"],
    # auth_client_secret=weaviate.AuthApiKey(
    #     api_key=st.secrets["WEAVIATE_AUTH_KEY"]
    # ),
)

After this, create a new schema. This is essential to store the data to weaviate. It is a JSON object that contains keys like classes, vectorizers, and properties.

schema = {
    "classes": [
        {
            "class": "PDF",
            "moduleConfig": {
                "text2vec-transformers": {
                    "skip": False,
                    "vectorizeClassName": False,
                    "vectorizePropertyName": False,
                }
            },
            "vectorizer": "text2vec-openai",
            "properties": [
                {
                    "name": "embeddings",
                    "dataType": ["number"],
                    "moduleConfig": {
                        "text2vec-transformers": {
                            "skip": False,
                            "vectorizePropertyName": False,
                            "vectorizeClassName": False,
                        }
                    },
                }
            ],
        }
    ]
}

client.schema.create(schema)
print('Schema created...')

Finally, store the generated embeddings in the weaviate DB. For this, you need to configure the client to perform batch processing that can handle large amounts of data.

client.batch.configure(
    batch_size=10,
    dynamic=True,
    timeout_retries=3,
)

for i in range(0, len(embeddings)):
    item = embeddings.iloc[i]

    pdf_obj = {
        "embedded_values": item["embedded_values"],
    }

    try:
        client.batch.add_data_object(pdf_obj, "PDF")
    except BaseException as error:
        print("Import Failed at: ", i)
        print("An exception occurred: {}".format(error))
        # Stop the import on error
        break

    print("Status: ", str(i) + "/" + str(len(embeddings) - 1))

client.batch.flush()
print("Job done...")
return True

Code Link:

https://github.com/LLM-Projects/docs-qa-bot/blob/main/app/store.py

Decoding the Schema📉

Weaviate supports Graph query language (GraphQL) that supports that is built over Graph data structures hence faster data access rates and updation. Like any other query language, you can perform Get, Aggregate, and Explore on the Weaviate schema. These can be easily performed on the Weaviate console that provides prebuilt UI to test the GraphQL queries:

Get⬇️

As per the schema, you can `Get` the embedding values which display the values stored inside the schema which is in the form of JSON. The command for the same is shown below:

{
Get {
  PDF {
    embedded_values
      }
    }
 }

Sample response:

Learn more about the `Get` function here.

Aggregate➕

It’s easy to perform aggregate functions like mean, median, mode, count, etc., on numeric columns. Other datatypes like String, Boolean, and Text are also supported. A sample query to perform an aggregate function is shown below:

{Aggregate {
PDF {
embedded_values {
count
maximum
mean
median
minimum
mode
sum
type
      }
    }
  }
}

Sample response:

Other available functions for different data types can be found in the Weaviate GrpahQL Aggregate docs.

Retrieve the closest match

As I talked about earlier — the key is finding the closest match(es) to the user input. You can run weaviate hybrid search on the schema, using Langchain, and OpenAI to return the closest match.

First, run a recursive text splitter on the extracted text. Then, the `ChatPromptTemplate` is used for creating chat prompts, and the `SystemMessagePromptTemplate` and `HumanMessagePromptTemplate` modules are used for generating prompts for system and human messages, respectively. These modules aid in creating interactive chat components using Langchain.

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Weaviate
from langchain.chains import RetrievalQAWithSourcesChain
from langchain import OpenAI

from langchain.prompts.chat import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate,
)

import streamlit as st

@st.cache_data
def chat_with_pdf(_text, _embeddings, chat_prompt):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    texts = text_splitter.split_text(_text)

    system_template="""Use the following pieces of context to answer the users question. 


    Begin!
    ----------------
    {summaries}
    """

    messages = [
        SystemMessagePromptTemplate.from_template(system_template),
        HumanMessagePromptTemplate.from_template("{question}")
    ]
    prompt = ChatPromptTemplate.from_messages(messages)

The RetrievalQAWithSourcesChain is used for Question-Answering(QA) using sources of information. Finally, pass the prompt to the `get_chain` function that searches using `docsearch`. The response is the answer to the prompt given by the user based on the most relevant document.

docsearch = Weaviate.from_texts(
    texts,
    _embeddings,
    weaviate_url=st.secrets["WEAVIATE_CLUSTER_URL"],
    by_text=False,
    metadatas=[{"source": f"{i}-pl"} for i in range(len(texts))],
)

def get_chain(store):
    chain_type_kwargs = {"prompt": prompt}
    chain = RetrievalQAWithSourcesChain.from_chain_type(
        OpenAI(temperature=0), 
        chain_type="stuff", 
        retriever=store.as_retriever(),
        chain_type_kwargs=chain_type_kwargs,
        reduce_k_below_max_tokens=True
    )
    return chain

chain = get_chain(docsearch)


# chain = RetrievalQAWithSourcesChain.from_chain_type(
#     OpenAI(temperature=0), chain_type="stuff", retriever=docsearch.as_retriever()
# )

response = chain(
    {"question": chat_prompt},
    return_only_outputs=True,
)

print(response["answer"])
return response["answer"]

Code Link:

https://github.com/LLM-Projects/docs-qa-bot/blob/main/app/chat.py

Developing the UI (Frontend):

Now that you are done creating the app, its time to bring it to life!

https://tenor.com/view/its-alive-young-frankenstein-gif-18700938

Streamlit helps build and share web apps in just a few minutes. The advantages of using streamlit are:

Customizable pre-built components for all our needs
Instant deployment
Responsive design
High compatibility with different libraries
Pre-built components

For building UI close to the one shown in the image above, you can easily use components like:

Structure your streamlit application by setting the title and use the file uploader component that accepts a PDF as input and scrapes and returns the text.

# External libraries
import streamlit as st
from streamlit_chat import message
import time
from PyPDF2 import PdfReader

# Internal file imports
from extract import extract_text
from embeddings import create_embeddings
from store import store_embeddings
from qa import search_qa
from chat import chat_with_pdf

# Start of streamlit application
st.title("PDF QA Bot using Langchain")

# Intitialization
st.header("File upload")
file = st.file_uploader("Choose a file (PDF)", type="pdf", help="file to be parsed")

if file is not None:
    # @st.cache_data
    data = extract_text(file)
    # st.text(data, help="Extracted text from uploaded pdf")

Then run the `create_embeddings` function in the `embeddings.py` file.

# Create, display, search and query the embeddings
st.header("Create Embeddings")
texts, embeddings, embeds_df = create_embeddings(
    data, st.secrets["OPENAI_API_KEY"]
)
st.text("Created successfully...")

if store_embeddings(embeds_df):
    st.success("Data saved successfully...", icon="✅")
else:
    st.error("Operation not successful. Please reach out to support...", icon="❌")

else:
st.error("Upload the file to proceed further", icon="🚨")

Last but not least is the chat component. The user input is passed as arguments along with scrapped text and embeddings to return the response from the LLM we created.

# Chat component to chat with your uploaded PDF
st.header("Chat with your PDF...")
if 'generated' not in st.session_state:
    st.session_state['generated'] = []

if 'past' not in st.session_state:
    st.session_state['past'] = []

def get_text():
    input_text = st.chat_input("Enter your query to retrieve answer from my knowledge...")
    return input_text

user_input = get_text()

if user_input:
    output = chat_with_pdf(data, embeddings, user_input)

    st.session_state.past.append(user_input)
    st.session_state.generated.append(output)

if st.session_state['generated']:

    for i in range(len(st.session_state['generated'])-1, -1, -1):
        message(st.session_state["generated"][i], key=str(i))
        message(st.session_state['past'][i], is_user=True, key=str(i) + '_user')

Finally, we are all done. Let’s run the app locally:

streamlit run app/main.py

This opens a new environment in http://localhost:8501 that renders the app. If this works as required, it’s time to take the application live and make it available for users all over the Internet!

Commit the changes to reflect in the GitHub repository. You can now head to the online Streamlit app, link to the GitHub repo — and reboot the application to see the changes.

Conclusion

This article showcases how to harness the power of LLMs on your PDF documents using tools and techniques that are now becoming a staple of building customized LLM-based solutions. The key aspects covered are:

LLMs (Using ChatGPT/GPT-4)
Vectorization and databases (Using Weaviate)
Prompt engineering/chaining (Using Langchain)
Frontend and UI (Using Streamlit)

While this space is growing so fast with new LLMs, database solutions, and prompt engineering techniques coming out every day — these can be used in a modular fashion and swapped out accordingly. With slight modifications, you can also build a similar chatbot for other file formats like CSV, Docs, PPTX, etc.

I hope this helps you build a real-life application or prototype with minimal overhead. I can’t wait to see what you all build! The entire GitHub repo is below:

GitHub - LLM-Projects/docs-qa-bot

Contribute to LLM-Projects/docs-qa-bot development by creating an account on GitHub.

github.com

If you like this post, follow me — I write on Generative AI in real-world applications and, more generally, on the intersections between data and society.

Feel free to connect with me on LinkedIn!

Here are some related articles:

When Should You Fine-Tune LLMs?

LLM Economics: ChatGPT vs Open-Source

4 Crucial Factors for Evaluating Large Language Models in Industry Applications

Deploying Open-Source LLMs As APIs

How Do You Build A ChatGPT-Powered App?

Extractive vs Generative Q&A — Which is better for your business?

Fine-Tune Transformer Models For Question Answering On Custom Data

Unleashing the Power of Generative AI For Your Customers

Deploy LLMs Using Azure ML