avatarShweta Lodha

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

1761

Abstract

n>] = <span class="hljs-string">"YOUR_KEY"</span></pre></div><h1 id="b6b2">Load PDF</h1><p id="e8da">For loading the PDF file, we can use <i>UnstructuredFileLoader </i>as shown below:</p><div id="b12b"><pre>loader = UnstructuredFileLoader(‘SamplePDF.pdf’) documents= loader.load()

<span class="hljs-comment"># if you want to load file as a list of elements then only do this</span> loader = UnstructuredFileLoader(<span class="hljs-string">'SamplePDF.pdf'</span>, mode=<span class="hljs-string">'elements'</span>)</pre></div><h1 id="9214">Split Documents Into Chunks</h1><p id="6b70">Once the PDF is loaded, next we need to divide our huge text into chunks. You can define chunk size based on your need, here I’m taking chunk size as 800 and chunk overlap as 0.</p><div id="5596"><pre>text_splitter = CharacterTextSplitter(chunk_size=<span class="hljs-number">800</span>, chunk_overlap=<span class="hljs-number">0</span>) texts = text_splitter.split_documents(documents)</pre></div><h1 id="3788">Prepare Model And Embeddings</h1><p id="fc2d">Till here, we are ready with our data. Now the only thing remaining is, generating embedding, associating them with text, select a large language model and stuff the data into it. All these steps can be done in just few lines of code as shown below:</p><div id="6865"><pre>embeddings = OpenAIEmbeddings(openai_api_key = os.environ[‘OPENAI_API_KEY’]) doc_search = Chroma.from_documents(texts,embeddings) chain = VectorDBQA.from_chain_type(llm=OpenAI(), chain_type=”stuff”, vectorstore=doc_search)</pre></div><h1 id="4429">Create Query And Get Response</h1><p id="d587">Now we are ready to ask questions and get response.</p><div id="e33c"><pre>query = “What are the effects of homelessness?” chain.run(query)</pre></div>

Options

<p id="5d29">On execution of above query, I received this response:</p><p id="1e07"><i>‘ The effects of homelessness can include personal, health, abuse, familial, and societal impacts.’</i></p><div id="5c77" class="link-block"> <a href="https://shweta-lodha.medium.com/membership"> <div> <div> <h2>Join Medium with my referral link - Shweta Lodha</h2> <div><h3>Read every story on Medium by joining membership of $5/month Your membership fee directly supports me and other writers…</h3></div> <div><p>shweta-lodha.medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*XXeihMDSAT8Y2PSO)"></div> </div> </div> </a> </div><p id="5fa8">I hope you find this walkthrough useful.</p><p id="6389">If you find anything, which is not clear, I would recommend you to watch my video recording, which demonstrates this flow from end-to-end.</p> <figure id="635a"> <div> <div> <img class="ratio" src="http://placehold.it/16x9"> <iframe class="" src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FxCvBL0OukyQ%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DxCvBL0OukyQ&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FxCvBL0OukyQ%2Fhqdefault.jpg&amp;key=a19fcc184b9711e1b4764040d3dc5c07&amp;type=text%2Fhtml&amp;schema=youtube" allowfullscreen="" frameborder="0" height="480" width="854"> </div> </div> </figure></iframe></div></div></figure></article></body>

In this article, I’ll walk you through all the steps required to query your PDFs and get response out of it.

Let’s get started by importing the required packages.

Import Required Packages

import langchain
import os
import openai
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain import OpenAI, VectorDBQA
from langchain.document_loaders import UnstructuredFileLoader
from langchain.text_splitter import CharacterTextSplitter
import nltk
nltk.download("punkt")

os.environ["OPENAI_API_KEY"] = "OPENAI_API_KEY"

Get OpenAI API Key

To get the OpenAI key, you need to go to https://openai.com/, login and then grab the keys using highlighted way:

Once you got the key, set that inside an environment variable(I’m using Windows).

os.environ["OPENAI_API_KEY"] = "YOUR_KEY"

Load PDF

For loading the PDF file, we can use UnstructuredFileLoader as shown below:

loader = UnstructuredFileLoader(‘SamplePDF.pdf’)
documents= loader.load()

# if you want to load file as a list of elements then only do this
loader = UnstructuredFileLoader('SamplePDF.pdf', mode='elements')

Split Documents Into Chunks

Once the PDF is loaded, next we need to divide our huge text into chunks. You can define chunk size based on your need, here I’m taking chunk size as 800 and chunk overlap as 0.

text_splitter = CharacterTextSplitter(chunk_size=800, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

Prepare Model And Embeddings

Till here, we are ready with our data. Now the only thing remaining is, generating embedding, associating them with text, select a large language model and stuff the data into it. All these steps can be done in just few lines of code as shown below:

embeddings = OpenAIEmbeddings(openai_api_key = os.environ[‘OPENAI_API_KEY’])
doc_search = Chroma.from_documents(texts,embeddings)
chain = VectorDBQA.from_chain_type(llm=OpenAI(), chain_type=”stuff”, vectorstore=doc_search)

Create Query And Get Response

Now we are ready to ask questions and get response.

query = “What are the effects of homelessness?”
chain.run(query)

On execution of above query, I received this response:

‘ The effects of homelessness can include personal, health, abuse, familial, and societal impacts.’

I hope you find this walkthrough useful.

If you find anything, which is not clear, I would recommend you to watch my video recording, which demonstrates this flow from end-to-end.

Recommended from ReadMedium