How to Create a PDF Bill Extractor Using Python and OpenAI API
In this tutorial, we’ll walk through the process of creating a PDF bill extractor using Python and the OpenAI API. This project will allow you to extract specific information from PDF invoices using artificial intelligence. Let’s dive into the step-by-step guide.

Step 1: Install Python on Your Machine
Ensure you have Python installed on your machine. You can download it from the official Python website and follow the installation instructions for your operating system.
Step 2: Create a Virtual Environment
It’s recommended to work within a virtual environment to manage dependencies. You can create a virtual environment using the following commands:
python -m venv myenv
source myenv/bin/activate # For Unix/Mac
myenv\Scripts\activate # For WindowsStep 3: Set Up OpenAI API Key
Sign up for the OpenAI API and obtain your API key. Set up an environment variable named OPENAI_API_KEY and assign your API key to it.
export OPENAI_API_KEY="your-api-key"Step 4: Install Required Libraries
Create a requirements.txt file with the following dependencies and install them using pip:
aiohttp==3.9.3
aiosignal==1.3.1
altair==5.2.0
annotated-types==0.6.0
anyio==4.3.0
attrs==23.2.0
blinker==1.7.0
cachetools==5.3.3
certifi==2024.2.2
charset-normalizer==3.3.2
click==8.1.7
colorama==0.4.6
dataclasses-json==0.6.4
distro==1.9.0
frozenlist==1.4.1
gitdb==4.0.11
GitPython==3.1.42
greenlet==3.0.3
h11==0.14.0
httpcore==1.0.4
httpx==0.27.0
idna==3.6
Jinja2==3.1.3
jsonpatch==1.33
jsonpointer==2.4
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
langchain==0.1.13
langchain-community==0.0.29
langchain-core==0.1.33
langchain-text-splitters==0.0.1
langsmith==0.1.31
markdown-it-py==3.0.0
MarkupSafe==2.1.5
marshmallow==3.21.1
mdurl==0.1.2
multidict==6.0.5
mypy-extensions==1.0.0
numpy==1.26.4
openai==1.14.2
orjson==3.9.15
packaging==23.2
pandas==2.2.1
pillow==10.2.0
protobuf==4.25.3
pyarrow==15.0.2
pydantic==2.6.4
pydantic_core==2.16.3
pydeck==0.8.1b0
Pygments==2.17.2
pypdf==4.1.0
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
pytz==2024.1
PyYAML==6.0.1
referencing==0.34.0
regex==2023.12.25
requests==2.31.0
rich==13.7.1
rpds-py==0.18.0
six==1.16.0
smmap==5.0.1
sniffio==1.3.1
SQLAlchemy==2.0.28
streamlit==1.32.2
tenacity==8.2.3
toml==0.10.2
toolz==0.12.1
tornado==6.4
tqdm==4.66.2
typing-inspect==0.9.0
typing_extensions==4.10.0
tzdata==2024.1
urllib3==2.2.1
watchdog==4.0.0
yarl==1.9.4Install the requirements using:
pip install -r requirements.txtStep 5: Create a Helper Function
Create a Python file named helpers.py and add the following code:
from langchain.llms import OpenAI # Import the OpenAI API wrapper
from pypdf import PdfReader # Import the PdfReader from PyPDF library for PDF processing
import pandas as pd # Import pandas for data manipulation
import re # Import re for regular expressions
from langchain.prompts import PromptTemplate # Import PromptTemplate from langchain for generating prompts
from langchain.chat_models import ChatOpenAI # Import ChatOpenAI from langchain for chat-based interactions
from langchain.agents.agent_types import AgentType # Import AgentType from langchain for agent types
import openai # Import OpenAI Python library
import os # Import os for system-related functions
from dotenv import find_dotenv, load_dotenv # Import dotenv for loading environment variables
load_dotenv(find_dotenv()) # Load environment variables from .env file if present
openai.api_key = os.getenv("OPENAI_API_KEY") # Set OpenAI API key from environment variable
# Define a function to extract text from a PDF file
def get_pdf_text(pdf_doc):
text = ""
pdf_reader = PdfReader(pdf_doc)
for page in pdf_reader.pages:
text += page.extract_text()
return text
# Define a function to extract data from text using OpenAI API
def extracted_data(pages_data):
# Define a template for the prompt to be sent to OpenAI API
template = """Extract all the following values : Invoice ID, DESCRIPTION, Issue Date,
UNIT PRICE, AMOUNT, Bill For, From and Terms from: {pages}
Expected output: remove any dollar symbols {{'Invoice ID': '1001329','DESCRIPTION': 'UNIT PRICE','AMOUNT': '2','Date': '5/4/2023','AMOUNT': '1100.00', 'Bill For': 'james', 'From': 'excel company', 'Terms': 'pay this now'}}
"""
# Create a PromptTemplate object with the template
prompt_template = PromptTemplate(input_variables=["pages"], template=template)
# Create an OpenAI object for language model interaction
llm = OpenAI(temperature=0.7)
# Generate a full response by formatting the prompt with pages_data and passing it to the OpenAI API
full_response = llm(prompt_template.format(pages=pages_data))
return full_response
# Define a function to create documents from uploaded PDFs
def create_docs(user_pdf_list):
# Create a pandas DataFrame to store extracted data
df = pd.DataFrame(
{
"Invoice ID": pd.Series(dtype="int"),
"DESCRIPTION": pd.Series(dtype="str"),
"Issue Date": pd.Series(dtype="str"),
"UNIT PRICE": pd.Series(dtype="str"),
"AMOUNT": pd.Series(dtype="int"),
"Bill For": pd.Series(dtype="str"),
"From": pd.Series(dtype="str"),
"Terms": pd.Series(dtype="str"),
}
)
# Iterate through each uploaded PDF file
for filename in user_pdf_list:
raw_data = get_pdf_text(filename) # Extract text from the PDF file
llm_extracted_data = extracted_data(raw_data) # Extract data using OpenAI API
# Use regex to extract data from the OpenAI API response
pattern = r"{(.+)}" # Capture one or more of any character, except newline
match = re.search(pattern, llm_extracted_data, re.DOTALL)
if match:
extracted_text = match.group(1)
# Convert the extracted text to a dictionary
data_dict = eval("{" + extracted_text + "}")
print(data_dict)
else:
print("No match found.")
# Add the extracted data to the DataFrame
df = pd.concat([df, pd.DataFrame([data_dict])], ignore_index=True)
df.head() # Display the first few rows of the DataFrame
return df # Return the DataFrame containing extracted dataStep 6: Create the Main Application
Create a Python file named app.py and add the following code:
import streamlit as st # Import Streamlit library for building web apps
from helpers import * # Import helper functions from the helpers module
# Define the main function for the Streamlit web app
def main():
# Set page configuration for the web app
st.set_page_config(page_title="Bill Extractor")
st.title("Bill Extractor AI Assistant...🤖") # Display title on the web app
# Upload Bills
pdf_files = st.file_uploader(
"Upload your bills in PDF format only", type=["pdf"], accept_multiple_files=True
) # Add a file uploader widget for uploading PDF bills
extract_button = st.button("Extract bill data...") # Add a button to trigger data extraction
if extract_button: # If the extract button is clicked
with st.spinner("Extracting... it takes time..."): # Display a spinner while extracting data
data_frame = create_docs(pdf_files) # Call the create_docs function to extract data
st.write(data_frame.head()) # Display the first few rows of the extracted data
# Clean and process the data (remove dollar symbols, convert to float, calculate average)
data_frame["AMOUNT"] = data_frame["AMOUNT"].replace('[\$,]', '', regex=True).astype(float)
st.write("Average bill amount: ", data_frame["AMOUNT"].mean()) # Display average bill amount
# Convert the data frame to CSV format and prepare for download
convert_to_csv = data_frame.to_csv(index=False).encode("utf-8")
# Add a download button for downloading the data as a CSV file
st.download_button(
"Download data as CSV",
convert_to_csv,
"CSV_Bills.csv",
"text/csv",
key="download-csv",
)
st.success("Success!!") # Display success message after data extraction and download
# Invoking the main function when the script is executed directly
if __name__ == "__main__":
main()Final Step: Run the Application
Run the application using the following command:
streamlit run app.pyThis will launch the web interface for your PDF bill extractor. Users can upload PDF bills, extract data, and download the extracted information as a CSV file.
You can test the app with these sample files
Congratulations! You’ve successfully created a PDF bill extractor using Python and the OpenAI API.
This note informs readers that the code snippet is part of the “LangChain LLM (GPT-3.5)” course on Udemy and encourages them to enroll in the course to learn more about the code and related concepts.





