How to Create a PDF Bill Extractor Using Python and OpenAI API

In this tutorial, we’ll walk through the process of creating a PDF bill extractor using Python and the OpenAI API. This project will allow you to extract specific information from PDF invoices using artificial intelligence. Let’s dive into the step-by-step guide.

Step 1: Install Python on Your Machine

Ensure you have Python installed on your machine. You can download it from the official Python website and follow the installation instructions for your operating system.

Step 2: Create a Virtual Environment

It’s recommended to work within a virtual environment to manage dependencies. You can create a virtual environment using the following commands:

python -m venv myenv
source myenv/bin/activate  # For Unix/Mac
myenv\Scripts\activate  # For Windows

Step 3: Set Up OpenAI API Key

Sign up for the OpenAI API and obtain your API key. Set up an environment variable named OPENAI_API_KEY and assign your API key to it.

export OPENAI_API_KEY="your-api-key"

Step 4: Install Required Libraries

Create a requirements.txt file with the following dependencies and install them using pip:

aiohttp==3.9.3
aiosignal==1.3.1
altair==5.2.0
annotated-types==0.6.0
anyio==4.3.0
attrs==23.2.0
blinker==1.7.0
cachetools==5.3.3
certifi==2024.2.2
charset-normalizer==3.3.2
click==8.1.7
colorama==0.4.6
dataclasses-json==0.6.4
distro==1.9.0
frozenlist==1.4.1
gitdb==4.0.11
GitPython==3.1.42
greenlet==3.0.3
h11==0.14.0
httpcore==1.0.4
httpx==0.27.0
idna==3.6
Jinja2==3.1.3
jsonpatch==1.33
jsonpointer==2.4
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
langchain==0.1.13
langchain-community==0.0.29
langchain-core==0.1.33
langchain-text-splitters==0.0.1
langsmith==0.1.31
markdown-it-py==3.0.0
MarkupSafe==2.1.5
marshmallow==3.21.1
mdurl==0.1.2
multidict==6.0.5
mypy-extensions==1.0.0
numpy==1.26.4
openai==1.14.2
orjson==3.9.15
packaging==23.2
pandas==2.2.1
pillow==10.2.0
protobuf==4.25.3
pyarrow==15.0.2
pydantic==2.6.4
pydantic_core==2.16.3
pydeck==0.8.1b0
Pygments==2.17.2
pypdf==4.1.0
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
pytz==2024.1
PyYAML==6.0.1
referencing==0.34.0
regex==2023.12.25
requests==2.31.0
rich==13.7.1
rpds-py==0.18.0
six==1.16.0
smmap==5.0.1
sniffio==1.3.1
SQLAlchemy==2.0.28
streamlit==1.32.2
tenacity==8.2.3
toml==0.10.2
toolz==0.12.1
tornado==6.4
tqdm==4.66.2
typing-inspect==0.9.0
typing_extensions==4.10.0
tzdata==2024.1
urllib3==2.2.1
watchdog==4.0.0
yarl==1.9.4

Install the requirements using:

pip install -r requirements.txt

Step 5: Create a Helper Function

Create a Python file named helpers.py and add the following code:

from langchain.llms import OpenAI  # Import the OpenAI API wrapper
from pypdf import PdfReader  # Import the PdfReader from PyPDF library for PDF processing
import pandas as pd  # Import pandas for data manipulation
import re  # Import re for regular expressions
from langchain.prompts import PromptTemplate  # Import PromptTemplate from langchain for generating prompts
from langchain.chat_models import ChatOpenAI  # Import ChatOpenAI from langchain for chat-based interactions
from langchain.agents.agent_types import AgentType  # Import AgentType from langchain for agent types

import openai  # Import OpenAI Python library
import os  # Import os for system-related functions
from dotenv import find_dotenv, load_dotenv  # Import dotenv for loading environment variables

load_dotenv(find_dotenv())  # Load environment variables from .env file if present
openai.api_key = os.getenv("OPENAI_API_KEY")  # Set OpenAI API key from environment variable

# Define a function to extract text from a PDF file
def get_pdf_text(pdf_doc):
    text = ""
    pdf_reader = PdfReader(pdf_doc)
    for page in pdf_reader.pages:
        text += page.extract_text()
    return text

# Define a function to extract data from text using OpenAI API
def extracted_data(pages_data):
    # Define a template for the prompt to be sent to OpenAI API
    template = """Extract all the following values : Invoice ID, DESCRIPTION, Issue Date, 
         UNIT PRICE, AMOUNT, Bill For, From and Terms from: {pages}

        Expected output: remove any dollar symbols {{'Invoice ID': '1001329','DESCRIPTION': 'UNIT PRICE','AMOUNT': '2','Date': '5/4/2023','AMOUNT': '1100.00', 'Bill For': 'james', 'From': 'excel company', 'Terms': 'pay this now'}}
        """
    # Create a PromptTemplate object with the template
    prompt_template = PromptTemplate(input_variables=["pages"], template=template)
    # Create an OpenAI object for language model interaction
    llm = OpenAI(temperature=0.7)
    # Generate a full response by formatting the prompt with pages_data and passing it to the OpenAI API
    full_response = llm(prompt_template.format(pages=pages_data))

    return full_response

# Define a function to create documents from uploaded PDFs
def create_docs(user_pdf_list):
    # Create a pandas DataFrame to store extracted data
    df = pd.DataFrame(
        {
            "Invoice ID": pd.Series(dtype="int"),
            "DESCRIPTION": pd.Series(dtype="str"),
            "Issue Date": pd.Series(dtype="str"),
            "UNIT PRICE": pd.Series(dtype="str"),
            "AMOUNT": pd.Series(dtype="int"),
            "Bill For": pd.Series(dtype="str"),
            "From": pd.Series(dtype="str"),
            "Terms": pd.Series(dtype="str"),
        }
    )

    # Iterate through each uploaded PDF file
    for filename in user_pdf_list:
        raw_data = get_pdf_text(filename)  # Extract text from the PDF file
        llm_extracted_data = extracted_data(raw_data)  # Extract data using OpenAI API

        # Use regex to extract data from the OpenAI API response
        pattern = r"{(.+)}"  # Capture one or more of any character, except newline
        match = re.search(pattern, llm_extracted_data, re.DOTALL)

        if match:
            extracted_text = match.group(1)
            # Convert the extracted text to a dictionary
            data_dict = eval("{" + extracted_text + "}")
            print(data_dict)
        else:
            print("No match found.")

        # Add the extracted data to the DataFrame
        df = pd.concat([df, pd.DataFrame([data_dict])], ignore_index=True)

    df.head()  # Display the first few rows of the DataFrame
    return df  # Return the DataFrame containing extracted data

Step 6: Create the Main Application

Create a Python file named app.py and add the following code:

import streamlit as st  # Import Streamlit library for building web apps
from helpers import *  # Import helper functions from the helpers module

# Define the main function for the Streamlit web app
def main():
    # Set page configuration for the web app
    st.set_page_config(page_title="Bill Extractor")
    st.title("Bill Extractor AI Assistant...🤖")  # Display title on the web app

    # Upload Bills
    pdf_files = st.file_uploader(
        "Upload your bills in PDF format only", type=["pdf"], accept_multiple_files=True
    )  # Add a file uploader widget for uploading PDF bills

    extract_button = st.button("Extract bill data...")  # Add a button to trigger data extraction

    if extract_button:  # If the extract button is clicked
        with st.spinner("Extracting... it takes time..."):  # Display a spinner while extracting data
            data_frame = create_docs(pdf_files)  # Call the create_docs function to extract data
            st.write(data_frame.head())  # Display the first few rows of the extracted data

            # Clean and process the data (remove dollar symbols, convert to float, calculate average)
            data_frame["AMOUNT"] = data_frame["AMOUNT"].replace('[\$,]', '', regex=True).astype(float)
            st.write("Average bill amount: ", data_frame["AMOUNT"].mean())  # Display average bill amount

            # Convert the data frame to CSV format and prepare for download
            convert_to_csv = data_frame.to_csv(index=False).encode("utf-8")

            # Add a download button for downloading the data as a CSV file
            st.download_button(
                "Download data as CSV",
                convert_to_csv,
                "CSV_Bills.csv",
                "text/csv",
                key="download-csv",
            )
        st.success("Success!!")  # Display success message after data extraction and download

# Invoking the main function when the script is executed directly
if __name__ == "__main__":
    main()

Final Step: Run the Application

Run the application using the following command:

streamlit run app.py

This will launch the web interface for your PDF bill extractor. Users can upload PDF bills, extract data, and download the extracted information as a CSV file.

You can test the app with these sample files

Congratulations! You’ve successfully created a PDF bill extractor using Python and the OpenAI API.

This note informs readers that the code snippet is part of the “LangChain LLM (GPT-3.5)” course on Udemy and encourages them to enroll in the course to learn more about the code and related concepts.