avatarankur garg

Summary

The article provides a comprehensive guide on extracting data from PDF forms using Python, specifically the PyPDF2 library, by understanding the PDF document object model and differentiating between XFA-based and Acroform PDF forms.

Abstract

The article delves into the process of data extraction from PDF forms, a common task for data scientists dealing with unstructured data in PDF format. It focuses on the use of the PyPDF2 library in Python to handle this task. The author, who works in the financial sector, found existing resources inadequate for extracting data from PDF forms and thus created this guide. The article explains the structure of PDF files, the two primary types of PDF forms (XFA-based and Acroforms), and provides code examples to extract data from each type. For XFA-based forms, the article illustrates accessing the embedded XML structure, while for Acroforms, it demonstrates the use of PyPDF2's getFormTextFields() method. The guide aims to simplify the process of data extraction for analysts and developers working with PDF forms.

Opinions

  • The author emphasizes the importance of understanding the object model of PDF documents for effective data mining.
  • PyPDF2 is highlighted as a preferred tool due to its Pure-Python nature, allowing it to run on any Python platform without dependencies.
  • The article suggests that the versatility and popularity of PDF format stem from its structured object model, which ensures consistent rendering across various platforms.
  • The author expresses a need for a comprehensive guide on extracting data from PDF forms, indicating a gap in available resources prior to this article.
  • The use of iText RUPS to visualize the PDF object model is recommended as a way to better understand the structure of PDF files.
  • The author advocates for the simplicity of parsing XML data from XFA-based PDF forms once the XML structure is accessed.
  • The article concludes by asserting that with the right understanding and tools, extracting data from PDF forms can be straightforward and enjoyable.

How to Extract Data from PDF Forms Using Python

Understanding the Object Model of PDF Documents for Data Mining

Photo by Leon Dewiwje on Unsplash

Introduction

PDF or Portable Document File format is one of the most common file formats in use today. It is widely used across enterprises, in government offices, healthcare and other industries. As a result, there is a large body of unstructured data that exists in PDF format and to extract and analyse this data to generate meaningful insights is a common task among data scientists.

I work for a financial institution and recently came across a situation where we had to extract data from a large volume of PDF forms. While there is a good body of work available to describe simple text extraction from PDF documents, I struggled to find a comprehensive guide to extract data from PDF forms. My objective to write this article is to develop such a guide.

There are several Python libraries dedicated to working with PDF documents, some more popular than the others. I will be using PyPDF2 for the purpose of this article. PyPDF2 is a Pure-Python library built as a PDF toolkit. Being Pure-Python, it can run on any Python platform without any dependencies or external libraries. You can use pip to install this library by executing the code below.

pip install PyPDF2

Once you have installed PyPDF2, you should be all set to follow along. We will take a quick look at the structure of PDF files as it will help us to better understand the programmatic basis of extracting data from PDF forms. I will briefly discuss the 2 types of PDF forms that are widely used. We will then jump right into the examples to extract data from each of the 2 types of PDF forms.

Structure of a PDF file

Instead of looking at PDF document as a monolith, it should be looked at as a collection of objects. All of these objects are arranged in a set pattern. If you open a PDF file in a text editor such as notepad, the content may not make much sense and appear to be junk. However, if you use a tool that provides low level access to PDF objects, you could see and appreciate the underlying structure. For example, please look at Figure 1 below. I used iText RUPS to open a simple PDF document.The image on the left is of a simple PDF document I opened in a reader application(Acrobat Reader). The middle image displays the low level object model of this document as rendered by iText RUPS. The image on the Right shows the data stream that captures the content of the PDF on its first page. As you could see, the object model(middle image) has a set pattern and encapsulates all of the meta data that is needed to render the document independent of the software, hardware, operating system etc. This structure is what makes PDF so versatile and popular.

Figure 1 — Structure of a PDF File

PDF Forms

There are 2 primary types of PDF forms.

  1. XFA (XML Forms Architecture) based Forms
  2. Acroforms

Adobe(the company that developed PDF format) has an application called AEM (Adobe Experience Manager) Forms Designer, which is aimed at enabling customers to create and publish PDF forms. Adobe uses the term PDF form to refer to the interactive and dynamic forms created with AEM Forms Designer. These PDF forms are based on Adobe’s XML Forms Architecture (XFA), which is based on XML. These forms can be dynamic in nature and can reflow PDF content based on user input.

There’s another type of PDF form, called an Acroform. Acroform is Adobe’s older and original interactive form technology introduced in 1996 as a part of PDF 1.2 specification. Acroforms are a combination of a traditional PDF that defines the static layout with Interactive form fields that are bolted on top. First, you design the form layout using Microsoft Word, Adobe InDesign, or Adobe Illustrator, etc. Then you add the form elements — fields, dropdown controls, checkboxes, script logic etc.

Extracting Data from XFA Based PDF Forms

Figure 2 below shows a screenshot of the XFA based PDF form that we will be using as an example for this exercise. This is a Currency Transactions Report form used by the banks and other institutions to report certain financial transactions to the regulatory agency. This is a dynamic form where you could add and remove sections based on the amount of information that needs to be reported. I have partially filled this form with some dummy data.

Figure 2 — XFA Form Example

Figure 3 shows the object model of this form. The XML document, shown on the right side of the image is what makes up the XFA, which is stored as the value of the XFA key inside the AcroForm dictionary(look at the object model on the left side of the image). The Acroform dictionary is a child element of the Catalog dictionary, which in turn is housed inside the Root of this PDF file. All we need to do is use PyPDF2 to access the XML document from the object structure of this file. Once we have access to the XML, it is a simple exercise of parsing out the XML document to access values for various form elements, which could then be stored into a Python list, Numpy array, Pandas dataframe etc. for the purpose of analysis.

Figure 3 — Object Model of Example XFA

Below is the code to extract the XML that makes up this form.

import PyPDF2 as pypdf
def findInDict(needle, haystack):
    for key in haystack.keys():
        try:
            value=haystack[key]
        except:
            continue
        if key==needle:
            return value
        if isinstance(value,dict):            
            x=findInDict(needle,value)            
            if x is not None:
                return x
pdfobject=open('CTRX_filled.pdf','rb')
pdf=pypdf.PdfFileReader(pdfobject)
xfa=findInDict('/XFA',pdf.resolvedObjects)
xml=xfa[7].getObject().getData()

In the first line, I am simply importing the PyPDF2 library and providing it an alias — pypdf. The second line is the beginning of function definition to find elements of a dictionary by providing the dictionary key. You would recall from our discussion above, that our XML is embedded inside a dictionary referenced by the key ‘/XFA’. This function helps me to navigate the complicated object model of the PDF file, which is basically a set of dictionaries embedded inside multiple sets of dictionaries. In the line following the function definition, I am reading in the PDF form and creating a PdfFileReader object. The resolvedObjects method of this class unravels the PDF object model as a set of Python dictionaries. I then invoke the findInDict function to extract the elements of the ‘/XFA’ dictionary, which is an array as shown in figure 4 below.

Figure 4 — XFA Array

The seventh element of this array is the actual XML content that makes up the form. It is an IndirectObject. An IndirectObject is an alias that points to an actual object. This reference helps to reduce the size of the file when same object appears at multiple places. The getObject() method used in the last line of the code retrieves the actual object. If the object is a text object, using str() function should give you the actual text. Otherwise, the getData() method needs to be used to render the data from the object. Below is a snapshot of a portion of the XML retrieved in the last line of the code above. You could see some of the dummy address data I entered into the sample form. You could easily parse out this data from the XML and use it for further analysis.

Figure 5 — Snapshot of the XML retrieved from the XFA PDF form

Extracting Data from Acroforms

This one will be relatively easy as we have already discussed most of the concepts related with the PDF object model in the sections above. Below is a sample income tax form that I will be using as an example. I have put some dummy data in it.

Figure 6 — Acroform Example

Figure 7 below shows the object model of this form.

Figure 7 — Acroform Sample Object Model

The values of individual form fields are referenced by the key ‘/V’, which is embedded inside ‘/Fields’, which in turn is embedded inside ‘/AcroForm’. ‘/AcroFrom’ is a child of the root Catalog dictionary of this PDF file. We could use the approach we used in the case of XFA form and use the ‘findInDict’ function to retrieve the ‘/Fields’ dictionary and then retrieve values of the individual fields. Fortunately, PyPDF2 provides a more direct way to do this. The PdfFileReader class provides a getFormTextFields() method that returns a dictionary of all form values. Below is the short code. Figure 8 shows the output. The dictionary object could be easily converted into a list or a Pandas dataframe for further procecssing.

import PyPDF2 as pypdf
pdfobject=open('incometaxform_filled.pdf','rb')
pdf=pypdf.PdfFileReader(pdfobject)
pdf.getFormTextFields()
Figure 8 — AcroForm example output

Conclusion

Extracting data from PDF forms is easy once you understand the underlying object model and PyPDF2 is a powerful library that enables you to access it. Have fun with your data!

Pdf
Python
Data Mining
Artificial Intelligence
Machine Learning
Recommended from ReadMedium