avatarAmazing lifestyle

Summary

The web content provides a guide on extracting text from PDF files using Python with the PDFPlumber library and saving the extracted data as a CSV file, utilizing Google Colab for execution.

Abstract

The article outlines a method for data scientists to efficiently handle data extraction from PDFs using PDFPlumber, a Python library specifically designed for this purpose. It details the steps to install the necessary package, import required modules, and mount Google Drive to access PDF files stored there. The guide then demonstrates how to select a file, extract text from a PDF page, and finally, save the extracted text into a CSV file using pandas DataFrame. The process is illustrated with code snippets and screenshots, emphasizing the ease and efficiency of the workflow when using PDFPlumber in a Google Colab environment.

Opinions

  • The article suggests that manually copying and pasting data from PDFs is an inefficient and impractical approach for data scientists.
  • It posits that using PDFPlumber simplifies and streamlines interactions with PDF files, implying that it is a superior tool compared to traditional methods.
  • The author conveys that integrating PDFPlumber with Google Colab and Google Drive provides a seamless and powerful setup for handling PDF data extraction tasks.
  • The use of Python code in the browser through Google Colab is highlighted as a beneficial feature for executing the demonstrated tasks.

[Python-Pdfplumber-PDF] Extract Text from PDF and save as CSV.

Data Scientists often have to work with information that’s stored in PDFs. Some of them will just copy and paste the data they need, but this is not only bad, it’s also the slowest and least effective way to work in the long run, and it might not even be possible with some PDFs. Using PDF Plumber will make everything that has to do with PDFs easy and smoother.

It allows you to write and execute Python code in the browser by using Google Colab.

1. Installing and Importing Packages

!pip install pdfplumber -q

Importing packages:

import pdfplumber
from google.colab import drive
import os

#PDF Plumber: Extracts text from PDF files
#OS: Changes/create directories
#Drive: Connects to your Google Drive

2. Mounting Google Drive

drive.mount('/content/gdrive')

3. Chose a File

os.chdir("/content/gdrive/MyDrive/Colab Notebooks") 
#going to the place where the reports are
os.getcwd()

/content/gdrive/MyDrive/Colab Notebooks

My driver
file in my driver
pdf file in colab notebooks
week_files = os.listdir() 
#making week_files the name of all the PDFs in our group
print(week_files) 
#printing the os.listdir()
#the result is ['test-file.pdf'] after run this code.

pdfplumber.open()

Now let’s start working on our PDF.The command pdfplumber.open(‘path/to/the/file’) will take us to the file

pdf = pdfplumber.open('/content/gdrive/MyDrive/Colab Notebooks/test-file.pdf')

.pages

we use .pages to acces the pages of the PDF, and we pass the page number as the argument Python starts to count from 0, so the first page index is 0, not 1.

pdf.pages
# # Link a variable to the first page
page = pdf.pages[0]

.extract_text()

That’s the function that extracts the text from our file, let’s associate it to text and check out what it returns to us.

# extracting the text
text = page.extract_text()

#checking out what the variable text contains now
text
the result after running the code
#printing our variable
print(text)
text.split('\n')
the result after running the code
#import package
import pandas as pd
import csv
# convert the json to dataframe
records_df = pd.DataFrame.from_dict(text.split('\n'))
records_df.head()
# save as new_excel.csv
records_df.to_csv("new_excel")
Pdf
Python
Programming
Recommended from ReadMedium