[Python-Pdfplumber-PDF] Extract Text from PDF and save as CSV.

Summary

The web content provides a guide on extracting text from PDF files using Python with the PDFPlumber library and saving the extracted data as a CSV file, utilizing Google Colab for execution.

Abstract

The article outlines a method for data scientists to efficiently handle data extraction from PDFs using PDFPlumber, a Python library specifically designed for this purpose. It details the steps to install the necessary package, import required modules, and mount Google Drive to access PDF files stored there. The guide then demonstrates how to select a file, extract text from a PDF page, and finally, save the extracted text into a CSV file using pandas DataFrame. The process is illustrated with code snippets and screenshots, emphasizing the ease and efficiency of the workflow when using PDFPlumber in a Google Colab environment.

Opinions

The article suggests that manually copying and pasting data from PDFs is an inefficient and impractical approach for data scientists.
It posits that using PDFPlumber simplifies and streamlines interactions with PDF files, implying that it is a superior tool compared to traditional methods.
The author conveys that integrating PDFPlumber with Google Colab and Google Drive provides a seamless and powerful setup for handling PDF data extraction tasks.
The use of Python code in the browser through Google Colab is highlighted as a beneficial feature for executing the demonstrated tasks.

[Python-Pdfplumber-PDF] Extract Text from PDF and save as CSV.

Data Scientists often have to work with information that’s stored in PDFs. Some of them will just copy and paste the data they need, but this is not only bad, it’s also the slowest and least effective way to work in the long run, and it might not even be possible with some PDFs. Using PDF Plumber will make everything that has to do with PDFs easy and smoother.

It allows you to write and execute Python code in the browser by using Google Colab.

3. Chose a File

os.chdir("/content/gdrive/MyDrive/Colab Notebooks") 
#going to the place where the reports are
os.getcwd()

/content/gdrive/MyDrive/Colab Notebooks

My driver

file in my driver

pdf file in colab notebooks

week_files = os.listdir() 
#making week_files the name of all the PDFs in our group
print(week_files) 
#printing the os.listdir()
#the result is ['test-file.pdf'] after run this code.

.extract_text()

That’s the function that extracts the text from our file, let’s associate it to text and check out what it returns to us.

# extracting the text
text = page.extract_text()

#checking out what the variable text contains now
text

the result after running the code

#printing our variable
print(text)
text.split('\n')

the result after running the code

#import package
import pandas as pd
import csv
# convert the json to dataframe
records_df = pd.DataFrame.from_dict(text.split('\n'))
records_df.head()
# save as new_excel.csv
records_df.to_csv("new_excel")

[Python-Pdfplumber-PDF] Extract Text from PDF and save as CSV.

1. Installing and Importing Packages

2. Mounting Google Drive

3. Chose a File

pdfplumber.open()

.pages

.extract_text()