
[Python-Pdfplumber-PDF] Extract Text from PDF and save as CSV.
Data Scientists often have to work with information that’s stored in PDFs. Some of them will just copy and paste the data they need, but this is not only bad, it’s also the slowest and least effective way to work in the long run, and it might not even be possible with some PDFs. Using PDF Plumber will make everything that has to do with PDFs easy and smoother.
It allows you to write and execute Python code in the browser by using Google Colab.
1. Installing and Importing Packages
!pip install pdfplumber -q

Importing packages:
import pdfplumber
from google.colab import drive
import os
#PDF Plumber: Extracts text from PDF files
#OS: Changes/create directories
#Drive: Connects to your Google Drive2. Mounting Google Drive
drive.mount('/content/gdrive')
3. Chose a File
os.chdir("/content/gdrive/MyDrive/Colab Notebooks")
#going to the place where the reports are
os.getcwd()/content/gdrive/MyDrive/Colab Notebooks



week_files = os.listdir()
#making week_files the name of all the PDFs in our group
print(week_files)
#printing the os.listdir()
#the result is ['test-file.pdf'] after run this code.pdfplumber.open()
Now let’s start working on our PDF.The command pdfplumber.open(‘path/to/the/file’) will take us to the file
pdf = pdfplumber.open('/content/gdrive/MyDrive/Colab Notebooks/test-file.pdf').pages
we use .pages to acces the pages of the PDF, and we pass the page number as the argument
Python starts to count from 0, so the first page index is 0, not 1.
pdf.pages
# # Link a variable to the first page
page = pdf.pages[0].extract_text()
That’s the function that extracts the text from our file, let’s associate it to text and check out what it returns to us.
# extracting the text
text = page.extract_text()
#checking out what the variable text contains now
text
#printing our variable
print(text)
text.split('\n')
#import package
import pandas as pd
import csv
# convert the json to dataframe
records_df = pd.DataFrame.from_dict(text.split('\n'))
records_df.head()
# save as new_excel.csv
records_df.to_csv("new_excel")



