Big $$$: OCR Scanned PDFs with Pytesseract and Imagemagick
A Step-by-Step Guide for Windows and Mac
In this article, I’m gonna show you how to use Pytesseract and Imagemagick to extract text from scanned PDFs. This technique has been super helpful for me, and I’ve made over 50 grand just by scraping various websites with it. It’s become a key player in my toolkit.
Right now, I’m working on another project using this tech, and I’m hoping it’ll bring in a steady income of at least 10k a month. My partner and I think we’ve found something that doesn’t exist on the market yet, and we’re pretty sure it’ll be a huge hit with customers. It’s a medical app, so it’s definitely something that could make a real difference in people’s lives.
So, if you want to learn how to use Pytesseract and Imagemagick like a pro, keep reading! You never know, it might just help you come up with the next big idea too.
Introduction
Optical Character Recognition (OCR) is a technology that allows users to convert scanned documents, images, or PDFs containing text into searchable and editable digital formats. In this article, we will explore how to set up and use Pytesseract, an OCR tool that uses Google's Tesseract engine, and Imagemagick, a powerful image processing library, to OCR scanned PDFs on both Windows and Mac computers.
Set up Pytesseract and Imagemagick
Windows:
Step 1: Download and Install Python To get started, download and install the latest version of Python from the official website (https://www.python.org/downloads/windows/). Make sure to select the option "Add Python to PATH" during the installation process.
Step 2: Install Pytesseract Open the Command Prompt and run the following command to install the Pytesseract Python library:
pip install pytesseract
Step 3: Install Tesseract OCR Download the Tesseract OCR installer from the official GitHub repository (https://github.com/UB-Mannheim/tesseract/wiki). Run the installer and follow the installation instructions. After the installation is complete, add Tesseract to your system PATH.
Step 4: Install Imagemagick Visit the Imagemagick download page (https://imagemagick.org/script/download.php) and download the appropriate Windows binary release. Run the installer and follow the installation instructions. Make sure to select the option "Install legacy utilities (e.g. convert)" during the installation process.
Mac:
Step 1: Install Homebrew Homebrew is a package manager for Mac that simplifies the installation process for many applications. If you do not have Homebrew installed, follow the instructions on the official website (https://brew.sh/).
Step 2: Install Python Open the Terminal and run the following command to install Python via Homebrew:
brew install python
Step 3: Install Pytesseract Run the following command to install the Pytesseract Python library:
pip install pytesseract
Step 4: Install Tesseract OCR Run the following command to install Tesseract OCR via Homebrew:
brew install tesseract
Step 5: Install Imagemagick Run the following command to install Imagemagick via Homebrew:
brew install imagemagick
OCR Scanned PDFs with Pytesseract and Imagemagick
Now that you have set up Pytesseract and Imagemagick, you can use the following Python script to OCR scanned PDFs:
import pytesseract
from PIL import Image
import os
import sys
from wand.image import Image as WandImage
input_file = sys.argv[1]
output_file = sys.argv[2]
# Convert PDF to image files
with WandImage(filename=input_file, resolution=300) as img:
img.compression_quality = 99
img.save(filename='temp_images/page.jpg')
# Perform OCR using Pytesseract
text = ''
for i, file in enumerate(sorted(os.listdir('temp_images'))):
with Image.open(f'temp_images/{file}') as img:
text += pytesseract.image_to_string(img)
# Save the OCR text to a file
with open(output_file, 'w', encoding='utf-8') as f:
f.write(text)
# Clean up temporary images
for file in os.listdir('temp_images'):
os.remove(f'temp_images/{file}')To use this script, save it as "pdf_ocr.py" and run the following command in your Terminal (Mac) or Command Prompt (Windows), replacing "input.pdf" with the path to your scanned PDF and "output.txt" with the desired output file name:
python pdf_ocr.py input.pdf output.txtThis script performs the following actions:
- Convert the input PDF to a series of images using Imagemagick's Wand library. The images are saved in a temporary folder called "temp_images". The resolution parameter is set to 300 DPI for better OCR accuracy.
- Iterate through the images, perform OCR using Pytesseract, and append the recognized text to a string variable.
- Write the recognized text to the specified output file.
- Clean up the temporary images by removing them from the "temp_images" folder.
Conclusion
With Pytesseract and Imagemagick, you can easily OCR scanned PDFs on both Windows and Mac platforms. By following this step-by-step guide, you can set up the necessary tools and create a simple Python script to convert scanned PDFs into searchable and editable text files. This solution is both efficient and highly customizable, allowing you to adapt it to your specific needs and improve your document processing workflows.
More content at PlainEnglish.io.
Sign up for our free weekly newsletter. Follow us on Twitter, LinkedIn, YouTube, and Discord.
Interested in scaling your software startup? Check out Circuit.





