avatarRenu Khandelwal

Summary

The provided web content offers an introductory guide to Optical Character Recognition (OCR), detailing its applications, how to extract text from PDFs and images using Python libraries, and best practices for optimal OCR results.

Abstract

The article serves as a primer on OCR technology, explaining its role in converting various forms of text, including handwritten and printed materials, into machine-encoded text. It outlines common OCR applications such as digitizing documents for automated workflows, eliminating manual data entry, and securing sensitive information. The piece walks readers through using the PyPDF2 library to read PDF files and extract metadata and text content. It also demonstrates how to use pytesseract, a Python wrapper for Google's Tesseract-OCR, to read text from images, discussing configuration options like page segmentation modes and OCR engine modes. The article emphasizes the importance of image quality and pre-processing for accurate OCR and concludes by noting that OCR performance is highly dependent on the quality of the input data.

Opinions

  • The author suggests that OCR can significantly streamline business processes by digitizing and automating the extraction of text from various document types.
  • There is an emphasis on the versatility of OCR, as it can be applied to a wide range of text formats and sources, including scanned documents, PDFs, and images.
  • The author implies that while OCR technology is powerful, users may need to experiment with different configurations and pre-processing techniques to achieve the best results, especially with complex or low-quality inputs.
  • The article advocates for the use of pyt

An Introduction to Optical Character Recognition for Beginners

Your first step towards reading text from unstructured data

In this article, you will learn

  • What is Optical Character Recognition(OCR)?
  • Usage of OCR
  • Simple code to read text from PDF files and images

You have scanned copies of several documents like certificates of courses candidates have taken. The course certificate could be a PDF or a JPEG or a PNG file. How can you extract vital information like the name of the candidate, name of the course completed, and the date when the course was taken?

Optical Character Recognition(OCR)

OCR is a technology to convert handwritten, typed, scanned text, or text inside images to machine-readable text.

You can use OCR on any image files containing text or a PDF document or any scanned document, printed document, or handwritten document that is legible to extract text.

Usage of OCR

Some of the common usages of OCR are

  • Create automated workflows by digitizing PDF documents across different business units
  • Eliminating manual data entry by digitizing printed documents like reading passports, invoices, bank statements, etc.
  • Create secure access to sensitive information by digitizing Id cards, credit cards, etc.
  • Digitizing printed books like the Gutenberg project

Reading a PDF file

Here you will read the contents of a PDF file. You need to install pypdf2 library which is built on python for handling different pdf functionalities like

  • Extracting document information like title, author, etc
  • Splitting documents page by page
  • Encrypting and decrypting PDF files
!pip install pypdf2

You can download a sample W4 form as a PDF

Importing the library

import PyPDF2

Extract the number of pages and PDF file information

Open the PDF file to be read in binary mode using mode as ‘rb’. Pass the pdfFileObj to the PdfFileReader() to read the file stream. numPages will get the total number of pages in the PDF file. Use getDocumentInfo() to extract the PDF file’s information like author, creator, producer, subject, title in a dictionary

filename=r'\PDFfiles\W4.pdf'
pdfFileObj = open(filename,'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
num_pages = pdfReader.numPages
info=pdfReader.getDocumentInfo()
print("No. of Pages: ", num_pages)
print("Titel: ", info.title)
print("Author: ",info.author)
print("Subject: ",info.subject)
print("Creator: ",info.creator)
print("Producer: ",info.producer)

Retrieve the text from all the pages in the PDF file

Iterate through all the pages in the PDF file and then use getPage(), which will retrieve a page by a number from the PDF file. You can now extract the text from PDF file using extractText(). In the end, close the file using close()

count = 0
text = “”
#The while loop will read each page.
while count < num_pages:
 pageObj = pdfReader.getPage(count)
 count +=1
 text += pageObj.extractText()
 print(“Page Number”,count)
 print(“Content”,text)
pdfFileObj.close()

A word of caution: Text extracted using extractText() is not always in the right order, and the spacing also can be slightly different.

Reading a Text from an Image

You will use pytesseract, which a python wrapper for Google’s tesseract for optical character recognition (OCR), to read the text embedded in images.

You will need to understand some of the configuration options that can be applied using pytesseract

  • Page segmentation modes(psm)
  • OCR engine modes(oem)
  • Language(l)

Page Segmentation Method(psm)

psm defines how tesseract splits or segments image into lines of text or words

options for page segmentation modes(psm):

0: Orientation and script detection (OSD) only. 1: Automatic page segmentation with OSD. 2: Automatic page segmentation, but no OSD, or OCR. 3: Fully automatic page segmentation, but no OSD. (Default) 4: Assume a single column of text of variable sizes. 5: Assume a single uniform block of vertically aligned text. 6: Assume a single uniform block of text. 7: Treat the image as a single text line. 8: Treat the image as a single word. 9: Treat the image as a single word in a circle. 10: Treat the image as a single character. 11: Sparse text. Find as much text as possible in no particular order. 12: Sparse text with OSD. 13: Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.

OCR Engine Mode(oem)

Tesseract has different engine modes for speed and performance

0: Legacy engine only. 1: Neural nets LSTM engine only. 2: Legacy + LSTM engines. 3: Default, based on what is available.

Language(l)

Pytessercat supports multiple languages, and you can specify the languages you intend to work with while installing pytesseract, and it will download the language package. By default, eng is the default language

Image used for reading text

Importing required libraries

import pytesseract
import cv2

Read the image file using openCV. Applying configuration option for pytesseract to read the text from images. You can try different options for psm and oem and checkout the difference sin output

image_Filename=r'\Apparel_tag.jpg'
# Read the file  using opencv and show the image
img=cv2.imread(image_Filename)
cv2.imshow("Apparel Tag", img)
cv2.waitKey(0)
#set the configuration for redaing text from image using pytesseract
custom_config = r'--oem 1 --psm 8 -l eng'
text=pytesseract.image_to_string(img, config=custom_config)
print(text)
extracted text from the image

Best Practices for OCR using pytesseract

  • Try a different combination of configurations for pytesseract to get the best results for your use case
  • The text should not be skewed, leave some white space around the text for better results and ensure better illumination of the image to remove dark borders
  • 300- 600 DPI at a minimum works great
  • The font size of 12 pt. or more gives better results
  • Applying different pre-processing techniques like binarizing, de-noising the image, rotating the image to deskew it, increase the sharpness of the image, etc.

Conclusion:

OCR results depend on the input data quality. A clean segmentation of the text and no noise in the background gives better results. In the real world, this is not always possible, so we need to apply multiple pre-processing techniques for OCR to give better results.

References:

https://pypi.org/project/PyPDF2/

https://stackoverflow.com/questions/9480013/image-processing-to-improve-tesseract-ocr-accuracy

Ocr
Machine Learning
Pytesseract
Python
Pdf
Recommended from ReadMedium