avatarCndro

Summary

This web content provides a comprehensive guide on using the Python Pytesseract module for Optical Character Recognition (OCR) to extract text from images.

Abstract

The article is a tutorial on how to utilize the Python Pytesseract module, which is a wrapper for the Google Tesseract OCR library, to convert text within images into a machine-readable string format. It discusses the relevance of OCR in fields such as data science and computer vision, emphasizing its importance in extracting text data from images. The installation process for Pytesseract on a Windows OS is outlined, including the necessary steps to set up the environment and specify the path to the tesseract.exe binary. The tutorial includes sample code demonstrating how to import the required libraries, open and read an image, and print the extracted text. It also provides visual examples of text extraction from images with varying amounts of text, showcasing the module's capabilities and versatility. The article concludes by encouraging readers to engage with the content by clapping, sharing, and following for more tutorials, and it points to further reading on specialized OCR services like TAGGUN.

Opinions

  • The author positions OCR as a valuable tool for professionals dealing with text data extraction from images.
  • The tutorial is presented as user-friendly and accessible, suitable for those with a basic understanding of Python.
  • The article implies that Pytesseract is a preferred tool for OCR tasks due to its integration with the powerful Google Tesseract library.
  • By providing step-by-step instructions and code examples, the author conveys a didactic approach, aiming to facilitate the learning process for readers.
  • The inclusion of further reading suggestions indicates the author's view that readers may benefit from exploring more specialized OCR solutions for specific use cases.
  • The encouragement for reader interaction suggests the author values community engagement and continuous learning.

How to Extract Text from Images in Python Using Pytesseract OCR

A tutorial on how to convert text from images into a machine-readable format with the help of the Python Pytesseract module.

In this tutorial, we’ll show you how to convert text from images into a machine-readable format with the help of the Python Pytesseract module. The Pytesseract Module is a Python wrapper for the Google Tesseract library for OCR. We will be using this module to convert the words in an image to a string.

Optical Character Recognition(OCR) has been seen as a field of research in pattern recognition, artificial intelligence, and computer vision. This technique of extracting text from images is generally carried out by data scientists, software engineers, and at different work environments, whereby we know it’s certain the image would contain text data.

Installation

To install the Pytesseract on our machine, we will need to download the package. In this tutorial, we will use the Windows Operating system.

You can as well download it like this:

Pytesseract

pip install Pytesseract

Pillow

pip install pillow

The library requires the tesseract.exe binary to be indicated when specifying the path. So, during our installation, we can copy the path and keep it for use in the code later. This path highlighted in the image will be used in our code.

Sample one

We will convert this particular image below to text by using the pytesseract module:

Code:

#we first import our libraries here
from PIL import Image
from pytesseract import *
#Here we specified the path to our tessseract installation
pytesseract.tesseract_cmd = "C:\\Users\\CNDRO\\AppData\\Local\\Programs\\Tesseract-OCR\\tesseract.exe"
#This is the name of the image we have above
image_path = "brush.png"  
# Opening the image & storing it in an image object
img = Image.open(image_path)
#Providing the location to pytesseract library
#pytesseract.tesseract_cmd = pytesseract
# we will use this particular function to extract the text from the image
text = pytesseract.image_to_string(img)
  
# We will display the result below
print(text[:-1])

Output:

Sample Two

Let’s say we have an image that has a lot of text, we can as well use the pytesseract module to extract our text from the image. We will demonstrate it with the image below:

Code:

#we first import our libraries here
from PIL import Image
from pytesseract import *
#Here we specified the path to our tessseract installation
pytesseract.tesseract_cmd = "C:\\Users\\CNDRO\\AppData\\Local\\Programs\\Tesseract-OCR\\tesseract.exe"
#This is the name of the image we have above
image_path = "behind.png"  
# Opening the image & storing it in an image object
img = Image.open(image_path)
#Providing the location to pytesseract library
#pytesseract.tesseract_cmd = pytesseract
# we will use this particular function to extract the text from the image
text = pytesseract.image_to_string(img)
  
# We will display the result below
print(text[:-1])

Output:

Thanks for reading this post. If you found this post helpful, clap, share, and follow us for more tutorial posts.

Further Reading

More content at PlainEnglish.io. Sign up for our free weekly newsletter. Follow us on Twitter, LinkedIn, YouTube, and Discord. Interested in Growth Hacking? Check out Circuit.

Python
Ocr
Pytesseract
Image Recognition
Data Science
Recommended from ReadMedium