avatarEsteban Thilliez

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

7008

Abstract

.int0(box)

<span class="hljs-comment"># Find the angle of rotation of the rectangle</span> angle = rect[<span class="hljs-number">2</span>]

<span class="hljs-comment"># Rotate the image by the angle of rotation to deskew it</span> rows, cols = img.shape M = cv2.getRotationMatrix2D((cols/<span class="hljs-number">2</span>,rows/<span class="hljs-number">2</span>), angle, <span class="hljs-number">1</span>) deskewed = cv2.warpAffine(img, M, (cols, rows))</pre></div><h2 id="f255">Feature Extraction and Classification</h2><p id="4bbc">After preprocessing the images, we need to extract features from the image that can be used for classification. There are various techniques available for feature extraction, such as edge detection, histogram of oriented gradients (HOG), and scale-invariant feature transform (SIFT).</p><p id="1843">Today, I’ll use HOG. HOG computes the gradient magnitude and orientation at each pixel in the image and then creates a histogram of gradients in each cell of a grid. The resulting feature vector can be used for classification.</p><div id="d0a5"><pre><span class="hljs-keyword">from</span> skimage.feature <span class="hljs-keyword">import</span> hog <span class="hljs-keyword">from</span> sklearn <span class="hljs-keyword">import</span> svm

<span class="hljs-comment"># Extract features using HOG</span> fd, hog_image = hog(deskewed, orientations=<span class="hljs-number">9</span>, pixels_per_cell=(<span class="hljs-number">8</span>, <span class="hljs-number">8</span>), cells_per_block=(<span class="hljs-number">2</span>, <span class="hljs-number">2</span>), visualize=<span class="hljs-literal">True</span>, multichannel=<span class="hljs-literal">False</span>)</pre></div><p id="d8dd">Once we have extracted features from the images, we need to train a machine learning model to classify them. We will use a support vector machine (SVM) classifier. SVM is a popular machine learning algorithm for classification tasks.</p><div id="3932"><pre><span class="hljs-comment"># Prepare data for training</span> features = [fd] labels = ['A']

<span class="hljs-comment"># Train the SVM classifier</span> clf = svm.SVC(kernel='linear', C=1, probability=True) clf.fit(features, labels)</pre></div><p id="7154">So in the two codes above, what happens?</p><p id="a3c6">First, we use the <code>hog</code> function from the <code>skimage.feature</code> module to extract the HOG features from the deskewed image. We specify the number of orientation bins, the size of the cells in which the gradients are computed, and the size of the blocks over which the histograms of gradients are normalized.</p><p id="5423">Next, we prepare the features and labels for training. Let’s assume our image is the letter ‘A’. In this case, we only have one image with the letter ‘A’, so we use the HOG features as the feature vector and the label ‘A’.</p><p id="9a50">Finally, we train an SVM classifier using the <code>SVC</code> class from the <code>sklearn</code> module. We specify a linear kernel and a regularization parameter <code>C</code> of 1. We also set <code>probability</code> to True to enable probability estimates for the classifier.</p><h2 id="b556">Evaluating and Improving OCR Performance</h2><p id="9c8c">After training an OCR model, it’s important to evaluate its performance on a separate dataset to ensure that it is accurate and reliable. There are several metrics that can be used to measure OCR performance, including precision, recall, and F1 score.</p><p id="4405">Precision measures the proportion of true positive predictions out of all positive predictions made by the OCR model. Recall measures the proportion of true positive predictions out of all actual positive instances in the dataset. The F1 score is a weighted average of precision and recall, and provides a single number that summarizes the overall performance of the OCR model.</p><div id="bf9e"><pre><span class="hljs-keyword">from</span> sklearn.metrics <span class="hljs-keyword">import</span> precision_score, recall_score, f1_score

<span class="hljs-comment"># Assume y_true and y_pred are arrays of true and predicted labels</span> precision = precision_score(y_true, y_pred) recall = recall_score(y_true, y_pred) f1 = f1_score(y_true, y_pred)

<span class="hljs-built_in">print</span>(<span class="hljs-string">f'Precision: <span class="hljs-subst">{precision:<span class="hljs-number">.4</span>f}</span>'</span>) <span class="hljs-built_in">print</span>(<span class="hljs-string">f'Recall: <span class="hljs-subst">{recall:<span class="hljs-number">.4</span>f}</span>'</span>) <span class="hljs-built_in">print</span>(<span class="hljs-string">f'F1 Score: <span class="hljs-subst">{f1:<span class="hljs-number">.4</span>f}</span>'</span>)</pre></div><p id="c0c1">To improve the performance of an OCR model, several techniques can be used. One approach is to use a larger training dataset, which can help the model learn more robust features and reduce overfitting. Another approach is to adjust the hyperparameters of the model, such as the learning rate or regularization strength, to achieve better performance.</p><p id="5f9a">To use a larger training dataset to improve OCR performance, we can simply gather more data and add it to our training set. Here’s an example of how to load images from a directory and add them to a <code>ImageDataGenerator</code> in Keras (but first, you need to install Keras using <code>pip install keras</code>):</p><div id="2c7a"><pre><span class="hljs-keyword">from</span> keras.preprocessing.image <span class="hljs-keyword">import</span> ImageDataGenerator

train_datagen = ImageDataGenerator(rescale=<span class="hljs-number">1.</span>/<span class="hljs-number">255</span>, rotation_range=<span class="hljs-number">10</span>, width_shift_range=<span class="hljs-number">0.1</span>, height_shift_range=<span class="hljs-number">0.1</span>, shear_range=<span class="hljs-number">0.2</span>, zoom_range=<span class="hljs-number">0.2</span>, horizontal_flip=<span class="hljs-literal">False</span>, fill_mode=<span class="hljs-string">'nearest'</span>)

train_generator = train_datagen.flow_from_directory( <span class="hljs-string">'/path/to/training/images'</span>, target_size=(<span class="hljs-number">256</span>, <span class="hljs-number">256</span>), batch_size=<span class="hljs-number">32</span>, class_mode=<span class="hljs-string">'categorical'</span>)</pre></div><p id="0750">To adjust the hyperparameters of an OCR model, we can use techniques like grid search or random search to find the optimal values. Here’s an example of how to use grid search in Keras:</p><div id="35ae"><pre><span class="hljs-keyword">from</span> keras.models <span class="hljs-keyword">import</span> Seq

Options

uential <span class="hljs-keyword">from</span> keras.layers <span class="hljs-keyword">import</span> Dense, Dropout <span class="hljs-keyword">from</span> keras.wrappers.scikit_learn <span class="hljs-keyword">import</span> KerasClassifier <span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> GridSearchCV

<span class="hljs-comment"># Define a function to create the OCR model</span> <span class="hljs-keyword">def</span> <span class="hljs-title function_">create_model</span>(<span class="hljs-params">learning_rate=<span class="hljs-number">0.001</span>, dropout_rate=<span class="hljs-number">0.2</span></span>): model = Sequential() model.add(Dense(<span class="hljs-number">64</span>, activation=<span class="hljs-string">'relu'</span>, input_shape=(<span class="hljs-number">784</span>,))) model.add(Dropout(dropout_rate)) model.add(Dense(<span class="hljs-number">10</span>, activation=<span class="hljs-string">'softmax'</span>)) optimizer = Adam(lr=learning_rate) model.<span class="hljs-built_in">compile</span>(loss=<span class="hljs-string">'categorical_crossentropy'</span>, optimizer=optimizer, metrics=[<span class="hljs-string">'accuracy'</span>]) <span class="hljs-keyword">return</span> model

<span class="hljs-comment"># Create the Keras classifier</span> model = KerasClassifier(build_fn=create_model, epochs=<span class="hljs-number">10</span>, batch_size=<span class="hljs-number">32</span>, verbose=<span class="hljs-number">0</span>)

<span class="hljs-comment"># Define the hyperparameters to search over</span> param_grid = {<span class="hljs-string">'learning_rate'</span>: [<span class="hljs-number">0.001</span>, <span class="hljs-number">0.01</span>, <span class="hljs-number">0.1</span>], <span class="hljs-string">'dropout_rate'</span>: [<span class="hljs-number">0.1</span>, <span class="hljs-number">0.2</span>, <span class="hljs-number">0.3</span>]}

<span class="hljs-comment"># Perform grid search</span> grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=<span class="hljs-number">3</span>) grid_result = grid.fit(X_train, y_train)

<span class="hljs-comment"># Print the best parameters and score</span> <span class="hljs-built_in">print</span>(<span class="hljs-string">f'Best Parameters: <span class="hljs-subst">{grid_result.best_params_}</span>'</span>) <span class="hljs-built_in">print</span>(<span class="hljs-string">f'Best Score: <span class="hljs-subst">{grid_result.best_score_:<span class="hljs-number">.4</span>f}</span>'</span>)</pre></div><p id="69ed">This code may seem a bit complex. It would take ages to explain all the parameters, but to explain the code quickly:</p><p id="10d5">A function called <code>create_model</code> is defined to create the OCR model with a given learning rate and dropout rate. The model is a sequential neural network with two layers, the first being a dense layer with 64 units and a ReLU activation function, and the second being a dense layer with 10 units and a softmax activation function. The dropout rate is applied after the first layer to prevent overfitting. The optimizer used is Adam with the specified learning rate. The model is then compiled with a categorical cross-entropy loss function and accuracy metric.</p><p id="1c2f">A KerasClassifier object is created with the <code>create_model</code> function as the build function, and the number of epochs and batch size are specified.</p><p id="b962">A dictionary called <code>param_grid</code> is defined to specify the hyperparameters to search over. In this case, the learning rate and dropout rate are the two hyperparameters to tune, and a list of values to test for each hyperparameter is specified.</p><p id="b8cf">GridSearchCV is then used to perform the hyperparameter search using the KerasClassifier model and the defined hyperparameter grid. The <code>cv</code> parameter specifies the number of folds for cross-validation.</p><p id="03b7">Finally, the best parameters and score are printed to the console. The <code>best_params_</code> attribute of the <code>grid_result</code> object returns a dictionary of the best hyperparameters found during the search, and the <code>best_score_</code> attribute returns the highest cross-validation score achieved by the model.</p><p id="5d60">Keep in mind that OCR performance can vary widely depending on the specific use case and the quality of the input images. Sometimes, it won’t be possible to develop a good model. For example, OCR performance may be lower on handwritten text or low-resolution images than on printed text or high-resolution images.</p><h2 id="f833">Final Note</h2><p id="89e8">Python provides a rich ecosystem of libraries and tools that make OCR accessible and relatively easy to implement. As OCR technology continues to improve, it is likely to become even more widespread and useful in a variety of fields, from document digitization to automated data entry.</p><p id="9980">In a next article, I’ll talk about a real use case of OCR. Be sure to follow me if you don’t want to miss it!</p><p id="3aee"><i>To explore the other stories of this series, click below!</i></p><div id="ad69" class="link-block"> <a href="https://readmedium.com/data-science-with-python-32da1e5c3d2f"> <div> <div> <h2>Data Science with Python</h2> <div><h3>Aka the best programming language for data scientists</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*d7J13Ipreaf-8k5j)"></div> </div> </div> </a> </div><p id="cc67"><i>To explore more of my Python stories, click <a href="https://readmedium.com/tech-aa824bad0d67">here</a>! You can also access all my content by checking <a href="https://readmedium.com/about-me-d63607c8c341">this page</a>.</i></p><p id="8692"><i>If you want to be notified every time I publish a new story, subscribe to me via email by clicking <a href="https://medium.com/subscribe/@estebanthi">here</a>!</i></p><p id="7770"><i>If you’re not subscribed to medium yet and wish to support me or get access to all my stories, you can use my link:</i></p><div id="a2d6" class="link-block"> <a href="https://medium.com/@estebanthi/membership"> <div> <div> <h2>Join Medium with my referral link — Esteban Thilliez</h2> <div><h3>Read every story from Esteban Thilliez (and thousands of other writers on Medium). Your membership fee directly…</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*IoN4BofrwCNWA_bS)"></div> </div> </div> </a> </div></article></body>

Data Science with Python — Optical Character Recognition

Photo by David Travis on Unsplash

This article is part of the “Datascience with Python” series. You can find the other stories of this series below:

Optical character recognition (OCR) is a powerful technology that has transformed the way we process and analyze text data. OCR is a method of converting scanned images, PDFs, or other documents into editable and searchable text.

Python, with its rich set of libraries and tools, has emerged as a popular language for OCR tasks. We’ll see how.

Overview of OCR

OCR is a multi-stage process that involves preprocessing images, extracting relevant features from the images, and classifying the features into the corresponding characters or words.

One of the most popular libraries for OCR in Python is Tesseract, an open-source OCR engine developed by Google. Tesseract is capable of recognizing text in over 100 languages and can handle a wide range of input formats, including scanned images, PDFs, and even handwriting. Tesseract can be easily integrated with Python using the pytesseract library, which provides a simple and intuitive interface for OCR tasks.

pip install pytesseract

Another important library for OCR in Python is OpenCV, a computer vision library that provides a wide range of image processing functions. OpenCV can be used to perform image preprocessing tasks such as binarization, noise reduction, and deskewing, which can help improve the accuracy of OCR. OpenCV can also be used for feature extraction tasks, such as detecting lines and edges in images.

pip install opencv-python

In addition to the Python package, you’ll probably need to download OpenCV here.

Finally, for classification tasks, Python provides several machine learning libraries, such as Scikit-Learn, which can be used to train classifiers for OCR tasks. Scikit-Learn provides a wide range of classification algorithms, including K-nearest neighbors (KNN), Support Vector Machines (SVM), and Neural Networks, which can be used for OCR tasks. Scikit-Learn also provides tools for evaluating the performance of classifiers, such as accuracy, precision, and recall.

Processing Images for OCR

Image preprocessing is an important step in OCR, as it helps to improve the quality of the image and make it easier to extract relevant features. There are several techniques that can be used for image preprocessing, such as binarization, noise reduction, and deskewing.

Binarization is the process of converting an image into a binary format, where each pixel is either black or white. This can be useful for improving the contrast of the image and removing any noise or artifacts.

import cv2
from skimage import io, filters


# Binarization using OpenCV
img = cv2.imread('image.jpg')

# Convert to grayscale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Apply thresholding using Otsu's method
# This sets the threshold value automatically based on the image histogram
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]


# Binarization using scikit-image
img = io.imread('image.jpg', as_gray=True)

# Apply thresholding using Otsu's method
# This sets the threshold value automatically based on the image histogram
thresh = filters.threshold_otsu(img)

# Convert the image to binary format using the threshold
binary = img > thresh

Noise reduction is another important technique for preprocessing images for OCR. Noise can be caused by factors such as low lighting conditions, scanning artifacts, or camera shake.

# Apply Gaussian blur to reduce noise
blur = cv2.GaussianBlur(img, (5, 5), 0)

Deskewing is the process of straightening an image that is tilted or skewed. This can be useful for improving the accuracy of OCR, as it ensures that the characters are properly aligned and easier to recognize.

img = cv2.imread('image.jpg', 0)

thresh = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

# Find contours in the binary image
# RETR_EXTERNAL retrieves only the external contours, CHAIN_APPROX_SIMPLE compresses horizontal, vertical, and diagonal segments and leaves only their end points.
contours, hierarchy = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

# Find the contour with the largest area (i.e., the text region)
max_contour = max(contours, key=cv2.contourArea)

# Find the minimum area rectangle that encloses the contour
rect = cv2.minAreaRect(max_contour)

# Find the four corners of the rectangle
box = cv2.boxPoints(rect)
box = np.int0(box)

# Find the angle of rotation of the rectangle
angle = rect[2]

# Rotate the image by the angle of rotation to deskew it
rows, cols = img.shape
M = cv2.getRotationMatrix2D((cols/2,rows/2), angle, 1)
deskewed = cv2.warpAffine(img, M, (cols, rows))

Feature Extraction and Classification

After preprocessing the images, we need to extract features from the image that can be used for classification. There are various techniques available for feature extraction, such as edge detection, histogram of oriented gradients (HOG), and scale-invariant feature transform (SIFT).

Today, I’ll use HOG. HOG computes the gradient magnitude and orientation at each pixel in the image and then creates a histogram of gradients in each cell of a grid. The resulting feature vector can be used for classification.

from skimage.feature import hog
from sklearn import svm

# Extract features using HOG
fd, hog_image = hog(deskewed, orientations=9, pixels_per_cell=(8, 8), cells_per_block=(2, 2), visualize=True, multichannel=False)

Once we have extracted features from the images, we need to train a machine learning model to classify them. We will use a support vector machine (SVM) classifier. SVM is a popular machine learning algorithm for classification tasks.

# Prepare data for training
features = [fd]
labels = ['A']

# Train the SVM classifier
clf = svm.SVC(kernel='linear', C=1, probability=True)
clf.fit(features, labels)

So in the two codes above, what happens?

First, we use the hog function from the skimage.feature module to extract the HOG features from the deskewed image. We specify the number of orientation bins, the size of the cells in which the gradients are computed, and the size of the blocks over which the histograms of gradients are normalized.

Next, we prepare the features and labels for training. Let’s assume our image is the letter ‘A’. In this case, we only have one image with the letter ‘A’, so we use the HOG features as the feature vector and the label ‘A’.

Finally, we train an SVM classifier using the SVC class from the sklearn module. We specify a linear kernel and a regularization parameter C of 1. We also set probability to True to enable probability estimates for the classifier.

Evaluating and Improving OCR Performance

After training an OCR model, it’s important to evaluate its performance on a separate dataset to ensure that it is accurate and reliable. There are several metrics that can be used to measure OCR performance, including precision, recall, and F1 score.

Precision measures the proportion of true positive predictions out of all positive predictions made by the OCR model. Recall measures the proportion of true positive predictions out of all actual positive instances in the dataset. The F1 score is a weighted average of precision and recall, and provides a single number that summarizes the overall performance of the OCR model.

from sklearn.metrics import precision_score, recall_score, f1_score

# Assume y_true and y_pred are arrays of true and predicted labels
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'F1 Score: {f1:.4f}')

To improve the performance of an OCR model, several techniques can be used. One approach is to use a larger training dataset, which can help the model learn more robust features and reduce overfitting. Another approach is to adjust the hyperparameters of the model, such as the learning rate or regularization strength, to achieve better performance.

To use a larger training dataset to improve OCR performance, we can simply gather more data and add it to our training set. Here’s an example of how to load images from a directory and add them to a ImageDataGenerator in Keras (but first, you need to install Keras using pip install keras):

from keras.preprocessing.image import ImageDataGenerator

train_datagen = ImageDataGenerator(rescale=1./255,
                                   rotation_range=10,
                                   width_shift_range=0.1,
                                   height_shift_range=0.1,
                                   shear_range=0.2,
                                   zoom_range=0.2,
                                   horizontal_flip=False,
                                   fill_mode='nearest')

train_generator = train_datagen.flow_from_directory(
    '/path/to/training/images',
    target_size=(256, 256),
    batch_size=32,
    class_mode='categorical')

To adjust the hyperparameters of an OCR model, we can use techniques like grid search or random search to find the optimal values. Here’s an example of how to use grid search in Keras:

from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import GridSearchCV

# Define a function to create the OCR model
def create_model(learning_rate=0.001, dropout_rate=0.2):
    model = Sequential()
    model.add(Dense(64, activation='relu', input_shape=(784,)))
    model.add(Dropout(dropout_rate))
    model.add(Dense(10, activation='softmax'))
    optimizer = Adam(lr=learning_rate)
    model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    return model

# Create the Keras classifier
model = KerasClassifier(build_fn=create_model, epochs=10, batch_size=32, verbose=0)

# Define the hyperparameters to search over
param_grid = {'learning_rate': [0.001, 0.01, 0.1],
              'dropout_rate': [0.1, 0.2, 0.3]}

# Perform grid search
grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=3)
grid_result = grid.fit(X_train, y_train)

# Print the best parameters and score
print(f'Best Parameters: {grid_result.best_params_}')
print(f'Best Score: {grid_result.best_score_:.4f}')

This code may seem a bit complex. It would take ages to explain all the parameters, but to explain the code quickly:

A function called create_model is defined to create the OCR model with a given learning rate and dropout rate. The model is a sequential neural network with two layers, the first being a dense layer with 64 units and a ReLU activation function, and the second being a dense layer with 10 units and a softmax activation function. The dropout rate is applied after the first layer to prevent overfitting. The optimizer used is Adam with the specified learning rate. The model is then compiled with a categorical cross-entropy loss function and accuracy metric.

A KerasClassifier object is created with the create_model function as the build function, and the number of epochs and batch size are specified.

A dictionary called param_grid is defined to specify the hyperparameters to search over. In this case, the learning rate and dropout rate are the two hyperparameters to tune, and a list of values to test for each hyperparameter is specified.

GridSearchCV is then used to perform the hyperparameter search using the KerasClassifier model and the defined hyperparameter grid. The cv parameter specifies the number of folds for cross-validation.

Finally, the best parameters and score are printed to the console. The best_params_ attribute of the grid_result object returns a dictionary of the best hyperparameters found during the search, and the best_score_ attribute returns the highest cross-validation score achieved by the model.

Keep in mind that OCR performance can vary widely depending on the specific use case and the quality of the input images. Sometimes, it won’t be possible to develop a good model. For example, OCR performance may be lower on handwritten text or low-resolution images than on printed text or high-resolution images.

Final Note

Python provides a rich ecosystem of libraries and tools that make OCR accessible and relatively easy to implement. As OCR technology continues to improve, it is likely to become even more widespread and useful in a variety of fields, from document digitization to automated data entry.

In a next article, I’ll talk about a real use case of OCR. Be sure to follow me if you don’t want to miss it!

To explore the other stories of this series, click below!

To explore more of my Python stories, click here! You can also access all my content by checking this page.

If you want to be notified every time I publish a new story, subscribe to me via email by clicking here!

If you’re not subscribed to medium yet and wish to support me or get access to all my stories, you can use my link:

Python
Data
Data Science
Programming
Artificial Intelligence
Recommended from ReadMedium