Free AI web copilot to create summaries, insights and extended knowledge, download it at here

5680

Abstract

span> x_train = cv2.threshold(x_train, <span class="hljs-number">0.5</span>, <span class="hljs-number">1</span>, cv2.THRESH_BINARY)[<span class="hljs-number">1</span>] x_test = cv2.threshold(x_test, <span class="hljs-number">0.5</span>, <span class="hljs-number">1</span>, cv2.THRESH_BINARY)[<span class="hljs-number">1</span>]</pre></div><p id="06e4">Then we can also encode our data using the one-hot encoding and reshape our data. The purpose of doing this is to prepare the MNIST dataset for input to a convolutional neural network (CNN) model.</p><p id="f234">In the case of the MNIST dataset, the labels represent the numerical values of the digits from 0 to 9. However, these numerical values should not be interpreted as continuous values, but rather as discrete categories. Therefore, one-hot encoding is used to transform the labels into a binary vector where each category is represented by a unique bit position. For example, the label 2 would be encoded as [0, 0, 1, 0, 0, 0, 0, 0, 0, 0].</p><p id="efa4">Reshaping the data is necessary because CNN models require input data to be in a specific format, typically a 4D tensor with dimensions [batch_size, height, width, channels]. In the case of the MNIST dataset, the images are grayscale with dimensions of 28x28 pixels. Therefore, the data needs to be reshaped into a 4D tensor with dimensions [batch_size, 28, 28, 1], where the “1” represents the number of channels (grayscale images have one channel).</p><h2 id="2e55">Building the OCR Model</h2><p id="ea3b">Now, we will build a convolutional neural network (CNN) model using the preprocessed MNIST dataset.</p><p id="e717">I’ll start with the code, and I’ll explain it:</p><div id="cb29"><pre><span class="hljs-comment"># define the model</span> model = keras.models.Sequential() model.add(keras.layers.Conv2D(<span class="hljs-number">32</span>, kernel_size=(<span class="hljs-number">3</span>, <span class="hljs-number">3</span>), activation=<span class="hljs-string">'relu'</span>, input_shape=(<span class="hljs-number">28</span>, <span class="hljs-number">28</span>, <span class="hljs-number">1</span>))) model.add(keras.layers.Conv2D(<span class="hljs-number">64</span>, kernel_size=(<span class="hljs-number">3</span>, <span class="hljs-number">3</span>), activation=<span class="hljs-string">'relu'</span>)) model.add(keras.layers.MaxPooling2D(pool_size=(<span class="hljs-number">2</span>, <span class="hljs-number">2</span>))) model.add(keras.layers.Dropout(<span class="hljs-number">0.25</span>)) model.add(keras.layers.Flatten()) model.add(keras.layers.Dense(<span class="hljs-number">128</span>, activation=<span class="hljs-string">'relu'</span>)) model.add(keras.layers.Dropout(<span class="hljs-number">0.5</span>)) model.add(keras.layers.Dense(<span class="hljs-number">10</span>, activation=<span class="hljs-string">'softmax'</span>))

<span class="hljs-comment"># compile the model</span> model.<span class="hljs-built_in">compile</span>(loss=keras.losses.categorical_crossentropy, optimizer=keras.optimizers.Adadelta(), metrics=[<span class="hljs-string">'accuracy'</span>])

<span class="hljs-comment"># train the model</span> model.fit(x_train, y_train, batch_size=<span class="hljs-number">128</span>, epochs=<span class="hljs-number">10</span>, verbose=<span class="hljs-number">1</span>, validation_data=(x_test, y_test))</pre></div><p id="1c7d">The first step is to define the CNN model using the Sequential model class from the Keras API. The architecture of the model consists of several layers:</p><ul><li><b>Conv2D layers</b>: These are the core building blocks of a CNN, and they apply a set of filters to the input image to extract meaningful features. Each filter is a small matrix of weights that slides over the input image, computing the dot product between the filter and a local patch of the input. The resulting output is called a feature map, which highlights specific patterns in the input image. The choice of 32 and 64 filters in the two Conv2D layers is a common practice in CNNs and can be adjusted depending on the complexity of the problem.</li><li><b>MaxPooling2D layers</b>: These layers reduce the spatial dimensions of the feature maps by taking the maximum value within each non-overlapping window. This operation helps to extract the most important features while reducing the computational complexity of the model.</li><li><b>Dropout layer</b>: Dropout is a regularization technique that randomly sets a fraction of the input units to zero during training. This helps to prevent overfitting by forcing the model to learn more robust features that are useful for making predictions.</li><li><b>Flatten layer</b>: This layer flattens the output of the previous layer into a 1D vector, which can be fed into a fully connected neural network layer.</li><li><b>Dense layers</b>: These layers are fully connected neural network layers that perform a linear transformation of the input followed by a nonlinear activation function. The number of units in the last dense layer corresponds to the number of classes in the classification problem.</li><li><b>Activation functions</b>: An activation function is a mathematical function that is applied to the output of each neuron in a neural network. It introduces nonlinearity into the network, allowing it to learn more complex and sophisticated relationships between the inputs and the outputs.</li></ul><p id="67c2">The choice of this architecture is based on its ability to extract useful features from the input images, while reducing the number of parameters in the model. The use of convolutional layers allows the model to learn local patterns in the images,

Options

while the pooling layers reduce the dimensionality of the feature maps. The ReLU activation function introduces nonlinearity into the model, making it more flexible and expressive. The Dropout layer helps to prevent overfitting by randomly dropping out units during training, while the dense layers perform the final classification task.</p><p id="0790">After defining the model architecture, we compile the model by specifying the loss function, optimizer, and metrics to be used during training. In this case, we use the categorical cross-entropy loss function, Adadelta optimizer, and accuracy as the evaluation metric.</p><p id="d3e2">Finally, we train the model using the fit() method, which takes the preprocessed MNIST dataset as input. We specify the batch size, number of epochs, and validation data to be used during training. The verbose parameter is set to 1 to display the progress of the training.</p><p id="ad53">We can then save the model for further use:</p><div id="6772"><pre>model.save(<span class="hljs-string">'mnist.h5'</span>)</pre></div><p id="1069">And load it later:</p><div id="b01c"><pre>model = keras.models.load_model(<span class="hljs-string">'mnist.h5'</span>)</pre></div><h2 id="3925">Evaluating the Model</h2><p id="301b">Once we have trained the model, we need to evaluate its performance on the test set. We use the <code>evaluate()</code> method to compute the test loss and accuracy:</p><div id="db21"><pre>score = model.evaluate(x_test, y_test, verbose=<span class="hljs-number">0</span>) <span class="hljs-built_in">print</span>(<span class="hljs-string">'Test loss:'</span>, score[<span class="hljs-number">0</span>]) <span class="hljs-built_in">print</span>(<span class="hljs-string">'Test accuracy:'</span>, score[<span class="hljs-number">1</span>])</pre></div><div id="597a"><pre><span class="hljs-keyword">Test </span>loss: 0.8606541156768799 <span class="hljs-keyword">Test </span>accuracy: 0.8409000039100647</pre></div><p id="fea1">The test accuracy gives us an idea of how well the model is likely to perform on new, unseen data.</p><p id="f2cb">We can also use the model to make predictions on new data using the <code>predict()</code> method. Let’s try to predict data using random images and display the result:</p><div id="4b6d"><pre><span class="hljs-keyword">import</span> random

predictions = model.predict(x_test) indices = random.sample(<span class="hljs-built_in">range</span>(<span class="hljs-number">0</span>, <span class="hljs-built_in">len</span>(x_test)), <span class="hljs-number">30</span>)

plt.figure(figsize=(<span class="hljs-number">10</span>, <span class="hljs-number">10</span>)) <span class="hljs-keyword">for</span> i, index <span class="hljs-keyword">in</span> <span class="hljs-built_in">enumerate</span>(indices): plt.subplot(<span class="hljs-number">5</span>, <span class="hljs-number">6</span>, i+<span class="hljs-number">1</span>) plt.xticks([]) plt.yticks([]) plt.grid(<span class="hljs-literal">False</span>) plt.imshow(x_test[index].reshape(<span class="hljs-number">28</span>, <span class="hljs-number">28</span>), cmap=plt.cm.binary) plt.xlabel(np.argmax(predictions[index])) plt.show()</pre></div><figure id="625c"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*GvQjrGfHN2-P9xuMy2NNZg.png"><figcaption></figcaption></figure><p id="18bb">Well, it looks good!</p><h2 id="4c78">Final Note</h2><p id="a12e">OCR may seem hard, but the hard part is really just understanding how we can develop the model using many layers. Each layer has a purpose, and it’s important to understand this. Everything else is not so far than things you already know.</p><p id="735f">I hope you enjoyed this article! I tried to be brief, as it would take hours to properly explain OCR.</p><p id="a0e6"><i>To explore the other stories of this series, click below!</i></p><div id="ad69" class="link-block"> <a href="https://readmedium.com/data-science-with-python-32da1e5c3d2f"> <div> <div> <h2>Data Science with Python</h2> <div><h3>Aka the best programming language for data scientists</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*d7J13Ipreaf-8k5j)"></div> </div> </div> </a> </div><p id="cc67"><i>To explore more of my Python stories, click <a href="https://readmedium.com/tech-aa824bad0d67">here</a>! You can also access all my content by checking <a href="https://readmedium.com/about-me-d63607c8c341">this page</a>.</i></p><p id="8692"><i>If you want to be notified every time I publish a new story, subscribe to me via email by clicking <a href="https://medium.com/subscribe/@estebanthi">here</a>!</i></p><p id="7770"><i>If you’re not subscribed to medium yet and wish to support me or get access to all my stories, you can use my link:</i></p><div id="a2d6" class="link-block"> <a href="https://medium.com/@estebanthi/membership"> <div> <div> <h2>Join Medium with my referral link — Esteban Thilliez</h2> <div><h3>Read every story from Esteban Thilliez (and thousands of other writers on Medium). Your membership fee directly…</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*IoN4BofrwCNWA_bS)"></div> </div> </div> </a> </div></article></body>

Data Science with Python — OCR Use Case

This article is part of the “Datascience with Python” series. You can find the other stories of this series below:

Data Science with Python

Aka the best programming language for data scientists

medium.com

In a previous article, we talked about how to perform OCR with Python. Today, we’ll see an application of OCR using a famous dataset: The MNIST (Modified National Institute of Standards and Technology) Handwritten Digits Dataset.

What is the MNIST Handwritten Digits Dataset?

The MNIST (Modified National Institute of Standards and Technology) Handwritten Digits Dataset is a well-known dataset in the field of machine learning and computer vision. It was developed by Yann LeCun, Corinna Cortes, and Christopher J.C. Burges as part of their research on machine learning algorithms for handwritten digit recognition.

The MNIST dataset contains 70,000 grayscale images of handwritten digits, each measuring 28x28 pixels. The images are labeled with their corresponding numerical values, which range from 0 to 9. The dataset is split into two parts: 60,000 images are used for training, and 10,000 images are used for testing.

The MNIST dataset has become a benchmark for testing and comparing machine learning algorithms for OCR. It is widely used in academic and industrial research, and many researchers have achieved high levels of accuracy using various machine learning techniques. However, the dataset has some limitations and challenges. For example, the dataset only contains images of handwritten digits, and it may not be representative of real-world OCR applications that require recognition of different font styles, handwriting styles, and languages. Nevertheless, the MNIST dataset remains an important resource for researchers and developers working on OCR and related fields.

Setting Up the Environment

I advise you to create a virtual environment or a miniconda environment to install the required libraries. We will need mainly Keras and TensorFlow.

pip install keras

To install TensorFlow, you should do either pip install tensorflow or pip install tensorflow-cpu depending on your version of Python and your OS.

You should also install OpenCV. I explained in the previous article of this series how to do it, you can install it with pip, and then with the official releases on the OpenCV website.

Loading the Dataset

from keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

Visualizing the Images

We can use matplotlib to visualize the images of this dataset:

import matplotlib.pyplot as plt

# plot the first image in the dataset
plt.imshow(x_train[0], cmap=plt.get_cmap('gray'))
plt.show()

# plot the first 25 images in the dataset
for i in range(25):
    plt.subplot(5, 5, i+1)
    plt.imshow(x_train[i], cmap=plt.get_cmap('gray'))
    plt.axis('off')
plt.show()

Preprocessing the Data

Preprocessing the data is an important step. The purpose of preprocessing is to transform the raw image data into a format that is suitable for machine learning algorithms.

One common preprocessing technique is normalization, which involves scaling the pixel values of the images to a common range. This helps to reduce the impact of variations in brightness and contrast on the OCR accuracy.

We can start with normalizing the data. Then we can also binarize it, and eventually reduce the noise (but here it won’t work as this dataset is not compatible with the median blur from OpenCV).

# normalize the data
x_train = x_train / 255
x_test = x_test / 255

# binarize the data
x_train = cv2.threshold(x_train, 0.5, 1, cv2.THRESH_BINARY)[1]
x_test = cv2.threshold(x_test, 0.5, 1, cv2.THRESH_BINARY)[1]

Then we can also encode our data using the one-hot encoding and reshape our data. The purpose of doing this is to prepare the MNIST dataset for input to a convolutional neural network (CNN) model.

In the case of the MNIST dataset, the labels represent the numerical values of the digits from 0 to 9. However, these numerical values should not be interpreted as continuous values, but rather as discrete categories. Therefore, one-hot encoding is used to transform the labels into a binary vector where each category is represented by a unique bit position. For example, the label 2 would be encoded as [0, 0, 1, 0, 0, 0, 0, 0, 0, 0].

Reshaping the data is necessary because CNN models require input data to be in a specific format, typically a 4D tensor with dimensions [batch_size, height, width, channels]. In the case of the MNIST dataset, the images are grayscale with dimensions of 28x28 pixels. Therefore, the data needs to be reshaped into a 4D tensor with dimensions [batch_size, 28, 28, 1], where the “1” represents the number of channels (grayscale images have one channel).

Building the OCR Model

Now, we will build a convolutional neural network (CNN) model using the preprocessed MNIST dataset.

I’ll start with the code, and I’ll explain it:

# define the model
model = keras.models.Sequential()
model.add(keras.layers.Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(keras.layers.Conv2D(64, kernel_size=(3, 3), activation='relu'))
model.add(keras.layers.MaxPooling2D(pool_size=(2, 2)))
model.add(keras.layers.Dropout(0.25))
model.add(keras.layers.Flatten())
model.add(keras.layers.Dense(128, activation='relu'))
model.add(keras.layers.Dropout(0.5))
model.add(keras.layers.Dense(10, activation='softmax'))

# compile the model
model.compile(loss=keras.losses.categorical_crossentropy, optimizer=keras.optimizers.Adadelta(), metrics=['accuracy'])

# train the model
model.fit(x_train, y_train, batch_size=128, epochs=10, verbose=1, validation_data=(x_test, y_test))

The first step is to define the CNN model using the Sequential model class from the Keras API. The architecture of the model consists of several layers:

Conv2D layers: These are the core building blocks of a CNN, and they apply a set of filters to the input image to extract meaningful features. Each filter is a small matrix of weights that slides over the input image, computing the dot product between the filter and a local patch of the input. The resulting output is called a feature map, which highlights specific patterns in the input image. The choice of 32 and 64 filters in the two Conv2D layers is a common practice in CNNs and can be adjusted depending on the complexity of the problem.
MaxPooling2D layers: These layers reduce the spatial dimensions of the feature maps by taking the maximum value within each non-overlapping window. This operation helps to extract the most important features while reducing the computational complexity of the model.
Dropout layer: Dropout is a regularization technique that randomly sets a fraction of the input units to zero during training. This helps to prevent overfitting by forcing the model to learn more robust features that are useful for making predictions.
Flatten layer: This layer flattens the output of the previous layer into a 1D vector, which can be fed into a fully connected neural network layer.
Dense layers: These layers are fully connected neural network layers that perform a linear transformation of the input followed by a nonlinear activation function. The number of units in the last dense layer corresponds to the number of classes in the classification problem.
Activation functions: An activation function is a mathematical function that is applied to the output of each neuron in a neural network. It introduces nonlinearity into the network, allowing it to learn more complex and sophisticated relationships between the inputs and the outputs.

The choice of this architecture is based on its ability to extract useful features from the input images, while reducing the number of parameters in the model. The use of convolutional layers allows the model to learn local patterns in the images, while the pooling layers reduce the dimensionality of the feature maps. The ReLU activation function introduces nonlinearity into the model, making it more flexible and expressive. The Dropout layer helps to prevent overfitting by randomly dropping out units during training, while the dense layers perform the final classification task.

After defining the model architecture, we compile the model by specifying the loss function, optimizer, and metrics to be used during training. In this case, we use the categorical cross-entropy loss function, Adadelta optimizer, and accuracy as the evaluation metric.

Finally, we train the model using the fit() method, which takes the preprocessed MNIST dataset as input. We specify the batch size, number of epochs, and validation data to be used during training. The verbose parameter is set to 1 to display the progress of the training.

We can then save the model for further use:

model.save('mnist.h5')

And load it later:

model = keras.models.load_model('mnist.h5')

Evaluating the Model

Once we have trained the model, we need to evaluate its performance on the test set. We use the evaluate() method to compute the test loss and accuracy:

score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Test loss: 0.8606541156768799
Test accuracy: 0.8409000039100647

The test accuracy gives us an idea of how well the model is likely to perform on new, unseen data.

We can also use the model to make predictions on new data using the predict() method. Let’s try to predict data using random images and display the result:

import random

predictions = model.predict(x_test)
indices = random.sample(range(0, len(x_test)), 30)

plt.figure(figsize=(10, 10))
for i, index in enumerate(indices):
    plt.subplot(5, 6, i+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(x_test[index].reshape(28, 28), cmap=plt.cm.binary)
    plt.xlabel(np.argmax(predictions[index]))
plt.show()

Well, it looks good!

Final Note

OCR may seem hard, but the hard part is really just understanding how we can develop the model using many layers. Each layer has a purpose, and it’s important to understand this. Everything else is not so far than things you already know.

I hope you enjoyed this article! I tried to be brief, as it would take hours to properly explain OCR.

To explore the other stories of this series, click below!

Data Science with Python

Aka the best programming language for data scientists

medium.com

To explore more of my Python stories, click here! You can also access all my content by checking this page.

If you want to be notified every time I publish a new story, subscribe to me via email by clicking here!

If you’re not subscribed to medium yet and wish to support me or get access to all my stories, you can use my link:

Join Medium with my referral link — Esteban Thilliez

Read every story from Esteban Thilliez (and thousands of other writers on Medium). Your membership fee directly…

medium.com