Free AI web copilot to create summaries, insights and extended knowledge, download it at here

6931

Abstract

n of images in grayscale. We lose some information, but we make sets smaller, keeping relevant information.</p><div id="94c2"><pre><span class="hljs-keyword">import</span> cv2 <span class="hljs-comment">#Gray scale</span> <span class="hljs-keyword">def</span> <span class="hljs-title function_">grayscale</span>(<span class="hljs-params">img</span>): <span class="hljs-keyword">return</span> cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)</pre></div><figure id="e721"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*AKDIJ38iuVIciL5XtjNNag.jpeg"><figcaption>Grayscale images — we can still recognize the symbols</figcaption></figure><h1 id="c63c">2.5 Dropout Layer</h1><p id="84ba">The dropout is an easy way to avoid overfitting. It removes randomly some of the neural connections of a layer.</p><p id="9c7a">The idea here is to make the network more robust: even without the whole information, it has to perform well.</p><p id="6535">An analogy is like training to recognize the whole image using only part of it. We can do it easily, as in cases below.</p><figure id="573d"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*2Wx7e9qCvMW-Kypd.jpg"><figcaption>Can you say the brand?</figcaption></figure><figure id="4a07"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*AOyfghuNU7m2QI4P."><figcaption>And this brand?</figcaption></figure><h1 id="b4c6">2.6 Equilibrium of data</h1><p id="9824">The set of traffic signals presented has an imbalance. There are signs with much samples, others with few.</p><p id="bcc8">Like someone studying a lot of math, but no history. The school has to teach an even amount of each one.</p><p id="7a7c">We can artificially increase the number of samples of the fewer represented signals, using image augmentation.</p><ul><li>Sets with less than 1000 samples will have 200% increase</li><li>Sets with less than 2000 samples will have 100% increase</li><li>Sets with greater than 2000 samples will have 20% increase</li></ul><h1 id="d1dc">2.7 Image augmentation</h1><p id="9486">In the case of the traffic signal project, there are some perturbations we can do to make it more robust: translation, warping, shadowing and so on. Different conditions, to simulate what happens: sun, night, snow, rain.</p><p id="fb47">These random modifications were made:</p><ul><li>Translation of at most 2 pixels</li><li>Rotation of at most 15 degrees</li><li>Small warp perspective (2 pixels of difference)</li></ul><p id="f950">Here are some examples.</p><figure id="e34a"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*2CIUozcM1dsylzRt5xWzTg.jpeg"><figcaption></figcaption></figure><p id="1bdc">In real life it is also very common. One technique of speech training is to speak with a pen in the mouth, in order to make more difficult to speak. The Barcelona team train in a smaller field, or with less players than the opposite team, in order to make the main team better. Or Rocky, the boxer, who trains in the snow, in the sun, in the rain!</p><figure id="7265"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*ywtD3HCC974htv6XSBPNSA.jpeg"><figcaption></figcaption></figure><h1 id="f8eb">3 — Neural Network Architecture</h1><p id="420a">To guide us in this project, we’re oriented to use TensorFlow with LeNet architecture as a starting point. LeNet is a design of Neural Networks by Yann LeCun (<a href="http://yann.lecun.com/exdb/publis/pdf/sermanet-ijcnn-11.pdf">http://yann.lecun.com/exdb/publis/pdf/sermanet-ijcnn-11.pdf</a>)</p><figure id="8d68"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*FTqJDxQSvqRULVaNLwcfxw.png"><figcaption></figcaption></figure><p id="eec2">The LeNet is a starting point. We can change the number of layers, the width of layers, and so on. My final configuration uses two convolutional layers, followed by a flatten and three fully connected networks:</p><ul><li>Layer 1: Convolutional. Input = 32x32x1. Output = 28x28x20.</li><li>Activation: tanh Pooling: Input = 28x28x20. Output = 14x14x20.</li><li>Layer 2: Convolutional. Output = 10x10x48.</li><li>Activation: tanh Pooling: Input = 10x10x48. Output = 5x5x48.</li><li>Flatten. Input = 5x5x48. Output = 1200.</li><li>Layer 3: Fully Connected. Input = 1200. Output = 120, Activation: tanh</li><li>Layer 4: Fully Connected. Input = 120. Output = 84, Activation: tanh</li><li>Layer 5: Fully Connected. Input = 84. Output = 43 (one hot encoding).</li></ul><p id="8476">The first two layers are convolutional layers. It is a 2D filter because we’re working with an image, two dimensions (and I used grayscale, to ignore the third dimension).</p><p id="ddf6">In my interpretation of how convolution works. It is like a filter sliding every position of the image. Because it is a linear filter, the resulting value will be greater when there is an exact match between the image and the filter weights, and value zero if not correlated at all. It is equivalent to try to find small features (or better, kernels) of the image. Try to find one piece of a puzzle per time: the eyes, nose, ears, hair, and so on.</p><figure id="a654"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*ZCjPUFrB6eHPRi4eyP6aaA.gif"><figcaption>Source <a href="http://stats.stackexchange.com/questions/116362/what-does-the-convolution-step-in-a-convolutional-neural-network-do">http://stats.stackexchange.com/questions/116362/what-does-the-convolution-step-in-a-convolutional-neural-network-do</a></figcaption></figure><p id="9d7a">The original LeCun paper even cites that he used different sizes of the kernels. Using the analogy of the puzzle, one type of filter looked for greater pieces, and other, smaller pieces.</p><p id="778e">The first layer extract these direct features, while second and third layers are <b>levels of abstraction</b> over the first layer. First try to identify each piece of the puzzle, then group each piece of the puzzle to form a greater piece of the puzzle, then group these group of puzzles.</p><p id="412c">And finally, there is a transformation of this image in a single code of information, by Flatten layer. The dense layers can work as a usual neural network from now on.</p><p id="9dfb">Since there’s no way to say beforehand the best architecture of the neural network (or at least, I do not know), the definition of metaparameters and architecture is mostly empirical: number of layers of each type, activation function, number of neurons in each layer. It is quite time-consuming, since there are a infinite combination. And also it is not good to overfit the network, since it will make wrong predictions.</p><p id="02a1">The complete code is in <a href="https://github.com/asgunzi/CarND-Traffic-Sign-Classifier-Project">Github</a>. But here is one example of convolutional layer in TensorFlow:</p><div id="d167"><pre><span class="hljs-comment"># Layer 1: Convolutional. Input = 32x32x1. Output = 2

Options

8x28x6.</span> <span class="hljs-attr">w1</span> = tf.Variable(tf.truncated_normal([<span class="hljs-number">5</span>,<span class="hljs-number">5</span>,<span class="hljs-number">1</span>,<span class="hljs-number">20</span>], mean = mu, stddev = sigma)) <span class="hljs-attr">b1</span> = tf.Variable(tf.zeros(<span class="hljs-number">20</span>))

<span class="hljs-attr">l1_conv</span> = tf.nn.conv2d(x, w1, strides = [<span class="hljs-number">1</span>,<span class="hljs-number">1</span>,<span class="hljs-number">1</span>,<span class="hljs-number">1</span>], padding= ‘VALID’, name = ‘l1_conv’) +b1

<span class="hljs-comment"># <span class="hljs-doctag">TODO:</span> Activation.</span> <span class="hljs-attr">l1_act</span> = tf.nn.tanh(l1_conv, name = ‘l1_act’)</pre></div><p id="9330">Full connected layer:</p><div id="28d0"><pre><span class="hljs-comment"># Layer 4: Fully Connected. Input = 120. Output = 84.</span> <span class="hljs-attr">w4</span> = tf.Variable(tf.truncated_normal([<span class="hljs-number">120</span>,<span class="hljs-number">84</span>], mean = mu, stddev = sigma)) <span class="hljs-attr">b4</span> = tf.Variable(tf.zeros(<span class="hljs-number">84</span>))

<span class="hljs-comment"># Activation.</span> <span class="hljs-attr">l4</span> = tf.add(tf.matmul(l3, w4),b4, name = ‘l4’)</pre></div><div id="64b2"><pre> l4 = <span class="hljs-keyword">tf</span>.<span class="hljs-keyword">nn</span>.<span class="hljs-built_in">tanh</span>(l4)</pre></div><h1 id="c6fe">4. Results</h1><p id="9a39">The final model had validation accuracy of 96,0% and test accuracy of 92,6%.</p><figure id="5516"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*CN4kamYxeRHJFs30BNuE2Q.jpeg"><figcaption>Some examples of predictions</figcaption></figure><p id="c435">The accuracy could be improved by best tuning of parameters (optimizer, epochs, activation function, number of layers, architecture of solution), or using other color space and more image augmentation. At the time I did this work, I didn’t have Cuda installed. Cuda would make the training process several orders of magnitude faster, as shown in project 3 of this course (<a href="https://chatbotslife.com/teaching-a-car-to-drive-himself-e9a2966571c5">Teaching a car to drive itself</a>).</p><p id="7d19">The pipeline was written in Jupyter notebook, and can be found in:</p><p id="9f37"><a href="https://github.com/asgunzi/CarND-Traffic-Sign-Classifier-Project">https://github.com/asgunzi/CarND-Traffic-Sign-Classifier-Project</a></p><h1 id="e3b9">5. Conclusions</h1><p id="bfe7">Since our computers are (still) millions of times fooler than us, they need hundred of thousands of data to do generalizations, as showed before, while we humans need just a few.</p><p id="a313">To recognize traffic signals, we need only one image for each case. Or not. We do not account in this example that we live decades, and our brains are working recognizing images all day, every day, every time.</p><p id="1f48">Truly, we are trained in billions of images to feed our internal neural network.</p><p id="cf51">For automatic image recognition, one bottleneck is data. High quality and big chunks of data.</p><p id="637a">Filming the road with a dashcam to get a large amount of traffic signals is not enough, because I would have to isolate each image, and then label it correctly. It is a huge manual work to be done!</p><p id="db6d">One relatively cheap way to label images is using a service like Amazon Mechanical Turk. It works like sending the images to several people in the world, that in their spare time do the recognition and label the image, for cents of dollars.</p><p id="dd0b">The second bottleneck is the neural network. The advent of TensorFlow, Keras and powerful computers allowed us to do tasks that would be impossible few years ago. But there are still a lot to do: better architectures of neural networks, automatic tuning of it (today is basically trial-and-error on metaparameters), how to make it more robust, how to use better the already trained networks we already have.</p><p id="fe9d"><b>Term 1 projects:</b></p><ul><li><a href="https://chatbotslife.com/advanced-lane-line-project-7635ddca1960">Advanced Lane Finding</a></li><li><a href="https://chatbotslife.com/teaching-a-car-to-drive-himself-e9a2966571c5">Teaching a car to drive itself</a></li><li><a href="https://chatbotslife.com/vehicle-detection-and-tracking-using-computer-vision-baea4df65906#.9hsuqtv9c">Vehicle detection and tracking</a></li><li><a href="https://readmedium.com/machine-versus-human-learning-in-traffic-sign-classification-2819e49e5e9#.5ekgw81ln">Traffic Sign Classifier</a></li><li><a href="https://readmedium.com/the-udacity-self-driving-car-nanodegree-term-1-384b75dcb987#.pckv4ce3e">Review on Udacity Term 1</a></li></ul><p id="9074">Other writings: <a href="https://medium.com/@arnaldogunzi">https://medium.com/@arnaldogunzi</a></p><p id="c3ae">Main blog: <a href="https://ideiasesquecidas.com/">https://ideiasesquecidas.com/</a></p><p id="6747">Written while listening to Asa Branca — Luiz Gonzaga.</p> <figure id="2abf"> <div> <div> <img class="ratio" src="http://placehold.it/16x9"> <iframe class="" src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FcWiJL0_yj9c%3Ffeature%3Doembed&url=http%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DcWiJL0_yj9c&image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FcWiJL0_yj9c%2Fhqdefault.jpg&key=d04bfffea46d4aeda930ec88cc64b87c&type=text%2Fhtml&schema=youtube" allowfullscreen="" frameborder="0" height="480" width="640"> </div> </div> </figure></iframe></div></div></figure><figure id="16bb"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*bQlRSzFHJEmF4Q7PyrLgng.gif"><figcaption></figcaption></figure><figure id="07a8"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*6XUspT9JOSq0w0Fi35HIaA.png"><figcaption></figcaption></figure><figure id="f670"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*c1LDMH5vbnIz9rmAka8Hwg.png"><figcaption></figcaption></figure><figure id="92df"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*D0Jf3dI6ZThtqcfwDYY7mg.png"><figcaption></figcaption></figure> <figure id="c0d4"> <div> <div> <img class="ratio" src="http://placehold.it/16x9"> <iframe class="" src="https://cdn.embedly.com/widgets/media.html?url=https%3A%2F%2Fpowered.by.rabbut.com%2Fp%2FqnHY%3Fc%3D0&src=https%3A%2F%2Fpowered.by.rabbut.com%2Fp%2FqnHY%3Fc%3D0&type=text%2Fhtml&key=d04bfffea46d4aeda930ec88cc64b87c&schema=rabbut" allowfullscreen="" frameborder="0" height="500" width="500"> </div> </div> </figure></iframe></div></div></figure></article></body>

Machine versus human learning in traffic sign classification

In the project 2 of the great Udacity self-driving car nanodegree (https://ww.udacity.com/drive), we are invited to recognize traffic signals using the best up-to-date techniques available in the world!

There are some analogies between machine and human learning. We can use our own way of learning to improve the machine learning, but we can also use machine learning to understand better how we learn!

Topics:

1 — The recognition project
2 — Analogies between human and machine learning
3 — Solution approach
4 — Results
5 — Conclusion

1. The Traffic Sign Classifier project

In this project, we use data from German traffic sign dataset. It was a challenge, sponsored by the German Ministry of Education and Research in 2011, to find the best algorithms that recognize the signs.

http://benchmark.ini.rub.de/?section=gtsrb&subsection=dataset

It had 43 classes of traffic signs, and more than 50.000 sample images, correctly labeled.

The number of samples per class was not equal per class.

Different number of samples per class — training data

Before approaching the solution, let’s discuss a bit some concepts of machine learning.

2 Analogies between human and machine learning

2.1 Overfitting and underfitting

We go to school to learn something useful and to apply it in real world.

There are two risks in the learning process: do not learn enough, and learning too much.

The first item is easy. Our brainpower must be enough to learn what is useful, and the school also has to challenge us. The neural networks of 10 years ago, for example, had usually three layers. Anything more than this didn’t work in practice, because of hardware and software limitations — the network wouldn’t converge or it would take forever to do this. There was a lack of brainpower. Nowadays, we have hardware (as GPU) and software (TensorFlow, Keras) for very complicated neural networks.

The overfitting item is more subtle. By learning too much, I mean to memorize exactly the training data, as if it is the holy true of the world, and not being able to do generalizations. To memorize noise instead of information.

The blue curve has zero error, but doesn’t generalize the behavior of the data

Overfitting can be a problem because it’s harder to identify. And we can be deceiving ourselves, thinking the accuracy is ok.

When we underfit, it is evident in a simple accuracy analysis, but the overfitting is harder to identify.

Underfit: zero grade in school, easily identifiable.

Overfit: the guy who has a perfect grade in school, in all subjects, but outside school knows nothing in real world. Or someone who has a phD in nuclear advanced theoretical gravitational quantum physics, but works as a waiter in a restaurant, because his knowledge is so specific it has no real world application.

2.2 Training, Validation and Test sets

One way to avoid overfitting is to separate the data in Training, Validation and Test sets.

The Test set will be set apart the others. The Training and Validation will be used in each epoch of the model. We train our neural network using only training data, and then we do a validation of its performation with the validation set.

It is like we do in school. We have a lot of exercises to study at home. Then we have an examination, in school.

Each epoch is like a complete study of training material and the examination in the school. These are analogies to train and validation data.

In the case of the project, I did the following separation:

Number of training examples = 31.367

Number of validation examples = 7.842

Number of testing examples = 12.630

To have an idea of what will happen in real world, we use a test set.

The test set is something completely apart from training, it must be something the model never saw before. In the end of the day, what counts is how we perform in real world, not in school.

2.3 Randomization

In each round (or epoch), the randomization of the Training and Validation samples for every training session will help it not to converge prematurely to some particular solution. It is because the neural network uses the backpropagation method, to adapt the weights step-by-step.

from sklearn.utils import shuffle

X_train_shuff, y_train_shuff = shuffle(X_train_norm, y_out)
X_test_shuff, y_test_shuff = shuffle (X_test_norm, y_test)

2.4 Grayscale

When we are learning something new, one of the first things we need to do is to eliminate less relevant factors. Eliminate less relevant details, to make the size of information smaller.

It is something analogous with the transformation of images in grayscale. We lose some information, but we make sets smaller, keeping relevant information.

import cv2
#Gray scale
def grayscale(img):
  return cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

Grayscale images — we can still recognize the symbols

2.5 Dropout Layer

The dropout is an easy way to avoid overfitting. It removes randomly some of the neural connections of a layer.

The idea here is to make the network more robust: even without the whole information, it has to perform well.

An analogy is like training to recognize the whole image using only part of it. We can do it easily, as in cases below.

2.6 Equilibrium of data

The set of traffic signals presented has an imbalance. There are signs with much samples, others with few.

Like someone studying a lot of math, but no history. The school has to teach an even amount of each one.

We can artificially increase the number of samples of the fewer represented signals, using image augmentation.

Sets with less than 1000 samples will have 200% increase
Sets with less than 2000 samples will have 100% increase
Sets with greater than 2000 samples will have 20% increase

2.7 Image augmentation

In the case of the traffic signal project, there are some perturbations we can do to make it more robust: translation, warping, shadowing and so on. Different conditions, to simulate what happens: sun, night, snow, rain.

These random modifications were made:

Translation of at most 2 pixels
Rotation of at most 15 degrees
Small warp perspective (2 pixels of difference)

Here are some examples.

In real life it is also very common. One technique of speech training is to speak with a pen in the mouth, in order to make more difficult to speak. The Barcelona team train in a smaller field, or with less players than the opposite team, in order to make the main team better. Or Rocky, the boxer, who trains in the snow, in the sun, in the rain!

3 — Neural Network Architecture

To guide us in this project, we’re oriented to use TensorFlow with LeNet architecture as a starting point. LeNet is a design of Neural Networks by Yann LeCun (http://yann.lecun.com/exdb/publis/pdf/sermanet-ijcnn-11.pdf)

The LeNet is a starting point. We can change the number of layers, the width of layers, and so on. My final configuration uses two convolutional layers, followed by a flatten and three fully connected networks:

Layer 1: Convolutional. Input = 32x32x1. Output = 28x28x20.
Activation: tanh Pooling: Input = 28x28x20. Output = 14x14x20.
Layer 2: Convolutional. Output = 10x10x48.
Activation: tanh Pooling: Input = 10x10x48. Output = 5x5x48.
Flatten. Input = 5x5x48. Output = 1200.
Layer 3: Fully Connected. Input = 1200. Output = 120, Activation: tanh
Layer 4: Fully Connected. Input = 120. Output = 84, Activation: tanh
Layer 5: Fully Connected. Input = 84. Output = 43 (one hot encoding).

The first two layers are convolutional layers. It is a 2D filter because we’re working with an image, two dimensions (and I used grayscale, to ignore the third dimension).

In my interpretation of how convolution works. It is like a filter sliding every position of the image. Because it is a linear filter, the resulting value will be greater when there is an exact match between the image and the filter weights, and value zero if not correlated at all. It is equivalent to try to find small features (or better, kernels) of the image. Try to find one piece of a puzzle per time: the eyes, nose, ears, hair, and so on.

Source http://stats.stackexchange.com/questions/116362/what-does-the-convolution-step-in-a-convolutional-neural-network-do

The original LeCun paper even cites that he used different sizes of the kernels. Using the analogy of the puzzle, one type of filter looked for greater pieces, and other, smaller pieces.

The first layer extract these direct features, while second and third layers are levels of abstraction over the first layer. First try to identify each piece of the puzzle, then group each piece of the puzzle to form a greater piece of the puzzle, then group these group of puzzles.

And finally, there is a transformation of this image in a single code of information, by Flatten layer. The dense layers can work as a usual neural network from now on.

Since there’s no way to say beforehand the best architecture of the neural network (or at least, I do not know), the definition of metaparameters and architecture is mostly empirical: number of layers of each type, activation function, number of neurons in each layer. It is quite time-consuming, since there are a infinite combination. And also it is not good to overfit the network, since it will make wrong predictions.

The complete code is in Github. But here is one example of convolutional layer in TensorFlow:

# Layer 1: Convolutional. Input = 32x32x1. Output = 28x28x6.
 w1 = tf.Variable(tf.truncated_normal([5,5,1,20], mean = mu, stddev = sigma))
 b1 = tf.Variable(tf.zeros(20))
 
 l1_conv = tf.nn.conv2d(x, w1, strides = [1,1,1,1], padding= ‘VALID’, name = ‘l1_conv’) +b1
 
 # TODO: Activation.
 l1_act = tf.nn.tanh(l1_conv, name = ‘l1_act’)

Full connected layer:

# Layer 4: Fully Connected. Input = 120. Output = 84.
 w4 = tf.Variable(tf.truncated_normal([120,84], mean = mu, stddev = sigma))
 b4 = tf.Variable(tf.zeros(84))
 
# Activation.
 l4 = tf.add(tf.matmul(l3, w4),b4, name = ‘l4’)

 l4 = tf.nn.tanh(l4)

4. Results

The final model had validation accuracy of 96,0% and test accuracy of 92,6%.

The accuracy could be improved by best tuning of parameters (optimizer, epochs, activation function, number of layers, architecture of solution), or using other color space and more image augmentation. At the time I did this work, I didn’t have Cuda installed. Cuda would make the training process several orders of magnitude faster, as shown in project 3 of this course (Teaching a car to drive itself).

The pipeline was written in Jupyter notebook, and can be found in:

https://github.com/asgunzi/CarND-Traffic-Sign-Classifier-Project

5. Conclusions

Since our computers are (still) millions of times fooler than us, they need hundred of thousands of data to do generalizations, as showed before, while we humans need just a few.

To recognize traffic signals, we need only one image for each case. Or not. We do not account in this example that we live decades, and our brains are working recognizing images all day, every day, every time.

Truly, we are trained in billions of images to feed our internal neural network.

For automatic image recognition, one bottleneck is data. High quality and big chunks of data.

Filming the road with a dashcam to get a large amount of traffic signals is not enough, because I would have to isolate each image, and then label it correctly. It is a huge manual work to be done!

One relatively cheap way to label images is using a service like Amazon Mechanical Turk. It works like sending the images to several people in the world, that in their spare time do the recognition and label the image, for cents of dollars.

The second bottleneck is the neural network. The advent of TensorFlow, Keras and powerful computers allowed us to do tasks that would be impossible few years ago. But there are still a lot to do: better architectures of neural networks, automatic tuning of it (today is basically trial-and-error on metaparameters), how to make it more robust, how to use better the already trained networks we already have.

Term 1 projects:

Other writings: https://medium.com/@arnaldogunzi

Main blog: https://ideiasesquecidas.com/

Written while listening to Asa Branca — Luiz Gonzaga.