M2M Day 191: Deconstructing a self-driving car model (based on my current knowledge)
This post is part of Month to Master, a 12-month accelerated learning project. For May, my goal is to build the software part of a self-driving car.
Now that I have working self-driving car code (see the video from yesterday), over the next few days, I plan to deconstruct the code and try to understand exactly how it works.
Today, I’ll be looking specifically at “the model”, which can be consider the the meat of the code: The model defines how input images are converted into steering instructions.
I don’t have too much time today, so I won’t be describing fully how the code works (since I don’t yet know and still need to do plenty of research). Instead, I’ll make some hypotheses about what the lines of code might mean and then document the open questions that I’ll need to further research.
This will set me up to learn the material in a structured way.
Here’s the code for the self-driving model in its entirety. It’s only 50 lines of code plus comments and spaces (which is pretty nuts, since it’s driving a car and stuff…)
import tensorflow as tf
import scipydef weight_variable(shape):
initial = tf.truncated_normal(shape, stddev=0.1)
return tf.Variable(initial)def bias_variable(shape):
initial = tf.constant(0.1, shape=shape)
return tf.Variable(initial)def conv2d(x, W, stride):
return tf.nn.conv2d(x, W, strides=[1, stride, stride, 1], padding='VALID')x = tf.placeholder(tf.float32, shape=[None, 66, 200, 3])
y_ = tf.placeholder(tf.float32, shape=[None, 1])x_image = x#first convolutional layer
W_conv1 = weight_variable([5, 5, 3, 24])
b_conv1 = bias_variable([24])h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1, 2) + b_conv1)
#second convolutional layer
W_conv2 = weight_variable([5, 5, 24, 36])
b_conv2 = bias_variable([36])h_conv2 = tf.nn.relu(conv2d(h_conv1, W_conv2, 2) + b_conv2)
#third convolutional layer
W_conv3 = weight_variable([5, 5, 36, 48])
b_conv3 = bias_variable([48])h_conv3 = tf.nn.relu(conv2d(h_conv2, W_conv3, 2) + b_conv3)
#fourth convolutional layer
W_conv4 = weight_variable([3, 3, 48, 64])
b_conv4 = bias_variable([64])h_conv4 = tf.nn.relu(conv2d(h_conv3, W_conv4, 1) + b_conv4)
#fifth convolutional layer
W_conv5 = weight_variable([3, 3, 64, 64])
b_conv5 = bias_variable([64])h_conv5 = tf.nn.relu(conv2d(h_conv4, W_conv5, 1) + b_conv5)
#FCL 1
W_fc1 = weight_variable([1152, 1164])
b_fc1 = bias_variable([1164])h_conv5_flat = tf.reshape(h_conv5, [-1, 1152])
h_fc1 = tf.nn.relu(tf.matmul(h_conv5_flat, W_fc1) + b_fc1)keep_prob = tf.placeholder(tf.float32)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)#FCL 2
W_fc2 = weight_variable([1164, 100])
b_fc2 = bias_variable([100])h_fc2 = tf.nn.relu(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)h_fc2_drop = tf.nn.dropout(h_fc2, keep_prob)#FCL 3
W_fc3 = weight_variable([100, 50])
b_fc3 = bias_variable([50])h_fc3 = tf.nn.relu(tf.matmul(h_fc2_drop, W_fc3) + b_fc3)h_fc3_drop = tf.nn.dropout(h_fc3, keep_prob)#FCL 4
W_fc4 = weight_variable([50, 10])
b_fc4 = bias_variable([10])h_fc4 = tf.nn.relu(tf.matmul(h_fc3_drop, W_fc4) + b_fc4)h_fc4_drop = tf.nn.dropout(h_fc4, keep_prob)#Output
W_fc5 = weight_variable([10, 1])
b_fc5 = bias_variable([1])y = tf.mul(tf.atan(tf.matmul(h_fc4_drop, W_fc5) + b_fc5), 2)Line-by-line commentary
Now, I’ll work through the code in chunks and describe what I think each chunk means/does.
import tensorflow as tf
import scipyThe first two lines are straightforward.
We’re importing the TensorFlow library (which we will refer to as “tf” elsewhere in the code) and the SciPy library. TensorFlow is a python library written by Google, which will help abstract away most of the ground-level machine learning implementations. SciPy will help with the math stuff.
Nothing new to learn here.
def weight_variable(shape):
initial = tf.truncated_normal(shape, stddev=0.1)
return tf.Variable(initial)def bias_variable(shape):
initial = tf.constant(0.1, shape=shape)
return tf.Variable(initial)Okay, so here I think we are defining new objects, which basically means we can use the notion of “weight_variable” and “bias_variable” elsewhere in our code without having to redefine them ever single time.
In machine learning, the function we are trying to solve is typically represented as Wx+b = y, where we are given x (the list of input images) and y (the list of corresponding steering instructions), and want to find the best combination of W and b to make the equation balance.
W and b aren’t actually single numbers, but instead collections of coefficients. These collections are multidimensional and the size of these collections corresponds to the number of nodes in the machine learning network. (At least, this is how I understand it right now).
So, in the above code, the weight_variable object represents W and the bias_variable object represents b, in the generalized sense.
These objects take an input called “shape”, which basically defines the dimensionality of W and b.
These W and b objects are initiated with a function called “normal”. I’m pretty sure this means that… when a collections of W’s and b’s are initially created, the values of the individual coefficients should be randomly assigned based on the normal distribution (i.e. a bell curve) with a standard deviation of 0.1. The standard deviation more or less defines how random we want the initial coefficients to be.
So, surprisingly, I think I mostly understand this code. At first glance, I wasn’t sure what was going on, but writing this out helped me collect my thoughts.
What I still need to learn: I need to learn more about the Wx + b = y structure, why it is used, how it works, etc., but I understand the fundamentals of the code.
def conv2d(x, W, stride):
return tf.nn.conv2d(x, W, strides=[1, stride, stride, 1], padding='VALID')I believe that this conv2d thing is a function that performs a kernel convolution on some input. Kernel convolutions are a more general class of the image manipulations I learned about a few days ago.
As far as I’m concerned, a kernel convolution manipulates the image to highlight some characteristic of that image, whether that is the image’s edges, corners, etc.
This particular characteristic is defined by “the kernel”, which seems to be defined using strides=[1, stride, stride, 1] from above. Although, I don’t know what strides means or exactly how this works.
It seems like there are three inputs to this image manipulation function: 1. The kernel/strides (to say how to manipulate the image); 2. x (which is the image itself); and 3. W (which I guess is a set of coefficients that are used to blend different image manipulations together in some capacity).
I have to learn more about W’s role in all of this.
At a high-level though, this function is manipulating the image in some way to automatically reduce the image into distinct features that are more conducive to training the model.
What I still need to learn: How exactly is the convolution function being defined mathematically, and how does W play a role in this?
x = tf.placeholder(tf.float32, shape=[None, 66, 200, 3])
y_ = tf.placeholder(tf.float32, shape=[None, 1])x_image = xThese next few lines seem pretty straightforward. Once again, we refer back to the equation Wx + b = y.
Here we are essentially defining placeholders for the x and y variables. These placeholders set the variables’ dimensions (remember: these variables represent a collection of values, not just a single number).
We are setting up x to expect to receive an image of certain dimensions, and we are setting up y to expect a single number as an output (i.e. the steering angle).
We then rename x to “x_image” to remind ourselves that x is an image, because… why not.
Nothing new to learn here.
#first convolutional layer
W_conv1 = weight_variable([5, 5, 3, 24])
b_conv1 = bias_variable([24])h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1, 2) + b_conv1)
Okay, we are now onto our first convolutional layer.
We define W_conv1, which is just a specific instance of the weight_variable I explained above (with the shape [5, 5, 3, 24]). I’m not sure how or why the shape was set in this particular way.
We then define b_conv1, which is just a specific instance of the bias_variable I explain above (with the shape [24]). This 24 likely needs to match the 24 from the W_conv1 shape, but I’m not sure why (other than this is going to help make the matrix multiplication work).
h_conv1 is an intermediate object that applies the convolution function to the inputs x_image and W_conv1, adds bconv1 to the output of the convolution, and then processes everything through a function called relu.
This relu thing sounds familiar, but I can’t remember exactly what it does. My guess is that it’s some kind of “squasher” or normalizing function, that smooths everything out in some capacity, whatever that means. I’ll have to look into it.
While I can read most of the code, I’m not exactly sure why a “convolutional layer” is set up in this way.
What I still need to learn: What is a convolutional layer, what is it supposed to do, and how does it do it?
#second convolutional layer
W_conv2 = weight_variable([5, 5, 24, 36])
b_conv2 = bias_variable([36])h_conv2 = tf.nn.relu(conv2d(h_conv1, W_conv2, 2) + b_conv2)
#third convolutional layer
W_conv3 = weight_variable([5, 5, 36, 48])
b_conv3 = bias_variable([48])h_conv3 = tf.nn.relu(conv2d(h_conv2, W_conv3, 2) + b_conv3)
#fourth convolutional layer
W_conv4 = weight_variable([3, 3, 48, 64])
b_conv4 = bias_variable([64])h_conv4 = tf.nn.relu(conv2d(h_conv3, W_conv4, 1) + b_conv4)
#fifth convolutional layer
W_conv5 = weight_variable([3, 3, 64, 64])
b_conv5 = bias_variable([64])h_conv5 = tf.nn.relu(conv2d(h_conv4, W_conv5, 1) + b_conv5)
We proceed to have four more convolutional layers, which function in the exact same way as the first layer, but instead of using x_image as an input, they use the output from the previous layer (i.e. the h_conv thing).
I’m not sure how we decided to use five layers and how and why the shapes of each W_conv are different.
What I still need to learn: Why five layers, and how do we pick the shapes for each?
#FCL 1
W_fc1 = weight_variable([1152, 1164])
b_fc1 = bias_variable([1164])h_conv5_flat = tf.reshape(h_conv5, [-1, 1152])
h_fc1 = tf.nn.relu(tf.matmul(h_conv5_flat, W_fc1) + b_fc1)keep_prob = tf.placeholder(tf.float32)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)#FCL 2
W_fc2 = weight_variable([1164, 100])
b_fc2 = bias_variable([100])h_fc2 = tf.nn.relu(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)h_fc2_drop = tf.nn.dropout(h_fc2, keep_prob)#FCL 3
W_fc3 = weight_variable([100, 50])
b_fc3 = bias_variable([50])h_fc3 = tf.nn.relu(tf.matmul(h_fc2_drop, W_fc3) + b_fc3)h_fc3_drop = tf.nn.dropout(h_fc3, keep_prob)#FCL 4
W_fc4 = weight_variable([50, 10])
b_fc4 = bias_variable([10])h_fc4 = tf.nn.relu(tf.matmul(h_fc3_drop, W_fc4) + b_fc4)h_fc4_drop = tf.nn.dropout(h_fc4, keep_prob)Next, we have four FCLs, which I believe stands for “Fully Connected Layers”.
The setup for these layers seems similar to the convolution steps, but I’m not exactly sure what’s happening here. I think this is just vanilla neural network stuff (which I write as to pretend I fully understand “vanilla neural network stuff”).
Anyway, I’ll have to look more into this.
What I still need to learn: What is a FCL, and what is happening in each FCL step?
#Output
W_fc5 = weight_variable([10, 1])
b_fc5 = bias_variable([1])y = tf.mul(tf.atan(tf.matmul(h_fc4_drop, W_fc5) + b_fc5), 2)Finally, we take the outputs of the final FCL layer, do some crazy trigonometric manipulations and then output y, the predicted steering angle.
This step seems to just be “making the math work out”, but I’m not sure.
What I still need to learn: How and why is the output being calculated in this way?
Done.
That took longer than expected — mostly because I was able to parse more than I expected.
It’s sort of crazy how much of the implementation has been abstracted away by the TensorFlow library, and how little of the underlying math knowledge is necessary to build a fully capable self-driving car model.
It seems like the only important thing for us to know as model constructors is how to set the depth (e.g. number of layers) of the model, the shapes of each layer, and the types of the layers.
My guess is that this might be more of an art than science, but likely an educated art.
I’ll start digging into my open questions tomorrow.
Read the next post. Read the previous post.
