Free AI web copilot to create summaries, insights and extended knowledge, download it at here

Abstract

ious sentences were confusing to you, all you need to know is when you take a derivative, that value is telling you what direction to take to get to the highest point on your graph. However, we want to get to the lowest point, because our graph’s y-axis is error and we want to minimize the error, we will go to the opposite direction of the gradient by taking a negative value of the gradient.<figure id="943a"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*vfglxAeL7CUhfG95vhVdLw.png"><figcaption><a href="https://www.youtube.com/watch?v=b4Vyma9wPHo">https://www.youtube.com/watch?v=b4Vyma9wPHo</a></figcaption></figure>You start somewhere on the graph, and you keep going the opposite direction of the gradient in order to reach the minimum. You can see the gradient descent in python implementation below.<figure id="8f47"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Ip_HlJZZejJVEm8NkGoSlQ.png"><figcaption></figcaption></figure><h1 id="7ecf">Chain Rule</h1>In the example above, there was only one weight, which is not common in the real world. Let’s go over an example with multiple weights and how chain rule is applied to calculate the derivatives.We will represent weights and loss function as a graph. Please note that “a” and “b” represent weights and “f” represents a loss function we want to minimize. We will see how adjusting the weights affects the output using the chain rule.<figure id="e858"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*AhJ2SS6MptCRkq0_tz6pLA.png"><figcaption><a href="https://twitter.com/random_forests">https://twitter.com/random_forests</a></figcaption></figure>The gradient of the loss with respect to the weights can be represented in two ways, because there are two weights. If you were to define a function for the above forward pass, it’d be like this:<figure id="bdb9"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*fXbBl8yLKYj38D4y1TqXIA.png"><figcaption><a href="https://twitter.com/random_forests">https://twitter.com/random_forests</a></figcaption></figure>Let’s start with calculating dL/da = df/da. The problem is the loss function f does not know about a. Every node is only aware of its adjacent nodes.<figure id="51d1"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*tREDSx4n2ddslOygsuO6eA.png"><figcaption><a href="https://twitter.com/random_forests">https://twitter.com/random_forests</a></figcaption></figure>Therefore, in order to calculate df/da, you need to use the <a href="https://www.khanacademy.org/math/ap-calculus-ab/ab-differentiation-2-new/ab-3-1a/a/chain-rule-review">chain rule</a>. Simply put, the chain rule is used when you have composite functions.<figure id="14c2"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*EqcwwY7S6SPlCedG0ZXcLQ.gif"><figcaption><a href="https://i.pinimg.com/originals/c2/f1/a3/c2f1a350856c3ee7c66fdcec1c66b3f3.gif">https://i.pinimg.com/originals/c2/f1/a3/c2f1a350856c3ee7c66fdcec1c66b3f3.gif</a></figcaption></figure>Since df/da cannot be calculated directly from the f node, which does not know about the node a, you’ll do df/da = df/dc * dc/da.Now, let us calculate dL/db = df/db. Since there are two edges coming out of the node b, when we back-propagate, we need to use the sum rule to add two paths to the node b.<figure id="f318"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Zr8EA8wehfVJ5r2lEL7f2w.png"><figcaption><a href="https://twitter.com/random_forests">https://twitter.com/random_forests</a></figcaption></figure>Each path can be calculated by using the chain rule: df/dc * dc/db and df/dd * dd/db and we can sum them at the end.In reality, those weights are vectorized, so all weights are calculated at the same time when we do back-propagation.<h1 id="6a6b">Complexity of Back Propagation</h1>Let’s think about our neural network as a graph. A node represents an operation and an edge r

Options

epresents a weight. In order to calculate gradient for every weight, every edge has to be visited at least once. Therefore, the complexity of back propagation is linear in the number of edges.<h1 id="2133">How to reach the minimum fast?</h1>Sometimes it may take awhile to reach the minimum while backpropagating. There are some tricks/adjustments you can make to reach the minimum fast.<ol><li>Adjust the learning Rates</li></ol>Learning rates are the value you multiply by the gradient to be subtracted from the loss function in order to reach the minimum.<figure id="1dbc"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*xcF9RNNIG7rSoncH2W2qTQ.png"><figcaption><a href="https://twitter.com/random_forests">https://twitter.com/random_forests</a></figcaption></figure>If you make the learning rate small, you can ensure to reach the minimum, but it’d take awhile.<figure id="b35c"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*sIug1U01irFjvcyrWBxt0w.png"><figcaption><a href="https://twitter.com/random_forests">https://twitter.com/random_forests</a></figcaption></figure>If your learning rate is too big, you might not be able to reach the minimum, because you can jump over the minimum. So you want to make the learning rate big enough to converge to get close enough to the minimum quickly and adjust it later to be small enough to reach the minimum.2. Adjust the momentumWith the use of momentum, the gradient of the past steps are accumulated instead of using only the gradient of the current step to guide the search.<figure id="96d7"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*RLI8jGv8uFdRjUk3xSD9tQ.png"><figcaption><a href="https://blog.paperspace.com/intro-to-optimization-momentum-rmsprop-adam/">https://blog.paperspace.com/intro-to-optimization-momentum-rmsprop-adam/</a></figcaption></figure>The momentum is used for the following reason. In the diagram above, consider the point A. If we were to sum those vectors up, the opposite directions cancel out, so the components along the w1 direction cancels out, but the w2 direction increases, which is the ideal path we want to make towards the optimum.3. Change the batch sizeBatch is the entire training set to compute gradient. Instead of using the entire batch, you can either choose random training samples at a time (stochastic) or use a small batch of data at a time (mini-batch).<figure id="5ef8"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*AqG5NPLN_uGFTg_pSN5jpw.png"><figcaption><a href="https://twitter.com/random_forests">https://twitter.com/random_forests</a></figcaption></figure>Stochastic method significantly decreases the time of computing gradient descent; however, it’s using only one example at a time, so its path to an optimum is noisier and can be more random than that of the batch gradient. But it doesn’t have a huge impact as we don’t care what kind of path this method is taking (only for now) as long as the minimum point has been reached with a shorter training time.<h1 id="58ea">Local and Global Minimum</h1>If we have multiple minima, we want to reach the global minimum. However, it is possible that we can get trapped in a local minimum like the diagram below.<figure id="f691"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*jddMSRiR1uVebViR6NwJAg.png"><figcaption><a href="https://twitter.com/random_forests">https://twitter.com/random_forests</a></figcaption></figure>In that case, the best way for you to get out of the local minimum is restart randomly. A hack to get out of a local minimum is to do random restart, which raises the probability of reaching a minimum somewhere else.<figure id="2498"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*RWNGeMWmyc0TzAd8c4IXXA.png"><figcaption><a href="https://twitter.com/random_forests">https://twitter.com/random_forests</a></figcaption></figure></article></body>

Back-propagation Demystified in 7 Minutes

In the past 10 years, the best-performing machine learning implementations such as the facial recognition and the speech recognizers have resulted from a technique called “deep learning.” Deep learning is composed of neural networks, which are a means to perform some task by analyzing training examples. How do neurons of the neural networks learn from training examples?

There are three steps.

Let your neural network guess the output.
Evaluate how well your network did.
Modify/Teach your neural network based on the evaluation from step 2. a.k.a. backpropagation

Step 1. Let your NN guess by forward pass.

When you make prediction using neural networks,

In order to train a neural network, you need to evaluate how good your prediction was. That’s when a loss function comes in to play.

Step 2. Evaluate how well your network did.

Loss functions compare what the output is supposed to be with the prediction that was made from the neural network. One of the most common examples of loss functions would be a mean squared error (MSE). It calculates the difference between the truth and prediction and square it. The reason that you square the result is to make the negative difference not relevant.

If you have multiple predictions and truths, this is what you do.

In this case,

your loss function would give you the error of 36. You want to make your mean squared error as low as possible. Ideally, you want to make it 0, but we’d be satisfied with any value that is close enough to it.

Step 3. Modify/Teach your neural network

Your error value is somewhere on the graph below. You might have to either increase weight or decrease weight for you to reach the minimum error possible. This step is called back-propagation. In order to achieve this goal of reaching the minimum error, there are multiple ways you can take. I will discuss two popular approaches.

https://www.edureka.co/blog/backpropagation/

Approach 1. Take a derivative and set it equal to 0 and solve for it

In calculus class, we learned that in order to reach the optimal point, you can take a derivative of the function with respect to the input, set it equal to 0 and solve for it.

By solving for w, we get w = truth/input. In this case, the truth was 12, the input was 3, so our w becomes 4. We finally taught our neural network!

This approach is simple and fast but it doesn’t always work because most loss functions don’t have closed form derivatives.

Approach 2. Gradient Descent

Gradient in calculus is a collective of partial derivatives. And it’s proven that gradient is the direction of the steepest ascent. If those previous sentences were confusing to you, all you need to know is when you take a derivative, that value is telling you what direction to take to get to the highest point on your graph. However, we want to get to the lowest point, because our graph’s y-axis is error and we want to minimize the error, we will go to the opposite direction of the gradient by taking a negative value of the gradient.

https://www.youtube.com/watch?v=b4Vyma9wPHo

You start somewhere on the graph, and you keep going the opposite direction of the gradient in order to reach the minimum. You can see the gradient descent in python implementation below.

Chain Rule

In the example above, there was only one weight, which is not common in the real world. Let’s go over an example with multiple weights and how chain rule is applied to calculate the derivatives.

We will represent weights and loss function as a graph. Please note that “a” and “b” represent weights and “f” represents a loss function we want to minimize. We will see how adjusting the weights affects the output using the chain rule.

The gradient of the loss with respect to the weights can be represented in two ways, because there are two weights. If you were to define a function for the above forward pass, it’d be like this:

Let’s start with calculating dL/da = df/da. The problem is the loss function f does not know about a. Every node is only aware of its adjacent nodes.

Therefore, in order to calculate df/da, you need to use the chain rule. Simply put, the chain rule is used when you have composite functions.

https://i.pinimg.com/originals/c2/f1/a3/c2f1a350856c3ee7c66fdcec1c66b3f3.gif

Since df/da cannot be calculated directly from the f node, which does not know about the node a, you’ll do df/da = df/dc * dc/da.

Now, let us calculate dL/db = df/db. Since there are two edges coming out of the node b, when we back-propagate, we need to use the sum rule to add two paths to the node b.

Each path can be calculated by using the chain rule: df/dc * dc/db and df/dd * dd/db and we can sum them at the end.

In reality, those weights are vectorized, so all weights are calculated at the same time when we do back-propagation.

Complexity of Back Propagation

Let’s think about our neural network as a graph. A node represents an operation and an edge represents a weight. In order to calculate gradient for every weight, every edge has to be visited at least once. Therefore, the complexity of back propagation is linear in the number of edges.

How to reach the minimum fast?

Sometimes it may take awhile to reach the minimum while backpropagating. There are some tricks/adjustments you can make to reach the minimum fast.

Adjust the learning Rates

Learning rates are the value you multiply by the gradient to be subtracted from the loss function in order to reach the minimum.

If you make the learning rate small, you can ensure to reach the minimum, but it’d take awhile.

If your learning rate is too big, you might not be able to reach the minimum, because you can jump over the minimum. So you want to make the learning rate big enough to converge to get close enough to the minimum quickly and adjust it later to be small enough to reach the minimum.

2. Adjust the momentum

With the use of momentum, the gradient of the past steps are accumulated instead of using only the gradient of the current step to guide the search.

https://blog.paperspace.com/intro-to-optimization-momentum-rmsprop-adam/

The momentum is used for the following reason. In the diagram above, consider the point A. If we were to sum those vectors up, the opposite directions cancel out, so the components along the w1 direction cancels out, but the w2 direction increases, which is the ideal path we want to make towards the optimum.

3. Change the batch size

Batch is the entire training set to compute gradient. Instead of using the entire batch, you can either choose random training samples at a time (stochastic) or use a small batch of data at a time (mini-batch).

Stochastic method significantly decreases the time of computing gradient descent; however, it’s using only one example at a time, so its path to an optimum is noisier and can be more random than that of the batch gradient. But it doesn’t have a huge impact as we don’t care what kind of path this method is taking (only for now) as long as the minimum point has been reached with a shorter training time.

Local and Global Minimum

If we have multiple minima, we want to reach the global minimum. However, it is possible that we can get trapped in a local minimum like the diagram below.

In that case, the best way for you to get out of the local minimum is restart randomly. A hack to get out of a local minimum is to do random restart, which raises the probability of reaching a minimum somewhere else.