avatarMazen Ahmed

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

3782

Abstract

ration by a value of <b>αlpha</b> , which we set ourselves, multiplied by our derivative function d/dθ₁.</li><li>Performing the above operation <b>brings θ₁ closer to our minimum</b></li></ul><h2 id="501d">We can apply this algorithm to any shaped cost function to reach our minimum.</h2><p id="70db">Example:</p><figure id="2785"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*lw5GpYrjSsH_KOxFND6TXQ.png"><figcaption></figcaption></figure><figure id="24d3"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*yBaYzSt1SC2tLIvJArSsTw.png"><figcaption></figcaption></figure><figure id="83db"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*iztmXNy8IhnQw7aINqzQUw.png"><figcaption></figcaption></figure><figure id="a2cc"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*lhcdXXD0MykFvatUJqgKYg.png"><figcaption></figcaption></figure><p id="f068">Eventually after <b>n </b>iterations we reach <b>our minimum </b>and we find our optimal value of θ₁ to minimise our cost function as 2.5.</p><p id="5133"><b>If we try to keep iterating:</b></p><figure id="ba06"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*qdSpJTKBOh1wYtjwJqnWGw.png"><figcaption></figcaption></figure><p id="2b8a">Our value for θ₁ remains the same, so we have<b> reached convergence</b> and the algorithm has accomplished its mission.</p><h2 id="5ddb">The Learning Rate Alpha ( α )</h2><p id="abeb">In the example above we set the learning rate alpha as 0.5. This value determines “how quickly” we approach our minimum.</p><p id="58b5">If Alpha is<b> too small</b> say 0.0001 it may end up taking a very long time to reach our minimum and this takes a lot of computing power. The positive of this, however, is that the minimum value found will be very accurate.</p><figure id="81b7"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Vy8cBZ7K01W060nw9tJqtA.png"><figcaption></figcaption></figure><p id="a12c">If Alpha is too large say 10 we may end up <b>overshooting</b> and missing our minimum point.</p><figure id="d47d"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*5OSiyJtUedcGFBto-ZI7yg.png"><figcaption></figcaption></figure><p id="9875">So we compromise by choosing a value of alpha between 0.001 and 1. This in general leads to accurate results fairly quickly with relative little computing power.</p><p id="fc93">One way to choose a good learning rate alpha is by trying values by a scale of 10 so try 0.001 then 0.01 then 0.1 and finally 1 and narrow down a good learning rate which:</p><p id="8fc6"><b>a) Reaches minimum cost efficiently</b></p><p id="e117"><b>b) Reaches an accurate minimum</b></p><h1 id="cacc">Gradient Descent in 3 Dimensions</h1><p id="61d7">The algorithm for gradient Descent in 3D has the <b>same concept</b> as in 2D but now we are applying the algorithm to both θ₀ and θ₁.</p><p id="b127">Because we are working in 3 dimensions we have to use whats called <a href="https://www.mathsisfun.com/calculus/derivatives-partial.html">partial derivatives</a> to change our values of θ₀ and θ₁ to approach our minimum.</p><p id="aab3">Using the same cost function as in the <a href="https://readmedium.com/understanding-linear-regression-caa7552509f8">previous episode</a> the partial derivatives for both θ₀ and θ₁ are given in orange.</p><figure id="7c34"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*3iIg7-_gBBkvt7G-4yguAg.png"><figcaption></figcaption></figure><p id="9599">Just as in 2D, the partial derivatives give the slope of the cost function, but this time either in the θ₀ plane or the θ₁ plane:</p><figure id="fefc"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*0XqJY4oeAWshFmo4qQtHNw.png"><figcaption></figcaption><

Options

/figure><h2 id="c5c5">Gradient Descent Algorithm with 2 parameters</h2><figure id="518c"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*gnyPBNue7LAOGRhNY07_0A.png"><figcaption></figcaption></figure><ul><li>This algorithm is conceptually the same as gradient descent in 1 dimension.</li><li>We are minimising across the θ₀ plane and the θ₁ shown above to reach our minimum cost.</li><li>In Python we can call a function that does all this maths for us, which i plan to cover in a future episode.</li></ul><h1 id="f4ec">Gradient Descent in multiple Dimensions</h1><p id="0ef4">Gradient descent in N dimensions often involve more variables, so instead of just <b>one input x</b> trying to be mapped to an o<b>utput y</b> in 2 dimensions.</p><p id="fb8f">With the temperature and humidity example we had the following data and regression line formula.</p><figure id="07d8"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*yVZvXCLaQZQs6Aym.png"><figcaption></figcaption></figure><p id="0618">With N dimensions we will be looking to map multiple inputs x to our output temperature, not just humidity but also looking at perhaps pressure and wind speed and seeing how that has an effect on temperature.</p><figure id="4591"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*lAu860xqZfWGHshWcm6a5w.png"><figcaption></figcaption></figure><p id="fe5f">Here each input column is named<b> 𝑥</b>₁,<b>𝑥</b>₂, <b>𝑥</b><b> </b>and is assigned its own input parameter <b>θ₁, θ₂ </b>and<b> θ₃</b> respectively.</p><p id="0c91">It is very <b>difficult to visualise in 4 dimensions</b>, so I won’t be able to show how our cost function changes according to our parameters θ₀, θ₁, θ₂ and θ₃ as shown in 2D and 3D.</p><p id="253a">The concept remains the same as in 2D and 3D but now we apply the same gradient descent algorithm too all parameters.</p><h2 id="28da">Gradient Descent Algorithm with multiple parameters</h2><figure id="db24"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*q8GQ7WU6fp_8urxXusC5BQ.png"><figcaption></figcaption></figure><p id="fba7">For<b> n parameters </b>θ₀, θ₁, θ₂, … ,θₙ the algorithm remains the same but we update <b>all n parameters</b> via gradient descent to reach our parameter values that produce our minimum cost.</p><p id="9faa">So for n parameters we have the <b>general gradient descent formula</b>:</p><h2 id="1a5f">Gradient Descent Algorithm for n parameters</h2><figure id="50d1"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*CP_IhKkgh1Xzdlfl8C04cg.png"><figcaption></figcaption></figure><p id="4fc7">If you can understand both gradient descent in 1 dimension and 2 dimensions don’t worry if n dimensional gradient descent algorithm looks confusing — the computer does all this for us!</p><p id="f4e5">I hope you now have a better understanding for gradient descent and what it’s all about and would really appreciate a few claps to keep me going!</p><h2 id="fb05">Prev Episode | Next Episode</h2><h2 id="dd81">If you have any questions please leave them below!</h2> <figure id="1625"> <div> <div> <img class="ratio" src="http://placehold.it/16x9"> <iframe class="" src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2F7GP_f9KlplA%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3D7GP_f9KlplA&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2F7GP_f9KlplA%2Fhqdefault.jpg&amp;key=a19fcc184b9711e1b4764040d3dc5c07&amp;type=text%2Fhtml&amp;schema=youtube" allowfullscreen="" frameborder="0" height="480" width="854"> </div> </div> </figure></iframe></div></div></figure></article></body>

Understanding Gradient Descent

With video explanation | Data Series | Episode 4.2

This article plans to expand on episode 4.1, explaining Gradient Descent and how it is used to minimise our cost function in Linear Regression. Knowledge of derivatives and partial derivatives will be helpful.

Linear Regression Recap

From the previous episode we calculated the regression line for our humidity and temperature data to be:

Which we obtained from the cost function graph shown below

The algorithm we use in order to obtain the parameter values that give this minimum cost is called gradient descent.

Overview

The idea of gradient descent is that we start at a random point on our cost function graph, for example here:

And use partial derivatives in order to obtain make our way down to the minimum.

We then look at what parameter values produce this minimum cost and use that in our regression line.

Gradient Descent in 2 Dimensions

Lets take a look at simplified version of gradient descent, with just one parameter θ₁ (where θ₀ = 0), to get the general idea of what is happening.

We plot the cost function J(θ₁) and see how it changes according to θ₁. Please see episode 4.1 to see how this cost function is derived.

The derivative of our function J(θ₁) is shown in orange.

  • This essentially gives the slope (or gradient) of our cost function at any value of θ₁.
  • For example, as shown in dark orange, at θ₁ = 2, our cost function has a slope of -1.

Gradient Descent Algorithm with 1 parameter

Gradient decent uses our derivative function to reach our minimum cost with the following algorithm.

Let’s break this down:

  • θ₁ =0 | Initialises our parameter θ₁
  • Repeat until convergence| Essentially means repeat until we find a minimum point, this is when d/dθ₁ = 0 and our value of θ₁ remains roughly the same
  • We are changing the value of θ₁ upon each iteration by a value of αlpha , which we set ourselves, multiplied by our derivative function d/dθ₁.
  • Performing the above operation brings θ₁ closer to our minimum

We can apply this algorithm to any shaped cost function to reach our minimum.

Example:

Eventually after n iterations we reach our minimum and we find our optimal value of θ₁ to minimise our cost function as 2.5.

If we try to keep iterating:

Our value for θ₁ remains the same, so we have reached convergence and the algorithm has accomplished its mission.

The Learning Rate Alpha ( α )

In the example above we set the learning rate alpha as 0.5. This value determines “how quickly” we approach our minimum.

If Alpha is too small say 0.0001 it may end up taking a very long time to reach our minimum and this takes a lot of computing power. The positive of this, however, is that the minimum value found will be very accurate.

If Alpha is too large say 10 we may end up overshooting and missing our minimum point.

So we compromise by choosing a value of alpha between 0.001 and 1. This in general leads to accurate results fairly quickly with relative little computing power.

One way to choose a good learning rate alpha is by trying values by a scale of 10 so try 0.001 then 0.01 then 0.1 and finally 1 and narrow down a good learning rate which:

a) Reaches minimum cost efficiently

b) Reaches an accurate minimum

Gradient Descent in 3 Dimensions

The algorithm for gradient Descent in 3D has the same concept as in 2D but now we are applying the algorithm to both θ₀ and θ₁.

Because we are working in 3 dimensions we have to use whats called partial derivatives to change our values of θ₀ and θ₁ to approach our minimum.

Using the same cost function as in the previous episode the partial derivatives for both θ₀ and θ₁ are given in orange.

Just as in 2D, the partial derivatives give the slope of the cost function, but this time either in the θ₀ plane or the θ₁ plane:

Gradient Descent Algorithm with 2 parameters

  • This algorithm is conceptually the same as gradient descent in 1 dimension.
  • We are minimising across the θ₀ plane and the θ₁ shown above to reach our minimum cost.
  • In Python we can call a function that does all this maths for us, which i plan to cover in a future episode.

Gradient Descent in multiple Dimensions

Gradient descent in N dimensions often involve more variables, so instead of just one input x trying to be mapped to an output y in 2 dimensions.

With the temperature and humidity example we had the following data and regression line formula.

With N dimensions we will be looking to map multiple inputs x to our output temperature, not just humidity but also looking at perhaps pressure and wind speed and seeing how that has an effect on temperature.

Here each input column is named 𝑥₁,𝑥₂, 𝑥 and is assigned its own input parameter θ₁, θ₂ and θ₃ respectively.

It is very difficult to visualise in 4 dimensions, so I won’t be able to show how our cost function changes according to our parameters θ₀, θ₁, θ₂ and θ₃ as shown in 2D and 3D.

The concept remains the same as in 2D and 3D but now we apply the same gradient descent algorithm too all parameters.

Gradient Descent Algorithm with multiple parameters

For n parameters θ₀, θ₁, θ₂, … ,θₙ the algorithm remains the same but we update all n parameters via gradient descent to reach our parameter values that produce our minimum cost.

So for n parameters we have the general gradient descent formula:

Gradient Descent Algorithm for n parameters

If you can understand both gradient descent in 1 dimension and 2 dimensions don’t worry if n dimensional gradient descent algorithm looks confusing — the computer does all this for us!

I hope you now have a better understanding for gradient descent and what it’s all about and would really appreciate a few claps to keep me going!

Prev Episode | Next Episode

If you have any questions please leave them below!

AI
Data Science
Machine Learning
Education
Recommended from ReadMedium