avatarRuslan Brilenkov

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

4684

Abstract

60FA.gif"><figcaption>An illustration of the dropout layer in action: randomly dropping network units. Made by Author.</figcaption></figure><p id="bb63">During the learning process, our deep learning network may become dependent on some of the weights in its (hidden) layers and ignore the rest. As a result, the network will perfectly fit the training data and fail on unseen data.</p><p id="d783">By implementing the dropout layers, we ensure that the network knowledge is based on the whole network of neurons/weights rather than depending too much on any specific unit.</p><p id="9004">The dropout weights are determined randomly to not give any neuron an advantage.</p><p id="2db9">As a result, the network is much better at generalizing out-of-sample data.</p><blockquote id="9156"><p><b>Note:</b> there are no weights to learn in the dropout layer, in comparison to the other layers.</p></blockquote><h1 id="ebbe">Method 2: Batch Normalization</h1><p id="5aa6">Despite the fact that the <b>batch normalization</b> technique is aimed at preventing the problem of <b><i>exploding gradients</i></b>*, it also helps with the overfitting problem.</p><p id="0cf3">This technique was described in detail in this <a href="https://arxiv.org/pdf/1502.03167.pdf">work</a> done by Ioffe, and Szeged about 6 years ago.</p><blockquote id="4fec"><p><b>Exploding gradients problem</b>: If not controlled, the deep learning network weights can become too large, which would lead to the extreme values of the gradients, i.e., making gradients to explode making the loss function to return infinite/NaN values.</p></blockquote><p id="8a67">Normalization is a process of scaling the data into a range of -1 to +1 or 0 to +1. Which is more like a matter of preference.</p><p id="9e44">For example, working with image data, one usually scales the input channels into -1 to +1 range instead of the 0 to 255. This makes the input layer normalized, while the hidden layers are not normalized.</p><p id="f9a0">Due to random initialization of the weights, the training process of the network may cause some of its weights to grow exponentially. This would lead to the exploding gradient problem.</p><p id="dae1">So, one can place the batch normalization layers after the other layers to ensure smooth deep learning training.</p><h2 id="c664">How does batch normalization work?</h2><p id="be66">There are two parameters operating inside the batch normalization layer — the <b>mean</b> and <b>standard deviation</b> of the batch input channels:</p><p id="7900" type="7">A batch normalization layer normalizes the batch by subtracting the mean and dividing by the standard deviation. So, there is no batch over which to take an upper hand.</p><p id="daf7">This operation ensures the output normalization from the previous layer.</p><blockquote id="936e"><p><b>Note:</b> these two parameters (the mean and the standard deviation) are non-trainable parameters because they are calculated during the feedforward propagation instead of the back propagation.</p></blockquote><p id="47c0">Additionally, by using the batch normalization, we actually do not need to apply dropout layers, as was mentioned in the above-mentioned work:</p><p id="70f2" type="7">Furthermore, batch normalization regularizes the model and reduces the need for Dropout.</p><p id="6709" type="7">— Ioffe, and Szeged, 2015</p><p id="6b99">However, there is no golden rule for deep learning architecture. The way to find the best architecture is to test them out. So, I would still probably use both batch normalization and the dropout layers to be sure to bypass an overfitting problem.</p><p id="4724">With this, we are concluding this article.</p><h1 id="45cd">Summary</h1><p id="b3a8">In this article, we discussed the two methods to prevent the overfitting of the deep neural network. Namely, batch normalization and dropout layer.</p><p id="0f1b" type="7">Modern deep learning architectures may not need dropout layers when using batch normalization because the last one also handles overfitting problem. But there is no golden rule.</p><p id="891a">It is good to test various architectures to find out which one performs the best in your case.</p><p id="ee5c">Thank you for reading until the end. I hope you enjoyed and learned something new.</p><p id="7319">Last but not least, if you have found any error, have a question or a comment, please do not hesitate to contact me (below).</p><h1 id="b9f2">Is deep learning a good starting point for beginners?</h1><p id="abac">In general, I would not encourage people to start with neural networks and deep learning. Without a proper understanding of the inner intricacies, neural network

Options

s are just magical black boxes.</p><p id="2eb6">There are a lot of simpler algorithms and models which can perform reasonably well on many tasks.</p><p id="e267">One can start with understanding supervised and unsupervised machine learning, classification, and regression, etc. There are a few of my previous articles aiming to give an overview with some hands-on examples:</p><div id="6e4f" class="link-block"> <a href="https://medium.datadriveninvestor.com/machine-learning-for-complete-beginners-introduction-61b3a961b5ae"> <div> <div> <h2>Machine Learning for Complete Beginners. Introduction.</h2> <div><h3>A comprehensive guide to starting practicing Machine Learning (ML) in Python for complete beginners with hands-on…</h3></div> <div><p>medium.datadriveninvestor.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*Jx-vQgPvz619aG3kSpH9kA.jpeg)"></div> </div> </div> </a> </div><div id="2c3f" class="link-block"> <a href="https://medium.datadriveninvestor.com/7-types-of-ml-classification-algorithms-af5ee5bcba2e"> <div> <div> <h2>7 Types of ML Classification Algorithms.</h2> <div><h3>An overview of Machine Learning classification algorithms. The best algorithm and "No free lunch theorem".</h3></div> <div><p>medium.datadriveninvestor.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*4uopW-y0YhVjbM3rDT3MBg.jpeg)"></div> </div> </div> </a> </div><p id="bbb3">Are you curious about the emerging field of Prompt Engineering? Grab <a href="https://ruslanbrilenkov.gumroad.com/l/promptengineering300">my new e-book</a>! You will learn and master everything from fundamental concepts to practical tips and real-world applications. Additionally, you will receive a bonus of 300 prompts and some of the free resources to kick-start your AI-driven journey. With all this value packed into one e-book, what is the price? The cost of a cup of coffee! Do not miss out on this opportunity to take your skills to the next level!</p><div id="0e12" class="link-block"> <a href="https://ruslanbrilenkov.gumroad.com/l/promptengineering300"> <div> <div> <h2>Prompt Engineering, 300 Prompts, & Free AI Resources</h2> <div><h3>Welcome to this e-book on prompt engineering — a rapidly growing field in artificial intelligence. This comprehensive…</h3></div> <div><p>ruslanbrilenkov.gumroad.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*kbPKUVsdzyKqgLhI)"></div> </div> </div> </a> </div><h1 id="37da">Contact</h1><p id="7098"><a href="https://www.linkedin.com/in/ruslan-brilenkov/"><b><i>LinkedIn</i></b></a></p><p id="7e91"><i>I recently started a <a href="https://bit.ly/RBrilenkovYT"><b>YouTube channel</b></a><b> </b>where I talk about different topics, including data science and AI news, research, and life in general among others. It is a steep learning curve for me but I invite you to <a href="https://bit.ly/RBrilenkovYT">check it out here</a>.</i></p><p id="f07a"><i>Never miss a story, join my <a href="https://ruslan-brilenkov.medium.com/subscribe"><b>mailing list</b></a>!</i></p><p id="77cf"><a href="https://github.com/RuslanBrilenkov"><b><i>GitHub</i></b></a></p><h2 id="5d6c">References:</h2><p id="81c8">Srivastava et al., <i>Dropout: A Simple Way to Prevent Neural Networks from Overfitting</i>, Journal of Machine Learning Research 15 (2014) [<a href="https://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf">link</a>]</p><p id="866b">Ioffe, and Szeged, <i>Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift</i>, (2015) [<a href="https://arxiv.org/pdf/1502.03167.pdf">link</a>]</p><p id="87c3"><i>P.S.: If you like this uninterrupted reading experience on this beautiful platform, Medium.com, consider supporting the writers of this community by signing up for a membership, <a href="https://ruslan-brilenkov.medium.com/membership">HERE</a>. It only costs $5 per month and supports all the writers.</i></p><p id="8403"><i>More content at <a href="http://plainenglish.io"><b>plainenglish.io</b></a></i></p></article></body>

2 Deep Learning Methods Against Overfitting

A more complex system does not always mean better performance, but there are ways to improve it.

Photo by NASA on Unsplash

This article is centered around Deep Machine Learning (ML), however, the general principles hold true for any ML algorithm.

So, even if you are just starting out with ML, I hope this article can still teach you something useful.

If you would like to recall the flow of the neural network, such as feedforward and backpropagation, I invite you to check out this article:

Otherwise, let us begin!

Firstly, we briefly mention what is overfitting and why is it a problem?

Then, we see two simple yet powerful methods to prevent overfitting in deep learning.

Why is overfitting a problem?

Let us define an overfitting

In statistics, an overfitting is a modeling error that occurs when a function is too closely aligned to a limited set of data points, and fails to fit additional data or predict future observations.

This may occur when the model is too complex or it is fitting the data for too long. Such a model begins to fit the noise and eventually is unable to predict a general trend.

In other words, the model memorizes training data instead of learning to generalize from a trend.

It is a big problem because an overfitted model is useless in predicting new, unseen, out-of-sample data. Which is usually desired with ML.

Intuitive example

Let us make an analogy with a student who is studying to give an exam. Suppose that a student found the solutions to the exam from previous years. Instead of understanding the material, our poor fellow simply memorized the answers. Obviously, this student could not pass an exam because the questions were not copy-pasted from the previous years.

This is an example of how our poor student overfitted the data instead of understanding the underlying logic which resulted in a failure in processing an unseen data.

Being a complex system, neural networks have a lot of parameters to train and adjust. There is a high probability of overfitting the data which we want to avoid.

It is not always necessary to use a complex system to model the data but if one is willing to do so, there are ways to improve the result.

Let us discuss the two methods to prevent our deep learning network from overfitting.

Method 1: Dropout Layers

Nearly 7 years ago, Srivastava and collaborators published a work where they presented an idea of Dropout layers.

Let me firstly cite the authors and then give my understanding of the process.

The term “dropout” refers to dropping out units (hidden and visible) in a neural network. By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections.

— Srivastava et al., 2014

Here is my illustration of the dropout layer in action:

An illustration of the dropout layer in action: randomly dropping network units. Made by Author.

During the learning process, our deep learning network may become dependent on some of the weights in its (hidden) layers and ignore the rest. As a result, the network will perfectly fit the training data and fail on unseen data.

By implementing the dropout layers, we ensure that the network knowledge is based on the whole network of neurons/weights rather than depending too much on any specific unit.

The dropout weights are determined randomly to not give any neuron an advantage.

As a result, the network is much better at generalizing out-of-sample data.

Note: there are no weights to learn in the dropout layer, in comparison to the other layers.

Method 2: Batch Normalization

Despite the fact that the batch normalization technique is aimed at preventing the problem of exploding gradients*, it also helps with the overfitting problem.

This technique was described in detail in this work done by Ioffe, and Szeged about 6 years ago.

Exploding gradients problem: If not controlled, the deep learning network weights can become too large, which would lead to the extreme values of the gradients, i.e., making gradients to explode making the loss function to return infinite/NaN values.

Normalization is a process of scaling the data into a range of -1 to +1 or 0 to +1. Which is more like a matter of preference.

For example, working with image data, one usually scales the input channels into -1 to +1 range instead of the 0 to 255. This makes the input layer normalized, while the hidden layers are not normalized.

Due to random initialization of the weights, the training process of the network may cause some of its weights to grow exponentially. This would lead to the exploding gradient problem.

So, one can place the batch normalization layers after the other layers to ensure smooth deep learning training.

How does batch normalization work?

There are two parameters operating inside the batch normalization layer — the mean and standard deviation of the batch input channels:

A batch normalization layer normalizes the batch by subtracting the mean and dividing by the standard deviation. So, there is no batch over which to take an upper hand.

This operation ensures the output normalization from the previous layer.

Note: these two parameters (the mean and the standard deviation) are non-trainable parameters because they are calculated during the feedforward propagation instead of the back propagation.

Additionally, by using the batch normalization, we actually do not need to apply dropout layers, as was mentioned in the above-mentioned work:

Furthermore, batch normalization regularizes the model and reduces the need for Dropout.

— Ioffe, and Szeged, 2015

However, there is no golden rule for deep learning architecture. The way to find the best architecture is to test them out. So, I would still probably use both batch normalization and the dropout layers to be sure to bypass an overfitting problem.

With this, we are concluding this article.

Summary

In this article, we discussed the two methods to prevent the overfitting of the deep neural network. Namely, batch normalization and dropout layer.

Modern deep learning architectures may not need dropout layers when using batch normalization because the last one also handles overfitting problem. But there is no golden rule.

It is good to test various architectures to find out which one performs the best in your case.

Thank you for reading until the end. I hope you enjoyed and learned something new.

Last but not least, if you have found any error, have a question or a comment, please do not hesitate to contact me (below).

Is deep learning a good starting point for beginners?

In general, I would not encourage people to start with neural networks and deep learning. Without a proper understanding of the inner intricacies, neural networks are just magical black boxes.

There are a lot of simpler algorithms and models which can perform reasonably well on many tasks.

One can start with understanding supervised and unsupervised machine learning, classification, and regression, etc. There are a few of my previous articles aiming to give an overview with some hands-on examples:

Are you curious about the emerging field of Prompt Engineering? Grab my new e-book! You will learn and master everything from fundamental concepts to practical tips and real-world applications. Additionally, you will receive a bonus of 300 prompts and some of the free resources to kick-start your AI-driven journey. With all this value packed into one e-book, what is the price? The cost of a cup of coffee! Do not miss out on this opportunity to take your skills to the next level!

Contact

LinkedIn

I recently started a YouTube channel where I talk about different topics, including data science and AI news, research, and life in general among others. It is a steep learning curve for me but I invite you to check it out here.

Never miss a story, join my mailing list!

GitHub

References:

Srivastava et al., Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Journal of Machine Learning Research 15 (2014) [link]

Ioffe, and Szeged, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, (2015) [link]

P.S.: If you like this uninterrupted reading experience on this beautiful platform, Medium.com, consider supporting the writers of this community by signing up for a membership, HERE. It only costs $5 per month and supports all the writers.

More content at plainenglish.io

Deep Learning
Machine Learning
Statistics
Data Science
Neural Networks
Recommended from ReadMedium